Re: [HACKERS] [PATCHES] Patch(es) to expose n_live_tuples and

2006-12-26 Thread Joshua D. Drake

> The current terminology of live and dead is already used in many places in 
> the 
> documentation and in userspace; mostly around the need for maintainance of 
> dead tuples within tables, reindex cleaning up dead pages, and even in the 
> vacuum commands output (n dead tuples cannot be removed yet). Given this 
> patch came from userland, istm people are comfortable enough with this 
> terminology there is no need to change it. 

+1

> 
-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PATCHES] Patch(es) to expose n_live_tuples and

2006-12-26 Thread Robert Treat
On Tuesday 26 December 2006 23:12, Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruce Momjian <[EMAIL PROTECTED]> writes:
> > > Alvaro Herrera wrote:
> > >> I'm not really convinced that Bruce's proposed names seem any better
> > >> to me.  What's wrong with "dead" and "live"?
> > >
> > > In my mind, visible really means "visible to anyone", and expired means
> > > visible to no one.
> >
> > Um ... surely, visibility is in the eye of the beholder (no smiley).
> >
> > I don't have an immediate suggestion for better terminology, but IMHO
> > the whole point of visible/invisible terminology is that it depends on
> > who's looking.  Dead and live seem to convey a more appropriate air
> > of finality.
> >
> > "Expired" is OK as a synonym for "dead", but there is no thesaurus
> > anywhere in the world that will suggest it as an antonym for "visible".
>
> OK, so we need new terminology and we need it to be used consistenly in
> our documentation, whatever we choose.

The current terminology of live and dead is already used in many places in the 
documentation and in userspace; mostly around the need for maintainance of 
dead tuples within tables, reindex cleaning up dead pages, and even in the 
vacuum commands output (n dead tuples cannot be removed yet). Given this 
patch came from userland, istm people are comfortable enough with this 
terminology there is no need to change it. 

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] effective_cache_size vs units

2006-12-26 Thread Tom Lane
"Joshua D. Drake" <[EMAIL PROTECTED]> writes:
> On 12/19/06, Tom Lane <[EMAIL PROTECTED]> wrote:
>>> I think we should just accept the strings case-insensitively, too.

> Where we at on this?

Anyone against making it case-insensitive, speak now or hold your peace.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PATCHES] Patch(es) to expose n_live_tuples and

2006-12-26 Thread Bruce Momjian
Tom Lane wrote:
> Bruce Momjian <[EMAIL PROTECTED]> writes:
> > Alvaro Herrera wrote:
> >> I'm not really convinced that Bruce's proposed names seem any better to
> >> me.  What's wrong with "dead" and "live"?
> 
> > In my mind, visible really means "visible to anyone", and expired means
> > visible to no one.
> 
> Um ... surely, visibility is in the eye of the beholder (no smiley).
> 
> I don't have an immediate suggestion for better terminology, but IMHO
> the whole point of visible/invisible terminology is that it depends on
> who's looking.  Dead and live seem to convey a more appropriate air
> of finality.
> 
> "Expired" is OK as a synonym for "dead", but there is no thesaurus
> anywhere in the world that will suggest it as an antonym for "visible".

OK, so we need new terminology and we need it to be used consistenly in
our documentation, whatever we choose.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Load distributed checkpoint

2006-12-26 Thread Bruce Momjian
ITAGAKI Takahiro wrote:
> 
> Bruce Momjian <[EMAIL PROTECTED]> wrote:
> 
> > I assume write() is not our checkpoint performance problem, but the
> > transfer to disk via fsync().  Perhaps a simple solution is to do the
> > write()'s of all dirty buffers as we do now at checkpoint time, but
> > delay 30 seconds and then do fsync() on all the files.
> 
> I think there are two platforms that have different problems in checkpoints.
> It's in fsync() on one platform, and in write() on another. It is complex
> depending on OS, the amount of memory, disks, writeback-cache and so on.
> 
> > I think the basic difference between this and the proposed patch is that
> > we do not put delays in the buffer write() or fsync() phases --- we just
> > put a delay _between_ the phases, and wait for the kernel to smooth it
> > out for us.  The kernel certainly knows more about what needs to get to
> > disk, so it seems logical to let it do the I/O smoothing.
> 
> Both proposals do not conflict each other. Also, solutions for either
> platform do not have bad effect on the other platform. Can we employ
> both of them?
> 
> I tested your proposal but it did not work on write-critical machine.
> However, if the idea works well on BSD or some platforms, we would be
> better off buying it.
> 
> [pgbench results]
> ...
> 566.973777
> 327.158222 <- (1) write()
> 560.773868 <- (2) sleep
> 544.106645 <- (3) fsync()

OK, so you are saying that performance dropped only during the write(),
and not during the fsync()?  Interesting.  I would like to know the
results of a few tests just like you reported them above:

1a) write spread out over 30 seconds
1b) write with no delay

2a) sleep(0)
2b) sleep(30)

 3) fsync

I would like to know the performance at each stage for each combination,
e.g. when using 1b, 2a, 3, performance during the write() phase was X,
during the sleep it was Y, and during the fsync it was Z. (Of course,
sleep(0) has no stage timing.)

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] [PATCHES] Patch(es) to expose n_live_tuples and

2006-12-26 Thread Tom Lane
Bruce Momjian <[EMAIL PROTECTED]> writes:
> Alvaro Herrera wrote:
>> I'm not really convinced that Bruce's proposed names seem any better to
>> me.  What's wrong with "dead" and "live"?

> In my mind, visible really means "visible to anyone", and expired means
> visible to no one.

Um ... surely, visibility is in the eye of the beholder (no smiley).

I don't have an immediate suggestion for better terminology, but IMHO
the whole point of visible/invisible terminology is that it depends on
who's looking.  Dead and live seem to convey a more appropriate air
of finality.

"Expired" is OK as a synonym for "dead", but there is no thesaurus
anywhere in the world that will suggest it as an antonym for "visible".

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Bitmap index thoughts

2006-12-26 Thread Gavin Sherry
Hey Heikki,

On Tue, 26 Dec 2006, Heikki Linnakangas wrote:

> I've been skimming through the bitmap index patch...
>
> A scan needs to access at least five pages:
>
> 1. B-tree index (root+others, depending on depth)
> 2. The auxiliary heap page
> 3. bitmap index meta page
> 4. LOV page
> 5. bitmap page
>
> That seems like a lot of indirection. A high startup cost is probably ok

You're right, this is excessive and it was something I'd hoped to have
ripped out, but...

> for typical bitmap index use cases and most of the needed pages should
> stay in memory, but could we simplify this? Why do we need the auxiliary
> heap, couldn't we just store the blk+offset of the LOV item directly in
> the b-tree index item?

The problem is, the b-tree code is very much tied to the heap. I don't
want to modify the b-tree code to make bitmap indexes work (better).
What's really tempting is to just manage our own balanced tree within the
bitmap index file(s) itself. It would start from the metapage and simply
spill to other 'special' index pages when necessary. The problem is, we do
not have b-tree code generic enough that it would allow us to do this
trivially -- consider concurrency and WAL in particular, which we
currently get for free. I guess this is why I've been ignoring this issue
:-).

> And instead of having separate LOV pages that store a number of LOV
> items, how about storing each LOV item on a page of it's own, and using
> the rest of the page to store the last chunk of the bitmap. That would
> eliminate one page access, but more importantly, maybe we could then get
> rid of all the bm_last_* attributes in BMLOVItemData that complicate the
> patch quite a bit, while preserving the performance.

That's an interesting approach. We would still need a concept of
last_word, at the very least, and probably last_comp_word for convenience.
Your idea doesn't require any extra space, either, which is good.
Something I've been working through is the idea of a 'bitmap data
segment'. Currently, we store the HRL compressed bitmap data to the extent
of the page. That is, sizeof(BMBitmap) is as close to BLCKSZ as we can
make it. The problem is that we may have some bitmaps where a few values
occur only a small number of times and are well clustered at the beginning
of the heap. In that circumstance, we use a whole page to store a small
number of words and the free space cannot be used by any other vector.
Now, say a segment was some fraction the size of BLCKSZ, we use less space
for those vectors with few tuples in the heap. Just an idea...

Thanks,

Gavin

PS: Another versio of the patch shall be forthcoming shortly. I've been
working on compressing the data in memory during CREATE INDEX instead of
just managing arrays of TIDs in memory as we did previously. The array of
TIDs works great for well clustered data but it stinks for poorly
clustered data as we approach maintenance_word_mem and have to swap a lot.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Win32 WEXITSTATUS too simplistic

2006-12-26 Thread ITAGAKI Takahiro

Tom Lane <[EMAIL PROTECTED]> wrote:

>   server process exited with exit code -1073741819
> from what I suspect is really the equivalent of a SIGSEGV trap,
> ie, attempted access to already-deallocated memory.  My calculator
> says the above is equivalent to hex C005, and I say that this
> makes it pretty clear that *some* parts of Windows put flag bits into
> the process exit code.  Anyone want to run down what we should really
> be using instead of the above macros?

C005 equals to EXCEPTION_ACCESS_VIOLATION. The value returned by
GetExceptionCode() seems to be the exit code in unhandeled exception cases.

AFAICS, all EXCEPTION_xxx (or STATUS_xxx) values are defined as 0xCxxx.
I think we can use the second high bit to distinguish exit by exception
from normal exits.

#define WEXITSTATUS(w)  ((int) ((w) & 0x4000))
#define WIFEXITED(w)((w) & 0x4000) == 0)
#define WIFSIGNALED(w)  ((w) & 0x4000) != 0)
#define WTERMSIG(w) (w) // or ((w) & 0x3FFF)

However, it comes from reverse engineering of the headers of Windows.
I cannot find any official documentation.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] SPAR Simple PostgreSQL AddOn Replication System

2006-12-26 Thread org
Couldnt find a replication system that worked and did what I wanted, so I made 
one.

If you would like to give my humble creation a try...

http://spar.orgfree.com/index.html

Its working for me oh yes... its free... naturally :)

Regards
Johnny

[HACKERS] SPAR Simple PostgreSQL AddOn Replication System

2006-12-26 Thread org
Couldnt find a replication system that worked and did what I wanted, so I made 
one.

If you would like to give my humble creation a try...

http://spar.orgfree.com/index.html
Regards
Johnny

Re: [HACKERS] effective_cache_size vs units

2006-12-26 Thread Joshua D. Drake
On Tue, 2006-12-19 at 22:06 -0800, Steve Atkins wrote:
> On Dec 19, 2006, at 9:50 PM, Jonah H. Harris wrote:
> 
> > On 12/19/06, Tom Lane <[EMAIL PROTECTED]> wrote:
> >> I think we should just accept the strings case-insensitively, too.
> >
> > While acknowledging Peter's pedantically-correct points, I say +1 for
> > ease of use.
> 
> +1. I spend some time walking people through tuning issues
> by phone or IM. Anything that complicates supporting users or
> frustrates users for no actual benefit is a bad thing.
> 
> (And this is unrelated to any theoretical units-aware data type -
> we might well be interested in milliwatts and megawatts in a
> datatype,  but in the configuration file we're unlikely to ever
> need to configure things in units of millibits).

Where we at on this?


Joshua D. Drake


> 
> Cheers,
>Steve
> 
> 
> ---(end of broadcast)---
> TIP 4: Have you searched our list archives?
> 
>http://archives.postgresql.org
> 
-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PATCHES] Patch(es) to expose n_live_tuples and

2006-12-26 Thread Tom Lane
"Joshua D. Drake" <[EMAIL PROTECTED]> writes:
> On Tue, 2006-12-26 at 13:59 -0800, Glen Parker wrote:
>> I'd love to see this back patched into 8.2.1 if possible.

> Probably not. We typically do not introduce new features into back
> releases.

And since this one would require an initdb, there is exactly zero chance
of it being back-patched ...

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Rare corruption of pg_class index

2006-12-26 Thread Tom Lane
"Greg Sabino Mullane" <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> Does the mentioned OID actually correspond to the OID of the table it's
>> supposed to be opening, or is it wrong?  Is anything being done to
>> the table schema in parallel?

> Yes, it is the correct OID. No, nothing done to the schema in parallel,
> although there is a process that disables/re-enables triggers and rules
> on that table via pg_class tweaking (inside a txn, of course).

Oh!  Duh, that's your issue right there, I'll bet.  The problem is that
relcache-open tries to read the pg_class row under SnapshotNow rules,
and if there is another xact concurrently modifying the row, it is
entirely possible for none of the row versions to be committed good at
the instant they are visited.  (The new row version either isn't seen at
all or isn't committed good yet when it's visited, and later when the
old row version is visited, it has become committed dead.)  This results
in ScanPgRelation failing (returning NULL) which leads to exactly the 
"could not open relation with OID xxx" symptom --- and in fact I see no
other code path that yields that failure.

As of 8.2 we have this problem fixed for system-initiated changes to the
pg_class row, but you're still going to be at risk if you are doing
manual "UPDATE pg_class" operations.  Can you get away from needing to
do that?  ALTER TABLE DISABLE TRIGGER might help, but we haven't got
anything like ALTER TABLE DISABLE RULE.  In any case the important point
is that you have to take AccessExclusive lock on a relation whose
pg_class row you would like to change, and you need to be on 8.2 because
prior releases weren't careful about obtaining lock *before* reading the
row.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Win32 WEXITSTATUS too simplistic

2006-12-26 Thread Andrew Dunstan
Tom Lane wrote:
> "Andrew Dunstan" <[EMAIL PROTECTED]> writes:
>> Tom Lane wrote:
>>> Anyone want to run down what we should really
>>> be using instead of the above macros?
>
>> The exit code is apparently what is reported from GetExitCodeProcess().
>> For info on that see
>> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/getexitcodeprocess.asp
>
>> I think we are possibly seeing the third case, i.e. the code from an
>> unhandled exception.  I haven't managed to find an API to handle them
>> though ...
>
> Right ... but I don't think we want to "handle the exception".  The
> right question to be asking is "what is the encoding of these 'exception
> values' it's talking about?"
>


Yes, sorry for my loose expression. That's what I meant - I didn't find an
API that would translate the exception values.

cheers

andrew


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Win32 WEXITSTATUS too simplistic

2006-12-26 Thread Tom Lane
"Andrew Dunstan" <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> Anyone want to run down what we should really
>> be using instead of the above macros?

> The exit code is apparently what is reported from GetExitCodeProcess().
> For info on that see
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/getexitcodeprocess.asp

> I think we are possibly seeing the third case, i.e. the code from an
> unhandled exception.  I haven't managed to find an API to handle them
> though ...

Right ... but I don't think we want to "handle the exception".  The
right question to be asking is "what is the encoding of these 'exception
values' it's talking about?"

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] Bitmap index thoughts

2006-12-26 Thread Heikki Linnakangas

I've been skimming through the bitmap index patch...

A scan needs to access at least five pages:

1. B-tree index (root+others, depending on depth)
2. The auxiliary heap page
3. bitmap index meta page
4. LOV page
5. bitmap page

That seems like a lot of indirection. A high startup cost is probably ok 
for typical bitmap index use cases and most of the needed pages should 
stay in memory, but could we simplify this? Why do we need the auxiliary 
heap, couldn't we just store the blk+offset of the LOV item directly in 
the b-tree index item?


And instead of having separate LOV pages that store a number of LOV 
items, how about storing each LOV item on a page of it's own, and using 
the rest of the page to store the last chunk of the bitmap. That would 
eliminate one page access, but more importantly, maybe we could then get 
rid of all the bm_last_* attributes in BMLOVItemData that complicate the 
patch quite a bit, while preserving the performance.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Win32 WEXITSTATUS too simplistic

2006-12-26 Thread Andrew Dunstan
Tom Lane wrote:
> win32.h says
>
> /*
>  *Signal stuff
>  *WIN32 doesn't have wait(), so the return value for children
>  *is simply the return value specified by the child, without
>  *any additional information on whether the child terminated
>  *on its own or via a signal.  These macros are also used
>  *to interpret the return value of system().
>  */
> #define WEXITSTATUS(w)(w)
> #define WIFEXITED(w)  (true)
> #define WIFSIGNALED(w)(false)
> #define WTERMSIG(w)   (0)
>
> I think this supposition has been pretty much proven false by recent
> reports of silly "exit code" numbers from Win32 users, as for instance
> here
>   http://archives.postgresql.org/pgsql-bugs/2006-12/msg00163.php
> where the postmaster reports
>   server process exited with exit code -1073741819
> from what I suspect is really the equivalent of a SIGSEGV trap,
> ie, attempted access to already-deallocated memory.  My calculator
> says the above is equivalent to hex C005, and I say that this
> makes it pretty clear that *some* parts of Windows put flag bits into
> the process exit code.  Anyone want to run down what we should really
> be using instead of the above macros?
>

The exit code is apparently what is reported from GetExitCodeProcess().
For info on that see
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/getexitcodeprocess.asp

I think we are possibly seeing the third case, i.e. the code from an
unhandled exception.  I haven't managed to find an API to handle them
though ...

cheers

andrew




---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Possible documentation error

2006-12-26 Thread Martijn van Oosterhout
On Tue, Dec 26, 2006 at 12:49:55PM -0500, D'Arcy J.M. Cain wrote:
> On Tue, 26 Dec 2006 18:12:45 +0100
> Martijn van Oosterhout  wrote:
> > On Tue, Dec 26, 2006 at 12:04:40PM -0500, D'Arcy J.M. Cain wrote:
> > > Now it certainly seems to me that it should behave as described given
> > > the definition of VACUUM FULL so I am a little confused by my tests.
> > > My test table only has two entries in it.  Is that the issue?  In fact,
> > > I find the same behaviour if I do a simple VACUUM on the table.
> > 
> > On a table with two entries, VACUUM FULL is going to do nothing of
> > interest. Moving tuples within a page is useless, generally.
> 
> I thought that that might be the issue.  The docs should probably say
> "can" instead of "will" then.

The doumenttion is accurate as is. It says when "moved by VACUUM FULL".
In your case they wern't moved. If you change the word "will" to "can",
it will be wrong.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to 
> litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] Possible documentation error

2006-12-26 Thread D'Arcy J.M. Cain
On Tue, 26 Dec 2006 18:12:45 +0100
Martijn van Oosterhout  wrote:
> On Tue, Dec 26, 2006 at 12:04:40PM -0500, D'Arcy J.M. Cain wrote:
> > Now it certainly seems to me that it should behave as described given
> > the definition of VACUUM FULL so I am a little confused by my tests.
> > My test table only has two entries in it.  Is that the issue?  In fact,
> > I find the same behaviour if I do a simple VACUUM on the table.
> 
> On a table with two entries, VACUUM FULL is going to do nothing of
> interest. Moving tuples within a page is useless, generally.

I thought that that might be the issue.  The docs should probably say
"can" instead of "will" then.

-- 
D'Arcy J.M. Cain  |  Democracy is three wolves
http://www.druid.net/darcy/|  and a sheep voting on
+1 416 425 1212 (DoD#0082)(eNTP)   |  what's for dinner.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Possible documentation error

2006-12-26 Thread Michael Fuhr
On Tue, Dec 26, 2006 at 06:12:45PM +0100, Martijn van Oosterhout wrote:
> On Tue, Dec 26, 2006 at 12:04:40PM -0500, D'Arcy J.M. Cain wrote:
> > Now it certainly seems to me that it should behave as described given
> > the definition of VACUUM FULL so I am a little confused by my tests.
> > My test table only has two entries in it.  Is that the issue?  In fact,
> > I find the same behaviour if I do a simple VACUUM on the table.
> 
> On a table with two entries, VACUUM FULL is going to do nothing of
> interest. Moving tuples within a page is useless, generally.

A test on a larger table shows the behavior:

test=> CREATE TABLE foo (x integer);
CREATE TABLE
test=> INSERT INTO foo SELECT * FROM generate_series(1, 1000);
INSERT 0 1000
test=> DELETE FROM foo WHERE x BETWEEN 2 AND 999;
DELETE 998
test=> SELECT ctid, x FROM foo;
  ctid  |  x   
+--
 (0,1)  |1
 (4,92) | 1000
(2 rows)

test=> VACUUM foo;
VACUUM
test=> SELECT ctid, x FROM foo;
  ctid  |  x   
+--
 (0,1)  |1
 (4,92) | 1000
(2 rows)

test=> VACUUM FULL foo;
VACUUM
test=> SELECT ctid, x FROM foo;
 ctid  |  x   
---+--
 (0,1) |1
 (0,2) | 1000
(2 rows)

-- 
Michael Fuhr

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Possible documentation error

2006-12-26 Thread Martijn van Oosterhout
On Tue, Dec 26, 2006 at 12:04:40PM -0500, D'Arcy J.M. Cain wrote:
> I have been testing this statement and find that it seems not quite
> true. Although ctid changes on update, VACUUM FULL does not change it.
> What it does do is make lower areas available again so an update after a
> VACUUM FULL can reuse lower numbers rather than higher ones before.

A VACUUM FULL will try to compact a table. Thus if there's a lot of
free space at the beginning, it will move tuples near the end to the
beginning.

> Now it certainly seems to me that it should behave as described given
> the definition of VACUUM FULL so I am a little confused by my tests.
> My test table only has two entries in it.  Is that the issue?  In fact,
> I find the same behaviour if I do a simple VACUUM on the table.

On a table with two entries, VACUUM FULL is going to do nothing of
interest. Moving tuples within a page is useless, generally.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to 
> litigate.


signature.asc
Description: Digital signature


[HACKERS] Possible documentation error

2006-12-26 Thread D'Arcy J.M. Cain
http://www.postgresql.org/docs/8.2/interactive/ddl-system-columns.html
has the following statement about ctid:

"The physical location of the row version within its table. Note that
although the ctid can be used to locate the row version very quickly, a
row's ctid will change each time it is updated or moved by VACUUM FULL.
Therefore ctid is useless as a long-term row identifier. The OID, or
even better a user-defined serial number, should be used to identify
logical rows."

I have been testing this statement and find that it seems not quite
true. Although ctid changes on update, VACUUM FULL does not change it.
What it does do is make lower areas available again so an update after a
VACUUM FULL can reuse lower numbers rather than higher ones before.

Now it certainly seems to me that it should behave as described given
the definition of VACUUM FULL so I am a little confused by my tests.
My test table only has two entries in it.  Is that the issue?  In fact,
I find the same behaviour if I do a simple VACUUM on the table.

-- 
D'Arcy J.M. Cain  |  Democracy is three wolves
http://www.druid.net/darcy/|  and a sheep voting on
+1 416 425 1212 (DoD#0082)(eNTP)   |  what's for dinner.

---(end of broadcast)---
TIP 6: explain analyze is your friend


[HACKERS] Recent SIGSEGV failures in buildfarm HEAD

2006-12-26 Thread Tom Lane
Several of the buildfarm machines are exhibiting repeatable signal 11
crashes in what seem perfectly ordinary queries.  This started about
four days ago so I suppose it's got something to do with my
operator-families patch :-( ... but I dunno what, and none of my own
machines show the failure.  Can someone provide a stack trace?

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] Win32 WEXITSTATUS too simplistic

2006-12-26 Thread Tom Lane
win32.h says

/*
 *  Signal stuff
 *  WIN32 doesn't have wait(), so the return value for children
 *  is simply the return value specified by the child, without
 *  any additional information on whether the child terminated
 *  on its own or via a signal.  These macros are also used
 *  to interpret the return value of system().
 */
#define WEXITSTATUS(w)  (w)
#define WIFEXITED(w)(true)
#define WIFSIGNALED(w)  (false)
#define WTERMSIG(w) (0)

I think this supposition has been pretty much proven false by recent
reports of silly "exit code" numbers from Win32 users, as for instance
here
http://archives.postgresql.org/pgsql-bugs/2006-12/msg00163.php
where the postmaster reports 
server process exited with exit code -1073741819
from what I suspect is really the equivalent of a SIGSEGV trap,
ie, attempted access to already-deallocated memory.  My calculator
says the above is equivalent to hex C005, and I say that this
makes it pretty clear that *some* parts of Windows put flag bits into
the process exit code.  Anyone want to run down what we should really
be using instead of the above macros?

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] TupleDescs and refcounts and such, again

2006-12-26 Thread Tom Lane
I looked into the bug reported by Jean-Pierre Pelletier here:
http://archives.postgresql.org/pgsql-bugs/2006-12/msg00163.php
The cause of the crash is a reference to an already-deallocated
TupleDesc.  The problem query has the structure of

SELECT sum(x) FROM (complicated-subselect) GROUP BY ...

which gets planned as HashAggregate atop a SubqueryScan, and the
reason for the crash is this coding in nodeAgg.c:

/* if first time through, initialize hashslot by cloning input slot */
if (hashslot->tts_tupleDescriptor == NULL)
{
ExecSetSlotDescriptor(hashslot, inputslot->tts_tupleDescriptor);

This means the upper query's tupletable contains a reference to the
result tuple descriptor of the subquery, which has been allocated in a
separate memory context (because the subquery has its own ExecutorState
and hence its own es_query_cxt).  During plan shutdown, the sub-query's
memory is freed before the upper query's tupletable is deallocated.
In an assert-enabled build this reliably causes a failure like "tupdesc
reference 401901f8 is not owned by resource owner Portal", because the
lower tupdesc has been overwritten by the memory-clobber code, and that
makes it look like it should be reference-counted.  (Pre-8.2 it
accidentally failed to malfunction because tupletable shutdown didn't
touch the referenced tupdescs at all.)

I can see a couple of possibilities for fixing this:

1. The most localized fix would be to introduce a CreateTupleDescCopy()
call into the above ExecSetSlotDescriptor() call.  But I have zero
confidence in this way because of the likelihood that there are similar
usages elsewhere.  Moreover a large part of the point of the tupdesc
refcounting changes was to avoid extra tupdesc-copying steps --- if we
have to copy tupdescs anywhere they might have come from a subplan, that
patch is a failure.

2. Somehow persuade the subplan to allocate its result tupdesc in the
upper query's query context.  Could probably be done with localized
copying in nodeSubqueryscan, but seems like a wart.

3. Rejigger CreateExecutorState so that a subquery does not have its own
es_query_cxt but shares the parent's context.  Then anything created at
query lifespan in the subquery lives just as long as stuff created in
the upper query.  AFAICS this wouldn't make any practical difference in
memory lifespan since subqueries are only destroyed when their parent is
... but it would solve this particular problem as well as any related
ones.

As you can probably guess, I'm leaning to #3, but wanted to see if
anyone had an objection or a better idea.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] pg_standby and build farm

2006-12-26 Thread Doug Knight
Hi all,
I'm new to the forums, so bear with me on my questions. I've set up an
auto-archive and auto-recover pair of databases using pg_standby, which
I'm prototyping various products for high availability. I've noticed
that when I attempt to fail from the primary archiver to the secondary
recovery db using the pg_standby trigger file, the secondary detects the
trigger file, flags that it couldn't read the current WAL file
pg_standby was waiting on, then attempts to read in the previous WAL
file. I use the -m option in pg_standby, so the previous WAL file no
longer exists, which causes the secondary postgres to "panic" on not
being able to open the previous WAL and terminates. Is there a way to
prevent the looking for the previous, or preserving the previous WAL
file so that when the trigger file is detected, the secondary will come
all the way up, completely its recovery mode? 

Thanks,
Doug Knight


Re: [HACKERS] Load distributed checkpoint

2006-12-26 Thread ITAGAKI Takahiro

Bruce Momjian <[EMAIL PROTECTED]> wrote:

> I assume write() is not our checkpoint performance problem, but the
> transfer to disk via fsync().  Perhaps a simple solution is to do the
> write()'s of all dirty buffers as we do now at checkpoint time, but
> delay 30 seconds and then do fsync() on all the files.

I think there are two platforms that have different problems in checkpoints.
It's in fsync() on one platform, and in write() on another. It is complex
depending on OS, the amount of memory, disks, writeback-cache and so on.

> I think the basic difference between this and the proposed patch is that
> we do not put delays in the buffer write() or fsync() phases --- we just
> put a delay _between_ the phases, and wait for the kernel to smooth it
> out for us.  The kernel certainly knows more about what needs to get to
> disk, so it seems logical to let it do the I/O smoothing.

Both proposals do not conflict each other. Also, solutions for either
platform do not have bad effect on the other platform. Can we employ
both of them?

I tested your proposal but it did not work on write-critical machine.
However, if the idea works well on BSD or some platforms, we would be
better off buying it.

[pgbench results]
...
566.973777
327.158222 <- (1) write()
560.773868 <- (2) sleep
544.106645 <- (3) fsync()
...

[changes in codes]
  (This is a bad implementation because shutdown takes long time!)
void
FlushBufferPool(void)
{
BufferSync();   // (1) write -- about 20s

time_t start = time(NULL);
while (time(NULL) - start < 30) // (2) sleep -- 30s
{
pg_usleep(BgWriterDelay * 1000L);
BgBufferSync();
AbsorbFsyncRequests();
}

smgrsync(); // (3) fsync -- less than 200ms
}

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match