Re: [HACKERS] Fixed length datatypes. WAS [GENERAL] UUID's as

2006-06-29 Thread Thomas Hallgren

Josh Berkus wrote:
> Jim,
>
>> I agree about splitting the utilities, except that I think the database
>> should be able to generate UUIDs somehow.
>
> There is a GUID add-in, and someone is working on a 2nd one.  UUIDs 
are not part of the SQL standard, and we've only seen sporadic demand 
for them (and different types each time) so I can't imagine one making 
it further than contrib real soon.

>
> Also, one could argue that UUIDs are a foot gun, so they're not 
exactly the type of thing we want to advocate in advance of demand.

>

Martijn van Oosterhout wrote:
> It seems to me that maybe the backend should include a 16-byte fixed
> length object (after all, we've got 1, 2, 4 and 8 bytes already) and
> then people can use that to build whatever they like, using domains,
> for example...
>
So how about the split? I.e. just add a 16 byte data type and forget all 
about UUID's for now.


Regards,
Thomas Hallgren



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Fixed length datatypes. WAS [GENERAL] UUID's as

2006-06-29 Thread Josh Berkus

Jim,


I agree about splitting the utilities, except that I think the database
should be able to generate UUIDs somehow.


There is a GUID add-in, and someone is working on a 2nd one.  UUIDs are 
not part of the SQL standard, and we've only seen sporadic demand for 
them (and different types each time) so I can't imagine one making it 
further than contrib real soon.


Also, one could argue that UUIDs are a foot gun, so they're not exactly 
the type of thing we want to advocate in advance of demand.


--Josh Berkus


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Index corruption

2006-06-29 Thread Tom Lane
Marc Munro <[EMAIL PROTECTED]> writes:
> I'll get back to you with kernel build information tomorrow.  We'll also
> try to talk to some kernel hackers about this.

Some googling turned up recent discussions about race conditions in
Linux NFS code:

http://threebit.net/mail-archive/linux-kernel/msg05313.html
http://lkml.org/lkml/2006/3/1/381

I don't know the kernel nearly well enough to guess if these are related
...

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: Longer startup delay (was Re: [HACKERS] Single Index Tuple Chain (SITC) method)

2006-06-29 Thread Tom Lane
Alvaro Herrera <[EMAIL PROTECTED]> writes:
> Speaking of which, I think I've noticed a longer delay in server start
> after initdb.  I haven't measured nor profiled it, but I think it may be
> because of the heap_inplace_update xlogging that we weren't doing
> previously.

Can't say that I've noticed anything like that.  I have gotten a nagging
sense lately that there are strange delays happening at random times
during the regression tests.  Like you I'm not quite sure about this,
but I run "make installcheck" often enough to be fairly attuned to it,
and it just seems like it's sometimes hanging up for a couple of seconds
for no apparent reason.  OTOH we've stuck some new, slow queries into
the tests recently and maybe I just haven't acclimated to that.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Index corruption

2006-06-29 Thread Marc Munro
On Thu, 2006-06-29 at 21:59 -0400, Tom Lane wrote:
> [ back to the start of the thread... ]
> 
> BTW, a couple of thoughts here:
> 
> * If my theory about the low-level cause is correct, then reindexing
> sl_log_1 would make the "duplicate key" errors go away, but nonetheless
> you'd have lost data --- the overwritten rows would be gone.  I suppose
> that this would result in the slave missing some rows that are present
> on the master.  Have you tried comparing slave and master databases to
> see if you can find any discrepancies?

Haven't done that yet - in test we tend to restart the old subscriber as
the new provider and rebuild the cluster.  I'll check the logs from our
production failure to figure out what to compare and see what I can
discover. 

> * One way that the problem could happen would be if a race condition in
> the kernel allowed an lseek(fd, 0, SEEK_END) to return a value less than
> the true end-of-file (as determined by another process' write()
> extending the EOF just slightly earlier --- ie, lseek fails to note the
> effects of the just-completed write, and returns the prior EOF value).
> PG does have internal locking that should guarantee that the lseek is
> not done until after the write completes ... but could there be a bug in
> the kernel allowing stale data to be returned?  The SMP hardware is
> relevant (maybe one processor sees different data than the other) and
> frankly I don't trust NFS very far at all for questions such as this.
> It'd be interesting to see if you can reproduce the problem in a
> database on local storage.

Unfortunately we haven't got any local storage that can stand the sort
of loads we are putting through.  With slower storage the CPUs mostly
sit idle and we are very unlikely to trigger a timing-based bug if
that's what it is.

I'll get back to you with kernel build information tomorrow.  We'll also
try to talk to some kernel hackers about this.

Many thanks for your efforts so far.
__
Marc


signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] Index corruption

2006-06-29 Thread Tom Lane
Marc Munro <[EMAIL PROTECTED]> writes:
> By dike out, you mean remove?  Please confirm and I'll try it.

Right, just remove (or comment out) the lines I quoted.

> We ran this system happily for nearly a year on the
> previous kernel without experiencing this problem (tcp lockups are a
> different matter).

I'm starting to think that a kernel bug is a realistic explanation,
see other message.  Whose kernel build is this exactly?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Index corruption

2006-06-29 Thread Marc Munro
On Thu, 2006-06-29 at 21:47 -0400, Tom Lane wrote:
> One easy thing that would be worth trying is to build with
> --enable-cassert and see if any Asserts get provoked during the
> failure case.  I don't have a lot of hope for that, but it's
> something that would require only machine time not people time.

I'll try this tomorrow.

> A couple other things to try, given that you can provoke the failure
> fairly easily:
> 
> 1. In studying the code, it bothers me a bit that P_NEW is the same as
> InvalidBlockNumber.  The intended uses of P_NEW appear to be adequately
> interlocked, but it's fairly easy to see how something like this could
> happen if there are any places where InvalidBlockNumber is
> unintentionally passed to ReadBuffer --- that would look like a P_NEW
> call and it *wouldn't* be interlocked.  So it would be worth changing
> P_NEW to "(-2)" (this should just take a change in bufmgr.h and
> recompile) and adding an "Assert(blockNum != InvalidBlockNumber)"
> at the head of ReadBufferInternal().  Then rebuild with asserts enabled
> and see if the failure case provokes that assert.

I'll try this too.

> 2. I'm also eyeing this bit of code in hio.c:
> 
> /*
>  * If the FSM knows nothing of the rel, try the last page before
>  * we give up and extend.  This avoids one-tuple-per-page syndrome
>  * during bootstrapping or in a recently-started system.
>  */
> if (targetBlock == InvalidBlockNumber)
> {
> BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
> 
> if (nblocks > 0)
> targetBlock = nblocks - 1;
> }
> 
> If someone else has just extended the relation, it's possible that this
> will allow a process to get to the page before the intended extender has
> finished initializing it.  AFAICT that's not harmful because the page
> will look like it has no free space ... but it seems a bit fragile.
> If you dike out the above-mentioned code, can you still provoke the
> failure?

By dike out, you mean remove?  Please confirm and I'll try it.

> A different line of attack is to see if you can make a self-contained
> test case so other people can try to reproduce it.  More eyeballs on the
> problem are always better.

Can't really see this being possible.  This is clearly a very unusual
problem and without similar hardware I doubt that anyone else will
trigger it.  We ran this system happily for nearly a year on the
previous kernel without experiencing this problem (tcp lockups are a
different matter).  Also the load is provided by a bunch of servers and
robots simulating rising and falling load.

> Lastly, it might be interesting to look at the WAL logs for the period
> leading up to a failure.  This would give us an idea of what was
> happening concurrently with the processes that seem directly involved.

Next time we reproduce it, I'll take a copy of the WAL files too.

__
Marc


signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] Index corruption

2006-06-29 Thread Tom Lane
[ back to the start of the thread... ]

Marc Munro <[EMAIL PROTECTED]> writes:
> We have now experienced index corruption on two separate but identical
> slony clusters.  In each case the slony subscriber failed after
> attempting to insert a duplicate record.  In each case reindexing the
> sl_log_1 table on the provider fixed the problem.

> Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875
> Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux
> ...
> NetApp FAS270 OnTap 7.0.3
> Mounted with the NFS options
> rw,nfsvers=3D3,hard,rsize=3D32768,wsize=3D32768,timeo=3D600,tcp,noac
> Jumbo frames 8192 MTU.
> All postgres data and logs are stored on the netapp.

BTW, a couple of thoughts here:

* If my theory about the low-level cause is correct, then reindexing
sl_log_1 would make the "duplicate key" errors go away, but nonetheless
you'd have lost data --- the overwritten rows would be gone.  I suppose
that this would result in the slave missing some rows that are present
on the master.  Have you tried comparing slave and master databases to
see if you can find any discrepancies?

* One way that the problem could happen would be if a race condition in
the kernel allowed an lseek(fd, 0, SEEK_END) to return a value less than
the true end-of-file (as determined by another process' write()
extending the EOF just slightly earlier --- ie, lseek fails to note the
effects of the just-completed write, and returns the prior EOF value).
PG does have internal locking that should guarantee that the lseek is
not done until after the write completes ... but could there be a bug in
the kernel allowing stale data to be returned?  The SMP hardware is
relevant (maybe one processor sees different data than the other) and
frankly I don't trust NFS very far at all for questions such as this.
It'd be interesting to see if you can reproduce the problem in a
database on local storage.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Index corruption

2006-06-29 Thread Tom Lane
Marc Munro <[EMAIL PROTECTED]> writes:
> If there's anything we can do to help debug this we will.  We can apply
> patches, different build options, etc.

One easy thing that would be worth trying is to build with
--enable-cassert and see if any Asserts get provoked during the
failure case.  I don't have a lot of hope for that, but it's
something that would require only machine time not people time.

A couple other things to try, given that you can provoke the failure
fairly easily:

1. In studying the code, it bothers me a bit that P_NEW is the same as
InvalidBlockNumber.  The intended uses of P_NEW appear to be adequately
interlocked, but it's fairly easy to see how something like this could
happen if there are any places where InvalidBlockNumber is
unintentionally passed to ReadBuffer --- that would look like a P_NEW
call and it *wouldn't* be interlocked.  So it would be worth changing
P_NEW to "(-2)" (this should just take a change in bufmgr.h and
recompile) and adding an "Assert(blockNum != InvalidBlockNumber)"
at the head of ReadBufferInternal().  Then rebuild with asserts enabled
and see if the failure case provokes that assert.

2. I'm also eyeing this bit of code in hio.c:

/*
 * If the FSM knows nothing of the rel, try the last page before
 * we give up and extend.  This avoids one-tuple-per-page syndrome
 * during bootstrapping or in a recently-started system.
 */
if (targetBlock == InvalidBlockNumber)
{
BlockNumber nblocks = RelationGetNumberOfBlocks(relation);

if (nblocks > 0)
targetBlock = nblocks - 1;
}

If someone else has just extended the relation, it's possible that this
will allow a process to get to the page before the intended extender has
finished initializing it.  AFAICT that's not harmful because the page
will look like it has no free space ... but it seems a bit fragile.
If you dike out the above-mentioned code, can you still provoke the
failure?

A different line of attack is to see if you can make a self-contained
test case so other people can try to reproduce it.  More eyeballs on the
problem are always better.

Lastly, it might be interesting to look at the WAL logs for the period
leading up to a failure.  This would give us an idea of what was
happening concurrently with the processes that seem directly involved.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] index corruption

2006-06-29 Thread Marc Munro
On Thu, 2006-06-29 at 19:59 -0400, Tom Lane wrote:

> Ummm ... you did restart the server?  "select version();" would be
> the definitive test.

Can't say I blame you for the skepticism but I have confirmed it again.

test=# select version();
 version 
-
 PostgreSQL 8.0.8 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC)
3.4.6 20060404 (Red Hat 3.4.6-3)
(1 row)

__
Marc



signature.asc
Description: This is a digitally signed message part


[HACKERS] [Re: Index corruption]

2006-06-29 Thread Marc Munro
Ooops: forgot to cc this to the list.

On Thu, 2006-06-29 at 19:27 -0400, Tom Lane wrote:
> Are you *certain* this slave isn't running 8.0.2 or older?  If you can
> verify that, then I guess we need to look for another mechanism that
> could cause the same kind of thing.

Certain.  We built new rpms for 8.0.8 and installed them on both
database servers (and I just checked them again to make sure).

If there's anything we can do to help debug this we will.  We can apply
patches, different build options, etc.

__
Marc



signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] Index corruption

2006-06-29 Thread Tom Lane
I wrote:
> What I speculate right at the moment is that we are not looking at index
> corruption at all, but at heap corruption: somehow, the first insertion
> into ctid (27806,2) got lost and the same ctid got re-used for the next
> inserted row.  We fixed one bug like this before ...

Further study makes this look more probable.  It seems that most of the
activity around the trouble spot consists of transactions inserting
exactly 13 rows into the table; for instance here are traces of two
successive transactions:

Block   ItemXMINCMIN

27800   7   600921860   3
27800   8   600921860   7
27800   9   600921860   11
27800   10  600921860   15
27800   11  600921860   19
27800   12  600921860   23
27800   13  600921860   27
27800   14  600921860   31
27800   15  600921860   35
27800   16  600921860   39
27800   17  600921860   43
27800   18  600921860   47
27800   19  600921860   51

27800   20  600921870   3
27800   21  600921870   7
27800   22  600921870   11
27800   23  600921870   15
27800   24  600921870   19
27800   25  600921870   23
27800   26  600921870   27
27800   27  600921870   31
27800   28  600921870   35
27800   29  600921870   39
27800   30  600921870   43
27800   31  600921870   47
27800   32  600921870   51

The pattern of CMIN values is the same for all these transactions.
But look at the rows inserted by 600921856 and 600921858, the two
transactions that seem to be involved with the problem on page
27086:

27787   50  600921856   3
27795   41  600921858   3
27795   42  600921858   7
27795   43  600921858   11
27795   44  600921858   15
27795   45  600921858   19
27795   46  600921858   23
27795   47  600921858   27
27795   48  600921858   31
27795   49  600921858   35
27795   50  600921858   39
27795   51  600921858   43
27806   1   600921858   47
27806   2   600921856   15
27806   3   600921856   19
27806   4   600921856   23
27806   5   600921856   27
27806   6   600921856   31
27806   7   600921856   35
27806   8   600921856   39
27806   9   600921856   43
27806   10  600921856   47
27806   11  600921856   51
27806   12  600921858   51

(27787,50) and (27795,51) are both the last rows on their pages.  What's
evidently happened is that the two transactions filled those pages and
then both seized on 27806 as the next page to insert into.  I think that
600921856 tried to insert its CMIN 7 and 11 rows as items 1 and 2 on
that page, and then something wiped the page, then 600921858 inserted
its CMIN 47 row as item 1, and then 600921856 got control back and
finished inserting its rows.  Further study of the indexes shows that
there are two entries in each index pointing to each of (27806,1) and
(27806,2) --- but since the xxid values are different for the two
(27806,1) entries, those didn't show up as duplicates in my first look.

Given the apparent connection to vacuuming, this is looking a WHOLE
lot like this bug fixed in 8.0.3:

2005-05-07 17:32  tgl

* src/backend/: access/heap/hio.c, commands/vacuumlazy.c
(REL7_3_STABLE), access/heap/hio.c, access/nbtree/nbtpage.c,
access/nbtree/nbtree.c, commands/vacuumlazy.c (REL7_4_STABLE),
access/heap/hio.c, commands/vacuumlazy.c (REL7_2_STABLE),
access/heap/hio.c, access/nbtree/nbtpage.c, access/nbtree/nbtree.c,
commands/vacuumlazy.c (REL8_0_STABLE), access/heap/hio.c,
access/nbtree/nbtpage.c, access/nbtree/nbtree.c,
commands/vacuumlazy.c: Repair very-low-probability race condition
between relation extension and VACUUM: in the interval between
adding a new page to the relation and formatting it, it was
possible for VACUUM to come along and decide it should format the
page too.  Though not harmful in itself, this would cause data loss
if a third transaction were able to insert tuples into the vacuumed
page before the original extender got control back.

Are you *certain* this slave isn't running 8.0.2 or older?  If you can
verify that, then I guess we need to look for another mechanism that
could cause the same kind of thing.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Index corruption

2006-06-29 Thread Marc Munro
On Fri, 2006-06-30 at 00:37 +0300, Hannu Krosing wrote:
> Marc: do you have triggers on some replicated tables ?
> 
We have a non-slony trigger on only 2 tables, neither of them involved
in this transaction.  We certainly have no circular trigger structures.

> I remember having some corruption in a database with weird circular
> trigger structures, some of them being slony log triggers. 
> 
> The thing that seemed to mess up something inside there, was when change
> on parent rownt fired a trigger that changes child table rows and there
> rows fired another trigger that changed the same parent row again.
> 

__
Marc


signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] Index corruption

2006-06-29 Thread Hannu Krosing
Ühel kenal päeval, N, 2006-06-29 kell 17:23, kirjutas Tom Lane:
> Marc Munro <[EMAIL PROTECTED]> writes:
> > Tom,
> > we have a newer and much smaller (35M) file showing the same thing:
> 
> Thanks.  Looking into this, what I find is that *both* indexes have
> duplicated entries for the same heap tuple:
> 
...
> However, the two entries in idx1 contain different data!!
> 
> What I speculate right at the moment is that we are not looking at index
> corruption at all, but at heap corruption: somehow, the first insertion
> into ctid (27806,2) got lost and the same ctid got re-used for the next
> inserted row.  We fixed one bug like this before ...

Marc: do you have triggers on some replicated tables ?

I remember having some corruption in a database with weird circular
trigger structures, some of them being slony log triggers. 

The thing that seemed to mess up something inside there, was when change
on parent rownt fired a trigger that changes child table rows and there
rows fired another trigger that changed the same parent row again.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Index corruption

2006-06-29 Thread Hannu Krosing
Ühel kenal päeval, N, 2006-06-29 kell 16:42, kirjutas Chris Browne:
> [EMAIL PROTECTED] (Marc Munro) writes:
> > As you see, slony is attempting to enter one tuple
> > ('374520943','22007','0') two times.
> >
> > Each previous time we have had this problem, rebuilding the indexes on
> > slony log table (sl_log_1) has fixed the problem.  I have not reindexed
> > the table this time as I do not want to destroy any usable evidence.
> 
> We have seen this phenomenon on 7.4.8 several times; pulled dumps of
> sl_log_1 and index files that Jan Wieck looked at, which alas hasn't
> led to a fix.
> 
> He did, mind you, find some concurrency pattern that led, if memory
> serves, to 7.4.12's release.  We had experienced cases where there was
> some worse corruption that required that we rebuild replicas from
> scratch :-(.

How well did you check the C-language triggers and special slony
functions for possibly corrupting some backend/shared-mem structures ?

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Index corruption

2006-06-29 Thread Tom Lane
Marc Munro <[EMAIL PROTECTED]> writes:
> Tom,
> we have a newer and much smaller (35M) file showing the same thing:

Thanks.  Looking into this, what I find is that *both* indexes have
duplicated entries for the same heap tuple:

idx1:

 Item 190 -- Length:   24  Offset: 3616 (0x0e20)  Flags: USED
  Block Id: 27806  linp Index: 2  Size: 24
  Has Nulls: 0  Has Varwidths: 0

  0e20: 9e6c 02001800 0100 0057d123  ...l.W.#
  0e30: 1a781200 .x..

 Item 191 -- Length:   24  Offset: 3592 (0x0e08)  Flags: USED
  Block Id: 27806  linp Index: 2  Size: 24
  Has Nulls: 0  Has Varwidths: 0

  0e08: 9e6c 02001800 0100 0057d123  ...l.W.#
  0e18: 2e781200 .x..

idx2:

 Item 127 -- Length:   16  Offset: 6144 (0x1800)  Flags: USED
  Block Id: 27806  linp Index: 2  Size: 16
  Has Nulls: 0  Has Varwidths: 0

  1800: 9e6c 02001000 0057d123   ...l.W.#

 Item 128 -- Length:   16  Offset: 6128 (0x17f0)  Flags: USED
  Block Id: 27806  linp Index: 2  Size: 16
  Has Nulls: 0  Has Varwidths: 0

  17f0: 9e6c 02001000 0057d123   ...l.W.#

However, the two entries in idx1 contain different data!!

What I speculate right at the moment is that we are not looking at index
corruption at all, but at heap corruption: somehow, the first insertion
into ctid (27806,2) got lost and the same ctid got re-used for the next
inserted row.  We fixed one bug like this before ...

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Index corruption

2006-06-29 Thread Chris Browne
[EMAIL PROTECTED] (Marc Munro) writes:
> As you see, slony is attempting to enter one tuple
> ('374520943','22007','0') two times.
>
> Each previous time we have had this problem, rebuilding the indexes on
> slony log table (sl_log_1) has fixed the problem.  I have not reindexed
> the table this time as I do not want to destroy any usable evidence.

We have seen this phenomenon on 7.4.8 several times; pulled dumps of
sl_log_1 and index files that Jan Wieck looked at, which alas hasn't
led to a fix.

He did, mind you, find some concurrency pattern that led, if memory
serves, to 7.4.12's release.  We had experienced cases where there was
some worse corruption that required that we rebuild replicas from
scratch :-(.
-- 
output = reverse("gro.mca" "@" "enworbbc")
http://cbbrowne.com/info/advocacy.html
"There  is no  psychiatrist in  the world  like a  puppy  licking your
face."  -- Ben Williams

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Some questions to developers

2006-06-29 Thread Jim C. Nasby
Adding -advocacy

On Thu, Jun 29, 2006 at 04:48:01PM +0400, Nikolay Samokhvalov wrote:
> Recently an interview with Marten Mickos (mysql's ceo) appeared in
> russian IT news (eg http://citcity.ru/12776/). I'd like ask some
> [similar] questions to PostgreSQL core developers and write an article
> in Russian. Actually, I'm sure that there is a great lack of PG news
> in our mass media. Moreover, it seems that Anniversary is a good
> opportunity to rise a wave of interest.

Would you be willing to also publish in English?

I should be able to get support from Pervasive for this, too. If
developers would rather do a phone interview rather than an email
interview, for example. We can also help with editing, as well as
producing collatoral for trade shows and what-not.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Index corruption

2006-06-29 Thread Marc Munro
We have reproduced the problem again.  This time it looks like vacuum is
not part of the picture.  From the provider's log:

2006-06-29 14:02:41 CST DEBUG2 cleanupThread:  101.057 seconds for vacuuming

And from the subscriber's:

2006-06-29 13:00:43 PDT ERROR  remoteWorkerThread_1: "insert into 
"public"."table_trans_attribute" (table_transaction_id,attribute_type,value) 
values ('374740387','22008','4000');

If my maths is correct and the logs are honest, the vacuum would have
started at 14:01:00 CST (13:01:PDT), about 20 seconds after we first
encounter the problem.  The clocks on the two machines, though in
different timezones, are currently synced.

Tom, I will create another tarball of the sl_log_1 table and indexes.
Should be quite a bit smaller than the previous one.

__
Marc


signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] Single Index Tuple Chain (SITC) method

2006-06-29 Thread Bruce Momjian
Martijn van Oosterhout wrote:
> > You can't truncate a tuple to just the header, or at least it's not
> > going to be very useful to do it, unless you can also move other tuples
> > to coalesce the free space on the page.  Which means you need a
> > VACUUM-strength page lock.  If you're trying to do this in foreground
> > queries, you get into the same performance and deadlock issues I already
> > mentioned.  And I think the net-increase-in-WAL-traffic point would
> > apply too, since VACUUM will still need to clean the page when it
> > removes the header.
> 
> Well, I was only thinking of having the bgwriter do it in the
> background, just bfore writing the block to disk. I'm hoping that it
> only tries to write out pages not recently used, so hopefully there
> would be very little contention there.
> 
> And perhaps you can avoid the xlogging for the same reason as I
> suggested above.

Non-visible members of the SITC chains could also removed by the
background writer.  Adding to the chain does not require a vacuum-level
lock; only reusing the space requires it.

The thing that has always held us back from more aggressive tuple reuse
is the inability of vacuum to do small cleanups --- it has to scan
indexes so it wants to do many pages at once.  If you have SITC, there
are new possibilities for tuple reuse.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Single Index Tuple Chain (SITC) method

2006-06-29 Thread Bruce Momjian
Martijn van Oosterhout wrote:
> One thing I am confused about, currently the ctid chain follows tuple
> history so that transactions can find the latest version of any tuple,
> even if the key fields have changed. This proposal breaks that, I'm not
> sure how important that is though.

No, SITC doesn't break the UPDATE chain, it merely doesn't set the
SITC_NOT_TAIL bit on the tuple, so an index scan knows that is the last
tuple for that index entry.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Single Index Tuple Chain (SITC) method

2006-06-29 Thread Martijn van Oosterhout
On Thu, Jun 29, 2006 at 12:35:12PM -0400, Tom Lane wrote:
> Another issue is that this would replace a simple hint-bit setting with
> an index change that requires a WAL entry.  There'll be more WAL traffic
> altogether from backends retail-deleting index tuples than there would
> be from VACUUM cleaning the whole page at once --- and it won't cut the
> I/O demand from VACUUM any, either, since VACUUM still has to scan the
> index.  AFAICS this wouldn't make VACUUM either cheaper or less
> necessary, so I'm not sure I see the point.

Ok, I'm going to suggest something that's either radical, or very dumb:
does this truncation really need to be xlogged? After all, just like
hint bits, nothing is being done that won't be recovered from later if
it doesn't happen. The only possible thing I can imagine is that during
xlog replay you end up trying to add more tuples than appear to fit.
But if you have a vacuum page procedure you could call it again to try
to compress it down.

Then again, perhaps visibility checks are not safe within xlog replay
state.

I'm hoping that overall disk traffic reduces because total disk space
used by tables/indexes reduces.

> You can't truncate a tuple to just the header, or at least it's not
> going to be very useful to do it, unless you can also move other tuples
> to coalesce the free space on the page.  Which means you need a
> VACUUM-strength page lock.  If you're trying to do this in foreground
> queries, you get into the same performance and deadlock issues I already
> mentioned.  And I think the net-increase-in-WAL-traffic point would
> apply too, since VACUUM will still need to clean the page when it
> removes the header.

Well, I was only thinking of having the bgwriter do it in the
background, just bfore writing the block to disk. I'm hoping that it
only tries to write out pages not recently used, so hopefully there
would be very little contention there.

And perhaps you can avoid the xlogging for the same reason as I
suggested above.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to 
> litigate.


signature.asc
Description: Digital signature


Longer startup delay (was Re: [HACKERS] Single Index Tuple Chain (SITC) method)

2006-06-29 Thread Alvaro Herrera
Tom Lane wrote:

> Another issue is that this would replace a simple hint-bit setting with
> an index change that requires a WAL entry.  There'll be more WAL traffic
> altogether from backends retail-deleting index tuples than there would
> be from VACUUM cleaning the whole page at once

Speaking of which, I think I've noticed a longer delay in server start
after initdb.  I haven't measured nor profiled it, but I think it may be
because of the heap_inplace_update xlogging that we weren't doing
previously.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Single Index Tuple Chain (SITC) method

2006-06-29 Thread Hannu Krosing
Ühel kenal päeval, N, 2006-06-29 kell 12:35, kirjutas Tom Lane:
> Martijn van Oosterhout  writes:
> >> Tom - what do you think of the other related idea, that of reusing dead
> >> index entries ?
> 
> Possibly workable for btree now that we do page-at-a-time index scans;
> however I'm pretty hesitant to build any large infrastructure atop that
> change until we've got more performance results.  We might yet end up
> reverting it.
> 
> Another issue is that this would replace a simple hint-bit setting with
> an index change that requires a WAL entry.  There'll be more WAL traffic
> altogether from backends retail-deleting index tuples than there would
> be from VACUUM cleaning the whole page at once --- and it won't cut the
> I/O demand from VACUUM any, either, since VACUUM still has to scan the
> index.  AFAICS this wouldn't make VACUUM either cheaper or less
> necessary, so I'm not sure I see the point.

How can it generate more traffic ? 

When you replace a dead index entry with a live one, you just reuse
space - you would have to WAL log the index in both cases (adding a new
entry or replacing dead entry)

Espacially in the case, where you replace an index entryu with the same
value.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [GENERAL] UUID's as primary keys

2006-06-29 Thread Martijn van Oosterhout
On Thu, Jun 29, 2006 at 06:40:13PM +0200, Thomas Hallgren wrote:
> Martijn van Oosterhout wrote:
> >On Thu, Jun 29, 2006 at 03:54:36PM +0200, Thomas Hallgren wrote:
> >  
> >>I have to concur with this. Assume you use a bytea for a UUID that in 
> >>turn is used as a primary key. The extra overhead will be reflected in 
> >>all indexes, all foreign keys, etc. In a normalized database some tables 
> >>may consist of UUID columns only.
> >>
> >
> >So you create a UUID type. It's cheap enough to create new types after
> >all, that's one of postgresql's strengths.
> It would be a whole lot easier if I could use a domain.

It seems to me that maybe the backend should include a 16-byte fixed
length object (after all, we've got 1, 2, 4 and 8 bytes already) and
then people can use that to build whatever they like, using domains,
for example...

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to 
> litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] Index corruption

2006-06-29 Thread Tom Lane
Marc Munro <[EMAIL PROTECTED]> writes:
> As you see, slony is attempting to enter one tuple
> ('374520943','22007','0') two times.
> Each previous time we have had this problem, rebuilding the indexes on
> slony log table (sl_log_1) has fixed the problem.  I have not reindexed
> the table this time as I do not want to destroy any usable evidence.

Good.  How big is the log table --- would it be possible for you to send
me the physical table and index files?  (Not a pg_dump, the actual disk
files)

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [GENERAL] UUID's as primary keys

2006-06-29 Thread Thomas Hallgren

Martijn van Oosterhout wrote:

On Thu, Jun 29, 2006 at 03:54:36PM +0200, Thomas Hallgren wrote:
  
I have to concur with this. Assume you use a bytea for a UUID that in 
turn is used as a primary key. The extra overhead will be reflected in 
all indexes, all foreign keys, etc. In a normalized database some tables 
may consist of UUID columns only.



So you create a UUID type. It's cheap enough to create new types after
all, that's one of postgresql's strengths.

It would be a whole lot easier if I could use a domain.


 What I'm saying is that it's
easier to create new fixed length types for the cases that need it,
than it is to redo the entire type handling of the backend.

  
Of course. But it's a matter of who does what. Your reasoning push the 
burden to the users.


Regards,
Thomas Hallgren


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Single Index Tuple Chain (SITC) method

2006-06-29 Thread Tom Lane
Martijn van Oosterhout  writes:
>> Tom - what do you think of the other related idea, that of reusing dead
>> index entries ?

Possibly workable for btree now that we do page-at-a-time index scans;
however I'm pretty hesitant to build any large infrastructure atop that
change until we've got more performance results.  We might yet end up
reverting it.

Another issue is that this would replace a simple hint-bit setting with
an index change that requires a WAL entry.  There'll be more WAL traffic
altogether from backends retail-deleting index tuples than there would
be from VACUUM cleaning the whole page at once --- and it won't cut the
I/O demand from VACUUM any, either, since VACUUM still has to scan the
index.  AFAICS this wouldn't make VACUUM either cheaper or less
necessary, so I'm not sure I see the point.

> I'd like to know about this too, including ideas about truncating
> tuples to just the header.

You can't truncate a tuple to just the header, or at least it's not
going to be very useful to do it, unless you can also move other tuples
to coalesce the free space on the page.  Which means you need a
VACUUM-strength page lock.  If you're trying to do this in foreground
queries, you get into the same performance and deadlock issues I already
mentioned.  And I think the net-increase-in-WAL-traffic point would
apply too, since VACUUM will still need to clean the page when it
removes the header.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Fixed length datatypes. WAS [GENERAL] UUID's as

2006-06-29 Thread Jim C. Nasby
On Thu, Jun 29, 2006 at 09:12:32AM +0200, Thomas Hallgren wrote:
> The split make sense since clients often have powerful UUID utilities handy 
> and hence have limited or no use for such utilities in the database (true 
> for all .NET and Java clients). Some PL's will also enable such packages 
> out of the box.

I agree about splitting the utilities, except that I think the database
should be able to generate UUIDs somehow.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Index corruption

2006-06-29 Thread Marc Munro
On Thu, 2006-06-29 at 12:11 -0400, Tom Lane wrote:
> OK, so it's not an already-known problem.
> 
> > We were able to corrupt the index within 90 minutes of starting up our
> > cluster.  A slony-induced vacuum was under way on the provider at the
> > time the subscriber failed.
> 
> You still haven't given any details.  What do you mean by "corrupt the
> index" --- what actual symptoms are you seeing?

We have a two node slony cluster with load tests running againt the
provider node.

After resyncing the subscriber, and running load tests for about an
hour, the slony subscriber begins to fail.  From the log file:

2006-06-28 17:58:43 PDT ERROR  remoteWorkerThread_1: "insert into 
"public"."table_trans_attribute" (table_transaction_id,attribute_type,value) 
values ('374520943','22001','0');
insert into "public"."table_trans_attribute" 
(table_transaction_id,attribute_type,value) values ('374520943','22002','0');
insert into "public"."table_trans_attribute" 
(table_transaction_id,attribute_type,value) values ('374520943','22003','0');
insert into "public"."table_trans_attribute" 
(table_transaction_id,attribute_type,value) values ('374520943','22004','0');
insert into "public"."table_trans_attribute" 
(table_transaction_id,attribute_type,value) values ('374520943','22005','0');
insert into "public"."table_trans_attribute" 
(table_transaction_id,attribute_type,value) values ('374520943','22006','0');
insert into "public"."table_trans_attribute" 
(table_transaction_id,attribute_type,value) values ('374520943','22007','0');
insert into "public"."table_trans_attribute" 
(table_transaction_id,attribute_type,value) values ('374520943','22007','0');
insert into "public"."table_trans_attribute" 
(table_transaction_id,attribute_type,value) values ('374520942','22009','0');
insert into "public"."table_trans_attribute" 
(table_transaction_id,attribute_type,value) values ('374520942','22010','0');
" ERROR:  duplicate key violates unique constraint "table_trans_attr_pk"

As you see, slony is attempting to enter one tuple
('374520943','22007','0') two times.

Each previous time we have had this problem, rebuilding the indexes on
slony log table (sl_log_1) has fixed the problem.  I have not reindexed
the table this time as I do not want to destroy any usable evidence.

__
Marc


signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] [GENERAL] UUID's as primary keys

2006-06-29 Thread Tom Lane
Greg Stark <[EMAIL PROTECTED]> writes:
> In the current setup the only reason for Postgres to have this data type at
> all is purely for legacy compatibility.

Yes.  So?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Index corruption

2006-06-29 Thread Tom Lane
Marc Munro <[EMAIL PROTECTED]> writes:
> On Tom Lane's advice, we upgraded to Postgres 8.0.8.

OK, so it's not an already-known problem.

> We were able to corrupt the index within 90 minutes of starting up our
> cluster.  A slony-induced vacuum was under way on the provider at the
> time the subscriber failed.

You still haven't given any details.  What do you mean by "corrupt the
index" --- what actual symptoms are you seeing?

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Index corruption

2006-06-29 Thread Marc Munro
On Tom Lane's advice, we upgraded to Postgres 8.0.8.  We also upgraded
slony to 1.1.5, due to some rpm issues.  Apart from that everything is
as described below.

We were able to corrupt the index within 90 minutes of starting up our
cluster.  A slony-induced vacuum was under way on the provider at the
time the subscriber failed.

What can we do to help identify the cause of this?  We have a test
system that seems able to reproduce this fairly easily.

__
Marc

On Wed, 2006-06-28 at 09:28 -0700, Marc Munro wrote:
> We have now experienced index corruption on two separate but identical
> slony clusters.  In each case the slony subscriber failed after
> attempting to insert a duplicate record.  In each case reindexing the
> sl_log_1 table on the provider fixed the problem.
> 
> The latest occurrence was on our production cluster yesterday.  This has
> only happened since we performed kernel upgrades and we are uncertain
> whether this represents a kernel bug, or a postgres bug exposed by
> different timings in the new kernel.
> 
> Our systems are:
> 
> Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875
> Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux
> kernel boot option: elevator=deadline
> 16 Gigs of RAM
> postgresql-8.0.3-1PGDG
> Bonded e1000/tg3 NICs with 8192 MTU.
> Slony 1.1.0
> 
> NetApp FAS270 OnTap 7.0.3
> Mounted with the NFS options
> rw,nfsvers=3,hard,rsize=32768,wsize=32768,timeo=600,tcp,noac
> Jumbo frames 8192 MTU.
> 
> All postgres data and logs are stored on the netapp.
> 
> In the latest episode, the index corruption was coincident with a
> slony-induced vacuum.  I don't know if this was the case with our test
> system failures.
> 

> __
> Marc


signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] [GENERAL] UUID's as primary keys

2006-06-29 Thread Greg Stark

Martijn van Oosterhout  writes:

> On Thu, Jun 29, 2006 at 03:54:36PM +0200, Thomas Hallgren wrote:
> > I have to concur with this. Assume you use a bytea for a UUID that in 
> > turn is used as a primary key. The extra overhead will be reflected in 
> > all indexes, all foreign keys, etc. In a normalized database some tables 
> > may consist of UUID columns only.
> 
> So you create a UUID type. It's cheap enough to create new types after
> all, that's one of postgresql's strengths. What I'm saying is that it's
> easier to create new fixed length types for the cases that need it,
> than it is to redo the entire type handling of the backend.

I guess my motivation here is that I feel currently char(n) is basically
broken in Postgres. Sure it satisfies the letter of the specification, but
it's failing to actually achieve anything for the users. There's no point at
all in using char(n) in Postgres since it takes exactly the same amount of
space as varchar() if you're always stuffing it full and more space if you're
not.

In the current setup the only reason for Postgres to have this data type at
all is purely for legacy compatibility. It doesn't actually "work" in that it
doesn't provide the space savings it's intended to and that would give users
an actual reason to use it in new databases.

-- 
greg


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-29 Thread Bruce Momjian
Hannu Krosing wrote:
> > > But we still have to think about similar cases (index entries pointing
> > > inside CITC chains), unless we plan to disallow adding indexes to
> > > tables.
> > 
> > CREATE INDEX has to undo any chains where the new indexed columns change
> > in the chain, and add index entries to remove the chain.
> 
> Yes, that would be the most straightforward solution.
> 
> It could be better in some cases, if we could avoid adding entries to
> other indexes. Maybe we can just reset some flags, so that some SITC ops
> like finding tuples by the CITC index pointer still work while adding
> new entries wont. 
> 
> But it will be tricky to make this work for bitmap index scans. 
> 
> So yes, index build is a slop operation anyway, so making it even a
> little slower is probably not a big problem. And most CITC chains will
> have only one visible row at a time, this will probably not be a big
> issue. 

Consider that index scans coming from different indexes have to span the
same SITC.  I see no clean way to avoid making all the indexes
consistent, and as you said, CREATE INDEX isn't a frequent operation.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [GENERAL] UUID's as primary keys

2006-06-29 Thread Martijn van Oosterhout
On Thu, Jun 29, 2006 at 03:54:36PM +0200, Thomas Hallgren wrote:
> I have to concur with this. Assume you use a bytea for a UUID that in 
> turn is used as a primary key. The extra overhead will be reflected in 
> all indexes, all foreign keys, etc. In a normalized database some tables 
> may consist of UUID columns only.

So you create a UUID type. It's cheap enough to create new types after
all, that's one of postgresql's strengths. What I'm saying is that it's
easier to create new fixed length types for the cases that need it,
than it is to redo the entire type handling of the backend.

And for people that want char(1), they should be using "char", which
really is one byte (ex padding ofcourse).

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to 
> litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] Compilatiuon of source code for windows

2006-06-29 Thread Hiroshi Saito
Hi.

Probably,
Supposing you are using source of 8.1.4, it is necessary to bring Stable of CVS 
or to apply PATCH with reference to the following. 
http://archives.postgresql.org/pgsql-patches/2006-05/msg00232.php

Regards,
Hiroshi Saito

- Original Message - 
From: Sandeep Jakkaraju(Navolve) 
To: pgsql-hackers@postgresql.org 
Sent: Thursday, June 29, 2006 4:29 PM
Subject: [HACKERS] Compilatiuon of source code for windows


Hi All

I am unable to compile the source code for windows.
the command in the INSTALL file  nmake /f win32.mak
is not working !!!


Can any one help ???


Thanks in advance

sandeep

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] session id and global storage

2006-06-29 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Rodrigo De Leon wrote:
>> You could do this:
>> 
>> SELECT procpid||' '||backend_start
>> FROM pg_stat_activity
>> WHERE datname = current_database()
>> AND usename = session_user
>> AND client_addr = inet_client_addr()
>> AND client_port = inet_client_port();

> That's pretty roundabout.

Indeed.  Use pg_backend_pid() instead:

SELECT whatever FROM pg_stat_activity WHERE procpid = pg_backend_pid();

A difficulty with this in existing releases is that pg_stat_activity
lags behind reality, so that you won't see your session listed in it
until you've been connected at least half a second or so.  8.2 won't
have that problem.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [GENERAL] UUID's as primary keys

2006-06-29 Thread Thomas Hallgren

Greg Stark wrote:

Martijn van Oosterhout  writes:

  

To be honest, it seems like a lot of work to save the four bytes of
overhead for the varlena structure on disk if you're going to need it
in memory anyway. And anything like RAW(16) which people want for
UUIDs, if it's going to have a lot of functions associated with it, may
as well just be a new type. 



For large databases storage density leads directly to speed. Saving four bytes
of overhead on a 16-byte data structure would mean a 20% speed increase. Even
if that's only helpful on a tenth of the columns you're still talking about a
2% speed increase for all queries on the table. A lot of databases use CHAR(1)
for flags. The overhead is even worse there.

  
I have to concur with this. Assume you use a bytea for a UUID that in 
turn is used as a primary key. The extra overhead will be reflected in 
all indexes, all foreign keys, etc. In a normalized database some tables 
may consist of UUID columns only.


Regards,
Thomas Hallgren


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] session id and global storage

2006-06-29 Thread Rodrigo De Leon

That's pretty roundabout. We already expose (hex coded) pid.starttime as
a session identifier in log_line_prefix (it's the %c escape) so I don't
see any reason not to provide either the same thing directly in a
function, or at least to expose the backend pid.


That would be nice.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [GENERAL] UUID's as primary keys

2006-06-29 Thread Greg Stark

Martijn van Oosterhout  writes:

> A tuple is just an array of datums, with some header information. The
> problems come when you don't have a tuple anymore, but only the datum,
> like in arguments for functions.
> 
> I think it's more a case that most places that deal with datums simply
> don't know about typmods. For example, the return type of a function
> can only be char, not char(16). If you consider the case of a function
> returning a RAW, the caller will have no way of knowing the typmod,
> they do know the type though.
> 
> To be honest, it seems like a lot of work to save the four bytes of
> overhead for the varlena structure on disk if you're going to need it
> in memory anyway. And anything like RAW(16) which people want for
> UUIDs, if it's going to have a lot of functions associated with it, may
> as well just be a new type. 

For large databases storage density leads directly to speed. Saving four bytes
of overhead on a 16-byte data structure would mean a 20% speed increase. Even
if that's only helpful on a tenth of the columns you're still talking about a
2% speed increase for all queries on the table. A lot of databases use CHAR(1)
for flags. The overhead is even worse there.

> Consider where we currently we have a "Filter Cond" on a "Seq Scan".
> Currently the filter can access the datums directly on the disk page, with
> what you're proposing, it can't.

Well it only can't if the data type has conversion functions. I'm not sure how
complex it would be having pointers that *getattr sometimes return pointers to
the disk page and sometimes return pointers to a palloced copy though.



-- 
greg


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] session id and global storage

2006-06-29 Thread Andrew Dunstan

Rodrigo De Leon wrote:


Hi, I cant find any function, which tells me something like session
id. Is there something like that? I need it in my AM, because I need
to know, if something which I wrote in file was written in this
current session or previously.



How about
select procpid||' '||backend_start from pg_stat_activity;
Yours,
Laurenz Albe



Something like this would be maybe possible, but this select can
return more rows, when the user is connected with more instances...



You could do this:

SELECT procpid||' '||backend_start
FROM pg_stat_activity
WHERE datname = current_database()
AND usename = session_user
AND client_addr = inet_client_addr()
AND client_port = inet_client_port();



That's pretty roundabout. We already expose (hex coded) pid.starttime as 
a session identifier in log_line_prefix (it's the %c escape) so I don't 
see any reason not to provide either the same thing directly in a 
function, or at least to expose the backend pid.


If you need it in backend C code, the data can be fetched from MyProcPid 
and MyProcPort->session_start.tv_sec


cheers

andrew


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] session id and global storage

2006-06-29 Thread Rodrigo De Leon

Hi, I cant find any function, which tells me something like session
id. Is there something like that? I need it in my AM, because I need
to know, if something which I wrote in file was written in this
current session or previously.


How about
select procpid||' '||backend_start from pg_stat_activity;
Yours,
Laurenz Albe


Something like this would be maybe possible, but this select can
return more rows, when the user is connected with more instances...

David Hoksza


You could do this:

SELECT procpid||' '||backend_start
FROM pg_stat_activity
WHERE datname = current_database()
AND usename = session_user
AND client_addr = inet_client_addr()
AND client_port = inet_client_port();

Regards,

Rodrigo

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


[HACKERS] Some questions to developers

2006-06-29 Thread Nikolay Samokhvalov

Hello,

Recently an interview with Marten Mickos (mysql's ceo) appeared in
russian IT news (eg http://citcity.ru/12776/). I'd like ask some
[similar] questions to PostgreSQL core developers and write an article
in Russian. Actually, I'm sure that there is a great lack of PG news
in our mass media. Moreover, it seems that Anniversary is a good
opportunity to rise a wave of interest.

--
Best regards,
Nikolay

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Single Index Tuple Chain (SITC) method

2006-06-29 Thread Zeugswetter Andreas DCP SD

> > And anyway, ctid is a usable unique row identifier only within 
> > read-only transactions, or not ?

actually for as long as no vacuum comes along. This would change
with SITC. (Maybe it would help to only reuse old versions of the same
row,
then anybody holding a ctid would at least be still looking at a version
of
the same row, and should thus be able to follow the update chain)
 
> Err, no. The ctid is the only identifer of a tuple in any 
> case. When you do a delete, the tuple to be deleted is 
> indicated by the ctid field which has been passed up from the 
> base table through the rest of the query. When you reach the 
> top the ctid better refer to the same tuple or you'll delete 
> the wrong one. UPDATE is the same.

For all these purposes you will be holding the ctid of a visible
(to someone) tuple. Those don't qualify for a new SITC tuple anyway.

> For all intents and purposes, the CTID of tuple can't change 
> unless you're 100% certain no-one is using it in any way.

For all I know, noone is using dead tuples except for visibility
lookup. We would need to make sure that other backends see the new 
tuple eighter as dead or txopen as long as the contents are not valid.
I think we could do that without a vacuum lock on platforms that support
4 byte atomic operations.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Single Index Tuple Chain (SITC) method

2006-06-29 Thread Martijn van Oosterhout
On Thu, Jun 29, 2006 at 01:39:51AM +0300, Hannu Krosing wrote:
> And anyway, ctid is a usable unique row identifier only within read-only
> transactions, or not ?

Err, no. The ctid is the only identifer of a tuple in any case. When
you do a delete, the tuple to be deleted is indicated by the ctid field
which has been passed up from the base table through the rest of the
query. When you reach the top the ctid better refer to the same tuple
or you'll delete the wrong one. UPDATE is the same.

In READ COMMITTED mode, the tuple is rechecked for visibility and if
it's invisible, the ctid chain is followed to find the visible one
(which may not necessarily be the last one).

For all intents and purposes, the CTID of tuple can't change unless
you're 100% certain no-one is using it in any way. That's what the
vacuum lock is for.

> Is it correct to assume, that only one row version can be in process of
> being modified at any one time?

No, different transactions may be updating differing versions, depending
on what was visible at the time. In serialisable transactions you'll
get a serialisation failure though, and for read committed, the query
will be rerun for the latest version of the tuple.

One thing I am confused about, currently the ctid chain follows tuple
history so that transactions can find the latest version of any tuple,
even if the key fields have changed. This proposal breaks that, I'm not
sure how important that is though.

> Tom - what do you think of the other related idea, that of reusing dead
> index entries ?

I'd like to know about this too, including ideas about truncating
tuples to just the header.

Have a nice day.
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to 
> litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-29 Thread Hannu Krosing
Ühel kenal päeval, T, 2006-06-27 kell 12:16, kirjutas Bruce Momjian:
> Hannu Krosing wrote:
> > ?hel kenal p?eval, T, 2006-06-27 kell 10:38, kirjutas Hannu Krosing:
> > > ?hel kenal p?eval, E, 2006-06-26 kell 23:08, kirjutas Bruce Momjian:
> > > > Jim C. Nasby wrote:
> > > > > On Mon, Jun 26, 2006 at 02:32:59PM -0400, Bruce Momjian wrote:
> > > > > > 
> > > > > > It is certainly possible to do what you are suggesting, that is 
> > > > > > have two
> > > > > > index entries point to same chain head, and have the index access
> > > > > > routines figure out if the index qualifications still hold, but that
> > > > > > seems like a lot of overhead.
> > > 
> > > I think Jim meant not 2 pointing to the same head, but 2 pointing into
> > > the same chain. Say we have table with (id serial, ts timestamp) where
> > > ts changes at each update and id does not.
> > > 
> > > So after 3 updates on one page we have one CITC/ITPC head with pointers
> > > from both indexes and two follow-up tuples with pointers from only ts
> > > index.
> > > 
> > > The problem with this setup is, that we can't reuse any of those
> > > follow-up tuples without index cleanup.
> > 
> > But we still have to think about similar cases (index entries pointing
> > inside CITC chains), unless we plan to disallow adding indexes to
> > tables.
> 
> CREATE INDEX has to undo any chains where the new indexed columns change
> in the chain, and add index entries to remove the chain.

Yes, that would be the most straightforward solution.

It could be better in some cases, if we could avoid adding entries to
other indexes. Maybe we can just reset some flags, so that some SITC ops
like finding tuples by the CITC index pointer still work while adding
new entries wont. 

But it will be tricky to make this work for bitmap index scans. 

So yes, index build is a slop operation anyway, so making it even a
little slower is probably not a big problem. And most CITC chains will
have only one visible row at a time, this will probably not be a big
issue. 

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Single Index Tuple Chain (SITC) method

2006-06-29 Thread Hannu Krosing
Ühel kenal päeval, K, 2006-06-28 kell 18:19, kirjutas Tom Lane:
> Bruce Momjian <[EMAIL PROTECTED]> writes:
> > Here is an overview of the SITC method:
> > http://momjian.us/cgi-bin/pgsitc
> 
> A pretty fundamental problem is that the method assumes it's OK to
> change the CTID of a live tuple (by swapping its item pointer with some
> expired version).  It is not --- this will break:
>   * active UPDATEs and DELETEs that may have fetched the CTID
> but not yet completed processing to decide whether to change
> the tuple;
>   * pending AFTER ROW triggers, such as foreign key checks;
>   * ODBC as well as other applications that assume CTID is a
> usable unique row identifier within transactions.

We should *always* return the ctid of CITC head, as this is the one that
does not change. 

And anyway, ctid is a usable unique row identifier only within read-only
transactions, or not ?

> VACUUM FULL can get away with moving tuples to new CTIDs because it takes
> AccessExclusiveLock, so there can be no open transactions with knowledge
> of current CTIDs in the table.  This is not OK for something that's
> supposed to happen in plain UPDATEs, though.

Would it still be a problem, if we *always* refer to the whole CITC
chain by its externally visible ctid, an look up the real tuple inside
tuple fetch op at every access.

(1) If we had some special bits for tuples at CITC chain head and inside
CITC but not at head, then even seqscan can ignore non-head CITC chain
members at its find next tuple op and do the real tuple lookup in some
inner function when it hits CITC head.

Is it correct to assume, that only one row version can be in process of
being modified at any one time?

> Another problem is you can't recycle tuples, nor item ids, without
> taking a VACUUM-style lock on the page (LockBufferForCleanup).  If
> anyone else is holding a pin on the page they risk getting totally
> confused --- for instance, a seqscan will either miss a tuple or scan it
> twice depending on which direction you're juggling item ids around it.

I think (1) above solves this, at cost of looking twice at CITC internal
tuple headers.

> The concurrency loss involved in LockBufferForCleanup is OK for
> background-maintenance operations like VACUUM, but I seriously doubt
> anyone will find it acceptable for UPDATE.  It could easily create
> application-level deadlocks, too.  (VACUUM is safe against that because
> it only holds one lock.)


Tom - what do you think of the other related idea, that of reusing dead
index entries ?

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [GENERAL] UUID's as primary keys

2006-06-29 Thread Martijn van Oosterhout
On Thu, Jun 29, 2006 at 02:40:15AM -0400, Greg Stark wrote:
> > Greg Stark <[EMAIL PROTECTED]> writes:
> > No, it doesn't, and we'd pay a nonzero price for allowing that.
> > Currently the executor doesn't have to care (much) about whether a
> > tuple is on-disk or in-memory --- the individual datums look the same
> > either way.  Allowing them to be different would force a lot of
> > format conversion steps that currently need not happen.
> 
> Is there ever a case where an entire tuple is passed around without knowing
> the typmod of an attribute in the tuple?

A tuple is just an array of datums, with some header information. The
problems come when you don't have a tuple anymore, but only the datum,
like in arguments for functions.

I think it's more a case that most places that deal with datums simply
don't know about typmods. For example, the return type of a function
can only be char, not char(16). If you consider the case of a function
returning a RAW, the caller will have no way of knowing the typmod,
they do know the type though.

To be honest, it seems like a lot of work to save the four bytes of
overhead for the varlena structure on disk if you're going to need it
in memory anyway. And anything like RAW(16) which people want for
UUIDs, if it's going to have a lot of functions associated with it, may
as well just be a new type. 

I think time would be much better spent finding a way of allowing
user-defined types to be created without using C functions.

> The conversion would only really have to happen when the attribute is fetched
> or stored, not when the tuple is being passed around wholesale. But I have a
> feeling that would be more intrusive than just making the entire system typmod
> aware.

I'm not sure if tuples are ever passed wholesale very far. The first
node to actually do anything with it (any join, expression or condition
test) is going to need to deconstruct it. Consider where we currently we
have a "Filter Cond" on a "Seq Scan". Currently the filter can access
the datums directly on the disk page, with what you're proposing, it
can't.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to 
> litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] Single Index Tuple Chain (SITC) method

2006-06-29 Thread Zeugswetter Andreas DCP SD

> > Here is an overview of the SITC method:
> > http://momjian.us/cgi-bin/pgsitc
> 
> A pretty fundamental problem is that the method assumes it's 
> OK to change the CTID of a live tuple (by swapping its item 
> pointer with some expired version).  It is not --- this will break:

I am having difficulty visualizing that. The plan is not to change
CTID's
(only the CTID's offset into the page is to be changed).
The CTID of the new version is one that is up to now invisible to all
backends,
so noone can actually have remembered that CTID.

Also you would first insert the slot content and then change the CTID
offset
(this offset change might need to be made atomic).

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[HACKERS] Compilatiuon of source code for windows

2006-06-29 Thread Sandeep Jakkaraju(Navolve)
Hi AllI am unable to compile the source code for windows.the command in the INSTALL file  nmake /f win32.makis not working !!!Can any one help ???Thanks in advancesandeep



Re: [HACKERS] Fixed length datatypes. WAS [GENERAL] UUID's as

2006-06-29 Thread Christopher Kings-Lynne

If there is interest - I'm sure Nathan and I would be willing to put
it on pgfoundry, and at some point give it up for inclusion into
PostgreSQL.

One requirement would be that it runs on Windows. Is that something you 
have tested?


In case it influences anyone, MySQL 5 already has built-in UUID support:

http://dev.mysql.com/doc/refman/5.0/en/miscellaneous-functions.html#id2899901

Chris


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Fixed length datatypes. WAS [GENERAL] UUID's as

2006-06-29 Thread Thomas Hallgren

[EMAIL PROTECTED] wrote:

On Thu, Jun 29, 2006 at 02:02:32AM -0400, [EMAIL PROTECTED] wrote:

It was written by Nathan Wagner <[EMAIL PROTECTED]> and myself, and
is based off the OSSP ( http://www.ossp.org/ ) UUID implementation.
I'm not an expert on the license, but it seems acceptable to me:
...
If there is interest - I'm sure Nathan and I would be willing to put
it on pgfoundry, and at some point give it up for inclusion into
PostgreSQL.


This might require a little bit of research. It appears that the
development version of OSSP UUID may provide its own PostgreSQL
'bindings'. I may try and contact the author of the OSSP UUID and
see whether any changes we have that he does not, can be rolled
into his version...

Cheers,
mark

I'm thinking ahead on possible objections to inclusion in core. One objection might be that 
a fully blown UUID implementation is a lot of code. Code that needs to be maintained and it 
increases the size of the binary etc. A solution to that might be to break the whole thing 
up in two:


1 The actual type
A plain scalar type that stores 16 bytes. It's complete with standard operators for 
comparison (natural order) and the text representation would be a 32 character hexadecimal 
string. This type should make no interpretation whatsoever on what it stores and its only 
association with UUID's is the storage size.


2 UUID utilities
Various ways of representing, generating, and extract partial information from UUID's. 
Should have support for variants #0, #1, and #2 (the OSSP based code sounds like a good 
candidate).



The split make sense since clients often have powerful UUID utilities handy and hence have 
limited or no use for such utilities in the database (true for all .NET and Java clients). 
Some PL's will also enable such packages out of the box.


The actual type would be extremely generic, lightweight, and easy to implement. No 
portability issues whatsoever. The only difficulty that I see is naming it :-).


Regards,
Thomas Hallgren


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Fixed length datatypes. WAS [GENERAL] UUID's as

2006-06-29 Thread Thomas Hallgren

[EMAIL PROTECTED] wrote:

On Wed, Jun 28, 2006 at 01:12:17PM -0500, Jim C. Nasby wrote:

On Wed, Jun 28, 2006 at 01:49:55PM -0400, Andrew Dunstan wrote:
Personally I don't buy the misuse objection - we already have plenty of 
things that can be misused. As long as there is a reasonable valid use 
and we can make it portable enough, I think there is a good case for 
including it.

Well, since Mark has one, how about we consider adding it in?
If nothing else, can you please put your stuff on pgFoundry so others
can find it, Mark?


It was written by Nathan Wagner <[EMAIL PROTECTED]> and myself, and
is based off the OSSP ( http://www.ossp.org/ ) UUID implementation.
I'm not an expert on the license, but it seems acceptable to me:

 "Permission to use, copy, modify, and distribute this software for
  any purpose with or without fee is hereby granted, provided that
  the above copyright notice and this permission notice appear in all
  copies."

I haven't tested to see how portable the OSSP UUID implementation is.
This is their words:

 "OSSP uuid was already written with maximum portability in mind, so
  there should be no great effort required to get it running on any Unix
  platform with a reasonable POSIX API. Additionally, the portability
  was tested by successfully building and running it on the following
  particular Unix platforms (syntax is "- ()"):

  alpha-tru644.0 (cc)
  alpha-tru645.1 (gcc, cc)
  hppa-hpux11.11 (cc)
  ia64-hpux11.23 (cc)
  ix86-debian2.2 (gcc, icc)
  ix86-debian3.0 (gcc)
  ix86-debian3.1 (gcc)
  ix86-freebsd4.9 (gcc)
  ix86-freebsd5.2 (gcc, icc)
  ix86-netbsd1.6 (gcc)
  ix86-qnx6.2 (gcc)
  ix86-solaris10 (gcc)
  ix86-unixware7.1.3 (cc)
  mips64-irix6.5 (gcc)
  sparc64-solaris8 (gcc, forte)
  sparc64-solaris9 (gcc)"

I've put it through a fair amount of testing, including using it
within compound indexes, expecting the index to be used for at
least '=', constructing many UUIDs quickly, in a sequence, and
converting it to and from string form. We chose to implement our
own encode / decode routines for performance reasons. With the
exception of testing it on a wider range of platforms, I would
call the module stable.

If there is interest - I'm sure Nathan and I would be willing to put
it on pgfoundry, and at some point give it up for inclusion into
PostgreSQL.


One requirement would be that it runs on Windows. Is that something you have 
tested?

Regards,
Thomas Hallgren


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org