Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE

2002-06-23 Thread Curt Sampson

On 23 Jun 2002, J. R. Nield wrote:

> So since we have all this buffering designed especially to meet our
> needs, and since the OS buffering is in the way, can someone explain to
> me why postgresql would ever open a file without the O_DSYNC flag if the
> platform supports it?

It's more code, if there are platforms out there that don't support
O_DYSNC. (We still have to keep the old fsync code.) On the other hand,
O_DSYNC could save us a disk arm movement over fsync() because it
appears to me that fsync is also going to force a metadata update, which
means that the inode blocks have to be written as well.

> Maybe fsync would be slower with two files, but I don't see how
> fdatasync would be, and most platforms support that.

Because, if both files are on the same disk, you still have to move
the disk arm from the cylinder at the current log file write point
to the cylinder at the current ping-pong file write point. And then back
again to the log file write point cylinder.

In the end, having a ping-pong file as well seems to me unnecessary
complexity, especially when anyone interested in really good
performance is going to buy a disk subsystem that guarantees no
torn pages and thus will want to turn off the ping-pong file writes
entirely, anyway.

cjs
-- 
Curt Sampson  <[EMAIL PROTECTED]>   +81 90 7737 2974   http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light.  --XTC


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE

2002-06-23 Thread J. R. Nield

On Sun, 2002-06-23 at 12:10, Curt Sampson wrote:
> 
> So what we would really need to implement, if we wanted to be more
> efficient with trusted disk subsystems, would be the option of writing
> to the log only the changed row or changed part of the row, or writing
> the entire changed page. I don't know how hard this would be
> 
We already log that stuff. The page images are in addition to the
"Logical Changes", so we could just stop logging the page images.

-- 
J. R. Nield
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE

2002-06-23 Thread J. R. Nield

On Sun, 2002-06-23 at 11:19, Tom Lane wrote: 
> Curt Sampson <[EMAIL PROTECTED]> writes:
> > This should also allow us to disable completely the ping-pong writes
> > if we have a disk subsystem that we trust.
> 
> If we have a disk subsystem we trust, we just disable fsync on the
> WAL and the performance issue largely goes away.

It wouldn't work because the OS buffering interferes, and we need those
WAL records on disk up to the greatest LSN of the Buffer we will be writing.


We already buffer WAL ourselves. We also already buffer regular pages.
Whenever we write a Buffer out of the buffer cache, it is because we
really want that page on disk and wanted to start an IO. If thats not
the case, then we should have more block buffers! 

So since we have all this buffering designed especially to meet our
needs, and since the OS buffering is in the way, can someone explain to
me why postgresql would ever open a file without the O_DSYNC flag if the
platform supports it? 



> 
> I concur with Bruce: the reason we keep page images in WAL is to
> minimize the number of places we have to fsync, and thus the amount of
> head movement required for a commit.  Putting the page images elsewhere
> cannot be a win AFAICS.


Why not put all the page images in a single pre-allocated file and treat
it as a ring? How could this be any worse than flushing them in the WAL
log? 

Maybe fsync would be slower with two files, but I don't see how
fdatasync would be, and most platforms support that. 

What would improve performance would be to have a dbflush process that
would work in the background flushing buffers in groups and trying to
stay ahead of ReadBuffer requests. That would let you do the temporary
side of the ping-pong as a huge O_DSYNC writev(2) request (or
fdatasync() once) and then write out the other buffers. It would also
tend to prevent the other backends from blocking on write requests. 

A dbflush could also support aio_read/aio_write on platforms like
Solaris and WindowsNT that support it. 

Am I correct that right now, buffers only get written when they get
removed from the free list for reuse? So a released dirty buffer will
sit in the buffer free list until it becomes the Least Recently Used
buffer, and will then cause a backend to block for IO in a call to
BufferAlloc? 

This would explain why we like using the OS buffer cache, and why our
performance is troublesome when we have to do synchronous IO writes, and
why fsync() takes so long to complete. All of the backends block for
each call to BufferAlloc() after a large table update by a single
backend, and then the OS buffers are always full of our "written" data. 

Am I reading the bufmgr code correctly? I already found an imaginary
race condition there once :-) 

;jnield 


> 
> > Well, whether or not there's a cheap way depends on whether you consider
> > fsync to be cheap. :-)
> 
> It's never cheap :-(
> 
-- 
J. R. Nield
[EMAIL PROTECTED]


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE

2002-06-23 Thread Curt Sampson

On Sun, 23 Jun 2002, Tom Lane wrote:

> Curt Sampson <[EMAIL PROTECTED]> writes:
> > This should also allow us to disable completely the ping-pong writes
> > if we have a disk subsystem that we trust.
>
> If we have a disk subsystem we trust, we just disable fsync on the
> WAL and the performance issue largely goes away.

No, you can't do this. If you don't fsync(), there's no guarantee
that the write ever got out of the computer's buffer cache and to
the disk subsystem in the first place.

> I concur with Bruce: the reason we keep page images in WAL is to
> minimize the number of places we have to fsync, and thus the amount of
> head movement required for a commit.

An fsync() does not necessarially cause head movement, or any real
disk writes at all. If you're writing to many external disk arrays,
for example, the fsync() ensures that the data are in the disk array's
non-volatile or UPS-backed RAM, no more. The array might hold the data
for quite some time before it actually writes it to disk.

But you're right that it's faster, if you're going to write out changed
pages and have have the ping-pong file and the transaction log on the
same disk, just to write out the entire page to the transaction log.

So what we would really need to implement, if we wanted to be more
efficient with trusted disk subsystems, would be the option of writing
to the log only the changed row or changed part of the row, or writing
the entire changed page. I don't know how hard this would be

> > Well, whether or not there's a cheap way depends on whether you consider
> > fsync to be cheap. :-)
>
> It's never cheap :-(

Actually, with a good external RAID system with non-volatile RAM,
it's a good two to four orders of magnitude cheaper than writing to a
directly connected disk that doesn't claim the write is complete until
it's physically on disk. I'd say that it qualifies as at least "not
expensive." Not that you want to do it more often than you have to
anyway

cjs
-- 
Curt Sampson  <[EMAIL PROTECTED]>   +81 90 7737 2974   http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light.  --XTC


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Suggestions for implementing IS DISTINCT FROM?

2002-06-23 Thread Tom Lane

Thomas Lockhart <[EMAIL PROTECTED]> writes:
> I'm looking at implementing IS DISTINCT FROM, among other things.
> ...
> I was thinking to implement this by simply expanding these rules within
> gram.y to be a tree of comparison tests.

Please, please, do not do that.  Make a new expression node tree type,
instead.  We've made this mistake before (eg for BETWEEN) and I don't
want to do it again.

Aside from the points you make, a direct expansion approach cannot
reverse-list properly in rules/views, and it will force multiple
evaluations of arguments that should not be multiply evaluated.

Adding a new expression node tree type is not too difficult these days;
see for example Joe Conway's recent NullTest and BooleanTest additions.

I believe the existing expansions of row comparison operators
(makeRowExpr) should be replaced by specialized nodes, too.  That would
give us a shot at implementing row '<', '>' comparisons in a
spec-compliant fashion...

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE

2002-06-23 Thread Tom Lane

Curt Sampson <[EMAIL PROTECTED]> writes:
> This should also allow us to disable completely the ping-pong writes
> if we have a disk subsystem that we trust.

If we have a disk subsystem we trust, we just disable fsync on the
WAL and the performance issue largely goes away.

I concur with Bruce: the reason we keep page images in WAL is to
minimize the number of places we have to fsync, and thus the amount of
head movement required for a commit.  Putting the page images elsewhere
cannot be a win AFAICS.

> Well, whether or not there's a cheap way depends on whether you consider
> fsync to be cheap. :-)

It's never cheap :-(

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] ecpg and bison again

2002-06-23 Thread Thomas Lockhart

A couple of notes:

...
> Then, update *only* the ecpg source directory to the branch:
> cd pgsql/src/interfaces
> cvs update -r ecpg_big_bison ecpg

cvs will respect any changes you have made to the sources in your
directory and the changes will be preserved in the move to the branch.

Here is what the update looks like on my machine:

myst$ cvs update -r ecpg_big_bison ecpg
cvs update: Updating ecpg
cvs update: Updating ecpg/include
cvs update: Updating ecpg/lib
? ecpg/lib/libecpg.so.3.2.0
? ecpg/lib/libecpg.so.3.3.0
? ecpg/lib/libecpg.so.3.4.0
cvs update: Updating ecpg/preproc
? ecpg/preproc/ecpg
cvs update: Updating ecpg/test


If you want to check on the branch status of a particular file use the
"status" command. For example, checking on ecpg/Makefile looks like:

myst$ cvs status ecpg/Makefile
===
File: Makefile  Status: Up-to-date

   Working revision:1.14Mon Feb  4 15:37:13 2002
   Repository revision: 1.14
/home/thomas/cvs/repository/pgsql/src/interfaces/ecpg/Makefile,v
   Sticky Tag:  ecpg_big_bison (branch: 1.14.4)
   Sticky Date: (none)
   Sticky Options:  (none)


(I'm using cvsup so the repository path is local to my machine in this
example.)

 - Thomas

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] ecpg and bison again

2002-06-23 Thread Thomas Lockhart

> > I'm happy setting up the branch if that would be helpful. Let me know if
> > this is the way you want to proceed, and if so what you would like the
> That would be nice. I do not really knwo cvs myself.

Done. And here is how you would use it...

> > branch to be called.
> No idea. "new-bison" maybe?

OK, the tag name is "ecpg_new_bison" (note the underscores; dashes are
not allowed afaicr).

I created the branch *only* on src/interfaces/ecpg. To get ready to work
on the branch, you will want to update your entire tree:

cvs update -PdA pgsql

If you have uncommitted changes to code cvs will respect that and
preserve the changes, but you may want to make a tarball just in case ;)

Then, update *only* the ecpg source directory to the branch:

cd pgsql/src/interfaces
cvs update -r ecpg_big_bison ecpg

At that point, *all* cvs files in your ecpg directory will be living on
the branch. If you make changes and commit them, the changes will only
be visible on the branch. The branch tag is "sticky", so *unless* you
explicitly change the tag or branch by, say, an unfortunate "update -A"
or "update -rHEAD" then all files and any new files will stay on the
branch.

Since the tag is only on the src/interfaces/ecpg directory, if you
accidentally try updating other directories to that tag the files may
"vanish", since they do not have that tag. Update those directories back
to the head and the files will reappear.

You can easily work on both the HEAD and ecpg_big_bison by renaming your
already-branched ecpg directory to, say, ecpg.big, then doing

cd pgsql/src/interfaces
cvs update -PdA ecpg

which will recover files from the tip of the cvs tree. You would then
have two directories, ecpg/ and ecpg.big/, and the sticky tags in
ecpg.big/ will be respected by CVS and will still refer to the correct
branch of ecpg/ within the CVS repository.

If you commit changes to the tip which you want to pull up to the
branch, use

cd pgsql/src/interfaces
# name may be ecpg/ if you have a completely separate tree
cvs update -j HEAD ecpg.big

which will merge changes from HEAD (the cvs tip) into your source
directory. You would still need to commit them to have them in the
repository:

cd pgsql/src/interfaces
cvs commit ecpg.big

When it comes time, you will want to merge your branch back down to the
main tree. You can do this as was done above for the other direction:

cd pgsql/src/interfaces
cvs update -j ecpg_big_bison ecpg

where the ecpg directory is, as described above, already on the cvs tip.

Let me know if you have any questions!

- Thomas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE

2002-06-23 Thread Curt Sampson

On 23 Jun 2002, J. R. Nield wrote:

> On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote:
> > J. R. Nield wrote:
> > > One other point:
> > >
> > > Page pre-image logging is fundamentally the same as what Jim Grey's
> > > book[1] would call "careful writes". I don't believe they should be in
> > > the XLOG, because we never need to keep the pre-images after we're sure
> > > the buffer has made it to the disk. Instead, we should have the buffer
> > > IO routines implement ping-pong writes of some kind if we want
> > > protection from partial writes.
> >
> > Ping-pong writes to where?  We have to fsync, and rather than fsync that
> > area and WAL, we just do WAL.  Not sure about a win there.

Presumably the win is that, "we never need to keep the pre-images
after we're sure the buffer has made it to the disk." So the
pre-image log can be completely ditched when we shut down the
server, so a full system sync, or whatever. This keeps the log file
size down, which means faster recovery, less to back up (when we
start getting transaction logs that can be backed up), etc.

This should also allow us to disable completely the ping-pong writes
if we have a disk subsystem that we trust. (E.g., a disk array with
battery backed memory.) That would, in theory, produce a nice little
performance increase when lots of inserts and/or updates are being
committed, as we have much, much less to write to the log file.

Are there stats that track, e.g., the bandwidth of writes to the
log file? I'd be interested in knowing just what kind of savings
one might see by doing this.

> The key question is: do we have some method to ensure that the OS
> doesn't do the writes in parallel?...
> It would seem to me that if the same process does both ping-pong writes,
> then there should be a cheap way to enforce a serial order. I could be
> wrong though.

Well, whether or not there's a cheap way depends on whether you consider
fsync to be cheap. :-)

cjs
-- 
Curt Sampson  <[EMAIL PROTECTED]>   +81 90 7737 2974   http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light.  --XTC


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] ecpg and bison again

2002-06-23 Thread Michael Meskes

On Wed, Jun 19, 2002 at 06:23:24PM -0700, Thomas Lockhart wrote:
> Michael, is this acceptable to you? If you use remote cvs, then you

Yes, it is.

> I'm happy setting up the branch if that would be helpful. Let me know if
> this is the way you want to proceed, and if so what you would like the

That would be nice. I do not really knwo cvs myself.

> branch to be called.

No idea. "new-bison" maybe?

Michael
-- 
Michael Meskes
[EMAIL PROTECTED]
Go SF 49ers! Go Rhein Fire!
Use Debian GNU/Linux! Use PostgreSQL!

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE

2002-06-23 Thread J. R. Nield

On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote:
> J. R. Nield wrote:
> > One other point:
> > 
> > Page pre-image logging is fundamentally the same as what Jim Grey's
> > book[1] would call "careful writes". I don't believe they should be in
> > the XLOG, because we never need to keep the pre-images after we're sure
> > the buffer has made it to the disk. Instead, we should have the buffer
> > IO routines implement ping-pong writes of some kind if we want
> > protection from partial writes.
> 
> Ping-pong writes to where?  We have to fsync, and rather than fsync that
> area and WAL, we just do WAL.  Not sure about a win there.
> 

The key question is: do we have some method to ensure that the OS
doesn't do the writes in parallel?

If the OS will ensure that one of the two block writes of a ping-pong
completes before the other starts, then we don't need to fsync() at 
all. 

The only thing we are protecting against is the possibility of both
writes being partial. If neither is done, that's fine because WAL will
protect us. If the first write is partial, we will detect that and use
the old data from the other, then recover from WAL. If the first is
complete but the second is partial, then we detect that and use the
newer block from the first write. If the second is complete but the
first is partial, we detect that and use the newer block from the second
write.

So does anyone know a way to prevent parallel writes in one of the
common unix standards? Do they say anything about this?

It would seem to me that if the same process does both ping-pong writes,
then there should be a cheap way to enforce a serial order. I could be
wrong though.

As to where the first block of the ping-pong should go, maybe we could
reserve a file with nBlocks space for them, and write the information
about which block was being written to the XLOG for use in recovery.
There are many other ways to do it.

;jrnield

-- 
J. R. Nield
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Code questions

2002-06-23 Thread Gavin Sherry

On Sun, 23 Jun 2002, Christopher Kings-Lynne wrote:

> Hi All,
> 
> Whereabouts in the code is the '*' expanded into the list of valid columns 

See the rule 'target_el' in gram.y. The SelectStmt node is processed
further down the parser in analyze.c: see transformStmt(),
transformSelectStmt() and transformTargetList().

> and also where are the columns specified in the select arguments (or
> whereever) checked for validity?

This is pretty easy to discover by working backward from the
elog(ERROR) produced when you select a non-existent attribute from a
relation:

ERROR:  Attribute 'nonexistent' not found

This is generated by transformIdent(), called from transformExpr, called
from transformTargetEntry. The latter is called by transformTargetList()
when it the attribute is not of the form '*' or 'relation.*' or when we
don't know if the attribute is actually an attribute.

> Chris

Gavin


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



[HACKERS] Code questions

2002-06-23 Thread Christopher Kings-Lynne



Hi All,
 
Whereabouts in the code is the '*' expanded into 
the list of valid columns and also where are the columns specified in the select 
arguments (or whereever) checked for validity?
 
Chris