Re: [HACKERS] stats_command_string default?

2003-02-16 Thread Kevin Brown
I wrote:

> > Would it make more sense to enable stats_command_string by default?
> > It could be a problem if doing so would have a significant impact on
> > performance, but that's the only reason I can think of for not doing
> > it.  Are there others?

and Christopher Kings-Lynne responded:

> You can just turn it on in one second in your postgresql.conf
> file...

Sure.  But my initial perusal of the documentation didn't reveal that
option (once I knew the name of the option to enable, I knew better
where to look in the documentation, but by that time it was too late).
I ended up figuring it out by (a) hearing that it was possible, (b)
looking at the definition of pg_stat_activity, and (c) looking through
the source code to find out how the pg_stat_get_backend_activity()
function worked, and from there how to make it display something.

Since we've been talking about fixing the defaults, it seems to me
a good opportunity to address little things like this as well.  It
doesn't cost us anything and it'll make someone's life easier.

It seems to me that it makes the most sense for the defaults to be
whatever is the most useful for the most people.  I can see lots of
use in enabling stats_command_string by default and little use in
disabling it by default.  It seems to me that most DBAs will want it
turned on unless there's a big performance loss as a result.

That's why I asked the question: is there a really good reason that
most DBAs would want it disabled?  My sense is that there isn't, but I
don't know, which is why I'm asking.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] stats_command_string default?

2003-02-16 Thread Kevin Brown
Tom Lane wrote:
> Kevin Brown <[EMAIL PROTECTED]> writes:
> > Would it make more sense to enable stats_command_string by default?
> 
> I'd vote against it.  If we turn it on by default, people are paying
> for a feature they may not even know exists.  Once they find out about
> it and decide they want it, they can turn it on easily enough.
> 
> If you can show that the overhead is unmeasurable, that'd indicate that
> this argument is bogus; but I suspect it's not negligible, at least on
> simple queries.

It's not unmeasurable, but it is reasonably low (guess it depends on
your definition of "reasonable" :-).  I wrote a small perl script
which would do a "SELECT 1" in a loop as many times as I specified on
the command line (autocommit was turned off).  I measured the amount
of wall clock time it took to do 10 passes on an unloaded system
with stats_command_string enabled, and then the same thing with it
disabled.

The difference in time over 10 passes was 20 seconds (44 seconds
with stats_command_string turned on, 24 with it turned off), for an
impact of 0.2 milliseconds per command executed.  This was on a 1.5GHz
P4 with 1G of RAM running Linux 2.4.20 on ReiserFS.  The data is
stored on a software RAID-5 across 3 Seagate ST-380021A IDE drives,
each connected to a separate channel on a Promise ATA100 card.

I have no idea if that's small enough to be considered negligible or
not, considering the hardware it was running on.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



[HACKERS] online reindex

2003-02-16 Thread Greg Stark

It sounds to me like it should be fairly straightforward to implement online
reindex. That is, reindex without locking the table. This is an important
feature for 24x7 operation. Since postgres doesn't modify data in place the
only thing required for online reindex is to ensure that the reindex operation
sees all of the latest data.

If reindex sets a flag that causes all new inserts and updates to allocate new
space at the end of the heap without checking for free space, then a simple
linear scan should be guaranteed to catch all the data.

(I'm not sure how the indexing operation works, if it reads in the whole table
and then sorts it there would have to be an extra step where any new tuples
are read in and inserted.)

There would only have to be a small window with the table locked while the
indexes are swapped at the end.

-- 
greg


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Linux.conf.au 2003 Report

2003-02-16 Thread Gavin Sherry
On Sat, 15 Feb 2003, Christopher Kings-Lynne wrote:

> I think so - Gavin?  As far as I'm aware there's not really anything else
> on the open source circuit.  There is often a MySQL rep there as well
> apparently.

Chris is right. David Axmark (MySQL AB) usually turns up, but he didn't
this year.

The conference has an attendence of 400 people. The audience is fairly
technical -- getting less so each year though :-(.

If I go next year, I think I will give a tutorial focusing on migrating
MySQL applications to PostgreSQL. Many attendees of my talk this year were
looking for information like that -- something I didn't cover :-\.

Gavin



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] client_encoding directive is ignored in

2003-02-16 Thread Tatsuo Ishii
> > Actually the problem can be divided into two parts:
> > 1) backend does not process GUC client_encoding. 
> > 2) libpq does not ask the backend's client_encoding, instead it asks
> >datanbase encoding when it starts up the connection. This is just a
> >mistake.
> 
> > I think we could fix 1) without any backward compatibilty problem and
> > should be applied to both 7.3-STATBLE and current.
> 
> If we change the backend behavior without changing libpq, aren't we
> breaking things even worse?  As long as libpq behaves as in (2), hadn't
> the backend better init its idea of client_encoding to match
> database_encoding?

Why? No matter how the backend's behavior regarding client_encoding
changes, libpq won't be affected by it since 7.2 and 7.3 libpq does
not use client_encoding anyway.
--
Tatsuo Ishii

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] location of the configuration files

2003-02-16 Thread Peter Eisentraut
Tom Lane writes:

> AFAICS, you can either set -C to /etc if you want your PG config files
> loose in /etc, or you can set it to /etc/postgresql/ if you want them
> in a privately-owned directory.  Which other arrangements are needed?

People might want to share them between servers, or allow a user to select
from a few pre-configured ones that which reside in the same directory.

-- 
Peter Eisentraut   [EMAIL PROTECTED]


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] psql and readline

2003-02-16 Thread Patrick Welche
On Sat, Feb 15, 2003 at 03:10:19PM -0600, Ross J. Reedstrom wrote:
> On Fri, Feb 14, 2003 at 11:32:02AM -0500, Tom Lane wrote:
> > Patrick Welche <[EMAIL PROTECTED]> writes:
> > > On Thu, Feb 13, 2003 at 10:25:52AM -0500, Tom Lane wrote:
> > >> Well, is that a bug in your wrapper?  Or must we add a configure test
> > >> for the presence of replace_history_entry()?
> > 
> > > Good question. Easiest for now for me would be add a configure test.
> > 
> > Okay with me --- Ross, can you handle that?
> 
> I'll take a crack at it. Testing will be a problem, since all my machines
> have libreadline installed, but I've got one I can probably pull it off
> of. Patrick, is 'your libedit wrapper' your personal code, or something
> standard on some set of systems (e.g. *BSD) ?

It's the standard NetBSD libedit which comes with built-in readline emulation,
it's just that replace_history_entry isn't one of the emulated functions.
({read,write,add}_history() are included)

> Ross "one line patches always grow" Reedstrom

:-)

I'm of course very happy to test the growing patch!

Cheers,

Patrick

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] client_encoding directive is ignored in

2003-02-16 Thread Tom Lane
Tatsuo Ishii <[EMAIL PROTECTED]> writes:
> Why? No matter how the backend's behavior regarding client_encoding
> changes, libpq won't be affected by it since 7.2 and 7.3 libpq does
> not use client_encoding anyway.

Doesn't client_encoding determine what encoding the backend sends to
the client?

It's probable that libpq itself is not affected by the selected
client_encoding, since it only passes data through.  But I think that
applications are quite likely to be broken if we alter the backend's
behavior in a way that changes the default client encoding.  That
strikes me as a change that should be made at a major release, not
a minor one --- people don't expect to get hit by compatibility
problems when they do a minor upgrade.

But this argument is mostly irrelevant if the proposed change will not
affect behavior in a default installation.  I guess I'm not entirely
clear on exactly which cases it will affect.  What will your proposed
change do in each possible combination (database encoding is SQL_ASCII
or not, client_encoding is defined in postgresql.conf or not,
PGCLIENTENCODING is set in postmaster's environment or not, etc)?

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] location of the configuration files

2003-02-16 Thread Tom Lane
Peter Eisentraut <[EMAIL PROTECTED]> writes:
> Tom Lane writes:
>> AFAICS, you can either set -C to /etc if you want your PG config files
>> loose in /etc, or you can set it to /etc/postgresql/ if you want them
>> in a privately-owned directory.  Which other arrangements are needed?

> People might want to share them between servers, or allow a user to select
> from a few pre-configured ones that which reside in the same directory.

You can accomplish that without the need to customize the .conf file
names; you just make, eg,

/etc/postgres/myconfig/postgresql.conf
/etc/postgres/yourconfig/postgresql.conf
/etc/postgres/herconfig/postgresql.conf

(plus additional config files as needed in each of these directories)
and then the postmaster start command is

postmaster -C /etc/postgres/myconfig

I see no real gain in flexibility in allowing people to choose random
names for the individual config files.  Also, it'd defeat the
ultimate-fallback approach of doing "find / -name postgresql.conf"
to figure out where the config files are hiding in an unfamiliar
installation.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] location of the configuration files

2003-02-16 Thread Tom Lane
Lamar Owen <[EMAIL PROTECTED]> writes:
> The script's been out there for awhile.  It does some things well, and some 
> things not so well.  The config files are still coresident with the database,
> and backup is more difficult than it can be.  Meeting all these needs (with 
> configure switches, configuration file directives, etc) would be a good 
> thing.

Sure.  I'm happy to change the software in a way that *allows* moving the
config files elsewhere.  But it's not apparent to me why you insist on
forcing people who are perfectly happy with their existing configuration
arrangements to change them.  I have not seen any reason in this
discussion why we can't support both a separate-config-location approach
and the traditional single-location one.

Please remember that the existing approach has been evolved over quite
a few releases.  It may not satisfy the dictates of the FHS religion,
but it does meet some people's needs perfectly well.  Let's look for a
solution that permits coexistence, rather than one that forces change
on people who don't need or want change.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] location of the configuration files

2003-02-16 Thread Tom Lane
Kevin Brown <[EMAIL PROTECTED]> writes:
> So in your case, what's the advantage of having initdb write anything
> to a config file, when you're probably also relying on PGDATA or -D to
> start the database (if you're not, then fair enough.  But see below)?

Keep in mind that initdb doesn't currently *need* to put the datadir
location into the config file.  It *will* need to do so if we separate
config and data dirs.  Or at least, *somebody* will need to do so.
It's not apparent to me how it simplifies life not to have initdb do it.
Especially when there are other configuration items that initdb should
or already does record: locale settings, database encoding.  And we
have already been talking about improving PG's self-tuning capability.
initdb would be the natural place to look around for information like
available RAM and adjust the config-file settings like sort_mem
accordingly.

Basically, the notion that initdb shouldn't write a config file seems
like a complete dead end to me.  It cannot possibly be more convenient
than the alternatives.  We'd be giving up a lot of current and future
functionality --- and for what?

> I'd expect initdb to initialize a database.  If I were running initdb
> without a lot of foreknowledge of its side effects, I think I'd
> probably be a bit surprised to find that it had touched my config
> file.

If we do it the way I suggested (dump into the datadir, which is
initially empty, same as always) then it cannot overwrite your existing
config files.  Think of it as providing a suggested config file to
compare against what you have.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] stats_command_string default?

2003-02-16 Thread Tom Lane
Kevin Brown <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> Kevin Brown <[EMAIL PROTECTED]> writes:
>>> Would it make more sense to enable stats_command_string by default?
>> 
>> I'd vote against it.  If we turn it on by default, people are paying
>> for a feature they may not even know exists.  Once they find out about
>> it and decide they want it, they can turn it on easily enough.


> The difference in time over 10 passes was 20 seconds (44 seconds
> with stats_command_string turned on, 24 with it turned off), for an
> impact of 0.2 milliseconds per command executed.

In other words, more than an eighty percent penalty on simple commands.
Not negligible in my book.

> I have no idea if that's small enough to be considered negligible or
> not, considering the hardware it was running on.

I would imagine that the CPU-time ratio would not depend all that much
on the particular hardware.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] online reindex

2003-02-16 Thread Tom Lane
Greg Stark <[EMAIL PROTECTED]> writes:
> If reindex sets a flag that causes all new inserts and updates to allocate new
> space at the end of the heap without checking for free space, then a simple
> linear scan should be guaranteed to catch all the data.

Oh?  If people are inserting tuples at the same time you're reading, I'm
not sure this holds good at all.

> There would only have to be a small window with the table locked while the
> indexes are swapped at the end.

Can you say "deadlock"?  Upgrading a lock from shared to exclusive won't
work in general, because there may be other transactions trying to do
the same thing.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] stats_command_string default?

2003-02-16 Thread Tom Lane
Kevin Brown <[EMAIL PROTECTED]> writes:
> It's not unmeasurable, but it is reasonably low (guess it depends on
> your definition of "reasonable" :-).  I wrote a small perl script
> which would do a "SELECT 1" in a loop as many times as I specified on
> the command line (autocommit was turned off).  I measured the amount
> of wall clock time it took to do 10 passes on an unloaded system
> with stats_command_string enabled, and then the same thing with it
> disabled.

FWIW, I did a comparable test using a slightly more reasonable example
(ie, a query that does useful work): in the regression database,
PREPARE q(int) AS SELECT * FROM tenk1 WHERE unique1 = $1
followed by 1 executions of
EXECUTE q(42)
This was with autocommit on (ie, each EXECUTE is its own transaction)
and using a C-coded client (a small tweak of src/test/examples/testlibpq.c).

Averaging over three trials on an unloaded system, I got 21.0 seconds
with stats_command_string off, 27.7 with it on, or about 32% overhead.

My conclusion is that stats_command_string overhead is non-negligible
for simple commands.  So I stand by my previous opinion that it should
not be turned on without the DBA taking explicit action to turn it on.
Do you want it on in every future benchmark, for example?

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] psql and readline

2003-02-16 Thread Peter Eisentraut
Ross J. Reedstrom writes:

> O.K., I found the 'editline' wrapper around 'libedit' that provides
> a subset of readline functionality, and used that for testing. On my
> Debian Linux systems, editline installs readline compatability headers
> (readline.h, history.h) into /usr/include/editline/, so I added tests
> for those into configure.in, and src/include/pg_config.h.in, and usage
> in src/bin/psql/input.h

I don't think this is what we were out for.  We've certainly been running
with libedit for a long time without anyone ever mentioning
/usr/include/editline.  I suggest this part is taken out.

-- 
Peter Eisentraut   [EMAIL PROTECTED]


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



Re: [HACKERS] online reindex

2003-02-16 Thread Greg Stark
Tom Lane <[EMAIL PROTECTED]> writes:

> Greg Stark <[EMAIL PROTECTED]> writes:
> > If reindex sets a flag that causes all new inserts and updates to allocate new
> > space at the end of the heap without checking for free space, then a simple
> > linear scan should be guaranteed to catch all the data.
> 
> Oh?  If people are inserting tuples at the same time you're reading, I'm
> not sure this holds good at all.

The premise is that reusing space within the heap is disabled during this
scan, so any new tuples are at the end. When you've reached the end you've
read all the ones inserted during your scan as well. 

There is a window between the end of the scan and locking the table when more
extra tuples could be added but presumably that window would be small and the
extra tuples would be limited. They would have to be processed while the table
is locked.

I don't know much about postgres's locking yet, so I can't comment on how to
arrange to be able to get a lock on the table. At first blush this sounds like
there needs to be a "upgradable lock" that only one process can hold but
allows other processes to read while it's held until it's upgraded. But as I
said I don't know much details about the locking policies in use currently.

-- 
greg


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



Re: [HACKERS] stats_command_string default?

2003-02-16 Thread Christopher Kings-Lynne
> Averaging over three trials on an unloaded system, I got 21.0 seconds
> with stats_command_string off, 27.7 with it on, or about 32% overhead.
>
> My conclusion is that stats_command_string overhead is non-negligible
> for simple commands.  So I stand by my previous opinion that it should
> not be turned on without the DBA taking explicit action to turn it on.
> Do you want it on in every future benchmark, for example?

How about with the stats_collector on?  ie. Recording block and row level
stats?

Chris


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



[HACKERS] Hard problem with concurrency

2003-02-16 Thread Christopher Kings-Lynne
OK, this is the problem:

I want to write a bit of SQL that if a row exists in a table, then update
it, otherwise it will update it.  Problem is, there is a very high chance of
simultaneous execute of this query on the same row (the rows have a unique
index).

So, strategy one:

begin;
update row;
if (no rows affected) insert row;
commit;

Problem - race condition!  If the two transactions run at the same time, the
second will end up doing an insert on a unique row which will cause query
failure

Strategy two:

begin;
select row for update;
if (row returned) update;
else insert;
commit;

Problem - race condition.  The row-level locking doesn't allow me to lock
'potential rows', so if the row does not yet exists and two transactions run
simultaneously then the second with die with a unique violation;

Strategy three:

begin;
lock table in exclusive mode;
update row;
if (no rows affected) insert row;
commit;

Problem - Works, but this table needs high concurrency.  Every time a member
hits a page of the site that needs authentication, this function is called.
In particular, the login transaction can take a little time sometimes and we
can't halt everyone else's activites for that duration...

So what is the solution???

I'm not sure if acquiring a ROW EXCLUSIVE MODE lock will help at all.  Also,
I can't try the insert and then the update because the INSERT, in Postgres,
will cause an outright transaction failure.

What the heck is the solution??

Chris



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] stats_command_string default?

2003-02-16 Thread Tom Lane
"Christopher Kings-Lynne" <[EMAIL PROTECTED]> writes:
>> My conclusion is that stats_command_string overhead is non-negligible
>> for simple commands.  So I stand by my previous opinion that it should
>> not be turned on without the DBA taking explicit action to turn it on.

> How about with the stats_collector on?  ie. Recording block and row level
> stats?

Didn't measure that, but I believe the block/row stats are dumped to the
collector once per transaction, so the overhead ought to be roughly
comparable to this test.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Hard problem with concurrency

2003-02-16 Thread greg

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Solution one: use sequences for the unique value.

Solution two: use another table to effect the exclusive locking 
and use it to store the "unique" values:

begin;
update row;
if (no rows affected) {
  lock table foo in exclusive mode;
  find a unique value that is not already in foo
  store this value inside of foo
  insert row;
}
commit;

Solution three: use your strategy two, but throw a loop around it and have 
it try again (with a new value) if it gets a unique violation.


- --
Greg Sabino Mullane  [EMAIL PROTECTED]
PGP Key: 0x14964AC8 200302162143

-BEGIN PGP SIGNATURE-
Comment: http://www.turnstep.com/pgp.html

iD8DBQE+T0sFvJuQZxSWSsgRAvMbAJwNndfcRb8U+W4TCeSGMGg+j7CqMwCgpfbd
98bDZI1r5AOLv1iCyVTC/AI=
=0Nkm
-END PGP SIGNATURE-



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Hard problem with concurrency

2003-02-16 Thread Tom Lane
"Christopher Kings-Lynne" <[EMAIL PROTECTED]> writes:
> I can't try the insert and then the update because the INSERT, in Postgres,
> will cause an outright transaction failure.

Do the update, then try to insert if the update found nothing, and put
a retry loop around the whole transaction in case you fail because of
concurrent inserts.

Realistically you will need a retry loop in all but the most trivial
cases anyway --- certainly so if you want to use serializable
transaction mode.  So I don't think this solution is unworkably complex.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



Re: [HACKERS] location of the configuration files

2003-02-16 Thread Lamar Owen
On Sunday 16 February 2003 13:15, Tom Lane wrote:
> Sure.  I'm happy to change the software in a way that *allows* moving the
> config files elsewhere.

So we agree.  Perfect.

>  But it's not apparent to me why you insist on
> forcing people who are perfectly happy with their existing configuration
> arrangements to change them.

Me? Trying to force things to change?  You misunderstand me.  No, I'm trying 
to understand the rationale for a (relative to the way other 
designed-multiple daemons do things) different, non-standard configuration 
process.  I understand better now; the exercise was a success.  Many thanks.
-- 
Lamar Owen
WGCR Internet Radio
1 Peter 4:11


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] location of the configuration files

2003-02-16 Thread mlw


Tom Lane wrote:


Peter Eisentraut <[EMAIL PROTECTED]> writes:
 

Tom Lane writes:
   

I would favor a setup that allows a -C *directory* (not file) to be
specified as a postmaster parameter separately from the -D directory;
 


 

A directory is not going to satisfy people.
   


Why not?  Who won't it satisfy, and what's their objection?

AFAICS, you can either set -C to /etc if you want your PG config files
loose in /etc, or you can set it to /etc/postgresql/ if you want them
in a privately-owned directory.  Which other arrangements are needed?

 

The idea of using a "directory" puts us back to using symlinks to share 
files.

While I know the core development teams thinks that symlinks are a 
viable configuration option, most admins, myself included, do not like 
to use symlinks because they do not have the ability to carry 
documentation, i.e. comments in a configuration file, and are DANGEROUS 
in a production environment.

Any configuration strategy that depends on symlinks is inadequate and 
poorly designed.


 




---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] location of the configuration files

2003-02-16 Thread Tom Lane
mlw <[EMAIL PROTECTED]> writes:
> The idea of using a "directory" puts us back to using symlinks to share 
> files.

So?  If you want to share files, you're probably sharing all three
config files and don't need a separate directory at all.  This is
not a sufficient argument to make me buy into the mess of letting
people choose nonstandard configuration file names --- especially
when most of the opposite camp seems to be more interested in choosing
*standard* names for things.  Why does that policy stop short at the
directory name?

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Hard problem with concurrency

2003-02-16 Thread Christopher Kings-Lynne
> Do the update, then try to insert if the update found nothing, and put
> a retry loop around the whole transaction in case you fail because of
> concurrent inserts.
>
> Realistically you will need a retry loop in all but the most trivial
> cases anyway --- certainly so if you want to use serializable
> transaction mode.  So I don't think this solution is unworkably complex.

I guess that will work, but it will not prevent our log from being spammed
up with error messages.  Also, this is a somewhat simplified case.  Some
transactions, such as our login transaction have rather large numbers of
operations in them and we don't want to have to rollback the whole thing.
I guess we'll have to live with it.

REPLACE INTO anyone? ;)

Chris



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] location of the configuration files

2003-02-16 Thread Kevin Brown
Tom Lane wrote:
> Keep in mind that initdb doesn't currently *need* to put the datadir
> location into the config file.  It *will* need to do so if we separate
> config and data dirs.  Or at least, *somebody* will need to do so.
> It's not apparent to me how it simplifies life not to have initdb do it.
> Especially when there are other configuration items that initdb should
> or already does record: locale settings, database encoding.  

Is it possible for the database engine to properly deal with a
database when it is told to use a different database encoding than the
one the database was initdb'd with?

If it's not, then that suggests to me that the database encoding is
something that doesn't belong in the configuration file but rather in
some other place that is intimately tied with the database itself and
which is difficult/impossible to change, like perhaps a read-only
system table that gets created at initdb time.

> And we have already been talking about improving PG's self-tuning
> capability.  initdb would be the natural place to look around for
> information like available RAM and adjust the config-file settings
> like sort_mem accordingly.

I agree here, and since you're thinking of just putting the resulting
config file in the database data directory, then as a DBA I wouldn't
be terribly surprised by it ... especially if it came back with a
message that told me what it had done.

> If we do it the way I suggested (dump into the datadir, which is
> initially empty, same as always) then it cannot overwrite your existing
> config files.  Think of it as providing a suggested config file to
> compare against what you have.

There is one minor complication: what if there's an existing config
file in the target directory?

One use for initdb would be as a quick way to completely wipe the
database and start over (e.g., if the encoding were found to be
incorrect), but the config file that's already there could easily
contain a lot of customization that the administrator would want to
retain.  Which suggests that we should consider writing to a file
using a slightly different name (e.g., postgresql.conf.initdb), at
least in the event that a config file already exists in the target
directory.

Not sure what the overall right thing to do here is...


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] stats_command_string default?

2003-02-16 Thread Kevin Brown
Tom Lane wrote:
> Kevin Brown <[EMAIL PROTECTED]> writes:
> > Tom Lane wrote:
> >> Kevin Brown <[EMAIL PROTECTED]> writes:
> >>> Would it make more sense to enable stats_command_string by default?
> >> 
> >> I'd vote against it.  If we turn it on by default, people are paying
> >> for a feature they may not even know exists.  Once they find out about
> >> it and decide they want it, they can turn it on easily enough.
> 
> 
> > The difference in time over 10 passes was 20 seconds (44 seconds
> > with stats_command_string turned on, 24 with it turned off), for an
> > impact of 0.2 milliseconds per command executed.
> 
> In other words, more than an eighty percent penalty on simple commands.
> Not negligible in my book.

Guess that's true when looked at that way.  :-)

Then again, that's the worst possible case: a simple command that only
invokes the parser and executor, doesn't reference any tables, doesn't
call any functions, and doesn't even write to anything.  As a
percentage of the work actually done on real systems, how often are
such commands executed?

In any case, it *does* show that there is a very high penalty for the
option relative to the operations that should be much more complex,
like parsing the command.  Why in the world is the penalty so high?  I
thought it would be a simple matter of copying the command to an
element of a structure that's overlaid onto a bit of shared memory
allocated to the backend process for its statistics.  In short, a
simple memory to memory copy, with perhaps the acquisition of a write
lock on the structure.  I'd expect such an operation to take a few
microseconds at most (especially on the kind of hardware I was testing
on), but it's orders of magnitude worse.  I have trouble believing
that the locking protocols required for this operation are that
inefficient unless we're doing something drastically wrong on that
front, and it's almost impossible for me to believe that the simple
operation of copying data to a shared memory segment would be that
inefficient.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] location of the configuration files

2003-02-16 Thread Kevin Brown
mlw wrote:
> symlinks suck. Sorry Tom, but they are *BAD* in a production
> server. 

Well, at least they're better than hard links.  ;-)


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



Re: [HACKERS] Hard problem with concurrency

2003-02-16 Thread Greg Stark

Hm, odd, nobody mentioned this solution:

If you don't have a primary key already, create a unique index on the
combination you want to be unique. Then:

. Try to insert the record
. If you get a duplicate key error
  then do update instead

No possibilities of duplicate records due to race conditions. If two people
try to insert/update at the same time you'll only get one of the two results,
but that's the downside of the general approach you've taken. It's a tad
inefficient if the usual case is updates, but certainly not less efficient
than doing table locks.

I'm not sure what you're implementing here. Depending on what it is you might
consider having a table of raw data that you _only_ insert into. Then you
process those results into a table with the consolidated data you're trying to
gather. I've usually found that's more flexible later because then you have
all the raw data in the database even if you only present a limited view.

-- 
greg


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Hard problem with concurrency

2003-02-16 Thread Christopher Kings-Lynne
> If you don't have a primary key already, create a unique index on the
> combination you want to be unique. Then:
>
> . Try to insert the record
> . If you get a duplicate key error
>   then do update instead
>
> No possibilities of duplicate records due to race conditions. If two
people
> try to insert/update at the same time you'll only get one of the two
results,
> but that's the downside of the general approach you've taken. It's a tad
> inefficient if the usual case is updates, but certainly not less efficient
> than doing table locks.

The idea was to stop our postgres logs being spammed up with unique
constraint violation warningsin which case your solution above is
identical to our current one.  Update and if it fails, insert, except since
the row is likely to already be there - our current way will be a bit more
efficient.

Chris


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



[HACKERS] Questions about indexes?

2003-02-16 Thread Ryan Bradetich
Hello postgres hackers,

Been a while since I have participated on this list ... but I have a new
itch to scratch

Although the table schema is immaterial, I will provide it so we have a
common framework for this discussion:


host_id integer (not null)
timestamp   datetime(not null)
categorytext(not null)  [<=5 chars]
anomaly text(not null)  [<= 1024 chars]

This table is used to store archived data, so each row in the table must
be unique.  Currently I am using a primary key across each column to
enforce this uniqueness.  This table currently has ~86 million rows and
is 16+ GB in size.  This primary key index is also 16+ GB in size,
because it appears all the data is duplicated in the index.  (I have
only done some preliminary looking at the database file with strings,
etc ... so this assumption is purly based on these observations).

I am not sure why all the data is duplicated in the index ... but i bet
it has to do with performance since it would save a lookup in the main
table.  Is there any benchmarks or papers related to this topic I should
locate and read?  I am curious about this because it seems the only
advantaged gained is searching the index for the specified values
Once the entry is found, the full entry needs to be pulled from the main
table anyhow since the index does not contain all the data.  Also with
the increased size, it seems additional pressure would be put on the
shared memory caches (no idea how this really works, just guessing! :))


Since my only requirement is that the rows be unique, I have developed a
custom MD5 function in C, and created an index on the MD5 hash of the
concatanation of all the fields.  This has reduced the disk space usage
considerably, as show below against my test database ~6 million rows
at 1+ GB.

All this data is based off the test database running 7.3.2:

TypeSize
---
Database Table  1188642816
All columns pkey1510252544
MD5 columns pkey 370999296

Just using MD5 hash data instead of all the columns is a considerable
diskspace win going from 1.5 GB to 370 MB.

Has anyone else solved this problem?  Has anyone else looked into
something like this and mind sharing so I do not have to re-invent the
wheel? :)  Also (assuming there is no papers / benchmarks proving data
in index is a good idea), how difficult would it be to impliment an
index type that extracts the data from the main table?


Thanks for reading.  I will be happy to field any question that I can,
or read any papers, research, etc that relates to this topic.

- Ryan

P.S. the production database is running 7.2.4 if that makes a
difference.

-- 
Ryan Bradetich <[EMAIL PROTECTED]>


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Questions about indexes?

2003-02-16 Thread Tom Lane
Ryan Bradetich <[EMAIL PROTECTED]> writes:
> Although the table schema is immaterial, I will provide it so we have a
> common framework for this discussion:

>   host_id integer (not null)
>   timestamp   datetime(not null)
>   categorytext(not null)  [<=5 chars]
>   anomaly text(not null)  [<= 1024 chars]

> This table is used to store archived data, so each row in the table must
> be unique.  Currently I am using a primary key across each column to
> enforce this uniqueness.

It's not real clear to me why you bother enforcing a constraint that the
complete row be unique.  Wouldn't a useful constraint be that the first
three columns be unique?  Even if that's not correct, what's wrong with
tolerating a few duplicates?  You can't tell me it's to save on storage
;-)

> I am not sure why all the data is duplicated in the index ... but i bet
> it has to do with performance since it would save a lookup in the main
> table.

An index that can't prevent looking into the main table wouldn't be
worth anything AFAICS ...

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] location of the configuration files

2003-02-16 Thread mlw


Tom Lane wrote:


mlw <[EMAIL PROTECTED]> writes:
 

The idea of using a "directory" puts us back to using symlinks to share 
files.
   


So?  If you want to share files, you're probably sharing all three
config files and don't need a separate directory at all.  This is
not a sufficient argument to make me buy into the mess of letting
people choose nonstandard configuration file names --- especially
when most of the opposite camp seems to be more interested in choosing
*standard* names for things.  Why does that policy stop short at the
directory name?
 

symlinks suck. Sorry Tom, but they are *BAD* in a production server. You 
can not add comments to symlinks. Most of the admins I know, myself 
included, HATE symlinks and use them as a last resort. Requiring 
symlinks is just pointless, we are talking about a few lines of code hat 
has nothing to do with performance.

The patch that I submitted allows PostgreSQL to work as it always has, 
but adds the ability for a configuration file to do what is normally 
done with fixed names in $PGDATA.

I have said before, I do not like policy, I like flexibility, forcing a 
directory is similarly restricting as requiring the files in $PGDATA.

Why is this such a problem? MANY people want to configure PostgreSQL 
this way, but the patch I submitted allows it, but does not force 
anything. Any configuration solution that requires symlinks is flawed.

 



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



Re: [HACKERS] Questions about indexes?

2003-02-16 Thread Ryan Bradetich
On Sun, 2003-02-16 at 23:34, Tom Lane wrote:
> Ryan Bradetich <[EMAIL PROTECTED]> writes:
> > Although the table schema is immaterial, I will provide it so we have a
> > common framework for this discussion:
> 
> > host_id integer (not null)
> > timestamp   datetime(not null)
> > categorytext(not null)  [<=5 chars]
> > anomaly text(not null)  [<= 1024 chars]
> 
> > This table is used to store archived data, so each row in the table must
> > be unique.  Currently I am using a primary key across each column to
> > enforce this uniqueness.
> 
> It's not real clear to me why you bother enforcing a constraint that the
> complete row be unique.  Wouldn't a useful constraint be that the first
> three columns be unique?  Even if that's not correct, what's wrong with
> tolerating a few duplicates?  You can't tell me it's to save on storage
> ;-)

The table holds system policy compliance data.  The catagory is
basically the policy, and the anomaly is the detailed text explaining
why the system is out of compliance.  So the anomaly data is important
(and often the reason why the key is unique).  The reason we are
archiving the data is to generate reports and graphs showing policy
compliance over time.  Duplicated rows will artifically inflate the
numbers in the reports and graphs.  The other option we had was to
perform a DISTINCT select at report / graph time, we chose no to go this
route bacause of the sort added to the query.  (Also it just seemed
tidier to only store good data :))

The disk storage is a minor concern :), but I was actually looking at it
as a possible performance enhancement.  I am curious how it affects the
shared buffer cache, and also there should be less average pages to read
since the index size was smaller.

Does this make sense? Or am I out in left field again? :)

> > I am not sure why all the data is duplicated in the index ... but i bet
> > it has to do with performance since it would save a lookup in the main
> > table.
> 
> An index that can't prevent looking into the main table wouldn't be
> worth anything AFAICS ...

Ok, scratch that idea then :)  I will continue looking at other ideas
like the MD5 data hashing etc.  

Thanks for your input Tom!

- Ryan

regards, tom lane
-- 
Ryan Bradetich <[EMAIL PROTECTED]>


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



Re: [HACKERS] Questions about indexes?

2003-02-16 Thread Tom Lane
Ryan Bradetich <[EMAIL PROTECTED]> writes:
> On Sun, 2003-02-16 at 23:34, Tom Lane wrote:
>> It's not real clear to me why you bother enforcing a constraint that the
>> complete row be unique.  Wouldn't a useful constraint be that the first
>> three columns be unique?

> The table holds system policy compliance data.  The catagory is
> basically the policy, and the anomaly is the detailed text explaining
> why the system is out of compliance.  So the anomaly data is important
> (and often the reason why the key is unique).

Well, sure the anomaly is important: it's the payload, the reason why
you bother to have the table in the first place.  But that doesn't mean
it's part of the key.  Generally the key would be the info you use to
look up a particular anomaly text.  In this example, it's not clear to
me why you'd need/want two different anomaly texts entered for the same
host_id and the same category at the same instant of time.  ISTM there's
something inadequate about your category column if you need that.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



Re: [HACKERS] Questions about indexes?

2003-02-16 Thread Ryan Bradetich
On Mon, 2003-02-17 at 00:15, Tom Lane wrote:
> Ryan Bradetich <[EMAIL PROTECTED]> writes:
> > On Sun, 2003-02-16 at 23:34, Tom Lane wrote:
> >> It's not real clear to me why you bother enforcing a constraint that the
> >> complete row be unique.  Wouldn't a useful constraint be that the first
> >> three columns be unique?
> 
> > The table holds system policy compliance data.  The catagory is
> > basically the policy, and the anomaly is the detailed text explaining
> > why the system is out of compliance.  So the anomaly data is important
> > (and often the reason why the key is unique).
> 
> Well, sure the anomaly is important: it's the payload, the reason why
> you bother to have the table in the first place.  But that doesn't mean
> it's part of the key.  Generally the key would be the info you use to
> look up a particular anomaly text.  In this example, it's not clear to
> me why you'd need/want two different anomaly texts entered for the same
> host_id and the same category at the same instant of time.  ISTM there's
> something inadequate about your category column if you need that.

Ok, I understand what you are asking now :)

Let me make up a contrived example to show how the table is used.

host_id 1 = hosta.somewhere.com
host_id 2 = hostb.somewhere.com

The catagories are coded so (made up examples):
cat p101 = /etc/passwd check
cat f101 = filesystem check.

the table would look like:
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has expired password.
2 | Mon Feb 17 00:34:24 MST 2003 | f101 | file /foo has improper owner.
etc...

So I do not need the anomaly to be part of the index, I only need it to 

I agree with you, that I would not normally add the anomally to the
index, except for the unique row requirement.  Thinking about it now,
maybe I should guarentee unique rows via a check constraint...

Thanks for making me think about this in a different way!

- Ryan

>   regards, tom lane
-- 
Ryan Bradetich <[EMAIL PROTECTED]>


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])