Re: [HACKERS] libpq and psql not on same page about SIGPIPE

2004-12-02 Thread Manfred Spraul
Tom Lane wrote:
Not really: it only solves the problem *if you change the application*,
which is IMHO not acceptable.  In particular, why should a non-threaded
app expect to have to change to deal with this issue?  But we can't
safely build a thread-safe libpq.so for general use if it breaks
non-threaded apps that haven't been changed.
 

No. non-threaded apps do not need to change. The default is the old, 7.3 
code: change the signal handler around the write calls. Which means that 
non-threaded apps are guaranteed to work without any changes, regardless 
of the libpq thread safety setting.
Threaded apps would have to change, but how many threaded apps use 
libpq? They check their code anyway - either just add PQinitLib() or 
review (and potentialy update) their signal handling code if it match 
any of the gotchas of the transparent handling.

--
   Manfred
--
   Manfred
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] libpq and psql not on same page about SIGPIPE

2004-12-01 Thread Manfred Spraul
Bruce Momjian wrote:
Comments?  This seems like our only solution.
 

This would be a transparent solution. Another approach would be:
- Use the old 7.3 approach by default. This means perfect backward 
compatibility for single-threaded apps and broken multithreaded apps.
- Add a new PQinitDB(int disableSigpipeHandler) initialization function. 
Document that multithreaded apps must call the function with 
disableSigpipeHandle=1 and handle SIGPIPE for libpq. Perhaps with a 
reference implementation in libpq (i.e. a sigpipeMode with 0 for old 
approach, 1 for do nothing, 2 for install our own handler).

It would prefer that approach:
It means that the multithreaded libpq apps must be updated [are there 
any?], but the solution is simpler and less fragile than calling 4 
signal handling function in a row to selectively block SIGPIPE per-thread.

--
   Manfred
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] libpq and psql not on same page about SIGPIPE

2004-12-01 Thread Manfred Spraul
Tom Lane wrote:
Bruce Momjian [EMAIL PROTECTED] writes:
 

His idea of pthread_sigmask/send/sigpending/sigwait/restore-mask.  Seems
we could also check errno for SIGPIPE rather than calling sigpending.
   

 

He has a concern about an application that already blocked SIGPIPE and
has a pending SIGPIPE signal waiting already.  One idea would be to
check for sigpending() before the send() and clear the signal only if
SIGPIPE wasn't pending before the call.  I realize that if our send()
also generates a SIGPIPE it would remove the previous realtime signal
info but that seems a minor problem.
   

Supposing that we don't touch the signal handler at all, then it is
possible that the application has set it to SIG_IGN, in which case a
SIGPIPE would be discarded rather than going into the pending mask.
So I think the logic has to be:
pthread_sigmask to block SIGPIPE and save existing signal mask
send();
if (errno == EPIPE)
{
if (sigpending indicates SIGPIPE pending)
use sigwait to clear SIGPIPE;
}
pthread_sigmask to restore prior signal mask
The only case that is problematic is where the application had
already blocked SIGPIPE and there is a pending SIGPIPE signal when
we are entered, *and* we get SIGPIPE ourselves.
If the C library does not support queued signals then our sigwait will
clear both our own EPIPE and the pending signal.  This is annoying but
it doesn't seem fatal --- if the app is writing on a closed pipe then
it'll probably try it again later and get the signal again.
If the C library does support queued signals then we will read the
existing SIGPIPE condition and leave our own signal in the queue.  This
is no problem to the extent that one pending SIGPIPE looks just like
another --- does anyone know of platforms where there is additional info
carried by a SIGPIPE event?
 

Linux stores pid/uid together with the signal. pid doesn't matter and no 
sane programmer will look at the uid, so it seems to be possible.

This seems workable as long as we document the possible gotchas.
 

Is that really worthwhile? There are half a dozend assumption about the 
C library and kernel internal efficiency of the signal handling 
functions in the proposal. Adding a PQinitLib function is obviously a 
larger change, but it solves the problem.
I'm aware of one minor gotcha: PQinSend() is not usable right now: it 
relies on the initialization of pq_thread_in_send, which is only created 
in the middle of the first connectDB(). That makes proper signal 
handling for the first connection impossible.

--
   Manfred
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [HACKERS] Why frequently updated tables are an issue

2004-10-20 Thread Manfred Spraul
[EMAIL PROTECTED] wrote a few months ago:
PostgreSQL's behavior on these cases is poor. I don't think anyone who has
tried to use PG for this sort of thing will disagree, and yes it is
getting better. Does anyone else consider this to be a problem? If so, I'm
open for suggestions on what can be done. I've suggested a number of
things, and admittedly they have all been pretty weak ideas, but they were
potentially workable.
 

What about a dblink style interface to a non-MVCC SQL database?
I think someone on this list mentioned that there are open source 
in-memory SQL databases.

--
   Manfred
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] tweaking MemSet() performance - 7.4.5

2004-09-25 Thread Manfred Spraul
[EMAIL PROTECTED] wrote:
If the memset 
bypasses the cache then the following access will cause a cache line 
miss, which can be so slow that using the faster memset can result in a 
net performance loss.

   

Could you suggest some structs to test? If I get your meaning, I would make a loop that sets then reads from the structure. 

 

Read the sources and the cpu specs. Benchmarking such problems is 
virtually impossible.
I don't have OS-X, thus I checked the Linux-kernel sources: It seems 
that the power architecture doesn't have the same problem as x86.
There is a special clear cacheline instruction for large memsets and the 
rest is done through carefully optimized store byte/halfword/word/double 
word sequences.

Thus I'd check what happens if you memset not perfectly aligned buffers. 
That's another point where over-optimized functions sometimes break 
down. If there is no slowdown, then I'd replace the postgres function 
with the OS provided function.

I'd add some __builtin_constant_p() optimizations, but I guess Tom won't 
like gcc hacks ;-)
--
   Manfred

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] tweaking MemSet() performance - 7.4.5

2004-09-18 Thread Manfred Spraul
Marc Colosimo wrote:
Oops, I used the same setting as in the old hacking message (-O2, gcc 
3.3). If I understand what you are saying, then it turns out yes, PG's 
MemSet is faster for smaller blocksizes (see below, between 32 and 
64). I just replaced the whole MemSet with memset and it is not very 
low when I profile.
Could you check what the OS-X memset function does internally?
One trick to speed up memset it to bypass the cache and bulk-write 
directly from write buffers to main memory. i386 cpus support that and 
in microbenchmarks it's 3 times faster (or something like that). 
Unfortunately it's a loss in real-world tests: Typically a structure is 
initialized with memset and then immediately accessed. If the memset 
bypasses the cache then the following access will cause a cache line 
miss, which can be so slow that using the faster memset can result in a 
net performance loss.

I could squeeze more out of it if I spent more time trying to 
understand it (change MEMSET_LOOP_LIMIT to 32 and then add memset 
after that?). I'm now working one understanding  Spin Locks and 
friends. Putting in a sync call (in s_lock.h) is really a time killer 
and bad for performance (it takes up 35 cycles).

That's the price you pay for weakly ordered memory access.
Linux on ppc uses eieio, on ppc64 lwsync is used. Could you check if 
they are faster?

--
   Manfred
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [HACKERS] futex

2004-08-25 Thread Manfred Spraul
Josh Berkus wrote:
Gaetano,
 

I knew there was an evaluation on the futex vs spinlock,
and Josh Berkus on IRC told me that there was only a 20%
performance increase, is this increase to throw away ?
   

Before we get totally off track here 
I evaluated futexes strictly as an attempt to solve the context switch storm 
bug.   I did NOT test whether they improved performance overall.

 

What did you test exactly and could you explain a bit about the context 
switch storm?
Did you use the futex interface directly or pthread_rwlock_rdlock?

--
   Manfred
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] fsync and hardware write cache

2004-08-23 Thread Manfred Spraul
[EMAIL PROTECTED] wrote:
Something to think about:
if you run PostgreSQL with fsync on, but you use the hardware write cache
on your disk drives, how likely are you to lose data? Obviously, this is a
fairly limited problem, as it only applies to power down (which you can
control) or power loss where the risks may be reduced but not eliminated
with a UPS.
Does it make sense to add a platform specific call that will flush a write
cache when fsync is enable?
 

Pete Zaitsev from mysql wrote that there is a special call on Mac OS:
Quoting him:
Mac OS X also has this optimization, but at least it provides an
alternative flush method for Database Servers:
fcntl(fd, F_FULLFSYNC, NULL)
can be used instead of fsync() to get true fsync() behavior. 

I couldn't confirm this with a quick google search - perhaps someone 
with MacOS docs (or mysql sources) should check it.

What might be useful is a test tool that benchmarks fsync: if it's 
faster than the rotational speed of a 15k rpm disk then probably someone 
caches the write calls.

--
   Manfred
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] NOT LOGGED options (was Point in Time Recovery )

2004-08-18 Thread Manfred Spraul
[EMAIL PROTECTED] wrote:
Tom Lane wrote
   

NOT LOGGED options on CREATE INDEX and COPY, to allow users to take
advantage of the no logging optimization without turning off PITR system
wide. (Just as this is possible in Oracle and Teradata).
 

Isn't this in direct conflict with your opinion above?  And I cannot say
that I think this one is a good idea.  We do not have support for
selective catalog xlogging;
Is it possible to skip the xlog fsync for NOT LOGGED transactions?
--
   Manfred
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] hot spare / log shipping work on

2004-08-13 Thread Manfred Spraul
Gaetano Mendola wrote:
a1) If exist check that is a 16MB file ( the request can
~arrive during the copy ),
I think this will fail under windows: copy first sets the file size 
and then transfers the data. I wouldn't rule out that some Unices use 
the same implementation.

~a2) If the file not exist this mean that is not yet 
recycled and
~is a partial file present on the partial directory,
~check if the alive file is older then 2 minutes.
~   a21) If the file is older than 2 minutes I assume 
that
~the master is dead:
I'd concentrate on cold failover: the user (or the OS) must call a 
script to cause a fail-over. The tricky thing are the various partial 
connection losses between master and spare: perhaps the alive file is 
not updated anymore due to a net split, but the master is still alive. 
Unless you are really careful both master and spare could run.

I think SAP DB / MaxDB supports failover - perhaps it would be 
interesting to check their failover scripts.

--
   Manfred
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] fsync vs open_sync

2004-08-10 Thread Manfred Spraul
[EMAIL PROTECTED] wrote:
I have been considering a full sweep in my test lab off client time later on.
ext2, ext3, jfs, xfs, and ReiserFS, fsync on with fdatasync or open_sync,
and fsync off.
 

Before you start: double check that the disks are not lying:
At least the suse 2.4 kernel send cache flush commands to ide disks on 
fsync(), but not with O_SYNC:

http://marc.theaimsgroup.com/?l=linux-kernelm=107964507113585
--
   Manfred
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] switch WAL segment

2004-08-09 Thread Manfred Spraul
Andreas Pflug wrote:
Tom Lane wrote:
Do we have a TODO for allowing users to
force switching to a new WAL file segment?

Together with PITR, this might make sense?
Another idea:
Has anyone tried to put the WAL segment directory on a cluster 
filesystem and use that for cold (perhaps even hot) failover?
The archive script could apply completed wal segments to the backup 
node. If the primary node fails, the last (partial) segment is applied 
as well and the backup node is activated.

--
   Manfred

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] fsync vs open_sync

2004-08-09 Thread Manfred Spraul
Tom Lane wrote:
[EMAIL PROTECTED] writes:
 

The improvements were REALLY astounding, and I would like to know if other
Linux users see this performance increase, I mean, it is almost 8~10 times
faster than using fsync.
Furthermore, it seems to also have the added benefit of reducing the I/O
storm at checkpoints over a system running with fsync off.
   

What size transactions are you using in your tests?
For a system with small transactions (not much more than 1 page worth of
WAL traffic per transaction) I'd be pretty surprised if there was any
real difference at all.  There certainly should not be any difference in
terms of the number of physical writes.  We have seen some platforms
where fsync() is inefficiently implemented and requires more kernel
overhead than is reasonable --- not for I/O, but just to look through
the kernel buffers and confirm that none of them need flushing.  But I
didn't think Linux was one of these.
 

IDE or scsi? If IDE: Write cache on or off? Which 2.4 kernel?
The numbers are very high - it could be a side effect of write caching 
by the disks. I think some Suse 2.4 kernels have partial support for 
reliable fsync even if the write cache is on (i.e. fsync issues a cache 
flush command to the disk), but not all code paths are handled. Perhaps 
fsync is handled and O_SYNC is not handled.
I could try to find the details.

--
   Manfred
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] xeon processors

2004-07-01 Thread Manfred Spraul
Christopher Browne wrote:
The fix for this problem is to rewrite all of your applications so
that they become conscious of which bits of memory they're using so
they can tune their own behaviour.  This, of course, requires
discarding useful notions such as virtual memory that are _assumed_
by most modern operating systems.
 

This is misleading: PAE means that a 32-bit cpu can have more that 4 GB 
physical memory. Each process can map at most 4 (in reality: ~2) GB memory.
Many databases manage their own, huge buffer pool and read/write the 
database tables with O_DIRECT. These apps must support buffer pools  2 
GB, which requires some work. Linux and Solaris contain a special 
syscall that helps Oracle to manage it's buffer pool for such setups 
(remap_page_rage()).
OTHO postgres has a small user space buffer pool, the majority of the 
file buffers are handled by OS. Thus no changes are required inside 
postgres for PAE, all it needs is an OS that support PAE for the buffer 
pool.

Regarding hyperthreading: I'm aware of two changes:
- busy loops must contain PAUSE instructions. Postgres does that.
- virtual aliases should be avoided: If two processes access memory at 
the same virtual address, then this can cause cache collisions and then 
misses. I think this is handled by the C library by randomizing the 
return addresses of malloc() and Intel mitigated the issue by improving 
the cache.

--
   Manfred
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] [PATCHES] Compiling libpq with VisualC

2004-06-13 Thread Manfred Spraul
[EMAIL PROTECTED] wrote:
What is the recommended way to create mutex objects (CreateMutex) from
Win32 libraries?  There must be a clean way like there is in pthreads.
   

A mutex is inherently a global object. CreateMutex(NULL, FALSE, NULL) will
return a handle to an unowned mutex.
 

That's not the problem. Under pthread, it's possible to initialize a 
mutex from compile time:

   static pthread_mutex_t init_mutex = PTHREAD_MUTEX_INITIALIZER;
This means that the mutex is immediately valid, no races with the 
initialization. I couldn't find an equivalent Win32 feature.

--
   Manfred
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] Table Spaces

2004-05-17 Thread Manfred Spraul
Bruce Momjian wrote:
The only downside to removal is that folks without symlinks (I believe
Win32 only) will loose that functionality with nothing to replace it. 
However, I think the clarity of removing it is worth it.  Also, I think
someone had a special way to do symlinks on Win32 and we should look
into that.
 

Windows 2000 and later support mount points - you can attach a new 
partition as C:\pgsql\data\xlog instead of D:\. That might be enough for 
most users. IIRC there was a tool to create arbitrary links, but it was 
removed just before W2K final.

--
   Manfred

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Linux 2.6.6 also

2004-05-12 Thread Manfred Spraul
Gregory Stark wrote:

This patch also looks relevant to Postgres for two reasons. 

This part seems like it might expose some bugs that otherwise might have
remained hidden:
  This affects I/O scheduling potentially quite significantly.  It is no
  longer the case that the kernel will submit pages for I/O in the order in
  which the application dirtied them.  We instead submit them in file-offset
  order all the time.
The part about part-file fdatasync calls seems like could be really useful.
It seems like that's just speculation about future directions though?
 

Correct. The kernel could do that now, but it's not exposed to user space.

But the change highlights one point: the order in which file blocks are 
written to disk is undefined. Theoretically the wal checkpoint record 
could be on the platter, but the preceeding pages were not written.
Is that case handled by the wal replay code?

--
   Manfred
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Flush to Disk

2004-03-28 Thread Manfred Spraul
Diego Montenegro wrote:

Hello all,

Can anyone point me to where in the code does Postgres Flush all the
Data to disk???
When XLogFlush is called, it only flushes the XLOG to disk, right? Does
the entire Data get flushed at the same time as the Log? 
 

in src/backend/storage/smgr/md.c, mdsync(): During a checkpoint, the 
whole system cache is synced to the disk.
Note that checkpoints should be rare - I think every few minutes. The 
xlog contains enough data to recover a transaction after a system crash, 
therefore only the xlog is forced to the disk during transaction commit.

--
   Manfred
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] [HACKERS] fsync method checking

2004-03-25 Thread Manfred Spraul
[EMAIL PROTECTED] wrote:

Compare file sync methods with one 8k write:
   (o_dsync unavailable)  
   open o_sync, write   6.270724
   write, fdatasync13.275225
   write, fsync,   13.359847
 

Odd. Which filesystem, which kernel? It seems fdatasync is broken and 
syncs the inode, too.

--
   Manfred
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] [HACKERS] fsync method checking

2004-03-24 Thread Manfred Spraul
Tom Lane wrote:

[EMAIL PROTECTED] writes:
 

I could certainly do some testing if you want to see how DBT-2 does.
Just tell me what to do. ;)
   

Just do some runs that are identical except for the wal_sync_method
setting.  Note that this should not have any impact on SELECT
performance, only insert/update/delete performance.
 

I've made a test run that compares fsync and fdatasync: The performance 
was identical:
- with fdatasync:

http://khack.osdl.org/stp/290607/

- with fsync:
http://khack.osdl.org/stp/290483/
I don't understand why. Mark - is there a battery backed write cache in 
the raid controller, or something similar that might skew the results? 
The test generates quite a lot of wal traffic - around 1.5 MB/sec. 
Perhaps the writes are so large that the added overhead of syncing the 
inode is not noticable?
Is the pg_xlog directory on a seperate drive?

Btw, it's possible to request such tests through the web-interface, see
http://www.osdl.org/lab_activities/kernel_testing/stp/script_param.html
--
   Manfred
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Why O_SYNC is faster than fsync on ext3

2004-03-21 Thread Manfred Spraul
Yusuf Goolamabbas wrote:

I sent this to Bruce but forgot to cc pgsql-hackers, The patches are
likely to go into 2.6.6. People interested in extremely safe fsync
writes should also follow the IDE barrier thread and the true fsync() in
Linux on IDE thread
 

Actually the most interesting part of the thread was the initial post 
from Peter Zaitsev on a fcntl(fd, F_FULLSYNC, NULL): He wrote that this 
is necessary for Mac OS X to force a flush of the write caches in the 
disks. Unfortunately I can't find anything about this flag with google.

Another interesting point is that right now, ide write caches must be 
disabled for reliable fsync operations  with Linux. Recent suse kernels 
contain partial support. If the existing patches are completed and 
merged, it will be safe to enable write caching.

Perhaps Bruce's cache flush test could be modified slightly to check 
that the OS isn't lying about fsync: if fsync is faster than the 
rotational delay of the disks, then the setup is not suitable for 
postgres. This could be recommended as a setup test in the install document.

--
   Manfred
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] WAL write of full pages

2004-03-15 Thread Manfred Spraul
Marty Scholes wrote:

2. Put them on an actual (or mirrored actual) spindle
Pros:
* Keeps WAL and data file I/O separate
Cons:
* All of the non array drives are still slower than the array
Are you sure this is a problem? The dbt-2 benchmarks from osdl run on an 
8-way Intel computer with several raid arrays distributed to 40 disks. 
IIRC it generates around 1.5 MB wal logs per second - well withing the 
capability of a single drive. My laptop can write around 10 MB/sec 
(measured with dd if=/dev/zero of=fill and vmstat), fast drives should 
be above 20 MB/sec.
How much wal data is generated by large postgres setups? Are there any 
setups that are limited by the wal logs.

--
   Manfred


---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] libpq thread safety

2004-03-14 Thread Manfred Spraul
Bruce Momjian wrote:

Your patch has been added to the PostgreSQL unapplied patches list at:

	http://momjian.postgresql.org/cgi-bin/pgpatches

I will try to apply it within the next 48 hours.
 

You are too fast: the patch was a proof of concept, not really tested 
(actually quite buggy).
Attached are two patches:

- ready-sigpipe: check_sigpipe_handler skips pthread_create_key if a 
signal handler was installed. This is wrong - the key is always required.
- ready-locking: locking around kerberos and openssl.

The patches pass the regression tests on i386 linux. Kerberos is 
untested, ssl only partially tested due to the lack of a test setup.
I'm still not sure if the new code is the right thing for the openssl 
initialization: libpq calls SSL_library_init() unconditionally. If the 
calling app uses ssl, too, this might confuse openssl.

Could you replace my initial proposal with these two patches?

Btw, is it intentional that THREAD_SUPPORT is not set in src/template/linux?

--
   Manfred
Index: src/backend/libpq/md5.c
===
RCS file: /projects/cvsroot/pgsql-server/src/backend/libpq/md5.c,v
retrieving revision 1.22
diff -c -r1.22 md5.c
*** src/backend/libpq/md5.c 29 Nov 2003 19:51:49 -  1.22
--- src/backend/libpq/md5.c 14 Mar 2004 10:46:54 -
***
*** 271,277 
  static void
  bytesToHex(uint8 b[16], char *s)
  {
!   static char *hex = 0123456789abcdef;
int q,
w;
  
--- 271,277 
  static void
  bytesToHex(uint8 b[16], char *s)
  {
!   static const char *hex = 0123456789abcdef;
int q,
w;
  
Index: src/interfaces/libpq/fe-auth.c
===
RCS file: /projects/cvsroot/pgsql-server/src/interfaces/libpq/fe-auth.c,v
retrieving revision 1.89
diff -c -r1.89 fe-auth.c
*** src/interfaces/libpq/fe-auth.c  7 Jan 2004 18:56:29 -   1.89
--- src/interfaces/libpq/fe-auth.c  14 Mar 2004 10:46:55 -
***
*** 590,595 
--- 590,596 
  
case AUTH_REQ_KRB4:
  #ifdef KRB4
+   pglock_thread();
if (pg_krb4_sendauth(PQerrormsg, conn-sock,
   (struct sockaddr_in *)  
conn-laddr.addr,
   (struct sockaddr_in *)  
conn-raddr.addr,
***
*** 597,604 
--- 598,607 
{
snprintf(PQerrormsg, PQERRORMSG_LENGTH,
libpq_gettext(Kerberos 4 authentication 
failed\n));
+   pgunlock_thread();
return STATUS_ERROR;
}
+   pgunlock_thread();
break;
  #else
snprintf(PQerrormsg, PQERRORMSG_LENGTH,
***
*** 608,620 
--- 611,626 
  
case AUTH_REQ_KRB5:
  #ifdef KRB5
+   pglock_thread();
if (pg_krb5_sendauth(PQerrormsg, conn-sock,
 hostname) != 
STATUS_OK)
{
snprintf(PQerrormsg, PQERRORMSG_LENGTH,
libpq_gettext(Kerberos 5 authentication 
failed\n));
+   pgunlock_thread();
return STATUS_ERROR;
}
+   pgunlock_thread();
break;
  #else
snprintf(PQerrormsg, PQERRORMSG_LENGTH,
***
*** 722,727 
--- 728,734 
if (authsvc == 0)
return NULL;/* leave original error message in 
place */
  
+   pglock_thread();
  #ifdef KRB4
if (authsvc == STARTUP_KRB4_MSG)
name = pg_krb4_authname(PQerrormsg);
***
*** 759,763 
--- 766,771 
  
if (name  (authn = (char *) malloc(strlen(name) + 1)))
strcpy(authn, name);
+   pgunlock_thread();
return authn;
  }
Index: src/interfaces/libpq/fe-connect.c
===
RCS file: /projects/cvsroot/pgsql-server/src/interfaces/libpq/fe-connect.c,v
retrieving revision 1.268
diff -c -r1.268 fe-connect.c
*** src/interfaces/libpq/fe-connect.c   10 Mar 2004 21:12:47 -  1.268
--- src/interfaces/libpq/fe-connect.c   14 Mar 2004 10:46:56 -
***
*** 2902,2908 
  PQsetClientEncoding(PGconn *conn, const char *encoding)
  {
charqbuf[128];
!   static char query[] = set client_encoding to '%s';
PGresult   *res;
int status;
  
--- 2902,2908 
  

Re: [HACKERS] libpq thread safety

2004-03-14 Thread Manfred Spraul
Bruce Momjian wrote:

How can we test if libpq needs to call that?  Seems that is an issue
whether we are threaded or not, no?
 

I think it's always an issue: in the non-threaded case, it's just not 
fatal. At least some openssl init functions are protected with if 
(done) return; done = 1;, and it the worst case, it's a memory leak.
With threaded apps,  it might corrupt a concurrent ssl transaction. 
Perhaps PQenableSSLLocks could handle that case, too - a special flag 
for skip SSL_library_init().

There is a new test program in src/tools/thread that needs to be run for
every platform for 7.5.  We can't use the 7.4.X tests because it didn't
report individual function tests, just one general value.  We need
individual test reports for 7.5.  Run the test program and post the
results and I will get it updated.  The test output on my bsd/os machine
is:
 

RedHat Fedora Core 1 and Debian 3.0 both report


Make sure you have added any needed 'THREAD_CPPFLAGS' and 'THREAD_LIBS'
defines to your template/$port file before compiling this program.
Add this to your template/$port file:

STRERROR_THREADSAFE=yes
GETPWUID_THREADSAFE=no
GETHOSTBYNAME_THREADSAFE=no

The uname's are
Linux snip 2.4.25-1-686 #1 Tue Feb 24 10:55:59 EST 2004 i686 unknown 
unknown GNU/Linux
and
Linux ab 2.4.22-1.2174.nptl #1 Wed Feb 18 16:38:32 EST 2004 i686 i686 
i386 GNU/Linux

Both glibc 2.3.2, one with nptl, one with linuxthreads as the pthread 
library.

--
   Manfred
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [HACKERS] Log rotation

2004-03-14 Thread Manfred Spraul
Bruce Momjian wrote:

Which basically shows one fsync, no O_SYNC's, and setting of the flag
only for klog reads.
 

Which sysklogd do you look at? The version from RedHat 9 contains this 
block:

/*
 * Crack a configuration file line
 */
void cfline(line, f)
char *line;
register struct filed *f;
{
register char *p;
[snip]
if (*p == '-')
{
syncfile = 0;
p++;
} else
syncfile = 1;
[snip]
if (syncfile)
f-f_flags |= SYNC_FILE;
And the the fsync depends on SYNC_FILE. As documented in man syslog.conf:

   You may prefix each entry with the minus ``-'' sign to omit 
syncing the
   file  after every logging.  Note that you might lose 
information if the
   system crashes right behind a write attempt.  Nevertheless  
this  might
   give you back some performance, especially if you run programs 
that use
   logging in a very verbose manner.


It's sysklogd-1.4.1rh, I'm not sure what part of it are Redhat specific.

--
   Manfred


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] libpq thread safety

2004-03-12 Thread Manfred Spraul
Bruce Momjian wrote:

What killed the idea of doing ssl or kerberos locking inside libpq was
that there was no way to be sure that outside code didn't also access
those routines.
A callback based implementation can handle that: libpq has a default 
implementation for apps that do not use openssl or kerberos themself. If 
the app wants to use the libraries, too, then it must replace the hooks 
with their own locks.

I've attached a simple proposal, just for kerberos 4. If you agree on 
the general approach, I'll add it to all functions that are not thread safe.

 I have documented that SSL and Kerberos are not
thread-safe in the libpq docs.  Let's wait and see If we need additional
work in this area.
 

It means that multithreading is not usable: As Tom explained, the 
connect string is often set directly by the end user. Setting sslmode 
would result is races - impossible to support. In the very least, 
sslmode and Kerberos would have to fail if the app is multithreaded.

--
   Manfred
Index: src/interfaces/libpq/fe-auth.c
===
RCS file: /projects/cvsroot/pgsql-server/src/interfaces/libpq/fe-auth.c,v
retrieving revision 1.89
diff -u -r1.89 fe-auth.c
--- src/interfaces/libpq/fe-auth.c  7 Jan 2004 18:56:29 -   1.89
+++ src/interfaces/libpq/fe-auth.c  12 Mar 2004 20:07:02 -
@@ -590,6 +590,7 @@
 
case AUTH_REQ_KRB4:
 #ifdef KRB4
+   pglock_thread();
if (pg_krb4_sendauth(PQerrormsg, conn-sock,
   (struct sockaddr_in *)  
conn-laddr.addr,
   (struct sockaddr_in *)  
conn-raddr.addr,
@@ -597,8 +598,10 @@
{
snprintf(PQerrormsg, PQERRORMSG_LENGTH,
libpq_gettext(Kerberos 4 authentication 
failed\n));
+   pgunlock_thread();
return STATUS_ERROR;
}
+   pgunlock_thread();
break;
 #else
snprintf(PQerrormsg, PQERRORMSG_LENGTH,
@@ -722,6 +725,7 @@
if (authsvc == 0)
return NULL;/* leave original error message in 
place */
 
+   pglock_thread();
 #ifdef KRB4
if (authsvc == STARTUP_KRB4_MSG)
name = pg_krb4_authname(PQerrormsg);
@@ -759,5 +763,6 @@
 
if (name  (authn = (char *) malloc(strlen(name) + 1)))
strcpy(authn, name);
+   pgunlock_thread();
return authn;
 }
Index: src/interfaces/libpq/fe-connect.c
===
RCS file: /projects/cvsroot/pgsql-server/src/interfaces/libpq/fe-connect.c,v
retrieving revision 1.268
diff -u -r1.268 fe-connect.c
--- src/interfaces/libpq/fe-connect.c   10 Mar 2004 21:12:47 -  1.268
+++ src/interfaces/libpq/fe-connect.c   12 Mar 2004 20:07:03 -
@@ -3163,4 +3163,34 @@
 
 #undef LINELEN
 }
+/*
+ * To keep the API consistent, the locking stubs are always provided, even
+ * if they are not required.
+ */
+pgthreadlock_t *g_threadlock;
 
+static pgthreadlock_t default_threadlock;
+static void
+default_threadlock(bool acquire)
+{
+#if defined(ENABLE_THREAD_SAFETY)
+   static pthread_mutex_t singlethread_lock = PTHREAD_MUTEX_INITIALIZER;
+   if (acquire)
+   pthread_mutex_lock(singlethread_lock);
+   else
+   pthread_mutex_unlock(singlethread_lock);
+#endif
+}
+
+pgthreadlock_t *
+PQregisterThreadLock(pgthreadlock_t *newhandler)
+{
+   pgthreadlock_t *prev;
+
+   prev = g_threadlock;
+   if (newhandler)
+   g_threadlock = newhandler;
+   else
+   g_threadlock = default_threadlock;
+   return prev;
+}
Index: src/interfaces/libpq/libpq-fe.h
===
RCS file: /projects/cvsroot/pgsql-server/src/interfaces/libpq/libpq-fe.h,v
retrieving revision 1.102
diff -u -r1.102 libpq-fe.h
--- src/interfaces/libpq/libpq-fe.h 9 Jan 2004 02:02:43 -   1.102
+++ src/interfaces/libpq/libpq-fe.h 12 Mar 2004 20:07:03 -
@@ -274,6 +274,22 @@
 PQnoticeProcessor proc,
 void *arg);
 
+typedef void (pgsigpipehandler_t)(bool enable, void **state);
+
+extern pgsigpipehandler_t *
+PQregisterSigpipeCallback(pgsigpipehandler_t *newhandler);
+
+/*
+ * Used to set callback that prevents concurrent access to
+ * non-thread safe functions that libpq needs.
+ * The default implementation uses a libpq internal mutex.
+ * Only required for multithreaded apps that use kerberos
+ * both within their app and for postgresql connections.
+ */
+typedef void (pgthreadlock_t)(bool acquire);
+
+extern pgthreadlock_t * 

Re: [HACKERS] friday 13 bug?

2004-02-15 Thread Manfred Spraul
zohn_ming wu wrote:

swap_free: Bad swap file entry 0004

Do you use ECC memory, is ECC enabled in the BIOS [and does it work - 
some vendors lie about ECC support]?

I would bet that it's a soft memory error:  means not used. One 
bit differs, and the kernel complains about the invalid value. I think 
the following oops is a side effect of the bad swap entry.
Do you have timestaps in the system log? Is the swap error just before 
the BUG in buffer.c?

--
   Manfred


---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] libpq thread safety

2004-02-10 Thread Manfred Spraul
Bruce Momjian wrote:

However, we really have two types of function tested. 
The first, strerror, can be thread safe by using thread-local storage
_or_ by returning pointers to static strings.  The other two function
tests require thread-local storage to be thread-safe.
 

You are completely ignoring that libpq is a library: what if the app 
itself wants to call gethostbyname or stderror, too?
Right now libpq has it's own private mutex. This doesn't work - the 
locking must be process-wide. The current implementation could be the 
default, and apps that want to use gethostbyname [or kerberos 
authentication, etc.] outside libpq must fill in appropriate callbacks.

--
   Manfred
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Mixing threaded and non-threaded

2004-01-27 Thread Manfred Spraul
Bruce Momjian wrote:

Woh, as far as I know, any application should run fine with -lpthread,
threaded or not.  What OS are you on?  This is the first I have heard of
this problem.
 

Perhaps we should try to figure out how other packages handle 
multithreaded/singlethreaded libraries? I'm looking at openssl right 
now, and openssl never links against libpthread: The caller is 
responsible for registering the locking primitives.

--
   Manfred
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [HACKERS] Disaster!

2004-01-25 Thread Manfred Spraul
Greg Stark wrote:

I do know that AFS returns quota failures on close. This was unusual enough
that when AFS was deployed at school unix tools failed left and right over
precisely this issue. Though it mostly just meant they returned the wrong exit
status.
That means
   open();
   write();
   sync();
could succeed, but the data is not stored on disk, correct?

--
   Manfred


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] LWLock/ShmemIndex startup question

2004-01-13 Thread Manfred Spraul
Tom Lane wrote:

Claudio Natoli [EMAIL PROTECTED] writes:
 

Or, maybe we'll just use the tas() implementation that already exists for
__i386__/__x86_64__ in s_lock.h. How did I miss that?
Move along. Nothing to see here.
   

Actually, I was expecting you to complain that the s_lock.h coding is
gcc-specific.  Which compilers do we need to support on Windows?
 

I think intel's compiler supports the gcc syntax. At least the Linux 
version can compile the Linux kernel.
MSVC has it's own syntax that is very primitive, and AFAIK not supported 
by the 64-bit windows versions. The AMD64 version definitively doesn't 
support inline assembly at all.

What are the chances for Win64 support? sizeof(unsigned long) remains 4, 
sizeof(void*) is 8.

--
   Manfred


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] LWLock/ShmemIndex startup question

2004-01-13 Thread Manfred Spraul
Tom Lane wrote:

Manfred Spraul [EMAIL PROTECTED] writes:
 

What are the chances for Win64 support? sizeof(unsigned long) remains 4, 
sizeof(void*) is 8.
   

If you can tell me what type Datum should be (unsigned long long
maybe?), we could probably handle that.
Probably uintptr_t: That's the official C99 integer type for storing 
pointers. I'm not sure if it's guaranteed to be wide enough for 
ULONG_MAX (or only UINT_MAX).

--
   Manfred
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[HACKERS] libpq thread safety

2004-01-11 Thread Manfred Spraul
libpq needs additional changes for complete thread safety:
- openssl needs different initialization.
- kerberos is not thread safe.
- functions such as gethostbyname are not thread safe, and could be used 
by kerberos. Right now protected with a  libpq specific mutex.
- dito for getpwuid and stderror.

openssl is trivial: just proper flags are needed for the init function.
But what about kerberos: I'm a bit reluctant to add a forth mutex: what 
if kerberos calls gethostbyname or getpwuid internally?
Usually I would use one single_thread mutex and use that mutex for all 
operations - races are just too difficult to debug. Any better ideas? 
Otherwise I'd start searching for the non-threadsafe functions and add 
pthread_lock around them.
Actually I'm not even sure if it should be a libpq specific mutex: what 
if the calling app needs to access openssl or kerberos as well? Perhaps 
libpq should use a system similar to openssl:

http://www.openssl.org/docs/crypto/threads.html

--
   Manfred
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


[HACKERS] PQinSend question

2004-01-11 Thread Manfred Spraul
From fe-secure.c:

/*
 *  Indicates whether the current thread is in send()
 *  For use by SIGPIPE signal handlers;  they should
 *  ignore SIGPIPE when libpq is in send().  This means
 *  that the backend has died unexpectedly.
 */
pqbool
PQinSend(void)
{
#ifdef ENABLE_THREAD_SAFETY
return (pthread_getspecific(thread_in_send) /* has it been 
set? */ 
*(char *)pthread_getspecific(thread_in_send) 
== 't') ? true : false;
#else
return false;   /* No threading, so we can't be in send() */
Why not? Signal delivery can interrupt send() even with single-threaded 
users.

I really like the openssl interface: what about something like

typedef void (*pgsigpipehandler_t)(bool enable);

void PQregisterSignalCallback(pgsigpipehandler_t new);

The callback is global, and called around the send() calls.
The default handler uses the sigaction code from 7.4. The current 
autodetection code is less flexible than a callback, and it's not 100% 
backward compatible.

--
   Manfred
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] libpq thread safety

2004-01-11 Thread Manfred Spraul
Tom Lane wrote:

Manfred Spraul [EMAIL PROTECTED] writes:
 

But what about kerberos: I'm a bit reluctant to add a forth mutex: what 
if kerberos calls gethostbyname or getpwuid internally?
   

Wouldn't help anyway, if some other part of the app also calls kerberos.

That's why I've proposed to use the system from openssl: The libpq user 
must implement a lock callback, and libpq calls it around the critical 
sections.
Attached is an untested prototype patch. What do you think?

--
   Manfred
Index: src/interfaces/libpq/fe-connect.c
===
RCS file: /projects/cvsroot/pgsql-server/src/interfaces/libpq/fe-connect.c,v
retrieving revision 1.267
diff -u -r1.267 fe-connect.c
--- src/interfaces/libpq/fe-connect.c   9 Jan 2004 02:02:43 -   1.267
+++ src/interfaces/libpq/fe-connect.c   11 Jan 2004 16:54:06 -
@@ -885,12 +885,6 @@
struct addrinfo hint;
const char *node = NULL;
int ret;
-#ifdef ENABLE_THREAD_SAFETY
-   static pthread_once_t check_sigpipe_once = PTHREAD_ONCE_INIT;
-
-   /* Check only on first connection request */
-   pthread_once(check_sigpipe_once, check_sigpipe_handler);
-#endif
 
if (!conn)
return 0;
Index: src/interfaces/libpq/fe-secure.c
===
RCS file: /projects/cvsroot/pgsql-server/src/interfaces/libpq/fe-secure.c,v
retrieving revision 1.36
diff -u -r1.36 fe-secure.c
--- src/interfaces/libpq/fe-secure.c9 Jan 2004 02:17:15 -   1.36
+++ src/interfaces/libpq/fe-secure.c11 Jan 2004 16:54:07 -
@@ -146,11 +146,6 @@
 static SSL_CTX *SSL_context = NULL;
 #endif
 
-#ifdef ENABLE_THREAD_SAFETY
-static void sigpipe_handler_ignore_send(int signo);
-pthread_key_t thread_in_send;
-#endif
-
 /*  */
 /*  Hardcoded values  
 */
 /*  */
@@ -212,6 +207,26 @@
 /*  */
 
 /*
+ * Sigpipe handling.
+ * Dummy provided even for WIN32 to keep the API consistent
+ */
+pgsigpipehandler_t default_sigpipehandler;
+
+void default_sigpipehandler(bool enable, void **state)
+{
+#ifndef WIN32
+   if (enable) {
+   *state = (void*) pqsignal(SIGPIPE, SIG_IGN);
+   } else {
+   pqsignal(SIGPIPE, (pqsigfunc)*state);
+   }
+#endif
+}
+
+static pgsigpipehandler_t *g_sigpipehandler = default_sigpipehandler;
+
+
+/*
  * Initialize global context
  */
 int
@@ -356,12 +371,9 @@
 {
ssize_t n;
 
-#ifdef ENABLE_THREAD_SAFETY
-   pthread_setspecific(thread_in_send, t);
-#else
 #ifndef WIN32
-   pqsigfunc   oldsighandler = pqsignal(SIGPIPE, SIG_IGN);
-#endif
+   void *sigstate;
+   g_sigpipehandler(true, sigstate);
 #endif
 
 #ifdef USE_SSL
@@ -420,12 +432,8 @@
 #endif
n = send(conn-sock, ptr, len, 0);
 
-#ifdef ENABLE_THREAD_SAFETY
-   pthread_setspecific(thread_in_send, f);
-#else
 #ifndef WIN32
-   pqsignal(SIGPIPE, oldsighandler);
-#endif
+   g_sigpipehandler(false, sigstate);
 #endif
 
return n;
@@ -1066,62 +1074,18 @@
 
 #endif   /* USE_SSL */
 
-
-#ifdef ENABLE_THREAD_SAFETY
 /*
- * Check SIGPIPE handler and perhaps install our own.
+ * PQregisterSigpipeCallback
  */
-void
-check_sigpipe_handler(void)
+pgsigpipehandler_t *
+PQregisterSigpipeCallback(pgsigpipehandler_t *newhandler)
 {
-   pqsigfunc pipehandler;
+   pgsigpipehandler_t *prev;
 
-   /*
-*  If the app hasn't set a SIGPIPE handler, define our own
-*  that ignores SIGPIPE on libpq send() and does SIG_DFL
-*  for other SIGPIPE cases.
-*/
-   pipehandler = pqsignalinquire(SIGPIPE);
-   if (pipehandler == SIG_DFL) /* not set by application */
-   {
-   /*
-*  Create key first because the signal handler might be called
-*  right after being installed.
-*/
-   pthread_key_create(thread_in_send, NULL);  
-   pqsignal(SIGPIPE, sigpipe_handler_ignore_send);
-   }
-}
-
-/*
- * Threaded SIGPIPE signal handler
- */
-void
-sigpipe_handler_ignore_send(int signo)
-{
-   /*
-*  If we have gotten a SIGPIPE outside send(), exit.
-*  Synchronous signals are delivered to the thread
-*  that caused the signal.
-*/
-   if (!PQinSend())
-   exit(128 + SIGPIPE);/* typical return value for SIG_DFL */
-}
-#endif
- 
-/*
- * Indicates whether the current thread is in send()
- * For use by SIGPIPE signal handlers;  they should
- * ignore SIGPIPE when libpq is in send().  This means
- * that the backend has died unexpectedly

Re: [HACKERS] libpq thread safety

2004-01-11 Thread Manfred Spraul
Tom Lane wrote:

Personally I find diff -u format completely unreadable :-(.  Send
diff -c if you want useful commentary.
 

diff -c is attached. I've removed the signal changes, they are 
unrelated. I'll resent them separately.

--
   Manfred
Index: src/interfaces/libpq/libpq-fe.h
===
RCS file: /projects/cvsroot/pgsql-server/src/interfaces/libpq/libpq-fe.h,v
retrieving revision 1.102
diff -c -r1.102 libpq-fe.h
*** src/interfaces/libpq/libpq-fe.h 9 Jan 2004 02:02:43 -   1.102
--- src/interfaces/libpq/libpq-fe.h 11 Jan 2004 17:29:38 -
***
*** 458,463 
--- 458,480 
   */
  pqbool PQinSend(void);
  
+ /* === in thread.c === */
+ 
+ /*
+  *Used to set callback that prevents concurrent access to
+  *non-thread safe functions that libpq needs.
+  *The default implementation uses a libpq internal mutex.
+  *Only required for multithreaded apps on platforms that
+  *do not support the thread-safe equivalents and that want
+  *to use the functions, too.
+  *List of functions:
+  *- stderror, getpwuid, gethostbyname.
+  *TODO: the mutex must be used around kerberos calls, too.
+  */
+ typedef void (pgthreadlock_t)(bool acquire);
+ 
+ extern pgthreadlock_t * PQregisterThreadLock(pgthreadlock_t *newhandler);
+ 
  #ifdef __cplusplus
  }
  #endif
Index: src/interfaces/libpq/libpq-int.h
===
RCS file: /projects/cvsroot/pgsql-server/src/interfaces/libpq/libpq-int.h,v
retrieving revision 1.84
diff -c -r1.84 libpq-int.h
*** src/interfaces/libpq/libpq-int.h9 Jan 2004 02:02:43 -   1.84
--- src/interfaces/libpq/libpq-int.h11 Jan 2004 17:29:38 -
***
*** 448,453 
--- 448,460 
  #ifdef ENABLE_THREAD_SAFETY
  extern void check_sigpipe_handler(void);
  extern pthread_key_t thread_in_send;
+ 
+ extern pgthreadlock_t *g_threadlock;
+ #define pglock_thread() g_threadlock(true);
+ #define pgunlock_thread() g_threadlock(false);
+ #else
+ #define pglock_thread() ((void)0)
+ #define pgunlock_thread() ((void)0)
  #endif
  
  /*
Index: src/port/thread.c
===
RCS file: /projects/cvsroot/pgsql-server/src/port/thread.c,v
retrieving revision 1.14
diff -c -r1.14 thread.c
*** src/port/thread.c   29 Nov 2003 22:41:31 -  1.14
--- src/port/thread.c   11 Jan 2004 17:29:38 -
***
*** 65,70 
--- 65,105 
   *non-*_r functions.
   */
   
+ #if defined(FRONTEND)
+ #include libpq-fe.h
+ #include libpq-int.h
+ /*
+  * To keep the API consistent, the locking stubs are always provided, even
+  * if they are not required.
+  */
+ pgthreadlock_t *g_threadlock;
+ 
+ static pgthreadlock_t default_threadlock;
+ static void
+ default_threadlock(bool acquire)
+ {
+ #if defined(ENABLE_THREAD_SAFETY)
+   static pthread_mutex_t singlethread_lock = PTHREAD_MUTEX_INITIALIZER;
+   if (acquire)
+   pthread_mutex_lock(singlethread_lock);
+   else
+   pthread_mutex_unlock(singlethread_lock);
+ #endif
+ }
+ 
+ pgthreadlock_t *
+ PQregisterThreadLock(pgthreadlock_t *newhandler)
+ {
+   pgthreadlock_t *prev;
+ 
+   prev = g_threadlock;
+   if (newhandler)
+   g_threadlock = newhandler;
+   else
+   g_threadlock = default_threadlock;
+   return prev;
+ }
+ #endif
  
  /*
   * Wrapper around strerror and strerror_r to use the former if it is
***
*** 82,96 
  #else
  
  #if defined(FRONTEND)  defined(ENABLE_THREAD_SAFETY)  
defined(NEED_REENTRANT_FUNCS)  !defined(HAVE_STRERROR_R)
!   static pthread_mutex_t strerror_lock = PTHREAD_MUTEX_INITIALIZER;
!   pthread_mutex_lock(strerror_lock);
  #endif
  
/* no strerror_r() available, just use strerror */
StrNCpy(strerrbuf, strerror(errnum), buflen);
  
  #if defined(FRONTEND)  defined(ENABLE_THREAD_SAFETY)  
defined(NEED_REENTRANT_FUNCS)  !defined(HAVE_STRERROR_R)
!   pthread_mutex_unlock(strerror_lock);
  #endif
  
return strerrbuf;
--- 117,130 
  #else
  
  #if defined(FRONTEND)  defined(ENABLE_THREAD_SAFETY)  
defined(NEED_REENTRANT_FUNCS)  !defined(HAVE_STRERROR_R)
!   g_threadlock(true);
  #endif
  
/* no strerror_r() available, just use strerror */
StrNCpy(strerrbuf, strerror(errnum), buflen);
  
  #if defined(FRONTEND)  defined(ENABLE_THREAD_SAFETY)  
defined(NEED_REENTRANT_FUNCS)  !defined(HAVE_STRERROR_R)
!   g_threadlock(false);
  #endif
  
return strerrbuf;
***
*** 118,125 
  #else
  
  #if defined(FRONTEND)  defined(ENABLE_THREAD_SAFETY)  
defined(NEED_REENTRANT_FUNCS)  !defined(HAVE_GETPWUID_R)
!   static pthread_mutex_t getpwuid_lock = PTHREAD_MUTEX_INITIALIZER;
!   pthread_mutex_lock(getpwuid_lock);
  #endif
  
/* no getpwuid_r() available, just use getpwuid() */
--- 152,158 
  #else
  

Re: [HACKERS] libpq thread safety

2004-01-11 Thread Manfred Spraul
Tom Lane wrote:

Wait a minute.  I am *not* buying into any proposal that we need to
support ENABLE_THREAD_SAFETY on machines where libc is not thread-safe.
We have other things to do than adopt an open-ended commitment to work
around threading bugs on obsolete platforms.  I don't believe that any
sane application programmer is going to try to implement a
multi-threaded app on such a platform anyway.
I'd agree - convince Bruce and I'll replace the mutexes in thread.c with 
#error. But I think libpq should support a mutex around kerberos (or at 
least fail at runtime) - right now it's too easy to corrupt the kerberos 
authentication state.

--
   Manfred
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] using stp for dbt2 + postgresql

2004-01-01 Thread Manfred Spraul
Bruce Momjian wrote:

[EMAIL PROTECTED] wrote:
 

Hi Manfred,

Just wanted to let you know I tried your patch-spinlock-i386 patch on
our STP (our automated test platform) 8-way systems and saw a 5.5%
improvement with Pentium III Xeons. If you want to see those results:
PostgreSQL 7.4.1:
http://khack.osdl.org/stp/285062/
PostgreSQL 7.4.1 w/ your patch:
	http://khack.osdl.org/stp/285087/
   

Impressive.  Thanks.
 

The best thing is that we can try our own postgres patches with SDT now: 
this gives us a chance to run tests on up to 8-way systems, with 4 gb 
memory, 40 spindles. From my experience, the typical turnaround time is 
half a day - submit patch [web interface], start benchmark run, and 
after a few ours you get a mail that contains the output. With oprofile, 
it's very detailed - % cpu time for each function, down to individual 
asm instructions, plus the ability for custom logging into the 
postmaster log.
I think we should try to use that to find a cache replacement policy 
that is SMP scalable, i.e. doesn't need a global lock - I searched a few 
minutes on citeseer, but couldn't find anything that doesn't rely on 
global lists.

--
   Manfred


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] [PATCHES] update i386 spinlock for hyperthreading

2003-12-30 Thread Manfred Spraul
Bruce Momjian wrote:

Anyone see an attack path here?



Should we have one lock per hash bucket rather than one for the entire
hash?
  

That's the simple part. The problem is the aging strategy: we need a
strategy that doesn't rely on a global list that's updated after every
lookup. If I understand the ARC code correctly, there is a
STRAT_MRU_INSERT(cdb, STRAT_LIST_T2) that happen in every lookup.


--
Manfred



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] [PATCHES] update i386 spinlock for hyperthreading

2003-12-30 Thread Manfred Spraul
Jan Wieck wrote:

Moving the Cache Directory Block (cdb) on a hit to the MRU position of
the appropriate queue is the bookkeeping of this strategy. The whole
algorithm is based on it, and I don't see yet how to avoid that without
opening a huge can of worms that look like deadlocks. But I'll think
about it for a while.
I feared that.
Are there strategies that do not rely on a global lock? The Linux kernel 
uses a lazy LRU with referenced bits: on access, the referenced bit is 
set. The freespace logic takes pages from the end of a linked list, and 
checks that bit: if it's set, then the page is moved back to the top of 
the list. Otherwise it's a candidate for replacement. Pages start at the 
head of that pseudo-lru list, with the reference bit clear: that way a 
page that is accessed only once has a lower priority than a frequently 
accessed page. At least that's how I understand the algorithm.

--
   Manfred
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] [PATCHES] update i386 spinlock for hyperthreading

2003-12-27 Thread Manfred Spraul
[EMAIL PROTECTED] wrote:

Hi Manfred,

I'm using unixware 7 but couldn't compile your source with native cc, I
had to compile it with gcc.
here are the results:
 

Thanks. The test app compares the time needed for three different short 
loops: a loop with six empty function calls, a loop with six function 
calls and one nop in the middle, and a loop with a rep;nop; in the middle.

Result:
- nop needs 0 cycles - executed in parallel.
- rep;nop between 24 and 60 cycles - long enough that the pipeline is 
emptied.

I've searched around for further info regarding the recommended spinlock 
algorithm:
- The optimization manual (google for Intel 248966) contains a section 
about pause instructions: The memory ordering violation is from the 
multiple simultaneous reads that are executed due to pipelining the busy 
loop.
- It references the Application Note AP-949 Using Spin-Loops on Intel 
Pentium 4 Processor and Intel Xeon Processor for further details. 
Unfortunately the app notes are stored on cedar.intel.com, and that 
server appears to be down :-(

--
   Manfred
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Issue with Linux+Pentium SMP Context Switching

2003-12-19 Thread Manfred Spraul
Josh Berkus wrote:

	Initial debug logging of a test on one Xeon system demonstrating this issue 
showed a very large number of unattributed semop() calls.   We are still 
following up on this.

Postgres has it's own user space spinlock and semaphore implementation. 
Both fall back to semop if there is contention.

Hmm. You wrote that the problem is Xeon specific, and that AthlonMP are 
unaffected. Perhaps Xeon cpus do not like the s_lock implementation? It 
doesn't follow Intel's recommentations:
- no pause instructions.
- always TAS. The recommended approach is nonatomic tests until the 
value is 0, then an atomic TAS.

Attached is a gross hack that adds pause instructions. If this doesn't 
magically fix your problem, then we must figure out what causes the 
semop calls, and avoid them.
Could you ask your Linux hackers why they blame the shared memory 
implementation in postgres? I don't see any link between shared memory 
and lock contention.

--
   Manfred

Index: backend/storage/lmgr/s_lock.c
===
RCS file: /projects/cvsroot/pgsql-server/src/backend/storage/lmgr/s_lock.c,v
retrieving revision 1.16
diff -c -r1.16 s_lock.c
*** backend/storage/lmgr/s_lock.c   8 Aug 2003 21:42:00 -   1.16
--- backend/storage/lmgr/s_lock.c   19 Dec 2003 20:01:33 -
***
*** 111,116 
--- 111,117 
  
spins = 0;
}
+   __asm__ __volatile__(rep;nop\n: : : memory);
}
  }
  

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] fsync method checking

2003-12-12 Thread Manfred Spraul
Bruce Momjian wrote:

	write  0.000360
	write  fsync  0.001391
	write, close  fsync   0.001308
	open o_fsync, write0.000924
 

That's 1 milliseconds vs. 1.3 milliseconds. Neither value is realistic - 
I guess the hw cache on and the os doesn't issue cache flush commands. 
Realistic values are probably 5 ms vs 5.3 ms - 6%, not 30%. How large is 
the syscall latency with BSD/OS 4.3?

One advantage of a seperate write and fsync call is better performance 
for the writes that are triggered within AdvanceXLInsertBuffer: I'm not 
sure how often that's necessary, but it's a write while holding both the 
WALWriteLock and WALInsertLock. If every write contains an implicit 
sync, that call would be much more expensive than necessary.

--
   Manfred
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Double linked list with one pointer

2003-12-07 Thread Manfred Spraul
Tom Lane wrote:

Greg Stark [EMAIL PROTECTED] writes:
 

Treating pointers as integers is technically nonportable but
realistically you would be pretty hard pressed to find any
architecture anyone runs postgres on where there isn't some integer
datatype that you can cast both directions from pointers safely.
   

... like, say, Datum.  We already make that assumption, so there's no
new portability risk involved.
 

There is a new type in C99 for integer that can hold a pointer value. 
I think it's called intptr_t resp. uintptr_t, but I don't have the 
standard around.
It will be necessary for a 64-bit Windows port: Microsoft decided that 
pointer are 64-bit on WIN64, intlong remain 32-bit. Microsoft's own 
typedefs are UINT_PTR, DWORD_PTR, INT_PTR.

--
   Manfred
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


[HACKERS] libpq thread safety

2003-11-17 Thread Manfred Spraul
Hi,

I've searched through libpq and looked for global or static variables as 
indicators of non-threadsafe code. I found:
- Win32 and BeOS: there is a global ioctlsocket_ret variable, but it 
seems to be a dummy variable that is always discarded.
- pg_krb4_init(): Are the kerberos libraries thread safe? Additionally, 
setting init_done is racy.
- pg_krb4_authname(): uses a static buffer.
- kerberos 5: Is the library thread safe? the initialization could run 
twice, I'm not sure if that's intentional.
- pg_krb4_authname(): relies on the global variable pg_krb5_name.
- PQoidStatus: uses a static buffer.
- libpq_gettext: setting already_bound is racy.
- openssl: According to
http://www.openssl.org/docs/crypto/threads.html
libpq must register locking callbacks within openssl, otherwise there 
will be random corruptions. Additionally the SSL_context initialization 
is not properly synchronized, and SSLerrmessage relies on a static buffer.

PQoidStatus is already documented as not thread safe, but what about 
OpenSSL and kerberos? It seems openssl needs support with callbacks, and 
according to google searches MIT kerberos 5 is not thread safe, and 
libpq must use mutexes to prevent concurrent calls into the kerberos 
library.

--
   Manfred
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-09 Thread Manfred Spraul
Greg Stark wrote:

I'm assuming fsync syncs writes issued by other processes on the same file,
which isn't necessarily true though.
 

It was already pointed out that we can't rely on that assumption.
   

So the NetBSD and Sun developers I checked with both asserted fsync does in
fact guarantee this. And SUSv2 seems to back them up:
 

At least Linux had one problem: fsync() syncs the inode to disk, but not 
the directory entry: if you rename a file, open it, write to it, fsync, 
and the computer crashes, then it's not guaranteed that the file rename 
is on the disk.
I think only the old ext2 is affected, not the journaling filesystems.

--
   Manfred
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] Performance features the 4th

2003-11-05 Thread Manfred Spraul
Jan Wieck wrote:

_Vacuum page delay_:

Tom Lane's napping during vacuums with another tuning option. I 
replaced the usleep() call with a PG_DELAY(msec) macro in miscadmin.h, 
which does use select(2) instead. That should address the possible 
portability problems.
What about skipping the delay if there are no outstanding disk 
operations? Then vacuum would get the full disk bandwidth if the system 
is idle.

--
   Manfred


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [HACKERS] Performance features the 4th

2003-11-05 Thread Manfred Spraul
Tom Lane wrote:

Manfred's idea is interesting but AFAICS completely unimplementable
in any portable fashion.  You'd have to have hooks into the kernel.
 

I thought about outstanding operations from postgres - I don't know 
enough about the buffer layer if it's possible to keep a counter of the 
currently running read() and write() operations, or something similar.

--
   Manfred
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] OSDL DBT-2 w/ PostgreSQL 7.3.4 and 7.4beta5

2003-11-04 Thread Manfred Spraul
[EMAIL PROTECTED] wrote:

On  1 Nov, Tom Lane wrote:
 

Manfred Spraul [EMAIL PROTECTED] writes:
   

signal handlers are a process property, not a thread property - that 
code is broken for multi-threaded apps.
 

Yeah, that's been mentioned before, but I don't see any way around it.
What we really want is to turn off SIGPIPE delivery on our socket
(only), but AFAIK there is no API to do that.
   

Will this be a problem for multi-threaded apps with any of the client
interfaces?
Anyone working on making it threadsafe?
 

The POSIX api is not thread safe: signal handlers are per process, and 
libpq would like to block SIGPIPE for it's send() calls. For single 
threaded apps, libpq just calls sigaction and sets the handler to 
SIG_IGN around the syscalls.
For multithreaded apps, this is not possible: sigaction is per process.
Thus the calling application must handle the SIGPIPE signals for libpq - 
either by blocking or ignoring them. We are still discussing the exact 
API. Probably a global state that is accessible through a new function.

One thread-safe alternative might be the combination of sigprocmask / 
pthread_sigmask and sigwait, but I think this would be too fragile.

--
   Manfred
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] OSDL DBT-2 w/ PostgreSQL 7.3.4 and 7.4beta5

2003-11-04 Thread Manfred Spraul
Tom Lane wrote:

Manfred Spraul [EMAIL PROTECTED] writes:
 

For multithreaded apps, this is not possible: sigaction is per process.
Thus the calling application must handle the SIGPIPE signals for libpq - 
either by blocking or ignoring them. We are still discussing the exact 
API. Probably a global state that is accessible through a new function.
   

I think we should also take a hard look at avoiding the problem by using
MSG_NOSIGNAL on platforms that have it,
I think that's the second step. First we need a portable solution, then 
we can optimize it.
The fastest solution is one signal(SIGPIPE, SIG_IGN) in main(), but that 
requires a change in all libpq users. OTHO there shouldn't be that many 
multithreaded users.
sigprocmask + sigwait could work, but sigprocmask is undefined if 
multiple threads are running. Is there a portable approach for weak 
links? libpq would have to call proc_sigmask if linked against 
libpthread, and sigprocmask if not linked against libpthread. With gcc, 
I could use 'void proc_sigmask () __attribute__ ((weak, alias 
(_sigprocmask)));' or something similar, but this wouldn't be portable 
either.

--
   Manfred
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] adding support for posix_fadvise()

2003-11-03 Thread Manfred Spraul
Neil Conway wrote:

The present Linux implementation doesn't do this, AFAICS -- all it does
it increase the readahead for this file:
 

AFAIK Linux uses a modified LRU that automatically puts pages that were 
touched only once at a lower priority than frequently accessed pages.

Neil: what about calling posix_fadvise for the whole file immediately 
after issue_xlog_fsync() in XLogWrite? According to the comment, it's 
guaranteed that this will happen only once.
Or:  add an posix_fadvise into issue_xlog_fsync(), for the range just 
sync'ed.

Btw, how much xlog traffic does a busy postgres site generate?

--
   Manfred
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: Avoiding SIGPIPE (was Re: [HACKERS] OSDL DBT-2 w/ PostgreSQL

2003-11-03 Thread Manfred Spraul
Tom Lane wrote:

It strikes me that sigpipe handling will be a global affair in any
particular application --- it's unlikely that it would be correct for
some PG connections and wrong for others.  So one possibility is to make
the control variable be global (static) and thus it could be set before
creating the first PGconn.
 

What about the attached patches?
I hope I found all places that must be updated when a new function is 
added to libpq.

--
   Manfred
Index: doc/src/sgml/libpq.sgml
===
RCS file: /projects/cvsroot/pgsql-server/doc/src/sgml/libpq.sgml,v
retrieving revision 1.141
diff -c -r1.141 libpq.sgml
*** doc/src/sgml/libpq.sgml 1 Nov 2003 01:56:29 -   1.141
--- doc/src/sgml/libpq.sgml 3 Nov 2003 20:35:57 -
***
*** 645,650 
--- 645,693 
/listitem
   /varlistentry
  
+  varlistentry
+   
termfunctionPQsetsighandling/functionindextermprimaryPQsetsighandling///term
+   
termfunctionPQgetsighandling/functionindextermprimaryPQgetsighandling///term
+   listitem
+para
+Set/query SIGPIPE signal handling.
+ synopsis
+ void PQsetsighandling(int internal_sigign);
+ /synopsis
+ synopsis
+ int PQgetsighandling(void);
+ /synopsis
+ /para
+ 
+ para
+ These functions allow to query and set the SIGPIPE signal handling
+ of libpq: by default, Unix systems generate a (fatal) SIGPIPE signal
+ on a send to a socket that lost it's connection. Most callers expect
+ a normal error return instead of the signal. A normal error return
+ can be achieved by blocking or ignoring the SIGPIPE signal. This can
+ be done either globally in the application or inside libpq.
+/para
+para
+ If internal signal handling is enabled (this is the default), then
+ libpq sets the SIGPIPE handler to SIG_IGN before every socket send
+ operation and restores it afterwards. This prevents libpq from
+ killing the application, at the cost of a slight performance
+ decrease. This approach is not reliable for multithreaded applications.
+/para
+para
+ If internal signal handling is disabled, then the caller is
+ responsible for blocking or handling SIGPIPE signals. This is
+ recommended for multithreaded applications.
+/para
+para
+ The signal handler setting is a global flag, it affects all
+ connections. The setting has no effect for Win32 clients - Win32
+ doesn't generate SIGPIPE events.
+/para
+   /listitem
+  /varlistentry
+ 
+ 
   /variablelist
  /para
  /sect1
Index: src/interfaces/libpq/blibpqdll.def
===
RCS file: /projects/cvsroot/pgsql-server/src/interfaces/libpq/blibpqdll.def,v
retrieving revision 1.9
diff -c -r1.9 blibpqdll.def
*** src/interfaces/libpq/blibpqdll.def  13 Aug 2003 16:29:03 -  1.9
--- src/interfaces/libpq/blibpqdll.def  3 Nov 2003 20:35:59 -
***
*** 113,118 
--- 113,120 
  _PQfformat   @ 109
  _PQexecPrepared  @ 110
  _PQsendQueryPrepared @ 111
+ _PQsetsighandling@ 112
+ _PQgetsighandling@ 113
  
  ; Aliases for MS compatible names
  PQconnectdb = _PQconnectdb
***
*** 226,228 
--- 228,232 
  PQfformat   = _PQfformat
  PQexecPrepared  = _PQexecPrepared
  PQsendQueryPrepared = _PQsendQueryPrepared
+ PQsetsighandling= _PQsetsighandling
+ PQgetsighandling= _PQgetsighandling
Index: src/interfaces/libpq/fe-secure.c
===
RCS file: /projects/cvsroot/pgsql-server/src/interfaces/libpq/fe-secure.c,v
retrieving revision 1.32
diff -c -r1.32 fe-secure.c
*** src/interfaces/libpq/fe-secure.c29 Sep 2003 16:38:04 -  1.32
--- src/interfaces/libpq/fe-secure.c3 Nov 2003 20:35:59 -
***
*** 198,203 
--- 198,204 
  -END DH PARAMETERS-\n;
  #endif
  
+ static int do_sigaction = 1;
  /*  */
  /* Procedures common to all secure sessions  
 */
  /*  */
***
*** 348,354 
ssize_t n;
  
  #ifndef WIN32
!   pqsigfunc   oldsighandler = pqsignal(SIGPIPE, SIG_IGN);
  #endif
  
  #ifdef USE_SSL
--- 349,358 
ssize_t n;
  
  #ifndef WIN32
!   pqsigfunc   oldsighandler = NULL;
! 
!   if (do_sigaction)
!   oldsighandler = pqsignal(SIGPIPE, SIG_IGN);
  #endif
  
  #ifdef USE_SSL
***
*** 408,417 
n = send(conn-sock, ptr, len, 0);
  
  #ifndef WIN32
!   pqsignal(SIGPIPE, oldsighandler);
  #endif
  
return n;
  }
  
  /*  */
--- 412,432 
   

Re: [HACKERS] OSDL DBT-2 w/ PostgreSQL 7.3.4 and 7.4beta5

2003-11-02 Thread Manfred Spraul
Mark Wong wrote:

On Sat, Nov 01, 2003 at 10:29:34PM +0100, Manfred Spraul wrote:
 

Mark Wong wrote:

   

Yeah, my dbt2 applications are multithreaded.

 

Do you need SIGPIPE delivery in your app? If no, could you try what 
happens if you apply the attached patch to postgres, and perform the
   signal(SIGPIPE, SIG_IGN);
once in your dbt2 app?
   

Wow, that patch made a pretty big difference:
http://developer.osdl.org/markw/dbt2-pgsql/191/
- metric 1605.51
So no one has to look for older mail before I applied that patch:
http://developer.osdl.org/markw/dbt2-pgsql/190/
- metric 1427.24
Looks like about a 12% improvement in the overall metric.  The first thing I
noticed is that do_sigaction in the kernel profile almost disappeared.
Cool.

 The
top few functions in the database profile doesn't appear to have changed much.
 

I've looked at the profile:
The only unusal line is the memcpy(cur_skey, cache-cc_skey, 
sizeof(cur_skey)): it copies 144 byte and needs ~5.3% global cpu time, 
from the 12.1% in SearchCatCache. The cachelines (line size 128 bytes) 
of cc_skey are shared with cc_bucket. 1.8% cpu time is spent in 
DLMoveToFront, the function that moves cache entries around.

Perhaps a scalability problem of the hash table? The implementation 
moves the entries around all the time, i.e. the worst case for cache 
line transfers.

--
   Manfred
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: Avoiding SIGPIPE (was Re: [HACKERS] OSDL DBT-2 w/ PostgreSQL

2003-11-02 Thread Manfred Spraul
AgentM wrote:

That wouldn't offer a solution for people who use SIGPIPE for other 
things during the lifetime of the program (after creating the 
connection) and if a SIGPIPE handler is called due to the connection, 
the handler won't be expecting the source, and polling signal for 
state is essentially what you do now. Instead, I propose a 
PQsigpipeOK/PQacceptsigpipe/PQrecvsigpipe(PGconn*) or something to 
that effect which skips this check for the connection. That way, 
programmers are aware that the connection could call their SIGPIPE 
handler because they explicitly request it and the library remains 
backwards-compatible.
If I understand libpq sources correctly, the first packets are send 
during connection setup - PQsigpipeOK(PGconn *) would be too late.
That's why I added sigpipe=caller as a new flag for PQconnectdb.

--
   Manfred
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] OSDL DBT-2 w/ PostgreSQL 7.3.4 and 7.4beta5

2003-11-01 Thread Manfred Spraul
[EMAIL PROTECTED] wrote:

Results from 7.4beta5
	http://developer.osdl.org/markw/dbt2-pgsql/188/
	- metric 1446.01
 

CPU: P4 / Xeon with 2 hyper-threads, speed 1497.51 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a 
unit mask of 0x01 (count cycles when processor is active) count 10
samples  %app name symbol name
15369575  9.6780  postgres SearchCatCache
13714258  8.6357  vmlinux  .text.lock.signal
10611912  6.6822  vmlinux  do_sigaction
4400461   2.7709  vmlinux  rm_from_queue
18% cpu time in the kernel signal handlers.

What are signals used for by postgres? I've seen the sigalarm to 
implement timeouts, what else?

--
   Manfred
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [HACKERS] OSDL DBT-2 w/ PostgreSQL 7.3.4 and 7.4beta5

2003-11-01 Thread Manfred Spraul
I've straced
$ pgbench -c 5 -s 6 -t 1000
total 157k syscalls, 70k of them are rt_sigaction(SIGPIPE):

1754  poll([{fd=3, events=POLLOUT|POLLERR, revents=POLLOUT}], 1, -1) = 1
1754  rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_DFL}, 8) = 0
1754  send(3, \0\0\0%\0\3\0\0user\0postgres\0database\0t..., 37, 0) = 37
1754  rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0
1754  poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
1754  recv(3, R\0\0\0\10\0\0\0\0S\0\0\0\36client_encoding\0SQ..., 
16384, 0) = 169
1754  rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_DFL}, 8) = 0
1754  send(3, Q\0\0\0\35SET search_path = public\0, 30, 0) = 30
1754  rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0
1754  poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
1754  recv(3, C\0\0\0\10SET\0Z\0\0\0\5I, 16384, 0) = 15
1754  rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_DFL}, 8) = 0

and so on. Is that really necessary?

Mark: could you strace your dbt2 app? I guess your app creates a similar 
streams of rt_sigaction calls.

--
   Manfred
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] OSDL DBT-2 w/ PostgreSQL 7.3.4 and 7.4beta5

2003-11-01 Thread Manfred Spraul
Tom Lane wrote:

Manfred Spraul [EMAIL PROTECTED] writes:
 

signal handlers are a process property, not a thread property - that 
code is broken for multi-threaded apps.
   

Yeah, that's been mentioned before, but I don't see any way around it.

Do not handle SIGPIPE on multithreaded apps, and ask the caller to do 
that? The current code doesn't block SIGPIPE reliably, which makes it 
totally useless (except that it's a debugging nightmare, because 
triggering it depends on the right timing).

What we really want is to turn off SIGPIPE delivery on our socket
(only), but AFAIK there is no API to do that.
 

Linux has as MSG_NOSIGNAL flag for send(), but that seems to be Linux 
specific.

--
   Manfred
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: Avoiding SIGPIPE (was Re: [HACKERS] OSDL DBT-2 w/ PostgreSQL

2003-11-01 Thread Manfred Spraul
Tom Lane wrote:

A bigger objection is that we couldn't get libssl to use it (AFAIK).
The flag really needs to be settable on the socket (eg, via fcntl),
not per-send.
It's a per-send flag, it's not possible to force it on with a fcntl :-(

What about an option to skip the sigaction calls for apps that can 
handle SIGPIPE? I'm not sure if an option at connect time, or a flag 
accessible through a function like PQsetnonblocking() is the better 
approach.

Attached is a patch that adds a connstr option, but I don't like it.

--
   Manfred
Index: fe-connect.c
===
RCS file: /projects/cvsroot/pgsql-server/src/interfaces/libpq/fe-connect.c,v
retrieving revision 1.260
diff -c -r1.260 fe-connect.c
*** fe-connect.c5 Sep 2003 02:08:36 -   1.260
--- fe-connect.c1 Nov 2003 21:02:04 -
***
*** 65,70 
--- 65,71 
  #else
  #define DefaultSSLModedisable
  #endif
+ #define DefaultSIGPIPEModesigaction
  
  
  /* --
***
*** 152,157 
--- 153,161 
{sslmode, PGSSLMODE, DefaultSSLMode, NULL,
SSL-Mode, , 8}, /* sizeof(disable) == 8 */
  
+   {sigpipemode, PGSIGPIPEMODE, DefaultSIGPIPEMode, NULL,
+   SIGPIPE-Mode, , 10},/* sizeof(sigaction) == 10 */
+ 
/* Terminating entry --- MUST BE LAST */
{NULL, NULL, NULL, NULL,
NULL, NULL, 0}
***
*** 369,374 
--- 373,380 
conn-sslmode = strdup(require);
}
  #endif
+   tmp = conninfo_getval(connOptions, sigpipemode);
+   conn-sigpipemode = tmp ? strdup(tmp) : NULL;
  
/*
 * Free the option info - all is in conn now
***
*** 478,483 
--- 484,508 
else
conn-sslmode = strdup(DefaultSSLMode);
  
+   /*
+* validate sigpipemode option
+*/
+   if (conn-sigpipemode)
+   {
+   if (strcmp(conn-sigpipemode, caller) != 0
+strcmp(conn-sigpipemode, sigaction) != 0)
+   {
+   conn-status = CONNECTION_BAD;
+   printfPQExpBuffer(conn-errorMessage,
+libpq_gettext(unrecognized 
sigpipemode: \%s\\n),
+ conn-sigpipemode);
+   return false;
+   }
+   }
+   else
+   conn-sigpipemode = strdup(DefaultSIGPIPEMode);
+ 
+ 
return true;
  }
  
***
*** 951,956 
--- 976,986 
else if (conn-sslmode[0] == 'a')   /* allow */
conn-wait_ssl_try = true;
  #endif
+   if (conn-sigpipemode[0] == 's') /* sigaction */
+   conn-do_sigaction = true;
+   else
+   conn-do_sigaction = false;
+ 
  
/*
 * Set up to try to connect, with protocol 3.0 as the first attempt.
***
*** 2033,2038 
--- 2063,2070 
free(conn-pgpass);
if (conn-sslmode)
free(conn-sslmode);
+   if (conn-sigpipemode)
+   free(conn-sigpipemode);
/* Note that conn-Pfdebug is not ours to close or free */
if (conn-notifyList)
DLFreeList(conn-notifyList);
Index: fe-secure.c
===
RCS file: /projects/cvsroot/pgsql-server/src/interfaces/libpq/fe-secure.c,v
retrieving revision 1.30
diff -c -r1.30 fe-secure.c
*** fe-secure.c 5 Sep 2003 02:08:36 -   1.30
--- fe-secure.c 1 Nov 2003 21:02:06 -
***
*** 348,354 
ssize_t n;
  
  #ifndef WIN32
!   pqsigfunc   oldsighandler = pqsignal(SIGPIPE, SIG_IGN);
  #endif
  
  #ifdef USE_SSL
--- 348,357 
ssize_t n;
  
  #ifndef WIN32
!   pqsigfunc   oldsighandler = NULL;
!
!   if (conn-do_sigaction)
!   oldsighandler = pqsignal(SIGPIPE, SIG_IGN);
  #endif
  
  #ifdef USE_SSL
***
*** 408,414 
n = send(conn-sock, ptr, len, 0);
  
  #ifndef WIN32
!   pqsignal(SIGPIPE, oldsighandler);
  #endif
  
return n;
--- 411,418 
n = send(conn-sock, ptr, len, 0);
  
  #ifndef WIN32
!   if (conn-do_sigaction)
!   pqsignal(SIGPIPE, oldsighandler);
  #endif
  
return n;
Index: libpq-int.h
===
RCS file: /projects/cvsroot/pgsql-server/src/interfaces/libpq/libpq-int.h,v
retrieving revision 1.82
diff -c -r1.82 libpq-int.h
*** libpq-int.h 5 Sep 2003 02:08:36 -   1.82
--- libpq-int.h 1 Nov 2003 21:02:07 -
***
*** 250,255 
--- 250,256 
char   *pguser; /* Postgres username and password, if 
any */
char   *pgpass;
char   *sslmode;/* SSL mode 

Re: [HACKERS] OSDL DBT-2 w/ PostgreSQL 7.3.4 and 7.4beta5

2003-11-01 Thread Manfred Spraul
Mark Wong wrote:

Yeah, my dbt2 applications are multithreaded.
 

Do you need SIGPIPE delivery in your app? If no, could you try what 
happens if you apply the attached patch to postgres, and perform the
   signal(SIGPIPE, SIG_IGN);
once in your dbt2 app?

--
   Manfred
--- pgsql.orig/src/interfaces/libpq/fe-secure.c 2003-11-01 22:28:13.0 +0100
+++ pgsql/src/interfaces/libpq/fe-secure.c  2003-11-01 22:27:21.0 +0100
@@ -348,7 +348,7 @@
ssize_t n;
 
 #ifndef WIN32
-   pqsigfunc   oldsighandler = pqsignal(SIGPIPE, SIG_IGN);
+/* pqsigfunc   oldsighandler = pqsignal(SIGPIPE, SIG_IGN); */
 #endif
 
 #ifdef USE_SSL
@@ -408,7 +408,7 @@
n = send(conn-sock, ptr, len, 0);
 
 #ifndef WIN32
-   pqsignal(SIGPIPE, oldsighandler);
+/* pqsignal(SIGPIPE, oldsighandler); */
 #endif
 
return n;

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: Avoiding SIGPIPE (was Re: [HACKERS] OSDL DBT-2 w/ PostgreSQL

2003-11-01 Thread Manfred Spraul
Tom Lane wrote:

Manfred Spraul [EMAIL PROTECTED] writes:
 

What about an option to skip the sigaction calls for apps that can 
handle SIGPIPE?
   

If the app is ignoring SIGPIPE globally, then our calls will have no
effect anyway.
Wrong. From the opengroup manpage:

SIG_IGN - ignore signal
[snip]
- Setting a signal action to SIG_IGN for a signal that is pending will 
cause the pending signal to be discarded, whether or not it is blocked
   
   This is why the kernel spends 20% cpu time processing the SIG_IGN:
   it must walk through all threads of the process and check if there
   are any SIGPIPE signals pending.

 I don't see that this proposal adds any security.
 

It's not about security: Right now multithreaded apps must call 
signal(SIGPIPE, SIG_IGN), otherwise they could get killed by sudden 
SIGPIPE signals. Additionally, they can't rely on sigpending, because 
the pendings bits are cleared regularly. On top, they get a noticable 
performance hit.

My proposal means that apps that know what they are doing (SIGPIPE 
either SIG_IGN, or blocked, or a suitable handler) can avoid the 
signal(SIGPIPE, SIG_IGN) in pqsecure_write. With backward compatibility, 
because the current system works for single threaded apps.

--
   Manfred
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] O_DIRECT in freebsd

2003-10-30 Thread Manfred Spraul
Greg Stark wrote:

Manfred Spraul [EMAIL PROTECTED] writes:

 

One problem for WAL is that O_DIRECT would disable the write cache -
each operation would block until the data arrived on disk, and that might block
other backends that try to access WALWriteLock.
Perhaps a dedicated backend that does the writeback could fix that.
   

aio seems a better fit.

 

Has anyone tried to use posix_fadvise for the wal logs?
http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html
Linux supports posix_fadvise, it seems to be part of xopen2k.
   

Odd, I don't see it anywhere in the kernel. I don't know what syscall it's
using to do this tweaking.
 

At least in 2.6: linux/mm/fadvise.c, the syscall is fadvise64 or 64_64

This is the only option that seems useful for postgres for both the WAL and
vacuum (though in other threads it seems the problems with vacuum lie
elsewhere):
  POSIX_FADV_DONTNEED attempts to free cached pages associated with the
  specified region. This is useful, for example, while streaming large
  files. A program may periodically request the kernel to free cached
  data that has already been used, so that more useful cached pages are
  not discarded instead.
  Pages that have not yet been written out will be unaffected, so if the
  application wishes to guarantee that pages will be released, it should
  call fsync or fdatasync first.
 

I agree. Either immediately after each flush syscall, or just before 
closing a log file and switching to the next.

Perhaps POSIX_FADV_RANDOM and POSIX_FADV_SEQUENTIAL could be useful in a
backend before starting a sequential scan or index scan, but I kind of doubt
it.
 

IIRC the recommendation is ~20% total memory for the postgres user space 
buffers. That's quite a lot - it might be sufficient to protect that 
cache from vacuum or sequential scans. AddBufferToFreeList already 
contains a comment that this is the right place to try buffer 
replacement strategies.

--
   Manfred
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [HACKERS] O_DIRECT in freebsd

2003-10-29 Thread Manfred Spraul
Tom Lane wrote:

Not for WAL --- we never read the WAL at all in normal operation. (If

it works for writes, then we would want to use it for writing WAL, but
that's not apparent from what Christopher quoted.)
At least under Linux, it works for writes. Oracle uses O_DIRECT to 
access (both read and write) disks that are shared between multiple 
nodes in a cluster - their database kernel must know when the data is 
visible to the other nodes.
One problem for WAL is that O_DIRECT would disable the write cache - 
each operation would block until the data arrived on disk, and that 
might block other backends that try to access WALWriteLock.
Perhaps a dedicated backend that does the writeback could fix that.

Has anyone tried to use posix_fadvise for the wal logs?
http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html
Linux supports posix_fadvise, it seems to be part of xopen2k.

--
   Manfred
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Database Kernels and O_DIRECT

2003-10-15 Thread Manfred Spraul
Andrew Dunstan wrote:

I have wondered (somewhat fruitlessly) for several years about the 
possibilities of special purpose lightweight file systems that could 
relax some of the assumptions and checks used in general purpose file 
systems. Such a thing might provide most of the benefits of a 
database kernel without imposing anything extra on the database 
application layer.
CPU is usually cheap compared to disk io.

There are two things that might be worth looking into:
Oracle released their cluster filesystem (ocfs) as a GPL driver for 
Linux. It might be interesting to check how it performs if used for 
postgres, but I fear that it implicitely assumes that the bulk of the 
caching is performed by the database in user space.
And using O_DIRECT for the WAL logs - the logs are never read.

--
   Manfred


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] compile warning

2003-10-10 Thread Manfred Spraul
Andrew Dunstan wrote:

Bruce Momjian wrote:

This seems to be a bug in gcc-3.3.1.  -fstrict-aliasing is enabled by
-O2 or higher optimization in gcc 3.3.1.

According to the C standard, it's illegal to access a data with a 
pointer of the wrong type. The only exception is char *.
This can be used by compilers to pipeline loops, or to reorder instructions.
For example

void dummy(double *out, int *in, int len)
{
   int j;
   for (j=0;jlen;j++)
  out[j] = 1.0/in[j];
}
Can be pipelined if a compiler relies on strict aliasing: it's 
guaranteed that writing to out[5] won't overwrite in[6].

I think MemSet violates strict aliasing: it writes to the given address 
with (int32*). gcc might move the instructions around.
I would disable strict aliasing with -fno-strict-aliasing.

  In the Linux kernel, you can see this in include/linux/tcp.h:

   /*
*  The union cast uses a gcc extension to avoid aliasing problems
*  (union is compatible to any of its members)
*  This means this part of the code is -fstrict-aliasing safe now.
*/
The kernel is still compiled with -fno-strict-aliasing - I'm not sure if 
there are outstanding problems, or if it's just a safety precaution.

--
   Manfred
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] IDE Drives and fsync

2003-10-08 Thread Manfred Spraul
scott.marlowe wrote:

OK, I've done some more testing on our IDE drive machine.

First, some background.  The hard drives we're using are Seagate 
drives, model number ST380023A.  Firmware version is 3.33.  The machine 
they are in is running RH9.  The setup string I'm feeding them on startup 
right now is:  hdparm -c3 -f -W1 /dev/hdx

where:

-c3 sets I/O to 32 bit w/sync (uh huh, sure...)

sync has nothing to do with sync to disk. The sync means read from three 
magic io ports before transfering data to or from the device.


-f sets the drive to flush buffer cache on exit

-f shouldn't have any effect: it means that the buffer cache in the OS 
is flushed after hdparm exits, it has no long-term effect on the disk.

-W1 turns on write caching

That's the problem: turning on write caching causes corruptions.
What's needed is partial write caching: write cache on, and fsync() 
sends a barrier to the disk, and only after the disk reports that the 
barrier is completed, then fsync() returns.
I consider that an OS/driver problem, not a problem for postgres.

The drives come up using DMA.  turning unmask IRQ on / off has no affect 
on the tests I've been performaing.
 

Of course. irq unmasking is about interrupt latency if DMA is not used: 
DMA off and dma masking off results in dropped bytes on serial links.

Without the -f switch, data corruption due to sudden power down is an 
almost certain.

It's odd that adding -f reduces the corruptions - probably it changes 
available memory, and thus the writeback of data from kernel to disk.

Tom, you had mentioned adding a delay of some kind to the fsync logic, and 
I'd be more than willing to try out any patch you'd like to toss out to me 
to see if we can get a semi-stable behaviour out of IDE drives with the 
-W1 and -f switches turned on.

I'm not aware that there is any safe delay. Disks with write caches 
reorder io operations, and some hold back write operations indefinitively.

Unfortunately Linux doesn't implement write barriers, and the support in 
some IDE disks is missing, too :-(

--
   Manfred
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [HACKERS] 2-phase commit

2003-09-29 Thread Manfred Spraul
Peter Eisentraut wrote:

Tom Lane writes:

 

No.  The real problem with 2PC in my mind is that its failure modes
occur *after* you have promised commit to one or more parties.  In
multi-master, if you fail you know it before you have told the client
his data is committed.
   

I have a book here which claims that the solution to the problems of
2-phase commit is 3-phase commit, which goes something like this:
coordinator participant
--- ---
INITIAL INITIAL
prepare --
WAIT
-- vote commit
READY
(all voted commit)
prepare-to-commit --
PRE-COMMIT
-- ready-to-commit
PRE-COMMIT
global-commit --
COMMIT  COMMIT
If the coordinator fails and all participants are in state READY, they can
safely decide to abort after some timeout.  If some participant is already
in state PRE-COMMIT, it becomes the new coordinator and sends the
global-commit message.
Details are left as an exercise. :-)
 

Ok. Lets assume one coordinator, two partitipants.
Global commit send to both by coordinator. One replies with ok, the 
other one remains silent.
What should the coordinator do? It can't fail the transaction - the 
first partitipant has commited its part. It can't complete the 
transaction, because the ok from the 2nd partitipant is still outstanding.
I think Bruce is right: It's an admin decision. If a timeout expires, a 
user supplied app should be called, with a safe default (database 
shutdown?).

--
   Manfred
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Threads vs Processes (was: NuSphere and PostgreSQL

2003-09-25 Thread Manfred Spraul
Tom Lane wrote:

Claudio Natoli [EMAIL PROTECTED] writes:
 

How are you dealing with the issue of wanting some static variables to
be per-thread and others not?
 

 

To be perfectly honest, I'm still trying to familiarize myself with the code
sufficiently well so that I can tell which variables need to be per-thread
and which are shared (and, in turn, which of these need to be protected from
concurrent access).
No. Not protected from concurrent access. Each thread must have it's own 
copy.

   

Well, the first-order approximation would be to duplicate the current
fork semantics: *all* static variables are per-thread, and should be
copied from the parent thread at thread creation.  If there is some
reasonably non-invasive way to do that, we'd have a long leg up on the
problem.
There is a declspec(thread) that makes a global variable per-thread. 
AFAIK it uses linker magic to replace the actual memory accesses with 
calls to TlsAlloc() etc. Note that declspec(thread) doesn't work from 
within dynamic link libraries, but that shouldn't be a big problem.

--
   Manfred


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] semtimedop instead of setitimer/semop/setitimer

2003-09-20 Thread Manfred Spraul
Tom Lane wrote:

AFAIK, semops are not done unless we actually have to yield the
processor, so saving a syscall or two in that path doesn't sound like a
big win.  I'd be more interested in asking why you're seeing long series
of semops in the first place.
 

Virtually all semops yield the processor, that part works.
I couldn't figure out what exactly causes the long series of semops. I 
tried to track it down (enable LOCK_DEBUG):
- postgres 7.3.3.
- pgbench -c 30 -t 300
- database stored on ramdisk - laptop disks are just too slow.

The long series of semops are caused by lots of processes that try to 
acquire a lock that is held exclusively by another process.
Something like
* 10 processes are waiting for a ShareLock on lock c568c. One of them 
already owns an ExclusiveLock on lock c91b4.
* everyone receives the shared lock A, does something, drops it.
* then the 9 processes try to acquire a ShareLock on lock B, and go to 
sleep.

Is there are simple way to figure out what lock c91b4 is?

Here is the log: I've added getpid() to the elog calls and I've 
overridden LOCK_DEBUG_ENABLED to write out everything always. 
Additionally, I've printed the caller address for LockAcquire

 Process 29420 acquires a lock exclusively:
LockAcquire for pid 29420 called by 0x81147d6 (XactLockTableInsert)
LockAcquire: new: 29420 lock(c91b4) tbl(1) rel(376) db(0) obj(1439) 
grantMask(0) req(0,0,0,0,0,0,0)=0 grant(0,0,0,0,0,0,0)=0 wait(0) 
type(ExclusiveLock)
LockAcquire: new: 29420 holder(c95e8) lock(c91b4) tbl(1) proc(a47b0) 
xid(1439) hold(0,0,0,0,0,0,0)=0
LockCheckConflicts: no conflict: 29420 holder(c95e8) lock(c91b4) tbl(1) 
proc(a47b0) xid(1439) hold(0,0,0,0,0,0,0)=0
GrantLock: 29420 lock(c91b4) tbl(1) rel(376) db(0) obj(1439) 
grantMask(80) req(0,0,0,0,0,0,1)=1 grant(0,0,0,0,0,0,1)=1 wait(0) 
type(ExclusiveLock)
[ Snip]
 Process 29420 acquires another lock shared, goes to sleep.
LockAcquire for pid 29420 called by 0x811484a (XactLockTableWait)
LockAcquire: found: 29420 lock(c568c) tbl(1) rel(376) db(0) obj(1421) 
grantMask(80) req(0,0,0,0,2,0,1)=3 grant(0,0,0,0,0,0,1)=1 wait(2) 
type(ShareLock)
LockAcquire: new: 29420 holder(c62c0) lock(c568c) tbl(1) proc(a47b0) 
xid(1439) hold(0,0,0,0,0,0,0)=0
LockCheckConflicts: conflicting: 29420 holder(c62c0) lock(c568c) tbl(1) 
proc(a47b0) xid(1439) hold(0,0,0,0,0,0,0)=0
WaitOnLock: sleeping on lock: 29420 lock(c568c) tbl(1) rel(376) db(0) 
obj(1421) grantMask(80) req(0,0,0,0,3,0,1)=4 grant(0,0,0,0,0,0,1)=1 
wait(2) type(ShareLock)
ProcSleep from 0x8115763, pid 29420, proc 0xbf2f57b0 for 0xbf31668c, mode 5.
 omitted: several other processes sleep on the same lock.
 omitted: LockReleaseAll grants the lock to everyone that was 
sleeping on c568c
 For several threads:
LOG:  ProcSleep from 0x8115763, pid 29436, proc 0xbf2f52f0 for 
0xbf31668c done.

LOG:  WaitOnLock: wakeup on lock: 29436 lock(c568c) tbl(1) rel(376) 
db(0) obj(1421) grantMask(20) req(0,0,0,0,3,0,0)=3 
grant(0,0,0,0,3,0,0)=3 wait(0) type(ShareLock)
LOG:  LockAcquire: granted: 29436 holder(c6274) lock(c568c) tbl(1) 
proc(a42f0) xid(1446) hold(0,0,0,0,1,0,0)=1
LOG:  LockAcquire: granted: 29436 lock(c568c) tbl(1) rel(376) db(0) 
obj(1421) grantMask(20) req(0,0,0,0,3,0,0)=3 grant(0,0,0,0,3,0,0)=3 
wait(0) type(ShareLock)
LOG:  LockRelease: found: 29436 lock(c568c) tbl(1) rel(376) db(0) 
obj(1421) grantMask(20) req(0,0,0,0,3,0,0)=3 grant(0,0,0,0,3,0,0)=3 
wait(0) type(ShareLock)
LOG:  LockRelease: found: 29436 holder(c6274) lock(c568c) tbl(1) 
proc(a42f0) xid(1446) hold(0,0,0,0,1,0,0)=1
LOG:  LockRelease: updated: 29436 lock(c568c) tbl(1) rel(376) db(0) 
obj(1421) grantMask(20) req(0,0,0,0,2,0,0)=2 grant(0,0,0,0,2,0,0)=2 
wait(0) type(ShareLock)
LOG:  LockRelease: updated: 29436 holder(c6274) lock(c568c) tbl(1) 
proc(a42f0) xid(1446) hold(0,0,0,0,0,0,0)=0
LOG:  LockRelease: deleting: 29436 holder(c6274) lock(c568c) tbl(1) 
proc(a42f0) xid(1446) hold(0,0,0,0,0,0,0)=0
LOG:  LockAcquire for pid 29436 called by 0x811484a. (XactLockTableWait)

LOG:  LockAcquire: found: 29436 lock(c91b4) tbl(1) rel(376) db(0) 
obj(1439) grantMask(80) req(0,0,0,0,2,0,1)=3 grant(0,0,0,0,0,0,1)=1 
wait(2) type(ShareLock)
LOG:  LockAcquire: new: 29436 holder(c6274) lock(c91b4) tbl(1) 
proc(a42f0) xid(1446) hold(0,0,0,0,0,0,0)=0
LOG:  LockCheckConflicts: conflicting: 29436 holder(c6274) lock(c91b4) 
tbl(1) proc(a42f0) xid(1446) hold(0,0,0,0,0,0,0)=0
LOG:  WaitOnLock: sleeping on lock: 29436 lock(c91b4) tbl(1) rel(376) 
db(0) obj(1439) grantMask(80) req(0,0,0,0,3,0,1)=4 
grant(0,0,0,0,0,0,1)=1 wait(2) type(ShareLock)
LOG:  ProcSleep from 0x8115763, pid 29436, proc 0xbf2f52f0 for 
0xbf31a1b4, mode 5.


Hmm. The initial exclusive lock is from XactLockTableInsert, the 
ShareLock waits are from XactLockTableWait. Everyone tries to start a 
transaction on the same entry?

I've uploaded a larger part (500 kB) of the log to 
http://www.colorfullife.com/~manfred/sql-log.gz

--
   Manfred
---(end of broadcast)---

Re: [HACKERS] semtimedop instead of setitimer/semop/setitimer

2003-09-20 Thread Manfred Spraul
Tom Lane wrote:

Oh, pgbench ;-).  Are you aware that you need a scale factor (-s)
larger than the number of clients to avoid unreasonable levels of
contention in pgbench?
No. What about adding a few reasonable examples to README? I've switched 
to pgbench -c 10 -s 11 -t 1000 test. Is that ok?
Now the semop calls are virtually gone. That leaves the question why 
sysv sem showed up high in the dbt2 benchmarks, but that's another question.

I'm back to my original idea: align the data buffers to speed up the 
user space/kernel space transfers. It looks good:
before: (with/without connection)
  105.031776//105.093682
  105.201246//105.260008
after aligning:
  112.664320//112.730542
  111.031901//111.098496
  111.685869/111.751130

Tested with 7.3.4. Initially I tried to increase MAX_ALIGNOF to 16, but 
the result didn't work: pgbench failed with:

ERROR:  CREATE DATABASE cannot be executed from a function
createdb: database creation failed

For my test I've manually edited shmem and aligned all allocations to 16 
byte offsets. I'll try to compile the 7.4 cvs tree, probably someone 
makes wrong assumptions about the alignment values.

--
   Manfred
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [HACKERS] semtimedop instead of setitimer/semop/setitimer

2003-09-20 Thread Manfred Spraul
Tom Lane wrote:

Manfred Spraul [EMAIL PROTECTED] writes:
 

... Initially I tried to increase MAX_ALIGNOF to 16, but 
the result didn't work:
   

You would need to do a full recompile and initdb to alter MAX_ALIGNOF.

I think I did that, but it still failed. 7.4cvs works, I'll ignore it.
MAX_ALIGNOF affects the on-disk format, correct? Then I agree that it's 
the wrong to change it.

However, if you are wanting to raise it past about 8, that's probably
not the way to go anyway; it would create padding wastage in too many
places.  It would make more sense to allocate the buffers using a
variant ShmemAlloc that could be told to align this particular object
on an N-byte boundary.  Then it costs you no more than N bytes in the
one place.
I agree, I'll write a patch.

(BTW, I wonder whether there would be any win in allocating the buffers
on a 4K or 8K page boundary... do any kernels use virtual memory mapping
tricks to replace data copying in such cases?)
Linux doesn't. Page table games are considered as evil, because tlb 
flushing is expensive, especially on SMP.

--
   Manfred
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[HACKERS] semtimedop instead of setitimer/semop/setitimer

2003-09-19 Thread Manfred Spraul
I've noticed that postgres strace output contains long groups of 
setitimer/semop/setitimer.
Just FYI: semtimedop is a special syscalls that implements a semop with 
a timeout. It was added just for the purpose of avoiding the setitimer 
calls.
I know that it's supported by Solaris and recent Linux versions, I'm not 
sure about other operating systems.

Has anyone tried to use it? Oracle pushed it to Linux, it seems to be 
worth the effort:
http://www.ussg.iu.edu/hypermail/linux/kernel/0211.3/0485.html

--
   Manfred
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


[HACKERS] Memory buffer alignment

2003-09-18 Thread Manfred Spraul
Hi,

When analyzing the kernel profile from osdl dbt benchmarks, I noticed 
that around 50% of the kernel time is spent in __copy_user_intel.
http://khack.osdl.org/stp/280060/profile/

This function is one of two functions that does the actual memory copy 
from/to kernel space to/from user space.
Unfortunately it's the slower one: Intel cpus have a microcode fastpath 
for memcopies that are 8-byte aligned. This fastpath is around 50% 
faster than the manual copy that is used for misaligned (i.e. only 
4-byte aligned) pointers. I don't know enough about other cpus, but I'd 
expect that most cpus prefer well-aligned buffers.
How are the user space buffers allocated?
So far I found buffile.c, but struct BufFile.buffer is at offset 32, 
i.e. aligned, although by chance. What is the alignment of the output of 
palloc? Is buffile.c the main code that reads/writes data to disk?

--
   Manfred
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] [PATCHES] Reorganization of spinlock defines

2003-09-12 Thread Manfred Spraul
Bruce Momjian wrote:

Tom Lane wrote:
 

Bruce Momjian [EMAIL PROTECTED] writes:
   

He is uncomfortable with the port/*.h changes at this point, so it seems
I am going to have to add Itanium/Opteron tests to most of those files.
 

Why don't you try to put together a proposed patch of that kind, and
then we can look to see how big and ugly it is compared to the other?
If the alternative is shown to be really messy, that would sway my
opinion, maybe Marc's too.
   

OK, here is an Opteron/Itanium patch that might work.  I say might
because I don't have a lot of confidence in the current spinlock
detection code.  There is an uncoupling between the definition of
HAS_TEST_AND_SET, the data type used by slock_t, and the assembler code.
 

Is the Itanium tas implementation correct? I think it should be 
xchg4.aqv instead of just xchg4 - as far as I know a normal atomic 
exchange is is not a memory barrier on Itanium. At least the Linux 
kernel version contains cmpxchg4.aqv.

--
   Manfred


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] [PATCHES] Reorganization of spinlock defines

2003-09-12 Thread Manfred Spraul
Manfred Spraul wrote:

Is the Itanium tas implementation correct? I think it should be 
xchg4.aqv instead of just xchg4 - as far as I know a normal atomic 
exchange is is not a memory barrier on Itanium. At least the Linux 
kernel version contains cmpxchg4.aqv.
Sorry for the noise, I'm wrong:
Itanium automatically uses acquire semantics with xchg.
See top of page 16 on
http://h21007.www2.hp.com/dspp/files/unprotected/itanium/spinlocks.pdf
--
   Manfred
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] FreeBSD/i386 thread test

2003-09-08 Thread Manfred Spraul
Jeroen Ruigrok/asmodai wrote:

-On [20030908 23:52], Peter Eisentraut ([EMAIL PROTECTED]) wrote:
 

Why would FreeBSD have a library of thread-safe libc functions (libc_r)
if the functions weren't thread-safe?  I think the test is faulty.
   

A thread-safe library has a per-thread errno value (i.e. errno is a 
#define to a function call), thread-safe io buffers for stdio, etc. Some 
of these changes cause a noticable overhead, thus a seperate library for 
those users who want to avoid that overhead.

Reentrancy is independant from _r: If you look at the prototype of 
gethostbyname(), it's just not possible to make that thread safe with 
reasonable effort - the C library would have to keep one buffer per 
thread around.

Having libc_r is not a guarantee that all functions of libc are
represented in that library as thread-safe functions.
gethostbyname_r() is a notable reentrant function which is absent in
FreeBSD.
 

Is there a thread-safe alternate to gethostbyname() for FreeBSD?

--
   Manfred


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [osdldbt-general] Re: [HACKERS] Prelimiary DBT-2 Test results

2003-09-05 Thread Manfred Spraul
Another question:
Is it possible to apply patches to postgresql before a DBT-2 run, or is 
only patching the kernel supported?

--
   Manfred
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Prelimiary DBT-2 Test results

2003-09-04 Thread Manfred Spraul
[EMAIL PROTECTED] wrote:

http://developer.osdl.org/markw/44/

I threw together (kind of sloppily) a web page of the data I was
starting to collect for our DBT-2 workload (TPC-C derivative) on
PostgreSQL 7.3.4. Keep in mind not much database tuning has been done
yet.  Feel free to ask any questions.
 

The kernel readprofile output is very odd:
sys_ipc receives lots of hits, but that function is a trivial multiplexer.
sys_timedsemop, and try_atomic_semop got 0 hits - that's the main 
implementation of sysv semaphores. Could you double check your 
readprofile scripts?

--
   Manfred
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [pgsql-advocacy] [HACKERS] [GENERAL] Postgresql AMD x86-64

2003-07-18 Thread Manfred Spraul
Bruce Momjian wrote:

 if test $enable_debug = yes  test $ac_cv_prog_cc_g = yes; then
   CFLAGS=$CFLAGS -g
 fi
+ 
+ /* Compile AMD Opteron using gcc in 64-bit mode */
+ if test $GCC = yes; then
+ case $host in
+   ia64-*)  CFLAGS=$CFLAGS -m64
+LDFLAGS=$LDFLAGS -melf_x86_64;;
+ esac
+ fi
+ 

Sorry, I think I confused you:
ia64-* is Intel's Itanium system. They are 64-bit only cpus (the 32-bit 
emulation is too slow to be usable). It's supported by multiple 
operating systems, among them HP UX, Linux, Windows. As far as I can see 
it's supported directly, by 7.3.3, at least RedHat builds their ia64 
version without any patches.
x86_64 is AMD's Operon/Athlon 64 system. They support concurrent 32-bit 
and 64-bit. Right now only supported by Linux, BSD and Windows support 
expected soon.
Thus the test must be for x86_64-*.

Martin: you are using debian-testing, correct? I've asked a Suse 
developer, and on their Linux distribution, -m64 is the default, i.e. 
you don't need any switches.

--
   Manfred
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] ECPG thread-safety

2003-06-01 Thread Manfred Spraul
Shridhar Daithankar wrote:

2) Native freeBSD threads
pthread.h in /usr/include and lc_r
 

Do you know if FreeBSD supports pthread_rwlock with 
PTHREAD_PROCESS_SHARED? I'm trying to replace the LWLocks with 
pthread_rwlocks.

What about other Unices?
--
   Manfred
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster