Re: [HACKERS] sync_file_range()

2006-06-18 Thread ITAGAKI Takahiro
Christopher Kings-Lynne <[EMAIL PROTECTED]> wrote:

> http://lwn.net/Articles/178199/
> Check out the article on sync_file_range():

> Is that at all useful for PostgreSQL's purposes?

I'm interested in it, with which we could improve responsiveness during
checkpoints. Though it is Linux specific system call, but we could use
the combination of mmap() and msync() instead of it; I mean we can use
mmap only to flush dirty pages, not to read or write pages.

---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] kill_prior_tuple for bitmap scan

2006-06-18 Thread Qingqing Zhou

"Tom Lane" <[EMAIL PROTECTED]> wrote
>
> Really?  An indexscan will release pin before returning no-more-tuples,
> and had better do so else we leak pins during queries involving many
> indexscans.
>

I guess I see your point. For the scan stages not returning no-more-tuples,
we can do kill, but the problem is that most bitmap index scan can finish in
just one round :-(.

>
> Not sure I believe that either.  Even granting the assumption that the
> pages are still in cache, this implies a big increase in bufmgr traffic.
>

If you mean the bufmgr traffic is on the BufMappingLock, then I don't worry
too much. Notice that we can have a list of buffer_ids that we are
interested in, we can pin/recheck-buftag of these targets without asking
bufmgr where are they. If we missed, then unpin and forget them is ok.

Regards,
Qingqing



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[HACKERS] sync_file_range()

2006-06-18 Thread Christopher Kings-Lynne

http://lwn.net/Articles/178199/

Check out the article on sync_file_range():


long sync_file_range(int fd, loff_t offset, loff_t nbytes, int flags);

This call will synchronize a file's data to disk, starting at the given 
offset and proceeding for nbytes bytes (or to the end of the file if 
nbytes is zero). How the synchronization is done is controlled by flags:


* SYNC_FILE_RANGE_WAIT_BEFORE blocks the calling process until any 
already in-progress writeout of pages (in the given range) completes.


* SYNC_FILE_RANGE_WRITE starts writeout of any dirty pages in the 
given range which are not already under I/O.


* SYNC_FILE_RANGE_WAIT_AFTER blocks the calling process until the 
newly-initiated writes complete.


An application which wants to initiate writeback of all dirty pages 
should provide the first two flags. Providing all three flags guarantees 
that those pages are actually on disk when the call returns.



Is that at all useful for PostgreSQL's purposes?

Chris



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] kill_prior_tuple for bitmap scan

2006-06-18 Thread Tom Lane
"Qingqing Zhou" <[EMAIL PROTECTED]> writes:
> As I read, the kill_prior_tuple optimization doesn't work for bitmap scan
> code. To fix this problem, we have two choices.

> One is still use the kill_prior_tuple trick in a modified way. Heap TIDs
> recycling should not be a problem. This is because generally we always hold
> pin of the last index page we scanned,

Really?  An indexscan will release pin before returning no-more-tuples,
and had better do so else we leak pins during queries involving many
indexscans.  (IIRC, I broke that around 7.3 :-()

> Another smaller problem is that we have
> to scan the heap pages again to identify the index pointers, but that's
> affordable since these pages should be in the buffer pool with big chance.

Not sure I believe that either.  Even granting the assumption that the
pages are still in cache, this implies a big increase in bufmgr traffic.

This certainly is an interesting problem, but there is not a simple
solution.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[HACKERS] kill_prior_tuple for bitmap scan

2006-06-18 Thread Qingqing Zhou
As I read, the kill_prior_tuple optimization doesn't work for bitmap scan
code. To fix this problem, we have two choices.

One is still use the kill_prior_tuple trick in a modified way. Heap TIDs
recycling should not be a problem. This is because generally we always hold
pin of the last index page we scanned, so the vacuum has to wait and the
underlying heap can't get vaccummed. Another smaller problem is that we have
to scan the heap pages again to identify the index pointers, but that's
affordable since these pages should be in the buffer pool with big chance.

An alternative way might leave the job to autovacuum -- once we found bitmap
scan seeing a lot of dead tuples, notify it to do the job. But this doesn't
look interesting -- the autovacuum performs in a coarse and blind level and
cause a lot of CPUs/IOs.

Anyway, it is a performance lost on a frequently updated table if we do
nothing. I observed in a optimized OLTP server for 7.4 using index scan
experienced a performance problem due to the plan changed to bitmap index
scan.

Is there any show-stop reasons that we don't do either of them?

Regards,
Qingqing





---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Rethinking stats communication mechanisms

2006-06-18 Thread Tom Lane
"Bort, Paul" <[EMAIL PROTECTED]> writes:
>> Anyone know a variant of this that really works?

> Here's a theory: If the counter is bumped to an odd number before
> modification, and an even number after it's done, then the reader will
> know it needs to re-read if the counter is an odd number. 

Great minds think alike ;-) ... I just committed exactly that protocol.
I believe it is correct, because AFAICS there are only four possible
risk cases:

* reader's read starts before and ends after writer's update: reader
will certainly note a change in update counter.

* reader's read starts before and ends within writer's update: reader
will note a change in update counter.

* reader's read starts within and ends after writer's update: reader
will note a change in update counter.

* reader's read starts within and ends within writer's update: reader
will see update counter as odd.

Am I missing anything?

Note in particular that this protocol does not assume atomic update
of the counter, so we don't need to worry about whether int is
sig_atomic_t.  If any of the bytes change, we have what we need.
We could use a counter narrower than int, but then there's some tiny
risk that the counter could wrap all the way around while the reader
is blocked.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Rethinking stats communication mechanisms

2006-06-18 Thread Bort, Paul
> 
> > BTW, I think the writer would actually need to bump the 
> counter twice, 
> > once before and once after it modifies its stats area.  
> Else there's 
> > no way to detect that you've copied a partially-updated stats entry.
> 
> Actually, neither of these ideas works: it's possible that 
> the reader copies the entry between the two increments of the 
> counter.  Then, it won't see any reason to re-read, but 
> nonetheless it has copied an inconsistent partially-modified entry.
> 
> Anyone know a variant of this that really works?
> 

Here's a theory: If the counter is bumped to an odd number before
modification, and an even number after it's done, then the reader will
know it needs to re-read if the counter is an odd number. 

This might be assuming too much about what the writer knows about the
current contents of the counter, but since it's per-back end, I think it
would work. 

Regards,
Paul Bort

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] regresssion script hole

2006-06-18 Thread Michael Fuhr
On Sun, Jun 18, 2006 at 07:18:07PM -0600, Michael Fuhr wrote:
> Maybe I'm misreading the packet, but I think the query is for
> ''kaltenbrunner.cc (two single quotes followed by kaltenbrunner.cc)

Correction: ''.kaltenbrunner.cc

-- 
Michael Fuhr

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] regresssion script hole

2006-06-18 Thread Michael Fuhr
On Sun, Jun 18, 2006 at 07:50:04PM -0400, Tom Lane wrote:
>  24583 postgres CALL  
> recvfrom(0x3,0x477e4000,0x1,0,0xfffe4da0,0xfffe4d5c)
>  24583 postgres GIO   fd 3 read 37 bytes
>"\M-Sr\M^A\M^B\0\^A\0\0\0\0\0\0\^B''\rkaltenbrunner\^Bcc\0\0\^A\0\^A"
>  24583 postgres RET   recvfrom 37/0x25
>  24583 postgres CALL  close(0x3)
>  24583 postgres RET   close 0
> 
> I'm not too up on what the DNS protocol looks like on-the-wire, but I'll
> bet this is it.  I think it's trying to look up "kaltenbrunner.cc" and
> failing.

Maybe I'm misreading the packet, but I think the query is for
''kaltenbrunner.cc (two single quotes followed by kaltenbrunner.cc)
and the DNS server is responding with SRVFAIL.

-- 
Michael Fuhr

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] regresssion script hole

2006-06-18 Thread Andrew Dunstan



Tom Lane wrote:


Anyway, the tail end of the trace
shows it repeatedly sending off a UDP packet and getting practically the
same data back:


I'm not too up on what the DNS protocol looks like on-the-wire, but I'll
bet this is it.  I think it's trying to look up "kaltenbrunner.cc" and
failing.

 



Why are we actually looking up anything? Just so we can bind to a 
listening socket?


Anyway, maybe the box needs a lookup line in its /etc/resolv.conf to 
direct it to use files first, something like


 lookup file bind

Stefan, can you look into that? It would be a bit ugly if it's calling 
DNS (and failing) to resolve localhost.


cheers

andrew

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] regresssion script hole

2006-06-18 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> One idea that comes to mind is a DNS lookup timeout.  Can you strace the
>> postmaster to see what it's doing?

> There is ktrace output I managed to capture at 
> http://developer.postgresql.org/~adunstan/ktrace.txt
> Not sure what it tells us. I do see it reading in the whole timezone db, 
> which isn't pretty.

Yeah... I think we fixed that in 8.1.  Anyway, the tail end of the trace
shows it repeatedly sending off a UDP packet and getting practically the
same data back:

 24583 postgres CALL  socket(0x2,0x2,0)
 24583 postgres RET   socket 3
 24583 postgres CALL  sendto(0x3,0x43e1e000,0x25,0,0x493a6338,0x10)
 24583 postgres GIO   fd 3 wrote 37 bytes
   "\M-Sr\^A\0\0\^A\0\0\0\0\0\0\^B''\rkaltenbrunner\^Bcc\0\0\^A\0\^A"
 24583 postgres RET   sendto 37/0x25
 24583 postgres CALL  getpid()
 24583 postgres RET   getpid 24583/0x6007
 24583 postgres CALL  select(0x4,0x40739600,0,0,0xfffe4d90)
 24583 postgres RET   select 1
 24583 postgres CALL  
recvfrom(0x3,0x477e4000,0x1,0,0xfffe4da0,0xfffe4d5c)
 24583 postgres GIO   fd 3 read 37 bytes
   "\M-Sr\M^A\M^B\0\^A\0\0\0\0\0\0\^B''\rkaltenbrunner\^Bcc\0\0\^A\0\^A"
 24583 postgres RET   recvfrom 37/0x25
 24583 postgres CALL  close(0x3)
 24583 postgres RET   close 0

I'm not too up on what the DNS protocol looks like on-the-wire, but I'll
bet this is it.  I think it's trying to look up "kaltenbrunner.cc" and
failing.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] regresssion script hole

2006-06-18 Thread Andrew Dunstan



Tom Lane wrote:


Andrew Dunstan <[EMAIL PROTECTED]> writes:
 

We need both, I think. I am still trying to find out why it's taking so 
long. This is on the 8.0 branch, though. Later branches seem to be working.
   



One idea that comes to mind is a DNS lookup timeout.  Can you strace the
postmaster to see what it's doing?


 



There is ktrace output I managed to capture at 
http://developer.postgresql.org/~adunstan/ktrace.txt


Not sure what it tells us. I do see it reading in the whole timezone db, 
which isn't pretty.


cheers

andrew

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] regresssion script hole

2006-06-18 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> We need both, I think. I am still trying to find out why it's taking so 
> long. This is on the 8.0 branch, though. Later branches seem to be working.

One idea that comes to mind is a DNS lookup timeout.  Can you strace the
postmaster to see what it's doing?

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] regresssion script hole

2006-06-18 Thread Andrew Dunstan



Tom Lane wrote:


Andrew Dunstan <[EMAIL PROTECTED]> writes:
 

The problem is that if the postmaster takes more than 60 seconds to 
start listening (as is apparently happening on spoonbill - don't yet 
know why) this code falls through.
   



If the postmaster takes that long to start listening, I'd say we need to
fix the postmaster not pg_regress.

 



We need both, I think. I am still trying to find out why it's taking so 
long. This is on the 8.0 branch, though. Later branches seem to be working.


I'm inclined to run the psql test one more time to make sure we can 
actually connect, and if not then fail at this point.
   



How does that differ from just iterating the loop one more time?
 



There is no guarantee that at the end of the loop we have connected 
successfully to postgres.


I will post a patch that shows what I suggest.

cheers

andrew

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] regresssion script hole

2006-06-18 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> The problem is that if the postmaster takes more than 60 seconds to 
> start listening (as is apparently happening on spoonbill - don't yet 
> know why) this code falls through.

If the postmaster takes that long to start listening, I'd say we need to
fix the postmaster not pg_regress.

> I'm inclined to run the psql test one more time to make sure we can 
> actually connect, and if not then fail at this point.

How does that differ from just iterating the loop one more time?

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] MultiXacts & WAL

2006-06-18 Thread paolo romano
>There is no "regular shared locks" in postgres in that sense. Shared locks >are only used for maintaining FK integrity. Or by manually issuing a >SELECT FOR SHARE, but that's also for maintaining integrity. MVCC >rules take care of the "plain reads". If you're not familiar with MVCC, >it's explained in chapter 12 of the manual.>>The source code in heapam.c also mentions Point In Time Recovery to >require logging the locks, though I'm not sure why.Thanks for your explanations, now I can see what was confusing me.The problem with is that we don't know beforehand if a transaction is a distributed one or not.Feel free to write a benchmark to see how much difference the logging makes! If it's significant, I'm sure we can figure out ways to improve it.Now that i
 finally see that multixacts are due only to explicit shared lock requests or to FKs, I tend to agree with tom's original doubts about the actual impact of the multixact related logging activities. Of course in practice such an impact would vary from application to application, so it may still make sense for some classes of workloads to avoid multixact logging, assuming they contain no distributed transactions and finding an hack to know beforehand whether a transaction is distributed or not... BTW, if i manage to find some free time to do some performance tests, i'll sure let you know!Thanks again,    Paolo Chiacchiera con i tuoi amici in tempo reale!  http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

[HACKERS] regresssion script hole

2006-06-18 Thread Andrew Dunstan


While investigating some problems with buildfarm member spoonbill I came 
across this piece of code in pg_regress.sh, which seems less than robust:



   # Wait till postmaster is able to accept connections (normally only
   # a second or so, but Cygwin is reportedly *much* slower).  Don't
   # wait forever, however.
   i=0
   max=60
   until "$bindir/psql" -X $psql_options postgres /dev/null
   do
   i=`expr $i + 1`
   if [ $i -ge $max ]
   then
   break
   fi
   if kill -0 $postmaster_pid >/dev/null 2>&1
   then
   : still starting up
   else
   break
   fi
   sleep 1
   done

   if kill -0 $postmaster_pid >/dev/null 2>&1
   then
   echo "running on port $PGPORT with pid $postmaster_pid"
   else
   echo
   echo "$me: postmaster did not start"
   echo "Examine $LOGDIR/postmaster.log for the reason."
   echo
   (exit 2); exit
   fi

The problem is that if the postmaster takes more than 60 seconds to 
start listening (as is apparently happening on spoonbill - don't yet 
know why) this code falls through.


I'm inclined to run the psql test one more time to make sure we can 
actually connect, and if not then fail at this point. I wouldn't bother 
but it did confuse the heck out of both Stefan and me when createlang 
failed.


Thoughts?

cheers

andrew

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Rethinking stats communication mechanisms

2006-06-18 Thread Tom Lane
I wrote:
> PFC <[EMAIL PROTECTED]> writes:
>> So, the proposal :
>> On executing a command, Backend stores the command string, then  
>> overwrites the counter with (counter + 1) and with the timestamp of  
>> command start.
>> Periodically, like every N seconds, a separate process reads the counter,  
>> then reads the data, then reads the counter again.

> BTW, I think the writer would actually need to bump the counter twice,
> once before and once after it modifies its stats area.  Else there's
> no way to detect that you've copied a partially-updated stats entry.

Actually, neither of these ideas works: it's possible that the reader
copies the entry between the two increments of the counter.  Then, it
won't see any reason to re-read, but nonetheless it has copied an
inconsistent partially-modified entry.

Anyone know a variant of this that really works?

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Rethinking stats communication mechanisms

2006-06-18 Thread Hannu Krosing
Ühel kenal päeval, P, 2006-06-18 kell 15:09, kirjutas Tom Lane:
> "Magnus Hagander" <[EMAIL PROTECTED]> writes:
> > Might it not be a win to also store "per backend global values" in the
> > shared memory segment? Things like "time of last command", "number of
> > transactions executed in this backend", "backend start time" and other
> > values that are fixed-size?

One thing that is doable in constant size memory and would be enormously
usable for us is counting to-level function calls and storing their
total (and possibly also max) duration.

The resaon being, that our production databases are accessed by clients
using  functions only (with some uninteresting exeptions of course),
that is call in form of "SELECT x,y,z FROM myfunc(i,j,k)"

So reserving N*1.5 slots (N being the number of functions defined at
databse startup) would be ok. If more than N*0.5 functions are defined
in the database lifetime, then the rest are simply ignored (not
counted).

Or maybe a better approach would be to have a conf variable
"number-of-functions-to-track" and a special boolean flag track_me in
pg_functions. In this way you don't accidentally run out of shared mem
by defining lots of new functions and then restarting the cluster.


-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Rethinking stats communication mechanisms

2006-06-18 Thread Tom Lane
"Magnus Hagander" <[EMAIL PROTECTED]> writes:
> Might it not be a win to also store "per backend global values" in the
> shared memory segment? Things like "time of last command", "number of
> transactions executed in this backend", "backend start time" and other
> values that are fixed-size?

I'm including backend start time, command start time, etc under the
heading of "current status" which'll be in the shared memory.  However,
I don't believe in trying to count events (like transaction commits)
that way.  If we do then we risk losing events whenever a backend quits
and is replaced by another.

I haven't yet looked through the stats in detail, but this approach
basically presumes that we are only going to count events per-table and
per-database --- I am thinking that the background stats collector
process won't even keep track of individual backends anymore.  (So,
we'll fix the old problem of loss of backend-exit messages resulting
in bogus displays.)

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Rethinking stats communication mechanisms

2006-06-18 Thread Magnus Hagander
> The existing stats collection mechanism seems OK for event 
> counts, although I'd propose two changes: one, get rid of the 
> separate buffer process, and two, find a way to emit event 
> reports in a time-driven way rather than once per transaction 
> commit.  I'm a bit vague about how to do the latter at the moment.

Might it not be a win to also store "per backend global values" in the
shared memory segment? Things like "time of last command", "number of
transactions executed in this backend", "backend start time" and other
values that are fixed-size?

You can obviously not do it for things like per-table values, since the
size can't be predicted, but all per-backend counters that are fixed
size should be able to do this, I think. And if it's just a counter, it
should be reasonably safe to just do the increment operation without
locking, since there's only one writer for each process. That should
have a much lower overhead than UDP or whatever to the stats process,
no?

It might be worthwhile to add a section for things like bgwriter (and
possibly the archiver?) to deliver statics that we can add statistics
views for. (they can obviously not use a standard backend "struct" for
this since they'd have completely different values to report)

//Magnus

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] MultiXacts & WAL

2006-06-18 Thread Heikki Linnakangas

On Sun, 18 Jun 2006, paolo romano wrote:

Anyway, again in theory, if one wanted to minimize logging overhead for 
shared locks, one might adopt a different treatment for (i) regular 
shared locks (i.e. locks due to plain reads not requiring durability in 
case of 2PC)  and (ii) shared locks held because some SQL command is 
referencing a tuple via a FK, which have to be persisted until the 2-nd 
2PC phase (There is no any other scenario in which you *must* persist 
shared locks, is there?)


There is no "regular shared locks" in postgres in that sense. Shared locks 
are only used for maintaining FK integrity. Or by manually issuing a 
SELECT FOR SHARE, but that's also for maintaining integrity. MVCC 
rules take care of the "plain reads". If you're not familiar with MVCC, 
it's explained in chapter 12 of the manual.


The source code in heapam.c also mentions Point In Time Recovery to 
require logging the locks, though I'm not sure why.


By the way, postgresql is detailedly logging *every* single shared 
lock, even though this is actually needed only if (i) the transaction 
turns out to be a distributed one (i.e. prepare is issued on that 
transactions),  AND (ii) the shared lock is due to ensure validity of a 
FK. AFAICS, in most practical workloads (i) local transactions dominate 
distributed ones and (ii) shared locks due to plain reads dominate locks 
due to FK, so the current implementaion does not seem to be optimizing 
the most frequent scenario.


The problem with is that we don't know beforehand if a transaction is a 
distributed one or not.


Feel free to write a benchmark to see how much difference the logging 
makes! If it's significant, I'm sure we can figure out ways to improve it.


- Heikki

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Segfault Exiting psql

2006-06-18 Thread Tom Lane
Jason Essington <[EMAIL PROTECTED]> writes:
> Has there been any movement on this? as of 8.1.2 psql still whines on  
> OS X tiger when you exit.
> I realize it is not significant, but I'd still rather not see it.

I've committed that fix into CVS HEAD.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Rethinking stats communication mechanisms

2006-06-18 Thread Greg Stark

Douglas McNaught <[EMAIL PROTECTED]> writes:

> (a) and (b): of course you would only do it on a temporary basis for
> problem diagnosis.

Temporary or not it isn't really an option when you're dealing with high
volumes. You could imagine a setup where say, 1% of page requests randomly
turn on debugging to get a random sample of database traffic. There are always
solutions. 

But my point is that there's no reason to think that only long queries are
useful for snapshots. Short queries are equally capable of consuming resources
in aggregate. Hiding some subset of queries assuming they're uninteresting is
only going to produce deceptive results data.

> Ideally, you'd find the query storm problem in load testing before you
> ever got to production.  I hope to someday visit that planet--it must
> be nice.

Indeed, when you get there send me a postcard :)

-- 
greg


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] union all bug?

2006-06-18 Thread Tom Lane
"Gurjeet Singh" <[EMAIL PROTECTED]> writes:
> Probably this explains the ERROR for the last query... The ORDER BY
> and LIMIT clauses are expected to end a query (except for subqueries,
> of course), and hence the keyword UNION is not expected after the
> LIMIT clause...

Yeah.  In theory that's unambiguous, but IIRC we couldn't persuade
bison of the fact, so you have to add parens.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] union all bug?

2006-06-18 Thread Gurjeet Singh

Probably this explains the ERROR for the last query... The ORDER BY
and LIMIT clauses are expected to end a query (except for subqueries,
of course), and hence the keyword UNION is not expected after the
LIMIT clause...

On 6/18/06, Tom Lane <[EMAIL PROTECTED]> wrote:

Joe Conway <[EMAIL PROTECTED]> writes:
> I was trying to work around limitations with "partitioning" of tables
> using constraint exclusion, when I ran across this little oddity:

I think you're under a misimpression about the syntax behavior of ORDER
BY and UNION.  Per spec, ORDER BY binds less tightly than UNION, thus

select foo union select bar order by x

means

(select foo union select bar) order by x

If you want to apply ORDER BY to either arm of a union individually,
you need parentheses, eg

(select foo order by x) union (select bar order by x)

(Note that this construct fails to guarantee that the output of the
union will be sorted by x!)  LIMIT is not in the spec but we treat
it like ORDER BY for this purpose.


To guarantee the ordering, one can use

select * from (select foo from A union select bar from B) order by x

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] union all bug?

2006-06-18 Thread Tom Lane
Joe Conway <[EMAIL PROTECTED]> writes:
> I was trying to work around limitations with "partitioning" of tables 
> using constraint exclusion, when I ran across this little oddity:

I think you're under a misimpression about the syntax behavior of ORDER
BY and UNION.  Per spec, ORDER BY binds less tightly than UNION, thus

select foo union select bar order by x

means

(select foo union select bar) order by x

If you want to apply ORDER BY to either arm of a union individually,
you need parentheses, eg

(select foo order by x) union (select bar order by x)

(Note that this construct fails to guarantee that the output of the
union will be sorted by x!)  LIMIT is not in the spec but we treat
it like ORDER BY for this purpose.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[HACKERS] union all bug?

2006-06-18 Thread Joe Conway
I was trying to work around limitations with "partitioning" of tables 
using constraint exclusion, when I ran across this little oddity:


-- works
test=# select * from (select time from url_access_2006_06_07 order by 1 
limit 2) as ss1;

time
-
 2006-06-07 15:07:41
 2006-06-07 15:07:41
(2 rows)

-- works
test=# select time from url_access_2006_06_08 order by 1 limit 2;
time
-
 2006-06-08 15:07:41
 2006-06-08 15:07:41
(2 rows)

-- huh ?!?
test=# select * from (select time from url_access_2006_06_07 order by 1 
limit 2) as ss1 union all select time from url_access_2006_06_08 order 
by 1 limit 2;

time
-
 2006-06-07 15:07:41
 2006-06-07 15:07:41
(2 rows)

-- works
test=# select * from (select time from url_access_2006_06_07 order by 1 
limit 2) as ss1 union all select * from (select time from 
url_access_2006_06_08 order by 1 limit 2) as ss2;

time
-
 2006-06-07 15:07:41
 2006-06-07 15:07:41
 2006-06-08 15:07:41
 2006-06-08 15:07:41
(4 rows)


I get an error if I try to eliminate the first FROM clause subselect:

test=# select time from url_access_2006_06_07 order by 1 limit 2 union 
all select * from (select time from url_access_2006_06_08 order by 1 
limit 2) as ss2;

ERROR:  syntax error at or near "all" at character 65
LINE 1: ...om url_access_2006_06_07 order by 1 limit 2 union all select...

So I'm wondering whether the second FROM clause subselect is really 
required, but not getting enforced as it should?


Joe

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


[HACKERS] Slightly bogus regression test for contrib/dblink

2006-06-18 Thread Tom Lane
Lines 509-512 of contrib/dblink/expected/dblink.out read:

-- this should fail because there is no open transaction
SELECT dblink_exec('myconn','DECLARE xact_test CURSOR FOR SELECT * FROM foo');
ERROR:  sql error
DETAIL:  ERROR:  cursor "xact_test" already exists

The error message is not consistent with what the comment claims.
Looking at the test case in total, I think the code is responding
correctly given the rules stated in your last commit message:

2005-10-17 22:55  joe

* contrib/dblink/: dblink.c, expected/dblink.out, sql/dblink.sql:
When a cursor is opened using dblink_open, only start a transaction
if there isn't one already open. Upon dblink_close, only commit the
open transaction if it was started by dblink_open, and only then
when all cursors opened by dblink_open are closed. The transaction
accounting is done individually for all named connections, plus the
persistent unnamed connection.

However, this comment is wrong and so is the preceding one, and I think
maybe you want to alter the test case so it does actually exercise
closing the transaction completely.

BTW, I was led to notice this while examining the current buildfarm
failure report from osprey,
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=osprey&dt=2006-06-17%2004:00:16
It looks to me like the diffs are consistent with the idea that the test
is using a copy of dblink that predates this patch ... do you agree?
If so, anyone have an idea how that could happen?  I thought we'd fixed
all the rpath problems, and anyway osprey wasn't failing like this
before today.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] MultiXacts & WAL

2006-06-18 Thread Tom Lane
paolo romano <[EMAIL PROTECTED]> writes:
> Anyway, again in theory, if one wanted to minimize logging overhead for 
> shared locks, one might adopt a different treatment for (i) regular shared 
> locks (i.e. locks due to plain reads not requiring durability in case of 2PC) 
>  and (ii) shared locks held because some SQL command is referencing a tuple 
> via a FK, which have to be persisted until the 2-nd 2PC phase (There is no 
> any other scenario in which you *must* persist shared locks, is there?)

I can't see any basis at all for asserting that you don't need to
persist particular types of locks.  In the current system, a multixact
lock might arise from either FK locking, or a user-issued SELECT FOR SHARE.
In either case it's possible that the lock was taken to guarantee the
integrity of a data change made somewhere else.  So we can't release it
before commit.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Unable to initdb using HEAD on Windows XP

2006-06-18 Thread Thomas Hallgren

Never mind. I scrubbed my folders and obtained a new fresh copy from CVS. Now 
it works.

Regards,
Thomas Hallgren


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] MultiXacts & WAL

2006-06-18 Thread paolo romano
No, it's not safe to release them until 2nd phase commit.Imagine table foo and table bar. Table bar has a foreign key reference to foo.1. Transaction A inserts a row to bar, referencing row R in foo. This acquires a shared lock on R.2. Transaction A precommits, releasing the lock.3. Transaction B deletes R. The new row inserted by A is not visible to B, so the delete succeeds.4. Transaction A and B commit. Oops, the new row in bar references R that doesn't exist anymore.Holding the lock until the true end of transaction, the 2nd phase of commit, blocks B from deleting R.- Heikki---(end of broadcast)---TIP 1: if posting/reading through Usenet, please send an appropriate   subscribe-nomail command to [EMAIL PROTECTED] so
 that your   message can get through to the mailing list cleanlyHeikki, thanks for the clarifications. I was not considering the additional issues arising in case of referential integrity constraints... in fact i was citing a known result from theory books on 2PC, which did not include FK in their speculations... But as usual in theory things look always much simpler than in practice!Anyway, again in theory, if one wanted to minimize logging overhead for shared locks, one might adopt a different treatment for (i) regular shared locks (i.e. locks due to plain reads not requiring durability in case of 2PC)  and (ii) shared locks held because some SQL command is referencing a tuple via a FK, which have to be persisted until the 2-nd 2PC phase (There is no any other scenario in which you *must* persist shared locks, is there?) Of course, in practice distinguishing the 2 above situations may not be so simple and it still
 has to be shown whether such an optimization is really worth of... By the way, postgresql is detailedly logging *every* single shared lock, even though this is actually needed only if (i) the transaction turns out to be a distributed one (i.e. prepare is issued on that transactions),  AND (ii) the shared lock is due to ensure validity of a FK. AFAICS, in most practical workloads (i) local transactions dominate distributed ones and (ii) shared locks due to plain reads dominate locks due to FK, so the current implementaion does not seem to be optimizing the most frequent scenario.regards,  paolo Chiacchiera con i tuoi amici in tempo reale!  http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

Re: [HACKERS] Unable to initdb using HEAD on Windows XP

2006-06-18 Thread Thomas Hallgren

Some more info. If I manually create the data directory first, the output is 
different:

  C:\Tada\Workspace>mkdir data

  C:\Tada\Workspace>initdb -D data
  The files belonging to this database system will be owned by user "postgres".
  This user must also own the server process.

  The database cluster will be initialized with locale English_United 
States.1252.


  fixing permissions on existing directory data ... ok
  creating subdirectories ... ok
  selecting default max_connections ... 100
  selecting default shared_buffers/max_fsm_pages ... 4000/20
  creating configuration files ... ok
  creating template1 database in data/base/1 ... ok
  initializing pg_authid ... child process was terminated by signal 5
  initdb: removing contents of data directory "data"


Thomas Hallgren wrote:
I just compiled a fresh copy from CVS head. I then tried to do an initdb 
as user 'postgres' (non admin user on my system). I get the following 
error:


 C:\Tada\Workspace>initdb -D data
 The files belonging to this database system will be owned by user 
"postgres".

 This user must also own the server process.

 The database cluster will be initialized with locale English_United 
States.1252.



 creating directory data ... ok
 creating subdirectories ... initdb: could not create directory 
"data/global": Permission denied

 initdb: removing data directory "data"
 could not open directory "data": No such file or directory
 initdb: failed to remove data directory

AFAICS, no data directory is ever created so the 'creating directory 
data ... ok' message is probably incorrect. I even tried to change the 
permissions on the parent directory so that user 'postgres' has full 
control. It doesn't help. I didn't think it would since I am able to 
create a database in this directory if I'm using version 8.1.4.


Regards,
Thomas Hallgren



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster




---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


[HACKERS] Unable to initdb using HEAD on Windows XP

2006-06-18 Thread Thomas Hallgren
I just compiled a fresh copy from CVS head. I then tried to do an initdb 
as user 'postgres' (non admin user on my system). I get the following error:


 C:\Tada\Workspace>initdb -D data
 The files belonging to this database system will be owned by user 
"postgres".

 This user must also own the server process.

 The database cluster will be initialized with locale English_United 
States.1252.



 creating directory data ... ok
 creating subdirectories ... initdb: could not create directory 
"data/global": Permission denied

 initdb: removing data directory "data"
 could not open directory "data": No such file or directory
 initdb: failed to remove data directory

AFAICS, no data directory is ever created so the 'creating directory 
data ... ok' message is probably incorrect. I even tried to change the 
permissions on the parent directory so that user 'postgres' has full 
control. It doesn't help. I didn't think it would since I am able to 
create a database in this directory if I'm using version 8.1.4.


Regards,
Thomas Hallgren



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Rethinking stats communication mechanisms

2006-06-18 Thread Douglas McNaught
Greg Stark <[EMAIL PROTECTED]> writes:

> Douglas McNaught <[EMAIL PROTECTED]> writes:
>
>> Yeah, but if you turn on query logging in that case you'll see the
>> bajillions of short queries, so you don't need the accurate snapshot
>> to diagnose that.
>
> Query logging on a production OLTP machine? a) that would be a huge
> performance drain on the production system b) it would produce so much logging
> that it would take a significant amount of disk and i/o resources just to
> handle and c) you would need to implement special purpose tools just to make
> sense of these huge logs.

(a) and (b): of course you would only do it on a temporary basis for
problem diagnosis.  We do that with our production apps where I work
(when absolutely necessary).

(c): Perl.  :)

Ideally, you'd find the query storm problem in load testing before you
ever got to production.  I hope to someday visit that planet--it must
be nice.

-Doug

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PERFORM] Sun Donated a Sun Fire T2000 to the PostgreSQL

2006-06-18 Thread Arjen van der Meijden

On 17-6-2006 1:24, Josh Berkus wrote:

Arjen,


I can already confirm very good scalability (with our workload) on
postgresql on that machine. We've been testing a 32thread/16G-version
and it shows near-linear scaling when enabling 1, 2, 4, 6 and 8 cores
(with all four threads enabled).


Keen.   We're trying to keep the linear scaling going up to 32 cores of 
course (which doesn't happen, presently).  Would you be interested in 
helping us troubleshoot some of the performance issues?


You can ask your questions, if I happen to do know the answer, you're a 
step further in the right direction.


But actually, I didn't do much to get this scalability... So I won't be 
of much help to you, its not that I spent hours on getting this performance.
I just started out with the "normal" attempts to get a good config. 
Currently the shared buffers is set to 30k. Larger settings didn't seem 
to differ much on our previous 4-core version, so I didn't even check it 
out on this one. I noticed I forgot to set the effective cache size to 
more than 6G for this one too, but since our database is smaller than 
that, that shouldn't make any difference. The work memory was increased 
a bit to 2K. So there are no magic tricks here.


I do have to add its a recent checkout of 8.2devel compiled using Sun 
Studio 11. It was compiled using this as CPPFLAGS: -xtarget=ultraT1 
-fast -xnolibmopt


The -xnolibmopt was added because we couldn't figure out why it yielded 
several linking errors at the end of the compilation when the -xlibmopt 
from -fast was enabled, so we disabled that particular setting from the 
-fast macro.



The workload generated is an abstraction and simplification of our 
website's workload, used for benchmarking. Its basically a news and 
price comparision site and it runs on LAMP (with the M of MySQL), i.e. a 
lot of light queries, many primary-key or indexed "foreign-key" lookups 
for little amounts of records. Some aggregations for summaries, etc. 
There are little writes and hardly any on the most read tables.
The database easily fits in memory, the total size of the actively read 
tables is about 3G.
This PostgreSQL-version is not a direct copy of the queries and tables, 
but I made an effort of getting it more PostgreSQL-minded as much as 
possible. I.e. I combined a few queries, I changed "boolean"-enum's in 
MySQL to real booleans in Postgres, I added specific indexes (including 
partials) etc.


We use apache+php as clients and just open X apache processes using 'ab' 
at the same time to generate various amounts of concurrent workloads. 
Solaris scales really well to higher concurrencies and PostgreSQL 
doesn't seem to have problems with it either in our workload.


So its not really a real-life scenario, but its not a synthetic 
benchmark either.


Here is a graph of our performance measured on PostgreSQL:
http://achelois.tweakers.net/~acm/pgsql-t2000/T2000-schaling-postgresql.png

What you see are three lines. Each represents the amount of total "page 
views" processed in 600 seconds for a specific amount of Niagara-cores 
(i.e. 1, 2, 4, 6 and 8). Each core had all its threads enabled, so its 
actually 4, 8, 16, 24 and 32 virtual cpu's you're looking at.
The "Max"-line displays the maximum generated "page views" on a specific 
core-amount for any concurrency, respectively: 5, 13, 35, 45 and 60.
The "Bij 50" is the amount of "page views" it generated with 50 
apache-processes working at the same time (on two dual xeon machines, so 
25 each). I took 50 a bit arbitrary but all core-configs seemed to do 
pretty well under that workload.


The "perfect" line is based on the "Max" value for 1 core and then just 
multiplied by the amount of cores to have a linear reference. The "Bij 
50" and the "perfect" line don't differ too much in color, but the 
top-one is the "perfect" line.


In the near future we'll be presenting an article on this on our 
website, although that will be in dutch the graphs should still be easy 
to read for you guys.
And because of that I can't promise too much detailed information until 
then.


I hope I clarified things a bit now, if not ask me about it,
Best regards,

Arjen

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Rethinking stats communication mechanisms

2006-06-18 Thread Greg Stark

Douglas McNaught <[EMAIL PROTECTED]> writes:

> Yeah, but if you turn on query logging in that case you'll see the
> bajillions of short queries, so you don't need the accurate snapshot
> to diagnose that.

Query logging on a production OLTP machine? a) that would be a huge
performance drain on the production system b) it would produce so much logging
that it would take a significant amount of disk and i/o resources just to
handle and c) you would need to implement special purpose tools just to make
sense of these huge logs.


-- 
greg


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match