Re: [HACKERS] [PATCHES] plpgsql, return can contains any expression

2007-02-09 Thread Pavel Stehule




OK, where are we on this patch?


without changes. This task have to do anybody who better know PostgreSQL 
cache system than me.


Regards
Pavel


---

Pavel Stehule wrote:


 
 Pavel Stehule [EMAIL PROTECTED] writes:
   This patch doesn't seem to cope with cases where the supplied tuple 
has

   the wrong number of columns, and it doesn't look like it's being
 careful
   about dropped columns either.  Also, that's a mighty 
bizarre-looking
   choice of cache memory context in coerce_to_tuple ... but then 
again,

   why are you bothering with a cache at all for temporary arrays?
 
   I am sorry, Tom. But I don't understand. I can check number of 
columns,
   ofcourse and I'll do it. What cache for temporary arrays do you 
mean?

 
 I thought that making coerce_to_tuple depend on estate-err_func was
 pretty bizarre, and that there was no need for any cache at all for
 arrays that need only live as long as the function runs.  All you are
 saving here is a palloc/pfree cycle, which is not worth the 
obscurantism

 and risk of bugs (are you sure natts can never change?).

 No, cache there is ugly. But I don't have idea about more efective
 implementation of it :-(. First version of this patch was more clean. 
and

 little bit slow. This cache save 10%.

 
 BTW, if you want this patch to make it into 8.2, it needs to be fixed
 and resubmitted *very* soon.

 I understand, but I am not able work on it in next four days. And I need
 help with it from Neil. It will be for 8.3.

 Thank you
 Pavel

 _
 Emotikony a pozadi programu MSN Messenger ozivi vasi konverzaci.
 http://messenger.msn.cz/


 ---(end of broadcast)---
 TIP 5: don't forget to increase your free space map settings

--
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


_
Najdete si svou lasku a nove pratele na Match.com. http://www.msn.cz/


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] [PATCHES] [pgsql-patches] Phantom Command IDs,updated patch

2007-02-09 Thread Heikki Linnakangas

Tom Lane wrote:

I'm testing the patch currently.  I was a bit surprised to find the
without_oid test failing, but it makes sense because I'm using a
MAXALIGN=8 machine.  I suppose Heikki tested on MAXALIGN=4.


That's right.

Thanks for the review!

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] [PATCHES] [pgsql-patches] Phantom Command IDs, updated patch

2007-02-09 Thread Zeugswetter Andreas ADI SD

 As for what I think we *should* do near-term, I'm pretty strongly
 tempted to suggest that we just throw an error if a subtransaction
tries
 to upgrade an upper transaction's shared lock to exclusive.

So when a RI check locks a parent, you would not be able to update the
parent
in a later subtrans.
I can imagine, that the error would be a problem in a select for update
loop,
because there you usually want to update the row.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [COMMITTERS] pgsql: Add lock matrix to documentation.

2007-02-09 Thread Peter Eisentraut
Am Donnerstag, 8. Februar 2007 16:32 schrieb Bruce Momjian:
 Log Message:
 ---
 Add lock matrix to documentation.

This needs some revisions.  The table needs to be mentioned somewhere in the 
text, so the reader knows when or why to refer to it.  Also, the cryptic 
abbreviations need to be expanded or explained.  And then the concept of 
lock compatibility, as the table puts it, is not used anywhere else in the 
documentation.  The table should be put in terms of conflicts instead.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PATCHES] How can I use 2GB of shared buffers on Windows?

2007-02-09 Thread Takayuki Tsunakawa
From: Magnus Hagander [EMAIL PROTECTED]
 Right. Which is why you're likely to see better performance if you
keep
 shared buffers smaller. There is something in dealing with it that's
 slow on win32, per reports from the field. It needs to be
investigated
 further...
 We've had reports that it's slow with large shared_buffers, yes.

That's a shocking news.  I'm sad.
I wonder whether the field you are talking about set Windows to use
more memory for programs than for filesystem cache, which is
selectable from [System] applet of Control Panel (Oh, I wonder how my
machine is set in this respect... have to check.)  If filesystem cache
is preferred, the following senario may be possible:

1. PostgreSQL tries to read data from disk into database cache.
2. The kernel tries to allocate filesystem buffers by paging out
PostgreSQL's memory (possibly shared buffers).
3. PostgreSQL finds data requested by its clients in database cache,
and tries to get it in memory.
4. But the shared buffers are paged out, and page-ins happen.

 Are you sure you're not running this on for example
 IDE disks with write-cache that lies? Windows will write through
that
 write-cache even if the disk lies, whereas most linux versions
won't. At
 least that used to be the case not too long ago, but there has also
been
 talking about fixign that in linux, so maybe that's done...

I'm using a PC server whose disks are all SCSI.  It has no IDE disk.

 Also note that when you run pg_bench on the local machine, you take
a
 much higher hit from the fact that context switching between
processes
 is a lot more expensive on Windows than it is on Linux. But it
shouldn't
 be big enough to explain the huge difference you had in your test.

Yes, I suspect it, too.  So, Oracle uses one multi-threaded server
process on Windows, while it employs multi-process architecture.  SQL
Server is of course multi-threaded.  SRA's original PostgreSQL for
Windows (based on 7.x) was also multi-threaded.





---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Jan Wieck

On 2/8/2007 11:41 PM, Richard Troy wrote:

On Thu, 8 Feb 2007, Joshua D. Drake wrote:


Well how deep are we talking here? My understanding of what Jan wants to
do is simple.

Be able to declare which triggers are fired depending on the state of
the cluster.

In Jan's terms, the Origin or Subscriber. In Replicator terms the Master
or Slave.

This is useful because I may have a trigger on the Master and the same
trigger on the Slave. You do not want the trigger to fire on the Slave
because we are doing data replication. In short, the we replicate the
result, not the action.

However, you may want triggers that are on the Slave to fire separately.
A reporting server that generates materialized views is a good example.
Don't tie up the Master with what a Slave can do.



It'd be great if Jan considers the blending of replication; 


Please elaborate. I would really like to get all you can contribute.




 I have no clue what got you into what you are doing here.


Jan, some sleep now and then might be helpful to your public disposition.


Richard, don't embarrass Bruce. He doesn't need your help.

I have been with this project and know Bruce Momjian for more than 10 
years. Every now and then, Bruce and I get into some sort of eventually 
publicly visible dispute that doesn't really mean much. I'll probably 
spend next Christmas with him and his family again, play a few rounds of 
backgammon with Wilma (who I really owe a revenge), hopefully don't 
interfere too much with Christine's work (especially when it involves 
handling food over a white carpet) and none of us will even remember 
this crap. Our friendship has been through some real tests. Any real 
problem we would have, we'd never discuss here. We would just meet and 
talk.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Archive log compression keeping physical log availablein the crash recovery

2007-02-09 Thread Zeugswetter Andreas ADI SD
 Our evaluation result is as follows:
 Database size: 2GB
 WAL size (after 10hours pgbench run): 48.3GB
 gzipped size: 8.8GB
 removal of the physical log: 2.36GB
 fullpage_writes=off log size: 2.42GB

 I'm planning to do archive log size evalutation with other benchmarks 
 such as DBT-2 as well.

Looks promising :-) 

Did you use the standard 5 minute checkpoint_timeout?
Very nice would be a run with checkpoint_timeout increased
to 30 min, because that is what you would tune if you are concerned
about fullpage overhead.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Jan Wieck

On 2/7/2007 7:13 AM, José Orlando Pereira wrote:

On Saturday 03 February 2007, Bruce Momjian wrote:

Jan Wieck wrote:
 I don't have any such paper and the proof of concept will be the
 implementation of the system. I do however see enough resistance against
 this proposal to withdraw the commit timestamp at this time. The new
 replication system will therefore require the installation of a patched,
 non-standard PostgreSQL version, compiled from sources cluster wide in
 order to be used. I am aware that this will dramatically reduce it's
 popularity but it is impossible to develop this essential feature as an
 external module.

 I thank everyone for their attention.

Going and working on it on your own doesn't seem like the proper
solution.  I don't see people objecting to adding it, but they want it
work, which I am sure you want too.  You have to show how it will work
and convince others of that, and then you have a higher chance it will
work, and be in the PostgreSQL codebase.


Hi,

Would it be possible to solve the problem using the GORDA on-commit hook?

Jan would be able reliably obtain a commit timestamp with the desired 
semantics and store it in a regular table within transaction boundaries.


I am not sure, I would have to look at what exactly that hook provides. 
The key to a Lamport timestamp is that it is advancing it commit order 
(plus some other things ... of course). If the hook can guarantee that 
the calls are made always in commit order, serialized without any race 
condition possible, it would probably be suitable.



Jan



PostgreSQL would not have to commit to a specific timestamp semantics and the 
patch is quite small.


Regards,




--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Variable length varlena headers redux

2007-02-09 Thread Bruce Momjian
Bruce Momjian wrote:
 
 Uh, I thought the approach was to create type-specific in/out functions,
 and add casting so every time there were referenced, they would expand
 to a varlena structure in memory.

Oh, one more thing.  You are going to need to teach the code that walks
through a tuple attributes about the short header types.  I think you
should set pg_type.typlen = -3 (vs -1 for varlena) and put your macro
code there too.  (As an example, see the macro att_addlength().)

I know it is kind of odd to have a data type that is only used on disk,
and not in memory, but I see this as a baby varlena type, used only to
store and get varlena values using less disk space.

---
 
 Gregory Stark wrote:
  
  I've been looking at this again and had a few conversations about it. This 
  may
  be easier than I had originally thought but there's one major issue that's
  bugging me. Do you see any way to avoid having every user function 
  everywhere
  use a new macro api instead of VARDATA/VARATT_DATA and VARSIZE/VARATT_SIZEP?
  
  The two approaches I see are either 
  
  a) To have two sets of macros, one of which, VARATT_DATA and VARATT_SIZEP 
  are
  for constructing new tuples and behaves exactly as it does now. So you 
  always
  construct a four-byte header datum. Then in heap_form*tuple we check if you
  can use a shorter header and convert. VARDATA/VARSIZE would be for looking 
  at
  existing datums and would interpret the header bits.
  
  This seems very fragile since one stray call site using VARATT_DATA to find
  the data in an existing datum would cause random bugs that only occur rarely
  in certain circumstances. It would even work as long as the size is filled 
  in
  with VARATT_SIZEP first which it usually is, but fail if someone changes the
  order of the statements.
  
  or 
  
  b) throw away VARATT_DATA and VARATT_SIZEP and make all user function
  everywhere change over to a new macro api. That seems like a pretty big
  burden. It's safer but means every contrib module would have to be updated 
  and
  so on.
  
  I'm hoping I'm missing something and there's a way to do this without 
  breaking
  the api for every user function.
  
  
 
 -- Start of included mail From: Tom Lane [EMAIL PROTECTED]
 
  To: Gregory Stark [EMAIL PROTECTED]
  cc: Gregory Stark [EMAIL PROTECTED], Bruce Momjian [EMAIL PROTECTED], 
  Peter Eisentraut [EMAIL PROTECTED], 
  pgsql-hackers@postgresql.org, 
  Martijn van Oosterhout kleptog@svana.org
  Subject: Re: [HACKERS] Fixed length data types issue 
  Date: Mon, 11 Sep 2006 13:15:43 -0400
  Lines: 64
  Xref: stark.xeocode.com work.enterprisedb:683
 
  Gregory Stark [EMAIL PROTECTED] writes:
   In any case it seems a bit backwards to me. Wouldn't it be better to
   preserve bits in the case of short length words where they're precious
   rather than long ones? If we make 0xxx the 1-byte case it means ...
  
  Well, I don't find that real persuasive: you're saying that it's
  important to have a 1-byte not 2-byte header for datums between 64 and
  127 bytes long.  Which is by definition less than a 2% savings for those
  values.  I think its's more important to pick bitpatterns that reduce
  the number of cases heap_deform_tuple has to think about while decoding
  the length of a field --- every if in that inner loop is expensive.
  
  I realized this morning that if we are going to preserve the rule that
  4-byte-header and compressed-header cases can be distinguished from the
  data alone, there is no reason to be very worried about whether the
  2-byte cases can represent the maximal length of an in-line datum.
  If you want to do 16K inline (and your page is big enough for that)
  you can just fall back to the 4-byte-header case.  So there's no real
  disadvantage if the 2-byte headers can only go up to 4K or so.  This
  gives us some more flexibility in the bitpattern choices.
  
  Another thought that occurred to me is that if we preserve the
  convention that a length word's value includes itself, then for a
  1-byte header the bit pattern 1000 is meaningless --- the count
  has to be at least 1.  So one trick we could play is to take over
  this value as the signal for toast pointer follows, with the
  assumption that the tuple-decoder code knows a-priori how big a
  toast pointer is.  I am not real enamored of this, because it certainly
  adds one case to the inner heap_deform_tuple loop and it'll give us
  problems if we ever want more than one kind of toast pointer.  But
  it's a possibility.
  
  Anyway, a couple of encodings that I'm thinking about now involve
  limiting uncompressed data to 1G (same as now), so that we can play
  with the first 2 bits instead of just 1:
  
  00xx4-byte length word, aligned, uncompressed data (up to 1G)
  01xx4-byte length word, aligned, compressed data (up to 1G)
  100x1-byte length word, 

[HACKERS] Database backup mechanism

2007-02-09 Thread RaviKumar . Mandala
Hi Folks,

We have a requirement to deal with large databases of the size Terabytes 
when we go into production. What is the best database back-up mechanism 
and possible issues?

pg_dump can back-up database but the dump file is limited by OS file-size 
limit. What about the option of compressing the dump file? How much time 
does it generally take for large databases? I heard, that it would be way 
too long (even one or days). I haven't tried it out, though.

What about taking zipped back-up of the database directory? We tried this 
out but the checkpoint data in pg_xlogs directory is also being backed-up. 
Since these logs keeps on increasing from day1 of database creation, the 
back_up size if increasing drastically.
Can we back-up certain subdirectories without loss of information or 
consistency..?

Any quick comments/suggestions in this regard would be very helpful.

Thanks in advance,
Ravi Kumar Mandala

Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Kris Jurka

Richard Troy wrote:


It'd be great if Jan considers the blending of replication; any given DB
instance shouldn't be only a master/originator or only a slave/subscriber.
A solution that lets you blend replication strategies in a single db is,
from my point of view, very important.



Perhaps if more people read Jan's posts he wouldn't be so frustrated.

http://archives.postgresql.org/pgsql-hackers/2007-01/msg01302.php

He clearly describes that the master/slave setting is per session, not 
per database.


Kris Jurka

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] doxygen.postgresql.org

2007-02-09 Thread Josh Berkus

Bruce Momjian wrote:

Luke Lonergan wrote:

Yay!

This rocks IMO, but I'm a borderline PHB so what do I know ;-)


You think, Oh, this will make my developers more productive.  :-)



Well, I just had a university contact thank us for setting it up.  So, 
thanks!


--Josh Berkus

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread J. Andrew Rogers

On Feb 8, 2007, at 8:41 PM, Richard Troy wrote:
It'd be great if Jan considers the blending of replication; any  
given DB
instance shouldn't be only a master/originator or only a slave/ 
subscriber.
A solution that lets you blend replication strategies in a single  
db is,

from my point of view, very important.



It might be constructive to define what a minimal complete set of  
replication primitives actually is in addition to which ones should  
be implemented.  In addition to master/slave models, you have Paxos  
algorithms and dynamic reconfiguration models in literature that can  
utilize many of the same primitives but which are very different in  
implementation.  I see the value of Jan's proposal, but perhaps it  
would be better to step back and make some assertions about the  
nature of the core capabilities that will be supported in some  
broader picture.  Having a theoretically (mostly) complete set of  
usable primitives would be an incredibly powerful feature set.


Cheers,

J. Andrew Rogers
[EMAIL PROTECTED]


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] [PATCHES] How can I use 2GB of shared buffers on Windows?

2007-02-09 Thread Magnus Hagander
On Fri, Feb 09, 2007 at 01:06:13PM +0900, Takayuki Tsunakawa wrote:
  On Thu, Feb 08, 2007 at 09:50:26PM +0900, Takayuki Tsunakawa wrote:
  When I try to start PostgreSQL 8.2.1 on Windows 2003 Server with
  shared_buffers=1024MB, I get the following error messages in the
 Event
  Log (with log_min_messages=debug5) and can't start PostgreSQL:
 
  Is this for testing, or for production? From what I've heard, you
 would
  normally never want that much shared memory - I've seen more reports
 on
  taht you shuld keep it as low as possible, really. For performance
  reasons.
 
 For testing.  I wanted to place all data in shared buffers to
 eliminate reads from disk while I run pgbench repeatedly (actually
 most reads should come from kernel cache, though.)

Right. Which is why you're likely to see better performance if you keep
shared buffers smaller. There is something in dealing with it that's
slow on win32, per reports from the field. It needs to be investigated
further...


 Does PostgreSQL for Windows have any problem when using a large
 database cache unlike UNIX versions? 

We've had reports that it's slow with large shared_buffers, yes.

 I'm excited about your current
 great work to enable building all of PostgreSQL with MSVC.  I thought
 you are aiming at making PostgreSQL 64-bit on Windows in the near
 future (though you may not have signified in ML.)  I'm afraid MinGW
 will not extend to 64-bit (for x64 and Itanium) at least reliably and
 immediately, due to the difference of data model -- 'long' is still
 32-bit in 64-bit applications on Windows.  I thought Magnus-san got
 worried about it and started the activity of completely switching to
 MSVC.

Well, that's one of the reasons, yes. But being able to build 64-bit
won't automatically mean that larger shared buffers is the way to go.

 BTW, the current PostgreSQL for Windows is very slow, isn't it?  I
 compared the performance of PostgreSQL 8.2.x for Linux (RHEL4 for x86,
 kernel 2.6.x) and Windows Server 2003.  I ran 'pgbench -c32 -t500' on
 the same machine with the same disk layout for data files and WAL,
 i.e. they are stored on separate disks.  The settings in
 postgresql.conf is the same, except for wal_sync_method -- it is set
 to open_sync on Linux and open_datasync on Windows, because they are
 the best for each platform.
 Linux version shows 1100 tps, but Windows version shows only 450 tps.

It's certainly slower than on Linux, but that's a larger difference than
i've usually seen. Are you sure you're not running this on for example
IDE disks with write-cache that lies? Windows will write through that
write-cache even if the disk lies, whereas most linux versions won't. At
least that used to be the case not too long ago, but there has also been
talking about fixign that in linux, so maybe that's done...

Also note that when you run pg_bench on the local machine, you take a
much higher hit from the fact that context switching between processes
is a lot more expensive on Windows than it is on Linux. But it shouldn't
be big enough to explain the huge difference you had in your test.

//Magnus

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Archive log compression keeping physical log available in the crash recovery

2007-02-09 Thread Koichi Suzuki

Further information about the following evaluation:

Pgbench throughput was as follows:
Full WAL archiving (full_page_writes=on), 48.3GB archive: 123TPS
Gzip WAL compress, 8.8GB archive: 145TPS
Physical log removal, 2.36GB archive: 148TPS
full_page_writes=off, 2.42GB archive: 161TPS

Koichi Suzuki wrote:

Sorry for the late responce;

Gzip can reduce the archive log size about one fourth.   My point is 
that it can still be large enough.Removing physical log record (by 
replacing them with logical log record) from archive log will achieve 
will shrink the size of the archive log to one twentieth, in the case of 
pgbehcn test about ten hours (3,600,000 transactions) with database size 
about 2GB.   In the case of gzip, maybe becuase of higher CPU load, 
total throughput for gzip is less than just copying WAL to archive.  In 
our case, throughput seems to be slightly higher than just copying 
(preserving physical log) or gzip.   I'll gather the meaturement result 
and try to post.


The size of archive log seems not affected by the size of the database, 
but just by the number of transactions.  In the case of 
full_page_writes=on and full_page_compress=on, compressed archive log 
size seems to be dependent only on the number of transactions and 
transaction characteristics.


Our evaluation result is as follows:
Database size: 2GB
WAL size (after 10hours pgbench run): 48.3GB
gzipped size: 8.8GB
removal of the physical log: 2.36GB
fullpage_writes=off log size: 2.42GB

The reason why archive log size of our case is slightly smaller than 
full_page_writes=off is because we remove not only the physical logs but 
also each page header and the dummy part at the tail of each log segment.


Further, we can apply gzip to this archive (2.36GB).   Final size is 
0.75GB, less than one sixtieth of the original WAL.


Overall duration to gzip from WAL (48.3GB to 8.8GB) was about 4000sec, 
and our compression to 2.36GB needed about 1010sec, slightly less than 
just cat command (1386sec).   When gzip is combined with our compression 
(48.3GB to 0.75GB), total duration was about 1330sec.


This shows that phyiscal log removal is good selection for the following 
case:


1) Need same crash recovery possibility as full_page_writes=on, and
2) Need to shrink the size of archive log for loger period to store.

Of course, if we care crash recovery in PITR slave, we still need 
physical log records in archive log.   In this case, because archive log 
is not intended to be kept long, its size will not be an issue.


I'm planning to do archive log size evalutation with other benchmarks 
such as DBT-2 as well.


Materials for this has already been thrown to HACKERS and PATCHES.   I 
hope you try this.



Jim Nasby wrote:
I thought the drive behind full_page_writes = off was to reduce the 
amount of data being written to pg_xlog, not to shrink the size of a 
PITR log archive.


ISTM that if you want to shrink a PITR log archive you'd be able to 
get good results by (b|g)zip'ing the WAL files in the archive. I quick 
test on my laptop shows over a 4x reduction in size. Presumably that'd 
be even larger if you increased the size of WAL segments.


On Jan 29, 2007, at 2:15 AM, Koichi Suzuki wrote:

This is a proposal for archive log compression keeping physical log 
in WAL.


In PotgreSQL 8.2, full-page_writes option came back to cut out physical
log both from WAL and archive log.   To deal with the partial write
during the online backup, physical log is written only during the online
backup.

Although this dramatically reduces the log size, it can risk the crash
recovery.   If any page is inconsisitent because of the fault, crash
recovery doesn't work because full page images are necessary to recover
the page in such case.  For critical use, especially in commercial use,
 we don't like to risk the crash recovery chance, while reducing the
archive log size will be crucial too for larger databases.WAL size
itself may be less critical, because they're reused cyclickly.

Here, I have a simple idea to reduce archive log size while keeping
physical log in xlog:

1. Create new GUC: full_page_compress,

2. Turn on both the full_page_writes and full_page_compress: physical
log will be written to WAL at the first write to a page after the
checkpoint, just as conventional full_page_writes ON.

3. Unless physical log is written during the online backup, this can be
removed from the archive log.   One bit in XLR_BKP_BLOCK_MASK
(XLR_BKP_REMOVABLE) is available to indicate this (out of four, only
three of them are in use) and this mark can be set in XLogInsert().
With the both full_page_writes and full_page_compress on, both logical
log and physical log will also be written to WAL with XLR_BKP_REMOVABLE
flag on.  Having both physical and logical log in a same WAL is not
harmful in the crash recovery.  In the crash recovery, physical log is
used if it's available.  Logical log is used in the archive recovery, as
the corresponding physical log 

Re: [HACKERS] HOT for PostgreSQL 8.3

2007-02-09 Thread Pavan Deolasee

On 2/9/07, Simon Riggs [EMAIL PROTECTED] wrote:


On Wed, 2007-02-07 at 14:17 -0500, Tom Lane wrote:

 ISTM we could fix that by extending the index VACUUM interface to
 include two concepts: aside from remove these TIDs when you find them,
 there could be replace these TIDs with those TIDs when you find them.
 This would allow pointer-swinging to one of the child tuples, after
 which the old root could be removed.  This has got the same atomicity
 problem as for CREATE INDEX, because it's the same thing: you're
 de-HOT-ifying the child.  So if you can solve the former, I think you
 can make this work too.

This is looking like the best option out of the many, since it doesn't
have any serious restrictions or penalties. Let's see what Pavan thinks,
since he's been working on this aspect.



ISTM that there two related issues that we need to solve to make
progress.

- We must make de-HOTifying or CHILLing crash safe
- Concurrent index scans should work correctly with CHILLing operations

I think the first issue can be addressed on the lines of what Heikki
suggested.
We can CHILL one tuple at a time. I am thinking of a two step process.
In the first step, the root-tuple and the heap-only tuple (which needs
CHILLing)
are marked with a special flag, CHILL_IN_PROGRESS. This operation is
WAL logged. We then insert appropriate index entries for the tuple under
consideration.

In the second step, the HEAP_UPDATED_ROOT and HEAP_ONLY_TUPLE
flags on the heap tuples are adjusted and CHILL_IN_PROGRESS flags are
cleared.

During normal operations, if CHILL_IN_PROGRESS flag is found set, we might
need to do some more work to figure out whether the index insert operations
were successful or not. If we find that there are missing index entries for
the tuple
under consideration for CHILLing, then those could be added now and flags
are set/reset appropriately.

The second problem of concurrent index scans seems a bit more complex.
We need a mechanism so that no tuples are missed or tuples are
not returned twice. Since CHILLing of a tuple adds a new access path to the
tuple from the index, a concurrent index scan may return a tuple twice.

How about grabbing a AccessExclusiveLock during CHILLing
operation ? This would prevent any concurrent index scans. Since CHILLing
of a large table can take a long time, the operation can be spread across
time with periodic acquire/release of the lock. This would prevent
starvation
of other backends. Since CHILLing is required only for CREATE INDEX
and stub-cleanup, I am assuming that its ok for it to be lazy in nature.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


Re: [HACKERS] HOT for PostgreSQL 8.3

2007-02-09 Thread Simon Riggs
On Thu, 2007-02-08 at 14:47 +, Heikki Linnakangas wrote:

 However, the easiest solution would be to make CREATE INDEX wait until 
 the old tuple is dead. That should be ok at least for concurrent CREATE 
 INDEX, because it already has that kind of a wait between 1st and 2nd 
 phase.

I'm not sure this part of the idea is possible; the rest sounded good.
Looking at DefineIndex() the wait happens only for transactions that
already have a lock on the table being indexed, which may not be very
many. AFAICS the ref page for CREATE INDEX CONCURRENTLY isn't fully
accurate (any more?) when it says [CREATE INDEX] must wait for all
existing transactions to terminate.

Waiting until an arbitrary Xid dies could be deadlock-prone, if the lock
isn't taken carefully. Imagine a pg_dump coming towards you and then
waiting on the locked table. You'd need to wait at the beginning of the
command, before locks were taken. However, that would means CREATE INDEX
would only be possible outside of transaction blocks, which I don't
think is acceptable.

I wanted this for VACUUM FULL also, but same problem exists.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Variable length varlena headers redux

2007-02-09 Thread Greg Stark

Bruce Momjian [EMAIL PROTECTED] writes:

 Bruce Momjian wrote:
  
  Uh, I thought the approach was to create type-specific in/out functions,
  and add casting so every time there were referenced, they would expand
  to a varlena structure in memory.

Are you talking about actual casts? Because that would lead to all kinds of
strange places with indexes and function lookups and so on. Or are you just
talking about code in the macro api to datum?

 Oh, one more thing.  You are going to need to teach the code that walks
 through a tuple attributes about the short header types.  I think you
 should set pg_type.typlen = -3 (vs -1 for varlena) and put your macro
 code there too.  (As an example, see the macro att_addlength().)

I thought of doing this. It would let us, for example, treat text/varchar,
bpchar, and numeric but leave other data types unchanged.

That does help somewhat but unfortunately text is the problem case. There's
tons of code that generates text without using textin. All of pgcrypto for
example.

 I know it is kind of odd to have a data type that is only used on disk,
 and not in memory, but I see this as a baby varlena type, used only to
 store and get varlena values using less disk space.

I was leaning toward generating the short varlena headers primarily in
heap_form*tuple and just having the datatype specific code generate 4-byte
headers much as you describe.

However that doesn't get us away from having VARDATA/VARSIZE aware of the new
headers. Since heap_deform*tuple and the other entry points which extract
individual attributes return pointers to the datum in the tuple. They can't
expand the header to a 4-byte header on the fly.

I thought of doing it in DETOAST_DATUM on the theory that everyone's going to
be calling it on their arguments. However there are other cases than just
arguments. Other functions might call, say, text_concat() and then call
VARDATA() on the result. 

Even if we only ever generate short headers on heap_form*tuple and always
expand them on DETOAST we could have code that passes around tuples that it
knows are entirely in memory and therefore not toasted. I'm thinking of
plpgsql here primarily. Perhaps it would be enough to outlaw this behaviour
but it still seems sort of fragile to me.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Proposal: TABLE functions

2007-02-09 Thread Pavel Stehule

Hello,

it can by more simple than I though. I need only one flag, and if its true 
then I don't create language variables for OUT params. But I need one next 
column in pg_proc.


Currently a lot of columns in pg_proc is bool. What about one binary columns 
for other options? I hope so next versions can support autonomous 
transaction, which need flag too.


Regards
Pavel Stehule





Pavel Stehule wrote:
 Hello,

 Currently PostgreSQL support set returning functions.

 ANSI SQL 2003 goes with new type of functions - table functions. With
 this syntax

 CREATE FUNCTION foo() RETURNS TABLE (c1 t1, ... )


Yeah this should be pretty easy because a table is just a composite
type. You can already do this:

CREATE TABLE foo (id bigint, first_name text);

CREATE FUNCTION foo() RETURNS SET OF foo...

 PostgreSQL equal statements are:

 CREATE TYPE tmptype AS (c1 t1, ...)
 CREATE FUNCTION ... RETURNS SETOF tmptype AS ...

 All necessary infrastructure is done. Implementation needs propably only
 small changes in parser.

 This feature doesn't need any changes in SQL functions. I expect so they
 will be more readable and consistent.

 CREATE OR REPLACE FUNCTION foo(f integer)
 RETURNS TABLE(a int, b int) AS
 $$
  SELECT a, b FROM
FROM footab
   WHERE a  f;
 $$ LANGUAGE sql;

 plpgpsql RETURN have to be enhanced for table expressions.

 CREATE OR REPLACE FUNCTION foo(f integer)
 RETURNS TABLE(a int, b int) AS -- they are not variables!
 $$
  BEGIN
RETURN TABLE(SELECT a, b  -- it's secure, a,b are not 
variables

 FROM footab
   WHERE a  f);
  END;
 $$ LANGUAGE plpgsql;

 RETURN NEXT can be used without changes. This feature doesn't allow
 combination of RETURN TABLE and RETURN NEXT statement.

 Table functions can have only IN arguments.

 Advances:
 * conformance with ansi sql 2003
 * less propability of colision varnames and colnames

 Regards
 Pavel Stehule

 _
 Emotikony a pozadi programu MSN Messenger ozivi vasi konverzaci.
 http://messenger.msn.cz/


 ---(end of broadcast)---
 TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq



--

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/



_
Emotikony a pozadi programu MSN Messenger ozivi vasi konverzaci. 
http://messenger.msn.cz/



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] [COMMITTERS] pgsql: Add lock matrix to documentation.

2007-02-09 Thread Oleg Bartunov

On Fri, 9 Feb 2007, Peter Eisentraut wrote:


Am Donnerstag, 8. Februar 2007 16:32 schrieb Bruce Momjian:

Log Message:
---
Add lock matrix to documentation.


This needs some revisions.  The table needs to be mentioned somewhere in the
text, so the reader knows when or why to refer to it.  Also, the cryptic
abbreviations need to be expanded or explained.  And then the concept of
lock compatibility, as the table puts it, is not used anywhere else in the
documentation.  The table should be put in terms of conflicts instead.



Another version with expanded abbreviations is 
http://mira.sai.msu.su/~megera/pgsql/lockmatrix/c2.html, if remove 
UPDATE EXCLUSIVE.


While compatibility matrix is a commonly accepted termin, I agree, that 
using conficts would be better in context of our docs.


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Database backup mechanism

2007-02-09 Thread Heikki Linnakangas

[EMAIL PROTECTED] wrote:
We have a requirement to deal with large databases of the size Terabytes 
when we go into production. What is the best database back-up mechanism 
and possible issues?


It depends.

Make sure you read Chapter 23. Backup and Restore of the user manual:

http://www.postgresql.org/docs/8.2/interactive/backup.html

It discusses pg_dump and restore, as well as file system level backup. 
You'll probably want to set up continuous archiving, which allows you to 
take a file-system level backup without shutting down the database.


What about taking zipped back-up of the database directory? We tried this 
out but the checkpoint data in pg_xlogs directory is also being backed-up. 
Since these logs keeps on increasing from day1 of database creation, the 
back_up size if increasing drastically.


The amount of WAL files in pg_xlog directory is controlled by the 
checkpoint_segments configuration parameter.


Can we back-up certain subdirectories without loss of information or 
consistency..?


No.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Database backup mechanism

2007-02-09 Thread Doug Knight
I would also be interested in any creative ways to reduce the size and
time to backup databases/clusters. We were just having a conversation
about this yesterday. We were mulling over things like using rsync to
only backup files in the database directory tree that actually changed.
Or maybe doing a selective backup of files based on modified times, etc,
but were unsure if this would be a safe, reliable way to backup a
reduced set of data.

Doug Knight
WSI Inc.
Andover, MA
 
On Fri, 2007-02-09 at 12:45 +0530, [EMAIL PROTECTED] wrote:
 
 Hi Folks, 
 
 We have a requirement to deal with large databases of the size
 Terabytes when we go into production. What is the best database
 back-up mechanism and possible issues? 
 
 pg_dump can back-up database but the dump file is limited by OS
 file-size limit. What about the option of compressing the dump file?
 How much time does it generally take for large databases? I heard,
 that it would be way too long (even one or days). I haven't tried it
 out, though. 
 
 What about taking zipped back-up of the database directory? We tried
 this out but the checkpoint data in pg_xlogs directory is also being
 backed-up. Since these logs keeps on increasing from day1 of database
 creation, the back_up size if increasing drastically. 
 Can we back-up certain subdirectories without loss of information or
 consistency..? 
 
 Any quick comments/suggestions in this regard would be very helpful. 
 
 Thanks in advance, 
 Ravi Kumar Mandala


Re: [HACKERS] HOT for PostgreSQL 8.3

2007-02-09 Thread Heikki Linnakangas

Tom Lane wrote:

ISTM we could fix that by extending the index VACUUM interface to
include two concepts: aside from remove these TIDs when you find them,
there could be replace these TIDs with those TIDs when you find them.
This would allow pointer-swinging to one of the child tuples, after
which the old root could be removed.  


Implementing the replace these TIDs operation atomically would be 
simple, except for the new bitmap index am. It should be possible there 
as well, but if the old and new tid happen to be on a different bitmap 
page, it requires some care to avoid deadlocks.


Also, we'd need more work mem for vacuum.


This has got the same atomicity
problem as for CREATE INDEX, because it's the same thing: you're
de-HOT-ifying the child.


Not exactly. De-HOT-ifying, or chilling, a child means inserting new 
index entries. But if we're just replacing the tids from the existing 
index entries, it's ok if we crash after replacing some but not all of 
them. The next vacuum would replace the rest of the pointers, and remove 
the old root tuple.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Archive log compression keeping physical log available in the crash recovery

2007-02-09 Thread Martijn van Oosterhout
On Fri, Feb 09, 2007 at 01:00:10PM +0900, Koichi Suzuki wrote:
 Further, we can apply gzip to this archive (2.36GB).   Final size is 
 0.75GB, less than one sixtieth of the original WAL.

Note that if you were compressing on the fly, you'll have to tell gzip
to regularly flush its buffers to make sure all the data actually hits
disk. That cuts into your compression ratio...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] Hierarchical Queries--Status

2007-02-09 Thread Mark Cave-Ayland
On Thu, 2007-02-08 at 20:49 -0500, Bruce Momjian wrote:
 Who is working on this item?

Jonah was trying to complete this for 8.3, but I believe that he has
handed it onto Gregory Stark - I think
http://archives.postgresql.org/pgsql-hackers/2007-01/msg01586.php is the
latest update.


Kind regards,

Mark.



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Database backup mechanism

2007-02-09 Thread Andrew Dunstan

[EMAIL PROTECTED] wrote:


Hi Folks,

We have a requirement to deal with large databases of the size 
Terabytes when we go into production. What is the best database 
back-up mechanism and possible issues?


pg_dump can back-up database but the dump file is limited by OS 
file-size limit. What about the option of compressing the dump file? 
How much time does it generally take for large databases? I heard, 
that it would be way too long (even one or days). I haven't tried it 
out, though.


What about taking zipped back-up of the database directory? We tried 
this out but the checkpoint data in pg_xlogs directory is also being 
backed-up. Since these logs keeps on increasing from day1 of database 
creation, the back_up size if increasing drastically.
Can we back-up certain subdirectories without loss of information or 
consistency..?


Any quick comments/suggestions in this regard would be very helpful.



Please ask in the correct forum, either pgsql-general or pgsql-admin. 
This list is strictly for discussion of development of postgres, not 
usage questions.


(If all you need is a pg_dump backup, maybe you could just pipe its 
output to something like 'split -a 5 -b 1000m - mybackup')


cheers

andrew

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Variable length varlena headers redux

2007-02-09 Thread Bruce Momjian
Greg Stark wrote:
 
 Bruce Momjian [EMAIL PROTECTED] writes:
 
  Bruce Momjian wrote:
   
   Uh, I thought the approach was to create type-specific in/out functions,
   and add casting so every time there were referenced, they would expand
   to a varlena structure in memory.
 
 Are you talking about actual casts? Because that would lead to all kinds of
 strange places with indexes and function lookups and so on. Or are you just
 talking about code in the macro api to datum?

I am thinking of auto-casts, sort of like how varchar, char, and text
are all internally treated as interchangable.

  Oh, one more thing.  You are going to need to teach the code that walks
  through a tuple attributes about the short header types.  I think you
  should set pg_type.typlen = -3 (vs -1 for varlena) and put your macro
  code there too.  (As an example, see the macro att_addlength().)
 
 I thought of doing this. It would let us, for example, treat text/varchar,
 bpchar, and numeric but leave other data types unchanged.
 
 That does help somewhat but unfortunately text is the problem case. There's
 tons of code that generates text without using textin. All of pgcrypto for
 example.

Well, TEXT can't use short headers.

  I know it is kind of odd to have a data type that is only used on disk,
  and not in memory, but I see this as a baby varlena type, used only to
  store and get varlena values using less disk space.
 
 I was leaning toward generating the short varlena headers primarily in
 heap_form*tuple and just having the datatype specific code generate 4-byte
 headers much as you describe.

Yep.

 However that doesn't get us away from having VARDATA/VARSIZE aware of the new
 headers. Since heap_deform*tuple and the other entry points which extract
 individual attributes return pointers to the datum in the tuple. They can't
 expand the header to a 4-byte header on the fly.

Yep, you are going to have to special-case those call points to test
typlen and use your short macros there if == -3.

 I thought of doing it in DETOAST_DATUM on the theory that everyone's going to
 be calling it on their arguments. However there are other cases than just
 arguments. Other functions might call, say, text_concat() and then call
 VARDATA() on the result. 

Right, I think all the in-memory stuff has to be varlena.

 Even if we only ever generate short headers on heap_form*tuple and always
 expand them on DETOAST we could have code that passes around tuples that it
 knows are entirely in memory and therefore not toasted. I'm thinking of
 plpgsql here primarily. Perhaps it would be enough to outlaw this behaviour
 but it still seems sort of fragile to me.

Yea, we might need some cleanup, but the cleanup is just to do things
properly.  I am unclear on the case you are describing.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [COMMITTERS] pgsql: Add lock matrix to documentation.

2007-02-09 Thread Merlin Moncure

On 2/9/07, Oleg Bartunov oleg@sai.msu.su wrote:

On Fri, 9 Feb 2007, Peter Eisentraut wrote:

 Am Donnerstag, 8. Februar 2007 16:32 schrieb Bruce Momjian:
 Log Message:
 ---
 Add lock matrix to documentation.

 This needs some revisions.  The table needs to be mentioned somewhere in the
 text, so the reader knows when or why to refer to it.  Also, the cryptic
 abbreviations need to be expanded or explained.  And then the concept of
 lock compatibility, as the table puts it, is not used anywhere else in the
 documentation.  The table should be put in terms of conflicts instead.


Another version with expanded abbreviations is
http://mira.sai.msu.su/~megera/pgsql/lockmatrix/c2.html, if remove
UPDATE EXCLUSIVE.

While compatibility matrix is a commonly accepted termin, I agree, that
using conficts would be better in context of our docs.


How about changing 'current lock mode' to 'opposing lock mode'?
'current' kind of suggests that you are escalating your own lock.

merlin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] [pgsql-patches] Phantom Command IDs, updated patch

2007-02-09 Thread Tom Lane
Zeugswetter Andreas ADI SD [EMAIL PROTECTED] writes:
 So when a RI check locks a parent, you would not be able to update the
 parent in a later subtrans.  I can imagine, that the error would be a
 problem in a select for update loop, because there you usually want to
 update the row.

No, it would not, because select for update would acquire exclusive lock
in the first place.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [COMMITTERS] pgsql: Add lock matrix to documentation.

2007-02-09 Thread Andrew Dunstan

Oleg Bartunov wrote:

On Fri, 9 Feb 2007, Peter Eisentraut wrote:


Am Donnerstag, 8. Februar 2007 16:32 schrieb Bruce Momjian:

Log Message:
---
Add lock matrix to documentation.


This needs some revisions.  The table needs to be mentioned somewhere 
in the

text, so the reader knows when or why to refer to it.  Also, the cryptic
abbreviations need to be expanded or explained.  And then the concept of
lock compatibility, as the table puts it, is not used anywhere else 
in the

documentation.  The table should be put in terms of conflicts instead.



Another version with expanded abbreviations is 
http://mira.sai.msu.su/~megera/pgsql/lockmatrix/c2.html, if remove 
UPDATE EXCLUSIVE.


While compatibility matrix is a commonly accepted termin, I agree, 
that using conficts would be better in context of our docs.


  


If UE is moved 2 or 3 places down/right, the matrix will be closer to 
triangular.


It would look better with non-conflicts being blank (or nbsp; in HTML) 
rather than 'O', I think - the conflicts will stand out more.


cheers

andrew

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] [PATCHES] How can I use 2GB of shared buffers on Windows?

2007-02-09 Thread Martijn van Oosterhout
On Fri, Feb 09, 2007 at 07:31:33PM +0900, Takayuki Tsunakawa wrote:
 1. PostgreSQL tries to read data from disk into database cache.
 2. The kernel tries to allocate filesystem buffers by paging out
 PostgreSQL's memory (possibly shared buffers).
 3. PostgreSQL finds data requested by its clients in database cache,
 and tries to get it in memory.
 4. But the shared buffers are paged out, and page-ins happen.

One of the biggest issues with large shared memory segments is that we
can't be sure they'll actually stay in memory. So your shared memory
block should not be too much larger than your working set.

So yes, if you make a really large segment, the problem you describe
may happen.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] [PATCHES] How can I use 2GB of shared buffers on Windows?

2007-02-09 Thread Tom Lane
Takayuki Tsunakawa [EMAIL PROTECTED] writes:
 I wonder whether the field you are talking about set Windows to use
 more memory for programs than for filesystem cache, which is
 selectable from [System] applet of Control Panel (Oh, I wonder how my
 machine is set in this respect... have to check.)  If filesystem cache
 is preferred, the following senario may be possible:

 1. PostgreSQL tries to read data from disk into database cache.
 2. The kernel tries to allocate filesystem buffers by paging out
 PostgreSQL's memory (possibly shared buffers).
 3. PostgreSQL finds data requested by its clients in database cache,
 and tries to get it in memory.
 4. But the shared buffers are paged out, and page-ins happen.

It's certainly true that if shared_buffers is large enough to make the
kernel try to swap out parts of the shared buffer array, then you've got
a counterproductive situation resulting in net *more* I/O than if you'd
used a smaller setting.  On some Unixen shared memory is implicitly
locked in RAM, and on others it's possible to request locking it (though
I'm not sure we try to at the moment).  Perhaps it's always swappable on
Windows?  Or maybe Windows is just more eager to swap it out?

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] How can I use 2GB of shared buffers on Windows?

2007-02-09 Thread Magnus Hagander
On Fri, Feb 09, 2007 at 09:49:25AM -0500, Tom Lane wrote:
 Takayuki Tsunakawa [EMAIL PROTECTED] writes:
  I wonder whether the field you are talking about set Windows to use
  more memory for programs than for filesystem cache, which is
  selectable from [System] applet of Control Panel (Oh, I wonder how my
  machine is set in this respect... have to check.)  If filesystem cache
  is preferred, the following senario may be possible:
 
  1. PostgreSQL tries to read data from disk into database cache.
  2. The kernel tries to allocate filesystem buffers by paging out
  PostgreSQL's memory (possibly shared buffers).
  3. PostgreSQL finds data requested by its clients in database cache,
  and tries to get it in memory.
  4. But the shared buffers are paged out, and page-ins happen.
 
 It's certainly true that if shared_buffers is large enough to make the
 kernel try to swap out parts of the shared buffer array, then you've got
 a counterproductive situation resulting in net *more* I/O than if you'd
 used a smaller setting.  On some Unixen shared memory is implicitly
 locked in RAM, and on others it's possible to request locking it (though
 I'm not sure we try to at the moment).  Perhaps it's always swappable on
 Windows?  Or maybe Windows is just more eager to swap it out?

The way it is it is definitly always swappable. I've been thinking of
digging into that, but haven't had the time. There are API calls to mark
memory as non-swappable, but I'm not sure it works on shared memory the
way we do it.

Apart from saying that, I will refrain from speculatnig more in *why*
it's slower with more shared memory before someone (yeah, I realise that
could be me) does some actual investigation into what happens.

//Magnus

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PATCHES] plpgsql, return can contains any expression

2007-02-09 Thread Alvaro Herrera
Pavel Stehule wrote:

 OK, where are we on this patch?
 
 without changes. This task have to do anybody who better know PostgreSQL 
 cache system than me.

How about you submit a version without any caching, but which works
correctly; and we worry about optimizations later?

 ---
 
 Pavel Stehule wrote:
 
 
  
  Pavel Stehule [EMAIL PROTECTED] writes:
This patch doesn't seem to cope with cases where the supplied tuple 
 has
the wrong number of columns, and it doesn't look like it's being
  careful
about dropped columns either.  Also, that's a mighty 
 bizarre-looking
choice of cache memory context in coerce_to_tuple ... but then 
 again,
why are you bothering with a cache at all for temporary arrays?
  
I am sorry, Tom. But I don't understand. I can check number of 
 columns,
ofcourse and I'll do it. What cache for temporary arrays do you 
 mean?
  
  I thought that making coerce_to_tuple depend on estate-err_func was
  pretty bizarre, and that there was no need for any cache at all for
  arrays that need only live as long as the function runs.  All you are
  saving here is a palloc/pfree cycle, which is not worth the 
 obscurantism
  and risk of bugs (are you sure natts can never change?).
 
  No, cache there is ugly. But I don't have idea about more efective
  implementation of it :-(. First version of this patch was more clean. 
 and
  little bit slow. This cache save 10%.
 
  
  BTW, if you want this patch to make it into 8.2, it needs to be fixed
  and resubmitted *very* soon.
 
  I understand, but I am not able work on it in next four days. And I need
  help with it from Neil. It will be for 8.3.


-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Proposal: TABLE functions

2007-02-09 Thread Tom Lane
Pavel Stehule [EMAIL PROTECTED] writes:
 it can by more simple than I though. I need only one flag, and if its true 
 then I don't create language variables for OUT params. But I need one next 
 column in pg_proc.

I thought you said this was just syntactic sugar for capabilities we
already had?

 Currently a lot of columns in pg_proc is bool. What about one binary columns 
 for other options? I hope so next versions can support autonomous 
 transaction, which need flag too.

I think stored procedures of that sort aren't functions at all, and
probably don't belong in pg_proc.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Variable length varlena headers redux

2007-02-09 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 Bruce Momjian [EMAIL PROTECTED] writes:
 I know it is kind of odd to have a data type that is only used on disk,
 and not in memory, but I see this as a baby varlena type, used only to
 store and get varlena values using less disk space.

 I was leaning toward generating the short varlena headers primarily in
 heap_form*tuple and just having the datatype specific code generate 4-byte
 headers much as you describe.

I thought we had a solution for all this, namely to make the short-form
headers be essentially a TOAST-compressed representation.  The format
with 4-byte headers is still legal but just not compressed.  Anyone who
fails to detoast an input argument is already broken, so there's no code
compatibility hit taken.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] referential Integrity and SHARE locks

2007-02-09 Thread Jan Wieck

On 2/8/2007 2:46 PM, Marc Munro wrote:

On Thu, 2007-08-02 at 14:33 -0500, Tom Lane wrote:

Marc Munro [EMAIL PROTECTED] writes:
 Yes in this case, T1 must abort because the record it was going to
 update has disappeared from underneath it.  I don't see how this is
 significantly different from the same race for the record if the table
 had no RI constraints.  The only difference that I can see, is that T1
 now has some locks that it must relinquish as the transaction aborts.

No, the difference is there would have been no error at all before;
if the record were deleted before T1 got to it then it wouldn't have
attempted to update it.  I really don't think you can make it work
to perform updates or deletes on a record you have not yet locked.


The record would be locked before the update or delete is attempted,
however it would not be locked until the referential integrity
constraints have succeeded in acquiring their locks.

It is becoming clear to me that I am missing something but I still don't
know what it is.  If anyone can see it and explain it I'd really
appreciate it.


I think you are missing the fact that the exclusive row lock on UPDATE 
is taken before any triggers are fired at all, even BEFORE ROW triggers. 
This is necessary in order to prevent the row being updated or removed 
concurrently while the triggers are executing. Since BEFORE ROW triggers 
can modify the content of the row (including the foreign key), the RI 
check and lock of the referenced row cannot happen before other BR 
triggers are completed.


In order to make your idea fly, the RI check trigger on INSERT or UPDATE 
would have to be fired before taking the row lock considering the NEW 
values for referencing columns as they are thus far. Since the row isn't 
locked at this time, it can change or disappear while the RI trigger is 
executing, so the check and lock has to be redone later with the actual 
row that got locked and after all BR triggers are done with it.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Database backup mechanism

2007-02-09 Thread Richard Huxton

Doug Knight wrote:

I would also be interested in any creative ways to reduce the size and
time to backup databases/clusters. We were just having a conversation
about this yesterday. We were mulling over things like using rsync to
only backup files in the database directory tree that actually changed.
Or maybe doing a selective backup of files based on modified times, etc,
but were unsure if this would be a safe, reliable way to backup a
reduced set of data.


Most of your virtual-layer filesystems should offer a snapshot facility 
that lets the database think its writing to the files while you see a 
static version. You could rsync that against an older file-level copy as 
the base copy in a PITR backup.


Note - even with a snapshot facility you need to use PITR or stop the 
database to get a guaranteed working copy.


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] HOT for PostgreSQL 8.3

2007-02-09 Thread Teodor Sigaev
Implementing the replace these TIDs operation atomically would be 
simple, except for the new bitmap index am. It should be possible there 


That isn't simple (may be, even possible) from GIN.
--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] HOT for PostgreSQL 8.3

2007-02-09 Thread Tom Lane
Teodor Sigaev [EMAIL PROTECTED] writes:
 Implementing the replace these TIDs operation atomically would be 
 simple, except for the new bitmap index am. It should be possible there 

 That isn't simple (may be, even possible) from GIN.

I suspect that those pushing this idea only care about btrees anyway,
so one possible answer is that HOT is only possible when the table has
only btree indexes --- or at least, only indexes of AMs that support the
replace-these-TIDs operation.  (Yet another pg_am flag...)

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[HACKERS] possible TODOs

2007-02-09 Thread Andrew Dunstan


From recent IRC discussion:

. provide encode() for text values in addition to current bytea
. proved cast from text to bytea

Thoughts?

cheers

andrew

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] HOT for PostgreSQL 8.3

2007-02-09 Thread Simon Riggs
On Fri, 2007-02-09 at 10:17 -0500, Tom Lane wrote:
 Teodor Sigaev [EMAIL PROTECTED] writes:
  Implementing the replace these TIDs operation atomically would be 
  simple, except for the new bitmap index am. It should be possible there 
 
  That isn't simple (may be, even possible) from GIN.
 
 I suspect that those pushing this idea only care about btrees anyway,
 so one possible answer is that HOT is only possible when the table has
 only btree indexes --- or at least, only indexes of AMs that support the
 replace-these-TIDs operation.  (Yet another pg_am flag...)

Well, thats me. Yes, I think b-trees-only is acceptable.

Realistically, very frequent updating and full text indexing are easily
separable use cases, at least into separate tables.

HOT should be of use in Data Warehousing applications also, when summary
tables are maintained alongside detailed data, but that also sounds like
HOT and bitmap indexes would be separable at the table level without
difficulty.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] possible TODOs

2007-02-09 Thread Peter Eisentraut
Am Freitag, 9. Februar 2007 16:45 schrieb Andrew Dunstan:
 . provide encode() for text values in addition to current bytea
 . proved cast from text to bytea

I'm not sure how exactly you want to define these, but some possible 
defintions will be dependent on the server encoding, which sounds 
troublesome.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] possible TODOs

2007-02-09 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
  From recent IRC discussion:
 . provide encode() for text values in addition to current bytea
 . proved cast from text to bytea

What would these be meant to do?  In particular I'm wondering if the
proposed cast is meant to be a binary-compatible transformation (thereby
exposing encoding issues), or if it's supposed to do the same backslash
conversions as I/O would, or what.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] possible TODOs

2007-02-09 Thread Andrew Dunstan

Peter Eisentraut wrote:

Am Freitag, 9. Februar 2007 16:45 schrieb Andrew Dunstan:
  

. provide encode() for text values in addition to current bytea
. proved cast from text to bytea



I'm not sure how exactly you want to define these, but some possible 
defintions will be dependent on the server encoding, which sounds 
troublesome.


  



Well, the cast wasn't my suggestion, so I'll let that one slide  - 
someone else can pick it up if they like :P .


The encode() suggestion was to have it have the same effect as:
   encode(byteain(varcharout(textvalue)), 'encodingtype')

Would it be such a worry to have that dependent on server encoding?

cheers

andrew

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Re-ordering of OR conditions

2007-02-09 Thread Tom Lane
Jim Nasby [EMAIL PROTECTED] writes:
 IF I run the following with the a  2900 condition first, the more  
 expensive EXISTS only gets executed when needed, but if I change the  
 order of the OR's, the EXISTS is always executed. It would be good if  
 the optimizer could re-order the OR conditions based on estimated  
 cost (granted, this wouldn't work very well if you've got functions  
 in the OR, but it'd still be useful):

I looked at this for a bit.  It's in principle do-able but I'm not
sure it's a good idea.  The problem is that while AND'ed condition
lists are usually fairly short and hence cheap to sort, OR'ed condition
lists are not infrequently very long --- nobody blinks an eye at
hundreds of items in an IN-list for instance.  I'm afraid we'd waste
a lot more cycles sorting than we could hope to regain.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Variable length varlena headers redux

2007-02-09 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes:

 Greg Stark [EMAIL PROTECTED] writes:
 Bruce Momjian [EMAIL PROTECTED] writes:
 I know it is kind of odd to have a data type that is only used on disk,
 and not in memory, but I see this as a baby varlena type, used only to
 store and get varlena values using less disk space.

 I was leaning toward generating the short varlena headers primarily in
 heap_form*tuple and just having the datatype specific code generate 4-byte
 headers much as you describe.

 I thought we had a solution for all this, namely to make the short-form
 headers be essentially a TOAST-compressed representation.  The format
 with 4-byte headers is still legal but just not compressed.  Anyone who
 fails to detoast an input argument is already broken, so there's no code
 compatibility hit taken.

It's not just input arguments though. A function could call
DirectFunctionCall* and rightfully expect the return value not to need
detoasting.

I suppose this leads me to *only* generate short headers at heap_form*tuple
time. Then DirectFunctionCall isn't relevant and most of the user code is
perfectly safe.

There could still be cases where a heaptuple is passed around in pl_exec.c or
somewhere but if it's subsequently deformed whoever looks at it hopefully
wouldn't be too surprised for it to be mandatory that they go through
pg_detoast_datum. It'll happen as long as they use the DatumGetFoo macros
anyways.

It does mean that anyone going through a heap_form*tuple/heap_deform*tuple
cycle may generate more copies and memory allocations than they expected.


-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Proposal: TABLE functions

2007-02-09 Thread Pavel Stehule


Pavel Stehule [EMAIL PROTECTED] writes:
 it can by more simple than I though. I need only one flag, and if its 
true
 then I don't create language variables for OUT params. But I need one 
next

 column in pg_proc.

I thought you said this was just syntactic sugar for capabilities we
already had?



My mistake. I am sorry. I have to store somewhere flag. One bit, which 
signalise don't use OUT arguments as function's parameters. Other is only 
game in parser.


 Currently a lot of columns in pg_proc is bool. What about one binary 
columns

 for other options? I hope so next versions can support autonomous
 transaction, which need flag too.

I think stored procedures of that sort aren't functions at all, and
probably don't belong in pg_proc.



Why not? Some people use ugly implementation of it in plperlu and DBI. 
pg_proc and related infrastructure works well. It miss only little bit 
bigger adaptability. I thing so can be interesting one general option byte, 
and one byte reservated for language handlers.


Regards
Pavel Stehule

_
Najdete si svou lasku a nove pratele na Match.com. http://www.msn.cz/


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Proposal: TABLE functions

2007-02-09 Thread Tom Lane
Pavel Stehule [EMAIL PROTECTED] writes:
 I thought you said this was just syntactic sugar for capabilities we
 already had?

 My mistake. I am sorry. I have to store somewhere flag. One bit, which 
 signalise don't use OUT arguments as function's parameters.

Huh?  What exactly is the meaning of the arguments then?

It sounds to me like this might be better thought of as a new
proargmode value, but I'm quite unsure what you're talking about ...

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] HOT for PostgreSQL 8.3

2007-02-09 Thread Simon Riggs
On Fri, 2007-02-09 at 13:39 +, Heikki Linnakangas wrote:
 Tom Lane wrote:
  ISTM we could fix that by extending the index VACUUM interface to
  include two concepts: aside from remove these TIDs when you find them,
  there could be replace these TIDs with those TIDs when you find them.
  This would allow pointer-swinging to one of the child tuples, after
  which the old root could be removed.  
 
 Implementing the replace these TIDs operation atomically would be 
 simple, except for the new bitmap index am. It should be possible there 
 as well, but if the old and new tid happen to be on a different bitmap 
 page, it requires some care to avoid deadlocks.

Grouped Item Indexes cope with this easily also, yes?

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Hierarchical Queries--Status

2007-02-09 Thread Gregory Stark

Mark Cave-Ayland [EMAIL PROTECTED] writes:

 On Thu, 2007-02-08 at 20:49 -0500, Bruce Momjian wrote:
 Who is working on this item?

 Jonah was trying to complete this for 8.3, but I believe that he has
 handed it onto Gregory Stark - I think
 http://archives.postgresql.org/pgsql-hackers/2007-01/msg01586.php is the
 latest update.

There's also

http://archives.postgresql.org/pgsql-patches/2007-02/msg00086.php


-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Proposal: TABLE functions

2007-02-09 Thread Pavel Stehule

 I thought you said this was just syntactic sugar for capabilities we
 already had?

 My mistake. I am sorry. I have to store somewhere flag. One bit, which
 signalise don't use OUT arguments as function's parameters.

Huh?  What exactly is the meaning of the arguments then?

It sounds to me like this might be better thought of as a new
proargmode value, but I'm quite unsure what you're talking about ...



My basic idea was:

CREATE FUNCTION aaa(IN a1, OUT a, OUT b)
RETURNS SETOF RECORD AS $$
..

is similar
CREATE FUNCTION aaa(IN a1)
RETURNS SETOF RECORD AS $$

from executor perspective there isn't any difference. But PL languages have 
to create only IN variables. It's protection before identifier's name 
colision. With special flag I don't need any changes in executor. And small 
change in PL compile rutines.  Special proargmode can be solution too. I 
don't need new column in pg_proc, but have to modify executor and need more 
changes in output rutines in PL.


I'll go on the way to spec. proargmode. It's good idea.

Thank You
Pavel Stehule

_
Emotikony a pozadi programu MSN Messenger ozivi vasi konverzaci. 
http://messenger.msn.cz/



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Re-ordering of OR conditions

2007-02-09 Thread Jim Nasby

On Feb 9, 2007, at 10:46 AM, Tom Lane wrote:

Jim Nasby [EMAIL PROTECTED] writes:

IF I run the following with the a  2900 condition first, the more
expensive EXISTS only gets executed when needed, but if I change the
order of the OR's, the EXISTS is always executed. It would be good if
the optimizer could re-order the OR conditions based on estimated
cost (granted, this wouldn't work very well if you've got functions
in the OR, but it'd still be useful):


I looked at this for a bit.  It's in principle do-able but I'm not
sure it's a good idea.  The problem is that while AND'ed condition
lists are usually fairly short and hence cheap to sort, OR'ed  
condition

lists are not infrequently very long --- nobody blinks an eye at
hundreds of items in an IN-list for instance.  I'm afraid we'd waste
a lot more cycles sorting than we could hope to regain.


Do people actually do that with OR lists though? My understanding is  
that now IN lists are converted to arrays, so I'd think that wouldn't  
be an issue there.


Is it easy for the planner to discern between simple OR expressions  
and stuff like EXISTS? If so it might be worth automatically pushing  
EXISTS to the end...

--
Jim Nasby   [EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] HOT for PostgreSQL 8.3

2007-02-09 Thread Heikki Linnakangas

Simon Riggs wrote:

On Fri, 2007-02-09 at 13:39 +, Heikki Linnakangas wrote:

Tom Lane wrote:

ISTM we could fix that by extending the index VACUUM interface to
include two concepts: aside from remove these TIDs when you find them,
there could be replace these TIDs with those TIDs when you find them.
This would allow pointer-swinging to one of the child tuples, after
which the old root could be removed.  
Implementing the replace these TIDs operation atomically would be 
simple, except for the new bitmap index am. It should be possible there 
as well, but if the old and new tid happen to be on a different bitmap 
page, it requires some care to avoid deadlocks.


Grouped Item Indexes cope with this easily also, yes?


Yes, as long as the old and the new tid point to the same page.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] [PATCHES] plpgsql, return can contains any expression

2007-02-09 Thread Pavel Stehule

Pavel Stehule wrote:

 OK, where are we on this patch?

 without changes. This task have to do anybody who better know PostgreSQL
 cache system than me.

How about you submit a version without any caching, but which works
correctly; and we worry about optimizations later?



I can update and send simple version.

Regards
Pavel Stehule

_
Najdete si svou lasku a nove pratele na Match.com. http://www.msn.cz/


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Variable length varlena headers redux

2007-02-09 Thread Greg Stark
Tom Lane [EMAIL PROTECTED] writes:

 Greg Stark [EMAIL PROTECTED] writes:
  Bruce Momjian [EMAIL PROTECTED] writes:
  I know it is kind of odd to have a data type that is only used on disk,
  and not in memory, but I see this as a baby varlena type, used only to
  store and get varlena values using less disk space.
 
  I was leaning toward generating the short varlena headers primarily in
  heap_form*tuple and just having the datatype specific code generate 4-byte
  headers much as you describe.
 
 I thought we had a solution for all this, namely to make the short-form
 headers be essentially a TOAST-compressed representation.  The format
 with 4-byte headers is still legal but just not compressed.  Anyone who
 fails to detoast an input argument is already broken, so there's no code
 compatibility hit taken.

Uh. So I don't see how to make this work on a little-endian machine. If the
leading its are 0 we don't know if they're toast flags or bits on the least
significant byte of a longer length.

If we store all lengths in network byte order that problem goes away but then
user code that does VARATT_SIZEP(datum) = len is incorrect.

If we declare in-memory format to be host byte order and on-disk format to be
network byte order then every single varlena datum needs to be copied when
heap_deform*tuple runs.

If we only do this for a new kind of varlena then only text/varchar/
char/numeric datums would need to be copied but that's still a lot.

-- 
greg


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] HOT for PostgreSQL 8.3

2007-02-09 Thread Hannu Krosing
Ühel kenal päeval, R, 2007-02-09 kell 13:39, kirjutas Heikki
Linnakangas:
 Tom Lane wrote:
  ISTM we could fix that by extending the index VACUUM interface to
  include two concepts: aside from remove these TIDs when you find them,
  there could be replace these TIDs with those TIDs when you find them.
  This would allow pointer-swinging to one of the child tuples, after
  which the old root could be removed.  
 
 Implementing the replace these TIDs operation atomically would be 
 simple, except for the new bitmap index am. It should be possible there 
 as well, but if the old and new tid happen to be on a different bitmap 
 page, it requires some care to avoid deadlocks.
 
 Also, we'd need more work mem for vacuum.

Why do we need to muck around with indexes at all ?

What are the problems with just shuffling the last (and only visible)
tuple to replace the HOT-hain root and be done with it ?

Can there be some problems with seqscans moving one tuple at a time or
doing revisits of the same (by TID) tuple ? If there are can't these
be fixed by creative use of ctid chains form the original live one to
the new live one ?

  This has got the same atomicity
  problem as for CREATE INDEX, because it's the same thing: you're
  de-HOT-ifying the child.
 
 Not exactly. De-HOT-ifying, or chilling, a child means inserting new 
 index entries. 

If we can just move the tuple inside the page we can avoid even that.

 But if we're just replacing the tids from the existing 
 index entries, it's ok if we crash after replacing some but not all of 
 them. The next vacuum would replace the rest of the pointers, and remove 
 the old root tuple.
 
-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] HOT for PostgreSQL 8.3

2007-02-09 Thread Tom Lane
Hannu Krosing [EMAIL PROTECTED] writes:
 What are the problems with just shuffling the last (and only visible)
 tuple to replace the HOT-hain root and be done with it ?

ctid stops being a reliable identifier.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Richard Troy

On Fri, 9 Feb 2007, Jan Wieck wrote:
  [ I wrote ]
  It'd be great if Jan considers the blending of replication;

 Please elaborate. I would really like to get all you can contribute.

Thanks Jan,

prefaced that I really haven't read everything you've written on this (or
what other people are doing, either), and that I've got a terrible flu
right now (fever, etc), I'll give it a go - hopefully it's actually
helpful. To wit:

In general terms, blending of replication [techniques] means to me that
one can have a single database instance serve as a master and as a slave
(to use only one set of terminology), and as a multi-master, too, all
simultaneously, letting the DBA / Architect choose which portions serve
which roles (purposes). All replication features would respect the
boundaries of such choices automatically, as it's all blended.

In more specific terms, and I'm just brainstorming in public here, perhaps
we can use the power of Schemas within a database to manage such
divisions; commands which pertain to replication can/would include a
schema specifier and elements within the schema can be replicated one way
or another, at the whim of the DBA / Architect. For backwards
compatability, if a schema isn't specified, it indicates that command
pertains to the entire database.

At the very least, a schema division strategy for replication leaverages
an existing DB-component binding/dividing mechanism that most everyone is
familliar with. While there are/may be database-wide, nay, installation-
wide constructs as in your Commit Timestamp proposal, I don't see that
there's any conflict - at least, from what I understand of existing
systems and proposals to date.

HTH,
Richard

-- 
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
[EMAIL PROTECTED], http://ScienceTools.com/


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] HOT for PostgreSQL 8.3

2007-02-09 Thread Simon Riggs
On Fri, 2007-02-09 at 13:47 -0500, Tom Lane wrote:
 Hannu Krosing [EMAIL PROTECTED] writes:
  What are the problems with just shuffling the last (and only visible)
  tuple to replace the HOT-hain root and be done with it ?
 
 ctid stops being a reliable identifier.

Yes, that sums it up.

The issue can be overcome in the internals, but the only means of doing
so that seemed workable required changing the output of the CTID special
column. Some applications use the tid datatype directly to relocate rows
within a transaction using the tidscan. ODBC driver uses that concept to
implement Updateable Cursors from the client. We can change that, but
we'd break all the other ones we don't know about. 

This issue was one of the major contributing factors to the size of the
previous HOT prototype, making it more invasive than is really
desirable.

Pointer-swinging avoids those issues and seems workable, even if it is a
pain to have to visit the index during VACUUM. So changing CTID isn't a
bridge we really need to cross, for which I'm glad.

Just as a matter of record, the tuple identifier would need to include
(block, itemid, xmin, cmin) to make this idea work, with the itemid
being that of the root tuple and the xmin and cmin being the tuple in
the chain that is being referenced. This would've then allowed callers
to relocate a specific tuple, even when the update chain had changed
between block accesses. Anyway, glad we're not going there anymore.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Andrew Hammond
On Feb 7, 8:12 pm, [EMAIL PROTECTED] (Bruce Momjian) wrote:
 Jan Wieck wrote:
  On 2/7/2007 10:35 PM, Bruce Momjian wrote:
   I find the term logical proof of it's correctness too restrictive.  It
   sounds like some formal academic process that really doesn't work well
   for us.

  Thank you.

My intuition is that it might be possible to prove that _nothing_ can
provide guaranteed ordering when there is disconnected operation.
However, I think that the clock based ordering Jan has described could
provide _probable_ ordering under disconnected operation. I can see
three variables in the equation that would determine the probability
of correctness for the ordering.
1) clock drift rate between disconnected clusters
2) disconnection time
3) transaction rate on the tables, or even rows involved
There are probably more. I think that if Jan implements what he's
described then a very interesting follow-up would be to do the
statistical analysis necessary to quantify the risk of incorrect
ordering while disconnected. (I've got x ms/ms relative clock drift,
and y tps. How long can I run disconnected  before falling under
99.999% probability of correctly ordered transactions?)

 No, I _now_ understand the use case, but when the patch was posted, the
 use case was missing.  I would like to see a repost with the patch, and
 a description of its use so we can all move forward on that.

An additional use case for an on-commit timestamp is in the analysis
of billing transactions in highly concurrent systems. For example,
imagine your billing period is monthly and you have transactions which
start before and end after the end-of-month. Having the on-commit
timestamp for these transactions may help when attempting to reconcile
between transactions and account activities.

Andrew


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Andrew Dunstan

Richard Troy wrote:

In more specific terms, and I'm just brainstorming in public here, perhaps
we can use the power of Schemas within a database to manage such
divisions; commands which pertain to replication can/would include a
schema specifier and elements within the schema can be replicated one way
or another, at the whim of the DBA / Architect. For backwards
compatability, if a schema isn't specified, it indicates that command
pertains to the entire database.

  


I understand that you're just thinking aloud, but overloading namespaces 
in this way strikes me as awful. Applications and extensions, which are 
the things that have need of namespaces, should not have to care about 
replication. If we have to design them for replication we'll be on a 
fast track to nowhere IMNSHO.



cheers

andrew

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Jan Wieck

On 2/9/2007 2:27 PM, Richard Troy wrote:

In general terms, blending of replication [techniques] means to me that
one can have a single database instance serve as a master and as a slave
(to use only one set of terminology), and as a multi-master, too, all
simultaneously, letting the DBA / Architect choose which portions serve
which roles (purposes). All replication features would respect the
boundaries of such choices automatically, as it's all blended.


That is specifically what the changes to pg_trigger and pg_rewrite take 
into account. However much you blend different techniques, a single 
transaction on one server will always fall into one of three categories. 
1) It could be the original operation done by the client application. B) 
It could be the actions performed by the replication engine to replay a 
remote transaction. And iii) it can be an administrative operation that 
requires not to be propagated at all.


No matter how many different models you have in parallel, one single 
transaction will be either a master, a slave or an isolated local thing. 
The proposed changes allow to tell the session which of these three 
roles it is playing and the triggers and rules can be configured to fire 
during master/local role, slave role, always or never. That 
functionality will work for master-slave as well as multi-master.


Although my current plan isn't creating such a blended system, the 
proposed trigger and rule changes are designed to support exactly that 
in a 100% backward compatible way.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Richard Troy

On Fri, 9 Feb 2007, Andrew Dunstan wrote:
 Richard Troy wrote:
  In more specific terms, and I'm just brainstorming in public here, perhaps
  we can use the power of Schemas within a database to manage such
  divisions; commands which pertain to replication can/would include a
  schema specifier and elements within the schema can be replicated one way
  or another, at the whim of the DBA / Architect. For backwards
  compatability, if a schema isn't specified, it indicates that command
  pertains to the entire database.

 I understand that you're just thinking aloud, but overloading namespaces
 in this way strikes me as awful. Applications and extensions, which are
 the things that have need of namespaces, should not have to care about
 replication. If we have to design them for replication we'll be on a
 fast track to nowhere IMNSHO.

Well, Andrew, replication _is_ an application. Or, you could think of
replication as an extension to an application. I was under the impression
that_users_ decide to put tables in schema spaces based upon _user_ need,
and that Postgres developer's use of them for other purposes was
incroaching on user choices, not the other way around. Either way,
claiming need  like this strikes me as stuck-in-a-rut or dogmatic
thinking. Besides, don't we have schema nesting to help resolve any such
care? And, what do you mean by design them for replication?

While I'm in no way stuck on blending replication strategies via schemas,
it does strike me as an appropriate concept and I'd preferr to have it
evaluated based on technical merrit - possibly citing workarounds or
solutions to technical issues, which is what I gather has been the
tradition of this group: Use case first, technical merrit second... Other
alternatives, ISTM, will have virtually the same look/feel as a schema
from an external perspective, and the more I think of it the more I think
using schemas is a sound, clean approach. That it offends someones sense
of asthetics STM a poor rationale for not choosing it. Another question
might be: What's lacking in the implementation of schemas that makes this
a poor choice, and what could be done about it without much effort?

Regards,
Richard

-- 
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
[EMAIL PROTECTED], http://ScienceTools.com/


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Richard Troy

On Fri, 9 Feb 2007, Jan Wieck wrote:

 No matter how many different models you have in parallel, one single
 transaction will be either a master, a slave or an isolated local thing.
 The proposed changes allow to tell the session which of these three
 roles it is playing and the triggers and rules can be configured to fire
 during master/local role, slave role, always or never. That
 functionality will work for master-slave as well as multi-master.

 Although my current plan isn't creating such a blended system, the
 proposed trigger and rule changes are designed to support exactly that
 in a 100% backward compatible way.

 Jan

Fantastic! ...At some point you'll be thinking of the management end -
turning it on or off, etc. That might be where these other points come
more into play.

Richard

-- 
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
[EMAIL PROTECTED], http://ScienceTools.com/


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Re-ordering of OR conditions

2007-02-09 Thread Simon Riggs
On Fri, 2007-02-09 at 11:46 -0500, Tom Lane wrote:
 Jim Nasby [EMAIL PROTECTED] writes:
  IF I run the following with the a  2900 condition first, the more  
  expensive EXISTS only gets executed when needed, but if I change the  
  order of the OR's, the EXISTS is always executed. It would be good if  
  the optimizer could re-order the OR conditions based on estimated  
  cost (granted, this wouldn't work very well if you've got functions  
  in the OR, but it'd still be useful):
 
 I looked at this for a bit.  It's in principle do-able but I'm not
 sure it's a good idea.  The problem is that while AND'ed condition
 lists are usually fairly short and hence cheap to sort, OR'ed condition
 lists are not infrequently very long --- nobody blinks an eye at
 hundreds of items in an IN-list for instance.  I'm afraid we'd waste
 a lot more cycles sorting than we could hope to regain.

Seems like the planner could decide ahead of time whether sorting the
conditions at execution time was likely to be effective or not. Perhaps
limiting it to at most 5 conditions, where at least one of those was a
function or a join condition? That would be a fairly cheap test at
planning time, but potentially a good win at execution time. 

The OR'ed condition is common condition when the schema uses complex
sub-classing. Now we have function costs it seems more likely this idea
would get used in practice.

Anyway, not necessarily for you to do, but sounds like a useful idea all
the same.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Jan Wieck

On 2/9/2007 2:19 PM, Andrew Hammond wrote:

On Feb 7, 8:12 pm, [EMAIL PROTECTED] (Bruce Momjian) wrote:

Jan Wieck wrote:
 On 2/7/2007 10:35 PM, Bruce Momjian wrote:
  I find the term logical proof of it's correctness too restrictive.  It
  sounds like some formal academic process that really doesn't work well
  for us.

 Thank you.


My intuition is that it might be possible to prove that _nothing_ can
provide guaranteed ordering when there is disconnected operation.


As a matter of physics, for two events happening outside of the event 
horizon of each other, the question which happened first is pointless.



However, I think that the clock based ordering Jan has described could
provide _probable_ ordering under disconnected operation. I can see
three variables in the equation that would determine the probability
of correctness for the ordering.


That precisely is the intended functionality. And I can exactly describe 
when two conflicting actions will result in the wrong row to persist. 
This will happen when the second update to the logically same row will 
be performed on the server with the Lamport timestamp lagging behind by 
more than the time between the two conflicting commits. Example: User 
fills out a form, submits, hits back button, corrects input and submits 
again within 3 seconds. Load balancing sends both requests to different 
servers and the first server is 3.0001 seconds ahead ... the users typo 
will be the winner.


My Lamport timestamp conflict resolution will not be able to solve this 
problem. However, when this happens, one thing is guaranteed. The update 
from the second server, arriving on the first for replication will be 
ignored because a locally generated row is newer. This fact can be used 
as an indicator that there is a possible conflict that was resolved 
using the wrong data (business process wise). All nodes in the cluster 
will end up using the same wrong row, so at least they are consistently 
wrong. Nevertheless, being able to identify possible problem cases this 
way will allow to initiate further action including but not limited to 
human intervention.


If this is not an acceptable risk for the application, other resolution 
methods will be needed. But I think in many cases, this form of default 
resolution will be good enough.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Andrew Dunstan

Richard Troy wrote:

On Fri, 9 Feb 2007, Andrew Dunstan wrote:
  

Richard Troy wrote:


In more specific terms, and I'm just brainstorming in public here, perhaps
we can use the power of Schemas within a database to manage such
divisions; commands which pertain to replication can/would include a
schema specifier and elements within the schema can be replicated one way
or another, at the whim of the DBA / Architect. For backwards
compatability, if a schema isn't specified, it indicates that command
pertains to the entire database.
  

I understand that you're just thinking aloud, but overloading namespaces
in this way strikes me as awful. Applications and extensions, which are
the things that have need of namespaces, should not have to care about
replication. If we have to design them for replication we'll be on a
fast track to nowhere IMNSHO.



Well, Andrew, replication _is_ an application. Or, you could think of
replication as an extension to an application. 


No, I don't think of it as either. It's a utility, more an extension of 
the DBMS than of the application. You don't replicate for the sake of 
replicating.



I was under the impression
that_users_ decide to put tables in schema spaces based upon _user_ need,
and that Postgres developer's use of them for other purposes was
incroaching on user choices, not the other way around.


That's exactly what you would be doing with this proposal, encroaching 
on what I regard as user space.




 Either way,
claiming need  like this strikes me as stuck-in-a-rut or dogmatic
thinking. Besides, don't we have schema nesting to help resolve any such
care? 


No. We do now have schema nesting, for this or any other purpose. Where 
did you get that idea? If we did I would not be so resistant to using 
them for this purpose, but as it is, if you hijack schemas for 
replication segregation you will detract from their more obvious use in 
name segregation.


cheers

andrew



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Andrew Dunstan

I wrote:

We do now have schema nesting, for this or any other purpose.


s/now/not/ (of course)

cheers

andrew


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Variable length varlena headers redux

2007-02-09 Thread Bruce Momjian
Greg Stark wrote:
 Tom Lane [EMAIL PROTECTED] writes:
 
  Greg Stark [EMAIL PROTECTED] writes:
   Bruce Momjian [EMAIL PROTECTED] writes:
   I know it is kind of odd to have a data type that is only used on disk,
   and not in memory, but I see this as a baby varlena type, used only to
   store and get varlena values using less disk space.
  
   I was leaning toward generating the short varlena headers primarily in
   heap_form*tuple and just having the datatype specific code generate 4-byte
   headers much as you describe.
  
  I thought we had a solution for all this, namely to make the short-form
  headers be essentially a TOAST-compressed representation.  The format
  with 4-byte headers is still legal but just not compressed.  Anyone who
  fails to detoast an input argument is already broken, so there's no code
  compatibility hit taken.
 
 Uh. So I don't see how to make this work on a little-endian machine. If the
 leading its are 0 we don't know if they're toast flags or bits on the least
 significant byte of a longer length.
 
 If we store all lengths in network byte order that problem goes away but then
 user code that does VARATT_SIZEP(datum) = len is incorrect.
 
 If we declare in-memory format to be host byte order and on-disk format to be
 network byte order then every single varlena datum needs to be copied when
 heap_deform*tuple runs.
 
 If we only do this for a new kind of varlena then only text/varchar/
 char/numeric datums would need to be copied but that's still a lot.

I wonder if we need to reorder the TOAST structure to have the bits we
need at the start of the structure so we can be sure they are first.

For example, what if we split varattrib.va_header, which is int32 now,
into for 'char' fields, and just reassemble it in the toast code.  That
would be pretty localized.

I had forgotten about hooking into the TOAST system, but since we are
going to be expanding the headers of these types when they get into
memory, it does make sense.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Jan Wieck

On 2/9/2007 3:25 PM, Andrew Dunstan wrote:

Richard Troy wrote:

On Fri, 9 Feb 2007, Andrew Dunstan wrote:
  

Richard Troy wrote:


In more specific terms, and I'm just brainstorming in public here, perhaps
we can use the power of Schemas within a database to manage such
divisions; commands which pertain to replication can/would include a
schema specifier and elements within the schema can be replicated one way
or another, at the whim of the DBA / Architect. For backwards
compatability, if a schema isn't specified, it indicates that command
pertains to the entire database.
  

I understand that you're just thinking aloud, but overloading namespaces
in this way strikes me as awful. Applications and extensions, which are
the things that have need of namespaces, should not have to care about
replication. If we have to design them for replication we'll be on a
fast track to nowhere IMNSHO.



Well, Andrew, replication _is_ an application. Or, you could think of
replication as an extension to an application. 


No, I don't think of it as either. It's a utility, more an extension of 
the DBMS than of the application. You don't replicate for the sake of 
replicating.



I was under the impression
that_users_ decide to put tables in schema spaces based upon _user_ need,
and that Postgres developer's use of them for other purposes was
incroaching on user choices, not the other way around.


That's exactly what you would be doing with this proposal, encroaching 
on what I regard as user space.


I'd never use a schema for that. Look at Slony-I. It uses a user 
selectable schema for the stuff it needs to store in the database, to 
keep Slony objects separate from user objects. But it organizes the user 
tables in what is called sets. A set can be any combination of tables 
and sequences from any number of namespaces. If I would do it again, I 
would use names instead of numbers to identify sets, and I would allow 
for summary sets containing groups of simple sets.


On the other issue, replication is part of the overall infrastructure 
and thereby part of the solution to a business problem. Like the 
hardware, database and application it solves a piece of the puzzle, none 
of the other is particularly good at. It will perform better or 
eventually not at all, depending on how much consideration the 
distributed nature of the business model was given when the application 
was designed. Applications that happened by accident rather than being 
designed usually don't allow any kind of distribution.


And yes, there are people who replicate for the sake of it. It is the 
kind of people who ask for sync multi-master no matter what their actual 
problem might be, because it sounds most prestigious.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] HOT for PostgreSQL 8.3

2007-02-09 Thread Simon Riggs
On Fri, 2007-02-09 at 13:16 +0530, Pavan Deolasee wrote:

 The second problem of concurrent index scans seems a bit more complex.
 We need a mechanism so that no tuples are missed or tuples are 
 not returned twice. Since CHILLing of a tuple adds a new access path
 to the tuple from the index, a concurrent index scan may return a
 tuple twice.
 
 How about grabbing a AccessExclusiveLock during CHILLing
 operation ? This would prevent any concurrent index scans. Since
 CHILLing of a large table can take a long time, the operation can be
 spread across time with periodic acquire/release of the lock. This
 would prevent starvation of other backends. Since CHILLing is required
 only for CREATE INDEX and stub-cleanup, I am assuming that its ok for
 it to be lazy in nature.

We've just spoken about this, so just wanted to add those thoughts here.

A pointer-swing operation will begin when we see a tuple that is both
status of HEAPTUPLE_DEAD and is marked HEAP_UPDATE_ROOT. Perhaps that
requires a new status from HeapTupleSatisfiesVacuum()? We chill all
tuples in the chain, up to the new root. We mark those tuples, so that
HeapTupleSatisfiesVacuum() will be describe them as
HEAPTUPLE_RECENTLY_DEAD. So the current Vacuum won't immediately remove
them, but they'll go away in the future as part of an on-demand block
vacuum or Vacuum. That's similar to the way we handle HALF_DEAD index
pages.

The index scans are page-at-a-time, so when we pointer-swing from the
root tuple to one of the HOT tuples we'll be OK. We're switching a
specific index tuple, so there's no multi-page locking on the index to
consider. 

Right now, I wouldn't want to assume that the way tuples are marked
prior to pointer-swinging is exactly the same as the chilling required
by CREATE INDEX: CHILL_IN_PROGRESS. It may well be, but I'm wary that we
assume they are exactly the same and introduce a subtle bug.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] HOT for PostgreSQL 8.3

2007-02-09 Thread Bruce Momjian
Tom Lane wrote:
  Removing the root tuple will require a VACUUM *FULL*.
 
 That seems unacceptable ... it won't take too long for your table to
 fill up with stubs, and we don't want to return to the bad old days
 when periodic VACUUM FULL was unavoidable.
 
 ISTM we could fix that by extending the index VACUUM interface to
 include two concepts: aside from remove these TIDs when you find them,
 there could be replace these TIDs with those TIDs when you find them.
 This would allow pointer-swinging to one of the child tuples, after
 which the old root could be removed.  This has got the same atomicity
 problem as for CREATE INDEX, because it's the same thing: you're
 de-HOT-ifying the child.  So if you can solve the former, I think you
 can make this work too.

I need clarification here.  Is removing dead heap tuple always going to
require an index scan, or was this just for chilling a row (adding an
index)?

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Variable length varlena headers redux

2007-02-09 Thread Gregory Stark

Bruce Momjian [EMAIL PROTECTED] writes:

 Uh. So I don't see how to make this work on a little-endian machine. If the
 leading its are 0 we don't know if they're toast flags or bits on the least
 significant byte of a longer length.
 ...
 I had forgotten about hooking into the TOAST system, but since we are
 going to be expanding the headers of these types when they get into
 memory, it does make sense.

Ok, I guess this can work if we guarantee that in-memory datums always have
4-byte headers. That means that heap_deform*tuple always copies the datum if
it's this type of datum.

That means we never have pointers to shortvarlena datums inside tuples. I'm
not sure if there are parts of the system that assume that the datums they get
out of heap_deform*tuple are pointers into the tuple or not. I haven't come
across any in my travels thus far.

That seems like an awful lot of copying and pallocs that aren't there
currently though. And it'll make us reluctant to change over frequently used
data types like text -- which are precisely the ones that would gain us the
most.

It seems to me that it might be better to change to storing varlena lengths in
network byte order instead. That way we can dedicate the leading bits to toast
flags and read more bytes as necessary.

I think the way to do this would be to throw out the VARATT_SIZEP macro and
replace it with VARATT_SET_SIZE(datum,size). VARSIZE would just call ntohl (or
ntohs if the leading bits on the first byte indicated...)

That does mean touching every piece of data type code. And invalidating every
piece of user code. :( At least it's fairly mechanical. And it has the
advantage of not being at all fragile -- unfixed code won't even compile.

While we're at it I would suggest taking out the VARHDRSZ offset. Just store
the size of the data payload. The constant VARHDRSZ offset no longer makes
sense since it won't actually be the size of the varlena header size anyways.
And predicting the actual size of the varlena header will be annoying and
bug-prone since it depends on the resulting value you calculate.

(Incidentally, this would actually make EnterpriseDB somewhat sad since we
want pg_migrator to work for 8.3. But it wouldn't be out of the realm of
possibility to go through the database and switch varlena headers to network
byte order. There's no need to compress them, just leave the 4-byte format in
place with the bytes swapped around.)

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] HOT for PostgreSQL 8.3

2007-02-09 Thread Simon Riggs
On Fri, 2007-02-09 at 18:10 -0500, Bruce Momjian wrote:
 Tom Lane wrote:
   Removing the root tuple will require a VACUUM *FULL*.
  
  That seems unacceptable ... it won't take too long for your table to
  fill up with stubs, and we don't want to return to the bad old days
  when periodic VACUUM FULL was unavoidable.
  
  ISTM we could fix that by extending the index VACUUM interface to
  include two concepts: aside from remove these TIDs when you find them,
  there could be replace these TIDs with those TIDs when you find them.
  This would allow pointer-swinging to one of the child tuples, after
  which the old root could be removed.  This has got the same atomicity
  problem as for CREATE INDEX, because it's the same thing: you're
  de-HOT-ifying the child.  So if you can solve the former, I think you
  can make this work too.
 
 I need clarification here.  Is removing dead heap tuple always going to
 require an index scan, or was this just for chilling a row (adding an
 index)?

We can remove a tupled marked HEAP_ONLY_TUPLE when it is status
HEAPTUPLE_DEAD. The HEAP_UPDATE_ROOT tuple can be reduced to a
TupleStub, but not removed. Multiple tuples in the chain can be removed,
though the HEAP_UPDATE_ROOT's t_ctid must be modified to point to the
first non-removed tuple in the chain. All of that can be done when we
hold a CleanupLock on the block, without reference to the indexes; this
can be performed on-demand, when we attempt an UPDATE. This is similar
to what already happens prior to a btree split operation. (This could
also be performed by bgwriter, but that isn't proposed at this time
because the buffer transit time through the cache is often not long
enough to allow tuples to die and get benefit from space reuse).

TupleStubs can be marked for removal by a pointer-swing operation during
normal VACUUM, i.e. it will require touching the indexes.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Variable length varlena headers redux

2007-02-09 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes:
 That seems like an awful lot of copying and pallocs that aren't there
 currently though. And it'll make us reluctant to change over frequently used
 data types like text -- which are precisely the ones that would gain us the
 most.

 It seems to me that it might be better to change to storing varlena lengths in
 network byte order instead. That way we can dedicate the leading bits to toast
 flags and read more bytes as necessary.

This'll add its own overhead ... but probably less than pallocs and
data-copying would.  And I agree we can find (pretty much) all the
places that need changing by the expedient of deliberately renaming
the macros and struct fields.

One problem I foresee is that I think you are about to propose that
VARDATA depend on the length already having been inserted, which it
does not now; and simple renamings won't detect ordering errors for
that.  Also I believe there are places that over-allocate memory,
fill in the data, and only then set the length; something you will
not easily be able to change.  It might work if we assume that
*creation* of a varlena value always produces the 4-byte-header form
and only reading of a value that might be on disk needs to cope with
the short-header forms.  However this seems to require two different
forms of VARDATA depending on whether one is preparing or reading a
value.  Ugh.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] HOT for PostgreSQL 8.3

2007-02-09 Thread Bruce Momjian
Simon Riggs wrote:
  I need clarification here.  Is removing dead heap tuple always going to
  require an index scan, or was this just for chilling a row (adding an
  index)?
 
 We can remove a tupled marked HEAP_ONLY_TUPLE when it is status
 HEAPTUPLE_DEAD. The HEAP_UPDATE_ROOT tuple can be reduced to a
 TupleStub, but not removed. Multiple tuples in the chain can be removed,
 though the HEAP_UPDATE_ROOT's t_ctid must be modified to point to the
 first non-removed tuple in the chain. All of that can be done when we
 hold a CleanupLock on the block, without reference to the indexes; this
 can be performed on-demand, when we attempt an UPDATE. This is similar
 to what already happens prior to a btree split operation. (This could
 also be performed by bgwriter, but that isn't proposed at this time
 because the buffer transit time through the cache is often not long
 enough to allow tuples to die and get benefit from space reuse).
 
 TupleStubs can be marked for removal by a pointer-swing operation during
 normal VACUUM, i.e. it will require touching the indexes.

OK, that sounds like a good plan.  Thanks.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Variable length varlena headers redux

2007-02-09 Thread Bruce Momjian
Tom Lane wrote:
 Gregory Stark [EMAIL PROTECTED] writes:
  That seems like an awful lot of copying and pallocs that aren't there
  currently though. And it'll make us reluctant to change over frequently used
  data types like text -- which are precisely the ones that would gain us the
  most.
 
  It seems to me that it might be better to change to storing varlena lengths 
  in
  network byte order instead. That way we can dedicate the leading bits to 
  toast
  flags and read more bytes as necessary.
 
 This'll add its own overhead ... but probably less than pallocs and
 data-copying would.  And I agree we can find (pretty much) all the
 places that need changing by the expedient of deliberately renaming
 the macros and struct fields.

I think we should go with the pallocs and see how it performs.  That is
certainly going to be easier to do, and we can test it pretty easily.

One palloc optimization idea would be to split out the representation so
the length is stored seprately from the data in memory, and we could use
an int32 for the length, and point to the shared buffer for the data. 
However I don't think our macros can handle that so it might be a
non-starter.

However, I think we should find out of the palloc is a problem before
avoiding it.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[HACKERS] Foreign keys for non-default datatypes, redux

2007-02-09 Thread Tom Lane
Almost a year ago, we talked about the problem that referential
integrity should be selecting comparison operators on the basis
of b-tree index opclasses, instead of assuming that the appropriate
operator is always named =:
http://archives.postgresql.org/pgsql-hackers/2006-02/msg00960.php
http://archives.postgresql.org/pgsql-hackers/2006-03/msg00161.php

I'm about to go off and implement that at last.  To refresh folks'
memory, what I think we agreed to was that at the time of definition
of a foreign-key constraint, we should identify the specific equality
operator to be used for (each column of) the constraint.  The method
for doing this is to be:

* First, identify the unique index that is relied on to enforce
uniqueness of the PK entries (we do this already of course).

* Look to see if there is an equality operator in this index's
opfamily accepting exactly the PK and FK data types (ie, PK = FK).
If so, use that.

* Else, check to see if there is an implicit promotion from the FK
datatype to the PK datatype.  If so, use the equality operator
PK = PK, which must exist since the opfamily supports an index
on the PK datatype.

* Else fail (this means that the present warning about inefficient
foreign keys will become a hard error).

The good thing about this proposal is that we know that we have
identified an operator whose notion of equality is compatible with
the notion of equality being enforced by the unique index, and thus
a lot of potential gotchas with nondefault opclasses go away.

My intention is that we'd record pg_depend entries making the RI
constraint dependent on not only the index, but the specific operators
to use.  This would not have been too critical a year ago given that
opclasses were effectively immutable; but in the current opfamily design
it's entirely likely that we'd select cross-type equality operators that
are considered loose and potentially droppable from the opfamily.
So we need dependencies to prevent the operators from disappearing out
from under us.  (Come to think of it, we might want to record
dependencies on the casts too, if we're using implicit casts?)

What I'm thinking about right now is that the ri_triggers.c routines
need to be able to find out which operators they're supposed to use,
so that they can construct the RI queries correctly.  We could possibly
have them dredge the information out of pg_depend, but this seems
inefficient, and I'm not entirely sure how one would match up operators
with columns given only the pg_depend entries.  What I'd like to propose
instead is:

* Add an oid[] column to pg_constraint that stores the equality operator
OIDs for a foreign-key constraint, in the same column order as conkey[]
and confkey[].

* Add an OID column to pg_trigger giving the OID of the constraint
owning the trigger (or 0 if none).  Add this information to struct
Trigger as well, so that it gets passed to trigger functions.

Given the pg_constraint OID, the RI triggers could fetch the constraint
row and look at conkey[], confkey[], and the new operator oid[] array
to determine what they need to know.

This would actually mean that they don't need pg_trigger.tgargs at all.
I am pretty strongly tempted to stop storing anything in tgargs for RI
triggers --- it's ugly, and updating the info during RENAME commands
is a pain in the rear.  On the other hand removing it might break
client-side code that expects to look at tgargs to learn about FK
constraints.  I'd personally think that pg_constraint is a lot easier to
work with, but there might be some code out there left over from way
back before pg_constraint existed --- anyone know of any such issue?

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Variable length varlena headers redux

2007-02-09 Thread Bruce Momjian
Bruce Momjian wrote:
 Tom Lane wrote:
  Gregory Stark [EMAIL PROTECTED] writes:
   That seems like an awful lot of copying and pallocs that aren't there
   currently though. And it'll make us reluctant to change over frequently 
   used
   data types like text -- which are precisely the ones that would gain us 
   the
   most.
  
   It seems to me that it might be better to change to storing varlena 
   lengths in
   network byte order instead. That way we can dedicate the leading bits to 
   toast
   flags and read more bytes as necessary.
  
  This'll add its own overhead ... but probably less than pallocs and
  data-copying would.  And I agree we can find (pretty much) all the
  places that need changing by the expedient of deliberately renaming
  the macros and struct fields.
 
 I think we should go with the pallocs and see how it performs.  That is
 certainly going to be easier to do, and we can test it pretty easily.
 
 One palloc optimization idea would be to split out the representation so
 the length is stored seprately from the data in memory, and we could use
 an int32 for the length, and point to the shared buffer for the data. 
 However I don't think our macros can handle that so it might be a
 non-starter.
 
 However, I think we should find out of the palloc is a problem before
 avoiding it.

Another idea about reducing palloc is that we know every short column is
at most 128 + 4 = 132 bytes, so we could allocate a 132-byte buffer for
every short column in the scan, and just re-use the buffer for every
row.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] RFC: Temporal Extensions for PostgreSQL

2007-02-09 Thread Warren Turkal
Temporal Extensions for PostgreSQL
by: Warren Turkal

I would like to see a comprehensive solution to time varying tables (or
temporal) in PostgreSQL. I specifically want to see suuport for valid-time and
transacation-time and bitemporal (valid-time and transaction-time) tables. I
will be defering the descriptions of much of the functionality to Dr. Richard 
T.
Snodgrass's _Developing Time-Oriented Database Applications in SQL_ at [1]. 
The
mangled pages 30-31 are at [2].


a) Functionality

Dr. Richard T. Snodgrass has worked on defining semantics of temporal very
completely in several writings. He was also involved in an unsuccessful effort
to standardize temporal extensions to SQL. I believe his book does a good job
in presenting the semantics of temporal databases and describing extensions to
SQL that make the data much more natural with which to work.


b) How current solutions fall flat

Current solutions fall flat due to the extreme complexity of implementing
valid-time and transaction time semantics on tables by adding columns to track
all of the data. Please see chapter 11 of [1] for a more complete description 
of
this complexity. Chapter 12 of [1] goes on to lay out new syntax for SQL that
will make dealing with data of this nature much more natural.


c) Examples

--create normal table
CREATE TABLE products
   ( id SERIAL PRIMARY KEY
   , description TEXT
   );

-- Add valid-time support to the table with granularity of timestamp.
  ALTER TABLE products
ADD VALIDTIME PERIOD(TIMESTAMP WITH TIMEZONE);

-- Insert row valid from 2006-01-01 to just before 2007-01-01
VALIDTIME PERIOD '[2006-01-01 - 2007-01-01)'
 INSERT INTO products
   ( description
   )
  VALUES
   ( 'red ball'
   );

-- Insert row valid from 2007-01-01 to just before 2008-01-01
-- Should be smart enough to realize the id=777 does not conflict in this time
--  of validity.
VALIDTIME PERIOD '[2007-01-01 - 2008-01-01)'
 INSERT INTO products
   ( id
   , description
   )
  VALUES
   ( 777
   , 'blue ball'
   );

-- Select history of products with id=777
VALIDTIME
   SELECT *
 FROM product
WHERE id=777;

 id | description | valid_period
--
 777| red ball| [2006-01-01 - 2007-01-01)
 777| blue ball   | [2007-01-01 - 2008-01-01)

-- Select current products with id=777
-- The date when query was run was 2007-02-10.
SELECT *
  FROM products
 WHERE id=777;

 id | description 
--
 777| blue ball

There are many more details in chapter 12 of [1].


d) New stuff (dependencies, indices, syntax, libraries)

One of the base level additions is the PERIOD datatype. I think that
implementing temporal support is reliant on developing such a type. The
description of this datatype is laid out in chapter 4 of [1]. The SQL syntax 
is
present in chapter 12 of [1]. I see this as the first piece that needs to be
implemented in order to take steps toward a DBMS to supports full temporal
capabilities. I think that PERIOD can largely reuse the datatime functionality
for parsing of literals and for comparisons. The RTREE seems to nicely
incorporate needed indexing of the PERIOD type. The syntax of the parser will
have to be extended to handle the PERIOD literals and constructor. I believe 
any
additional libraries will be required.

There are also extensions to the syntax of table creation, table altering,
querying, inserting, and updating on temporal tables. These are all discussed 
in
some detail in chapter 12 of [1]. I don't think that any of these changes will
require new libraries.

The semantics of temporal tables and querying them could have a dramatic 
affect
on how things like primary keys and unique constraints work. I would like to 
get
some comments about this from the community.


e) See Also

Addtional resources can be found at Dr. Richard T. Snodgrass's website at [3],
including SQL valid-time table support spec at [4] and SQL transaction-time
table support spec at [5].

Thoughts? Questions? Comments?

[1]http://www.cs.arizona.edu/~rts/tdbbook.pdf
[2]http://www.cs.arizona.edu/~rts/pp30-31.pdf
[3]http://www.cs.arizone.edu/~rts/
[4]ftp://ftp.cs.arizona.edu/tsql/tsql2/sql3/mad146.pdf
[5]ftp://ftp.cs.arizona.edu/tsql/tsql2/sql3/mad147.pdf

Thanks,
wt
-- 
Warren Turkal (w00t)

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq