Re: [HACKERS] Hash index todo list item

2007-09-07 Thread Neil Conway
On Sun, 2007-02-09 at 13:04 -0500, Kenneth Marshall wrote:
 2. Evaluate the performance of different hash index implementations
and/or changes to the current implementation. My current plan is
to keep the implementation as simple as possible and still provide
the desired performance. Several hash index suggestions deal with
changing the layout of the keys on a page to improve lookup
performance, including reducing the bucket size to a fraction of
a page or only storing the hash value on the page, instead of
the index value itself.

You might find this patch useful:

http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php

It implements the just store the hash in the index idea; it also sorts
the entries in a bucket by the hash value, which allows binary search to
be used to locate candidate matches.

I was surprised that this didn't result in a performance improvement for
the benchmarks that I ran, but I never got around to investigating
further (either running more benchmarks or checking whether there was a
bug in the implementation).

Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge
it up to HEAD if you'd like.

-Neil



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[HACKERS] Installation problem and a question

2007-09-07 Thread Phil
I'm running Vista x64 and while I can successfully install PG, I 
can't seem to get PL/JAVA installed. The installer doesn't recognize 
that I have a JRE installed (I do) and the option is grayed out 
during installation.


So a couple of questions:

1) Should PL/JAVA be able to be installed on Vista x64 or is there 
some known issue? What environment variables should be set, and how 
should I set them for a JRE vs. JDK installation?


2) Where is the source code for the Windows installer? I've tried 
fairly hard to find it, including Googling, browsing development 
sites, and looking at the PG 8.2.4 source code directory, but no luck 
so far. Sorry if this is a simple question but I don't know where 
else to ask at this point. Among other things I wanted to look at how 
the installer was detecting the existence of a JRE.




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)

2007-09-07 Thread Markus Schiltknecht

Hi,

apoc9009 wrote:

Thadt is Replication NOT Backup


I've now read all of your messages in this thread, but I simply fail to 
understand why you are that much opposed to the term 'replication'. I 
think the only thing which comes any close to what you're looking for is 
replication (in particular eager multi-master replication).


I'd recommend you familiarize yourself with the world of database 
replication. You already know the important chapter from our manual, 
learn that by heart. Then read [2] and [3]. :-)


Regards

Markus

[1]: Postgres advocacy wiki:
http://developer.postgresql.org/index.php/Replication%2C_Clustering%2C_and_Connection_Pooling

[2]: Terms and Definitions of Database Replication
http://www.postgres-r.org/documentation/terms


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)

2007-09-07 Thread Filip Rembiałkowski
backup is not replication.
but replicated database can be treated as good source of backup.

please take following remarks:

1) in English you don't capitalize nouns
2) read what other people write to you and try to understand that.
3) this is open source, try to be more cooperative not just cry for a
ready made solution


2007/9/7, apoc9009 [EMAIL PROTECTED]:
 Markus Schiltknecht schrieb:
  Hi,
 
  apoc9009 wrote:
  Thadt is Replication NOT Backup
 
  I've now read all of your messages in this thread, but I simply fail
  to understand why you are that much opposed to the term 'replication'.
  I think the only thing which comes any close to what you're looking
  for is replication (in particular eager multi-master replication).
 What is your Problem in understanding the Word Backup?

 Translation for you:
 A Backup is a File or Set of Files thadt contains the Data of your
 Business critical Informations.
 It should not be Archived on the same place, the same House or the same
 Room.

 A Replication Database has nothing to do with a Backup, it works only
 for Failover if the Primary
 Database has a Mailfunction.

 A good Backuptool is needed if you have Databases with sizes over 1
 Terrabyte. The common
 Backup methods wont Work with Online Productiondatabases and without the
 Problem of Datalosses,
 this is only a Way for small and Mediumsize Databases, not for Hugh
 Databases.

 Keep in Mind: Backup is NOT Replication!

 Write it down 100 times and maybe you understand



 ---(end of broadcast)---
 TIP 4: Have you searched our list archives?

http://archives.postgresql.org



-- 
Filip Rembiałkowski

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] left outer join vs subplan

2007-09-07 Thread Teodor Sigaev




I think we need to build up a library of autogenerated queries, so we
can do things which address multiple use cases. Can you tell us more
about who/what generated it, so we can research?


Sorry, I can't publish a lot of information, that is on of the biggest russian 
software company, it tries to migrate from MS SQL to PostgreSQL. MS SQL can 
optimize such queries to form similar to second query.




The query formulation does seem a fairly common one.


It seems to me too. Instead of SUM aggregates it can be MIN/AMX/AVG etc or more 
complex subquery. But pgsql usually optimizes non-aggregate subquery rather well.



The value of sum(b.val) is never used in the query, so the aggregate
itself could be discarded. I suspect there are other conditions you
aren't showing us that would make this impossible?

No, - select *, ie all fields from a and bagg tables.



The aggregate prevents the condition bagg.id = a.id from being pushed
down so that we know b.id = a.id. If we knew that then we could use b.id
= ? as an index condition to retrieve the rows.

In this case, it's safe to push down clause b.id=a.id.

BTW, is pgsql understand that query 'select id,sum() ... group by
 id' produces result with unique id?


Since we can't use the best technique, we use another. That then hits a
third optimization problem. When an IndexScan is used to enforce order,
we don't estimate how much of the table needs to be scanned before we
start hitting rows. In the example you give we need to scan 65% of the
Why 65%? a.addon has only 100 unique values and in first 100 tuples in index 
(a2idx) it will be about 80 tuples with  id1.



table using an IndexScan before we hit any rows. So we would probably be
better off doing a Sort-SeqScan to apply the condition.


That's true for query without LIMIT clause (second query is slower for 10% in 
compare with first one - both without LIMIT).

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)

2007-09-07 Thread Andrew Dunstan



apoc9009 wrote:


Write it down 100 times and maybe you understand




If you are going to be rude nobody will bother to respond to you. 
Acknowledged experts have been very patient with you so far in this 
thread. You should be appreciative, not truculent.


cheers

andrew

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Hash index todo list item

2007-09-07 Thread Kenneth Marshall
On Fri, Sep 07, 2007 at 09:50:07AM -0400, Mark Mielke wrote:
 Kenneth Marshall wrote:
 On Thu, Sep 06, 2007 at 11:56:25PM -0700, Neil Conway wrote:
   
 You might find this patch useful:

 http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php
 ...

 Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge
 it up to HEAD if you'd like.
 
 This is a great starting point. I would appreciate it if you have the
 time and could make it apply cleanly to HEAD. I remember when you first
 posted it but had forgotten, probably because of the lack-luster results.
 Just a quick glance at the patch and from what I can tell, tagging the
 index as lossy causes a lot more work to be done than should be needed
 in theory. Currently the index-scan machinery will recheck the value
 against the original value for lossy indexes. However, given that we
 are using a good hash function with a low chance of collision, if we
 mark the unique items in the index then they do not actually have to
 be rechecked during the scan. Do you have any suggestions for implementing
 that optimization or is there any option to tell the scan machinery to
 only re-check a certain list of tuples? Thank you again for pointing
 this patch out and please let me know when you have a version against
 HEAD.
   
 What do you mean by mark the unique items in the index then they do not 
 actually have to be rechecked during the scan. Even if there is a unique 
 hash value mapping to a unique key, there is no guarantee that a new value 
 won't result in that same hash value. Such is the nature of hashes. A hash 
 key map does not mean a value match. The value must be checked. The 
 opposite, however, may be true. If the hash key is not found, then we know 
 the row for the value does not exist.

 Have you measured the performance of re-checking? I have always assumed the 
 performance of re-checking was near free when compared to the cost of 
 looking up the tuples in the table to determine whether or not they are 
 live. This is why I have not been upset that bitmap index scans often 
 re-check.

 Cheers,
 mark

I understand that a hash value is a many-to-one mapping. That is the
point of the flag in the index. The flag means that there is only one
item in the heap corresponding to that hash value. In this case we
know that the value in the heap is the correct one and a possibly
very expensive string comparison can be skipped. Given that the hash
function is doing its job, almost every string comparison can be skipped.
How long would it take to compare 1-32K of data? How much CPU usage?
With this field in place, you only need to check tuple visibility.

Regards,
Ken

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)

2007-09-07 Thread Rainer Bauer
Markus Schiltknecht wrote:

[2]: Terms and Definitions of Database Replication
http://www.postgres-r.org/documentation/terms

Markus, the links in the left side menu are broken on the about and
documentation page. They point to http://www.postgres-r.org/overview
instead of http://www.postgres-r.org/documentation/overview, etc.

Rainer

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Hash index todo list item

2007-09-07 Thread Mark Mielke

Kenneth Marshall wrote:

On Thu, Sep 06, 2007 at 11:56:25PM -0700, Neil Conway wrote:
  

You might find this patch useful:

http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php
...

Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge
it up to HEAD if you'd like.


This is a great starting point. I would appreciate it if you have the
time and could make it apply cleanly to HEAD. I remember when you first
posted it but had forgotten, probably because of the lack-luster results.
Just a quick glance at the patch and from what I can tell, tagging the
index as lossy causes a lot more work to be done than should be needed
in theory. Currently the index-scan machinery will recheck the value
against the original value for lossy indexes. However, given that we
are using a good hash function with a low chance of collision, if we
mark the unique items in the index then they do not actually have to
be rechecked during the scan. Do you have any suggestions for implementing
that optimization or is there any option to tell the scan machinery to
only re-check a certain list of tuples? Thank you again for pointing
this patch out and please let me know when you have a version against
HEAD.
  
What do you mean by mark the unique items in the index then they do not 
actually have to be rechecked during the scan. Even if there is a 
unique hash value mapping to a unique key, there is no guarantee that a 
new value won't result in that same hash value. Such is the nature of 
hashes. A hash key map does not mean a value match. The value must be 
checked. The opposite, however, may be true. If the hash key is not 
found, then we know the row for the value does not exist.


Have you measured the performance of re-checking? I have always assumed 
the performance of re-checking was near free when compared to the cost 
of looking up the tuples in the table to determine whether or not they 
are live. This is why I have not been upset that bitmap index scans 
often re-check.


Cheers,
mark

--
Mark Mielke [EMAIL PROTECTED]



Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)

2007-09-07 Thread apoc9009

Markus Schiltknecht schrieb:

Hi,

apoc9009 wrote:

Thadt is Replication NOT Backup


I've now read all of your messages in this thread, but I simply fail 
to understand why you are that much opposed to the term 'replication'. 
I think the only thing which comes any close to what you're looking 
for is replication (in particular eager multi-master replication).

What is your Problem in understanding the Word Backup?

Translation for you:
A Backup is a File or Set of Files thadt contains the Data of your 
Business critical Informations.
It should not be Archived on the same place, the same House or the same 
Room.


A Replication Database has nothing to do with a Backup, it works only 
for Failover if the Primary

Database has a Mailfunction.

A good Backuptool is needed if you have Databases with sizes over 1 
Terrabyte. The common
Backup methods wont Work with Online Productiondatabases and without the 
Problem of Datalosses,
this is only a Way for small and Mediumsize Databases, not for Hugh 
Databases.


Keep in Mind: Backup is NOT Replication!

Write it down 100 times and maybe you understand



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Hash index todo list item

2007-09-07 Thread Kenneth Marshall
On Fri, Sep 07, 2007 at 12:55:37PM +0100, Heikki Linnakangas wrote:
 Neil Conway wrote:
  You might find this patch useful:
  
  http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php
 
 Oh, I had forgot about that.
 
  It implements the just store the hash in the index idea; it also sorts
  the entries in a bucket by the hash value, which allows binary search to
  be used to locate candidate matches.
  
  I was surprised that this didn't result in a performance improvement for
  the benchmarks that I ran, but I never got around to investigating
  further (either running more benchmarks or checking whether there was a
  bug in the implementation).
 
 You did get a smaller index, which has cache benefits with larger
 tables. You didn't compare the size comparison against b-tree, but a
 smaller index is a good thing.
 
 I think you left some money on the table on that front. Since all the
 HashItems are fixed size, you can get rid of the line pointers
 altogether. Given that a sizeof(HashItemData) is 8 bytes, and a line
 pointer is 4 bytes, that should give a further 1/3 reduction in index
 size. If you're willing to give up on the alignment of HashItemData, you
 can get it down to 6 bytes.
 
 Even from a CPU point of view, as the table becomes bigger and the
 b-tree becomes deeper, the difference between O(1) and O(log n) lookup
 starts to become more significant.
 
If you use the size values from my initial tests, the hash index is
down to 3X the btree size instead of 5X. If we can make these additional
changes and add a unique flags to the index entry, we would have a much
smaller index. I had not thought of it at the time, but the addition of
the unique/sole item flag would allow the use of the hash index to
support unique indexes and be used for primary keys. In some usage
cases, the O(1) would be very advantageous.

Regards,
Ken

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Hash index todo list item

2007-09-07 Thread Kenneth Marshall
On Thu, Sep 06, 2007 at 11:56:25PM -0700, Neil Conway wrote:
 On Sun, 2007-02-09 at 13:04 -0500, Kenneth Marshall wrote:
  2. Evaluate the performance of different hash index implementations
 and/or changes to the current implementation. My current plan is
 to keep the implementation as simple as possible and still provide
 the desired performance. Several hash index suggestions deal with
 changing the layout of the keys on a page to improve lookup
 performance, including reducing the bucket size to a fraction of
 a page or only storing the hash value on the page, instead of
 the index value itself.
 
 You might find this patch useful:
 
 http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php
 
 It implements the just store the hash in the index idea; it also sorts
 the entries in a bucket by the hash value, which allows binary search to
 be used to locate candidate matches.
 
 I was surprised that this didn't result in a performance improvement for
 the benchmarks that I ran, but I never got around to investigating
 further (either running more benchmarks or checking whether there was a
 bug in the implementation).
 
 Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge
 it up to HEAD if you'd like.
 
 -Neil
 
This is a great starting point. I would appreciate it if you have the
time and could make it apply cleanly to HEAD. I remember when you first
posted it but had forgotten, probably because of the lack-luster results.
Just a quick glance at the patch and from what I can tell, tagging the
index as lossy causes a lot more work to be done than should be needed
in theory. Currently the index-scan machinery will recheck the value
against the original value for lossy indexes. However, given that we
are using a good hash function with a low chance of collision, if we
mark the unique items in the index then they do not actually have to
be rechecked during the scan. Do you have any suggestions for implementing
that optimization or is there any option to tell the scan machinery to
only re-check a certain list of tuples? Thank you again for pointing
this patch out and please let me know when you have a version against
HEAD.

Regards,
Ken
 
 

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)

2007-09-07 Thread Hannu Krosing
Ühel kenal päeval, R, 2007-09-07 kell 12:03, kirjutas apoc9009:
 Andrew Sullivan schrieb:
  It seems that what you want is near-real-time online backups with _no
  cost_, which is not a feature that I think anyone will ever work on.  
  A
 100% Correct!
 
 I think anyone commit the Statement, thadt a Databases is a very 
 imported Part of Software
 for a wide range of Peoples. It is absoluty normaly thadt thoose Peoples 
 who using Databases having the Wish to have
 there Data save and up to Date until to the Last Record of a Customer 
 (for Example: A Order form a Onlineshop).
 
 Very Large Databases with sizes more then over 1 Terrabyte cannot be 
 backed up continously using Dumps
 or the common Backupmethods
 
 The Abstract need is:
 
 1.An exact Copy (1:1) of the Productiondatabase should be mirrored at a 
 diffrent Machine, connected via (LAN/WAN) (as Backup)
 2.The Backup Operations should be done normaly while the Database is 
 Online with very small Bandwich, I/O and CPU Trafficcosts.
 3.Per Month or Week (or choosable) at a New Backup-Timeline/Version 
 should be started (depend on the Basebackup)

Probably you don't want your backup as a base backup + a bunch of
logfiles. 
If you need to rally move to backup, then even a weeks worth of logfiles
will take several hours or possibly days to apply.

What you want is a standby machine, which applies all WAL files as they
are copied over.

 4.A New Base and  its Streaming Backup Configuration should be 
 Hot-Addable to an running Productiondatabase (without Shutdown or Lock)

It currently is

 5.A Remote Operationmanager Interface (GUI) should be there for 
 Monitoring and Maintaining the Backups (maybe in PGAdmin)

Huh ? I don't think this has anything to do with postgres-hackers list.

Actually your other questions should also go to some users/newbies list,
as this is a question of using existing features, and not something that
needs to be added to backend.

 6.If the Production and Mirror replicated Database is crashed (Hardware 
 Failure or Provider Problems) the Recovery should done verry fast.

Yep. That is what Hot-standby as described in postgres manual is meant
to do.

 (An Idea for 5. / Clientside)...
 It would be a great Feature, if the PSQL-Client having a Local Deamon, 
 thadt can Download the Backup Catalogues from
 the WAN-Backup FTP-Server continously down to the local Adminmachine. 
 Then the Admin is able to burn the DB-Backups on
 DVD or write it on Tape on its local Machine at his Company (if the 
 Backupreplicationserver fails two or the Internetprovider has
 Problems,  the DB-Admin can apply its local Backups from DVD to the New 
 DB on a New Machine anytime)
  

All this keeping the WAL logs on CD's/tapes has nothing to do with fast
recovery after failure.

This is something that would be useful, but for entirely other purposes.
Namely for a case, when you need to _GO_BACK_ to an earlier state. For
example you discover that an careless operator or software bug has
deleted important data last Wednesday and you need a way to get it back.
In that case you take your last Sundays base backup and apply WAL up to
few moments before the error happened. But it will not be fast.


Hannu





---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)

2007-09-07 Thread Hannu Krosing
Ühel kenal päeval, R, 2007-09-07 kell 12:20, kirjutas apoc9009:
 Trevor Talbot schrieb:
  Backup 12/24/2008 Version 2
  /pg/backup/12_24_2008/base/rcvry.rcv--- Basebackup
  /pg/backup/12_24_2008/changes/0001.chg  --- Changed Data
   /changes/0002.chg  --- Changed Data
   /changes/0003.chg  --- Changed Data

   /changes/0010.chg  --- Changed Data
   /changes/0001.rsf  ---  Recovery
 
  Stripeset File (10 MByte) addon of Basebackup
  delete *.chg
 
  if a Stripeset of 10 *.chg Files exist, they should be converted or merged
  to one bigger *.rsf Stripesetfile (to prevent for having Millions *.chg)
 
  
 
  Why?  What does this actually do?
 This is the Structure on FTP-Site of the Backupserver! Not locally where 
 the Postmaster runs on!

This can currently be done with a simple script, either with ftp or
ssh/scp

-
Hannu



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Hash index todo list item

2007-09-07 Thread Heikki Linnakangas
Neil Conway wrote:
 You might find this patch useful:
 
 http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php

Oh, I had forgot about that.

 It implements the just store the hash in the index idea; it also sorts
 the entries in a bucket by the hash value, which allows binary search to
 be used to locate candidate matches.
 
 I was surprised that this didn't result in a performance improvement for
 the benchmarks that I ran, but I never got around to investigating
 further (either running more benchmarks or checking whether there was a
 bug in the implementation).

You did get a smaller index, which has cache benefits with larger
tables. You didn't compare the size comparison against b-tree, but a
smaller index is a good thing.

I think you left some money on the table on that front. Since all the
HashItems are fixed size, you can get rid of the line pointers
altogether. Given that a sizeof(HashItemData) is 8 bytes, and a line
pointer is 4 bytes, that should give a further 1/3 reduction in index
size. If you're willing to give up on the alignment of HashItemData, you
can get it down to 6 bytes.

Even from a CPU point of view, as the table becomes bigger and the
b-tree becomes deeper, the difference between O(1) and O(log n) lookup
starts to become more significant.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Hash index todo list item

2007-09-07 Thread Kenneth Marshall
On Thu, Sep 06, 2007 at 11:56:25PM -0700, Neil Conway wrote:
 On Sun, 2007-02-09 at 13:04 -0500, Kenneth Marshall wrote:
  2. Evaluate the performance of different hash index implementations
 and/or changes to the current implementation. My current plan is
 to keep the implementation as simple as possible and still provide
 the desired performance. Several hash index suggestions deal with
 changing the layout of the keys on a page to improve lookup
 performance, including reducing the bucket size to a fraction of
 a page or only storing the hash value on the page, instead of
 the index value itself.
 
 You might find this patch useful:
 
 http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php
 
 It implements the just store the hash in the index idea; it also sorts
 the entries in a bucket by the hash value, which allows binary search to
 be used to locate candidate matches.
 
 I was surprised that this didn't result in a performance improvement for
 the benchmarks that I ran, but I never got around to investigating
 further (either running more benchmarks or checking whether there was a
 bug in the implementation).
 
 Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge
 it up to HEAD if you'd like.
 
 -Neil
 
I have another question. Did the scan code at this time use the
heap-order scanning? Could that have had an impact on the patch
performance?

Ken

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Hash index todo list item

2007-09-07 Thread Mark Mielke

Kenneth Marshall wrote:

I understand that a hash value is a many-to-one mapping. That is the
point of the flag in the index. The flag means that there is only one
item in the heap corresponding to that hash value. In this case we
know that the value in the heap is the correct one and a possibly
very expensive string comparison can be skipped. Given that the hash
function is doing its job, almost every string comparison can be skipped.
How long would it take to compare 1-32K of data? How much CPU usage?
With this field in place, you only need to check tuple visibility
The value comparison cannot be skipped. I do not think you understand 
the many-to-one mapping in its entirety.


Example:

   Table has: a(1), b(2), c(3)
   Index has: 1 = 1 (unique), 2 = 2 (unique), 3 = 3 (unique)

Query:

   select * from table where key = 'z';

If 'z' hashes to '3' (completely possible), then the index record 3 
points to tuple 3, and it exists. Only, it doesn't because 'a'  'z'. 
You MUST check the value.


Cheers,
mark

--
Mark Mielke [EMAIL PROTECTED]

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Hash index todo list item

2007-09-07 Thread Kenneth Marshall
On Fri, Sep 07, 2007 at 10:36:41AM -0400, Brian Hurt wrote:
 Kenneth Marshall wrote:

 I understand that a hash value is a many-to-one mapping. That is the
 point of the flag in the index. The flag means that there is only one
 item in the heap corresponding to that hash value. In this case we
 know that the value in the heap is the correct one and a possibly
 very expensive string comparison can be skipped. Given that the hash
 function is doing its job, almost every string comparison can be skipped.
 How long would it take to compare 1-32K of data? How much CPU usage?
 With this field in place, you only need to check tuple visibility.
  

 How likely is it that you will get a hash collision, two strings that are 
 different that will hash to the same value?  To avoid this requires a very 
 large hash key (128 bits, minimum)- otherwise you get into birthday attack 
 problems.  With a 32-bit hash, the likelyhood is greater than 50% that two 
 strings in a collection of 100,000 will hash to the same value.  With a 
 64-bit hash, the likelyhood is greater than 50% that two strings in a 
 collection of 10 billion will has to same value.  10 billion is a large 
 number, but not an unreasonable number, of strings to want to put into a 
 hash table- and it's exactly this case where the O(1) cost of hashtables 
 starts being a real win.

 Brian

Yes, there is a non-negligible chance of collision (In a DB is there
any chance that is non-negligible? :) ) and the values must be checked
against the actual. The win is the collapse of the index size and only
needed to check a small fraction of the actual tuples.

Regards,
Ken

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)

2007-09-07 Thread Markus Schiltknecht

Hi,

apoc9009 wrote:

Translation for you:
A Backup is a File or Set of Files thadt contains the Data of your 
Business critical Informations.
It should not be Archived on the same place, the same House or the same 
Room.


I disagree, a backup does not necessarily have to be a single file or a 
set of files. Wikipedia has this definition:


  backup refers to making copies of data so that these additional
   copies may be used to restore the original after a data loss event.

While for replica, it states:

  replica is a copy that is relatively indistinguishable from the
   original

Thus a backup can very well be thought of as replica, and vice versa.

A Replication Database has nothing to do with a Backup, it works only 
for Failover if the Primary

Database has a Mailfunction.


That's certainly plain wrong, see multi-master replication where 
failover doesn't make any sense. Wikipedia again (although that's 
unfair, as I've contributed to that definition myself) [1]:


  Replication is the process of sharing information so as to ensure
   consistency between redundant resources

..for example a master database and a backup.


Keep in Mind: Backup is NOT Replication!
Write it down 100 times and maybe you understand


A backup IS a replica. A backup IS a replica. A backup IS a replica. A 
backup IS a replica...


Regards

Markus

[1]: http://en.wikipedia.org/wiki/Replication_%28computer_science%29

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Hash index todo list item

2007-09-07 Thread Kenneth Marshall
On Fri, Sep 07, 2007 at 11:08:13AM -0400, Brian Hurt wrote:
 Kenneth Marshall wrote:

  
 How likely is it that you will get a hash collision, two strings that are 
 different that will hash to the same value?  To avoid this requires a 
 very large hash key (128 bits, minimum)- otherwise you get into birthday 
 attack problems.  With a 32-bit hash, the likelyhood is greater than 50% 
 that two strings in a collection of 100,000 will hash to the same value.  
 With a 64-bit hash, the likelyhood is greater than 50% that two strings 
 in a collection of 10 billion will has to same value.  10 billion is a 
 large number, but not an unreasonable number, of strings to want to put 
 into a hash table- and it's exactly this case where the O(1) cost of 
 hashtables starts being a real win.

 Brian


 Yes, there is a non-negligible chance of collision (In a DB is there
 any chance that is non-negligible? :) ) and the values must be checked
 against the actual. The win is the collapse of the index size and only
 needed to check a small fraction of the actual tuples.


  

 Ah, OK- I misunderstood you.  I thought you were saying that the hash 
 values would need to be unique, and you wouldn't check the original values 
 at all.  My bad.

 Brian

No, you were correct. I misstated originally and you and Mark both pointed
out my mistake.

Regards,
Ken

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] terms for database replication: synchronous vs eager

2007-09-07 Thread Markus Schiltknecht

Hi,

I'm asking for advice and hints regarding terms in database replication, 
especially WRT Postgres-R. (Sorry for crossposting, but I fear not 
reaching enough people on the Postgres-R ML alone)


I'm struggling on how to classify the Postgres-R algorithm. Up until 
recently, most people thought of it as synchronous replication, but it's 
not synchronous in the strong (and very common) sense. I.e. after a node 
confirms to have committed a transaction, other nodes didn't necessarily 
commit already. (They only promise that they *will* commit without 
conflicts).


This violates the common understanding of synchrony, because you can't 
commit on a node A and then query another node B and expect it be 
coherent immediately.


None the less, Postgres-R is eager (or pessimistic?) in the sense that 
it replicates *before* committing, so as to avoid divergence. In [1] 
I've tried to make that distinction clear, and I'm currently advocating 
for using synchronous only in the very strong (and commonly used) sense. 
I've choosen the word 'eager' to mean 'replicates before committing'.


According to that definitions, Postgres-R is async but eager.

Do these definitions violate any common meaning? Maybe in other areas 
like distributed storage or lock managers?


Regards

Markus

[1]: Terms and Definitions of Database Replication
http://www.postgres-r.org/documentation/terms

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)

2007-09-07 Thread apoc9009

Filip Rembiałkowski schrieb:

please take following remarks:
thx, but if i need some advice form a scandinavian dickhead then i will 
let you know this






---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Hash index todo list item

2007-09-07 Thread Kenneth Marshall
On Fri, Sep 07, 2007 at 10:30:30AM -0400, Mark Mielke wrote:
 Kenneth Marshall wrote:
 I understand that a hash value is a many-to-one mapping. That is the
 point of the flag in the index. The flag means that there is only one
 item in the heap corresponding to that hash value. In this case we
 know that the value in the heap is the correct one and a possibly
 very expensive string comparison can be skipped. Given that the hash
 function is doing its job, almost every string comparison can be skipped.
 How long would it take to compare 1-32K of data? How much CPU usage?
 With this field in place, you only need to check tuple visibility
 The value comparison cannot be skipped. I do not think you understand the 
 many-to-one mapping in its entirety.

 Example:

Table has: a(1), b(2), c(3)
Index has: 1 = 1 (unique), 2 = 2 (unique), 3 = 3 (unique)

 Query:

select * from table where key = 'z';

 If 'z' hashes to '3' (completely possible), then the index record 3 points 
 to tuple 3, and it exists. Only, it doesn't because 'a'  'z'. You MUST 
 check the value.

 Cheers,
 mark

Yes, you are completely correct. Thank you for the reminder.

Regards,
Ken

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)

2007-09-07 Thread Joshua D. Drake
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

apoc9009 wrote:
 Filip Rembiałkowski schrieb:
 please take following remarks:
 thx, but if i need some advice form a scandinavian dickhead then i will
 let you know this
 

This is not acceptable on our lists. Do not post in such a way again.

Sincerely,

Joshua D. Drake

 
 
 
 
 ---(end of broadcast)---
 TIP 5: don't forget to increase your free space map settings
 


- --

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997  http://www.commandprompt.com/
UNIQUE NOT NULL
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG4W9BATb/zqfZUUQRAgglAJ9Le5Yxu796/tfJfVTXUfRSecGnlACfe+iB
KNK7jelJo30lh8ymw1Ppfqo=
=CSbC
-END PGP SIGNATURE-

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)

2007-09-07 Thread Dave Page

apoc9009 wrote:

Filip Rembiałkowski schrieb:

please take following remarks:
thx, but if i need some advice form a scandinavian dickhead then i will 
let you know this


That kind of remark is not acceptable on the PostgreSQL mailing lists. 
Please do not post here again unless you can speak to people with an 
appropriate amount of respect.


Regards, Dave.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Hash index todo list item

2007-09-07 Thread Brian Hurt

Kenneth Marshall wrote:



 

How likely is it that you will get a hash collision, two strings that are 
different that will hash to the same value?  To avoid this requires a very 
large hash key (128 bits, minimum)- otherwise you get into birthday attack 
problems.  With a 32-bit hash, the likelyhood is greater than 50% that two 
strings in a collection of 100,000 will hash to the same value.  With a 
64-bit hash, the likelyhood is greater than 50% that two strings in a 
collection of 10 billion will has to same value.  10 billion is a large 
number, but not an unreasonable number, of strings to want to put into a 
hash table- and it's exactly this case where the O(1) cost of hashtables 
starts being a real win.


Brian

   


Yes, there is a non-negligible chance of collision (In a DB is there
any chance that is non-negligible? :) ) and the values must be checked
against the actual. The win is the collapse of the index size and only
needed to check a small fraction of the actual tuples.


 



Ah, OK- I misunderstood you.  I thought you were saying that the hash 
values would need to be unique, and you wouldn't check the original 
values at all.  My bad.


Brian



Re: [HACKERS] GIN readme is out of date

2007-09-07 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Since that interface documentation has been copied to the manual,
 gin.sgml, section 52.2, which is up to date, how about we just remove it
 from the README?

+1 ... README files should not duplicate info that's in the SGML docs.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] GIN readme is out of date

2007-09-07 Thread Teodor Sigaev

Since that interface documentation has been copied to the manual,
gin.sgml, section 52.2, which is up to date, how about we just remove it
from the README?


Agreed
--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Hash index todo list item

2007-09-07 Thread Brian Hurt

Kenneth Marshall wrote:


I understand that a hash value is a many-to-one mapping. That is the
point of the flag in the index. The flag means that there is only one
item in the heap corresponding to that hash value. In this case we
know that the value in the heap is the correct one and a possibly
very expensive string comparison can be skipped. Given that the hash
function is doing its job, almost every string comparison can be skipped.
How long would it take to compare 1-32K of data? How much CPU usage?
With this field in place, you only need to check tuple visibility.
 



How likely is it that you will get a hash collision, two strings that 
are different that will hash to the same value?  To avoid this requires 
a very large hash key (128 bits, minimum)- otherwise you get into 
birthday attack problems.  With a 32-bit hash, the likelyhood is greater 
than 50% that two strings in a collection of 100,000 will hash to the 
same value.  With a 64-bit hash, the likelyhood is greater than 50% that 
two strings in a collection of 10 billion will has to same value.  10 
billion is a large number, but not an unreasonable number, of strings to 
want to put into a hash table- and it's exactly this case where the O(1) 
cost of hashtables starts being a real win.


Brian


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-07 Thread Simon Riggs
On Fri, 2007-09-07 at 11:48 -0400, Greg Smith wrote:
 On Fri, 7 Sep 2007, Simon Riggs wrote:
 
  I think that is what we should be measuring, perhaps in a simple way 
  such as calculating the 90th percentile of the response time 
  distribution.
 
 I do track the 90th percentile numbers, but in these pgbench tests where 
 I'm writing as fast as possible they're actually useless--in many cases 
 they're *smaller* than the average response, because there are enough 
 cases where there is a really, really long wait that they skew the average 
 up really hard.  Take a look at any of the inidividual test graphs and 
 you'll see what I mean.

I've looked at the graphs now, but I'm not any wiser, I'm very sorry to
say. We need something like a frequency distribution curve, not just the
actual times. Bottom line is we need a good way to visualise the
detailed effects of the patch.

I think we should do some more basic tests to see where those outliers
come from. We need to establish a clear link between number of dirty
writes and response time. If there is one, which we all believe, then it
is worth minimising those with these techniques. We might just be
chasing the wrong thing.

Perhaps output the number of dirty blocks written on the same line as
the output of log_min_duration_statement so that we can correlate
response time to dirty-block-writes on that statement.

For me, we can enter Beta while this is still partially in the air. We
won't be able to get this right without lots of other feedback. So I
think we should concentrate now on making sure we've got the logging in
place so we can check whether your patch works when its out there. I'd
say lets include what you've done and then see how it works during Beta.
We've been trying to get this right for years now, so we have to allow
some slack to make sure we get this right. We can reduce or strip out
logging once we go RC.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)

2007-09-07 Thread Martijn van Oosterhout
On Fri, Sep 07, 2007 at 12:03:31PM +0200, apoc9009 wrote:
 Andrew Sullivan schrieb:
 It seems that what you want is near-real-time online backups with _no
 cost_, which is not a feature that I think anyone will ever work on.  
 A
 100% Correct!
 
 I think anyone commit the Statement, thadt a Databases is a very 
 imported Part of Software
 for a wide range of Peoples. It is absoluty normaly thadt thoose Peoples 
 who using Databases having the Wish to have
 there Data save and up to Date until to the Last Record of a Customer 
 (for Example: A Order form a Onlineshop).

Ah, but that's not what you asked (at least not as I interpreted it).
You see, postgres does what you want if there's a simple power failure.
Everything committed stays committed. You can protect against disk
failure with RAID arrays also.  What it sounds like is that you want to
protect from someone blowing up your storage system. 

I think you need to step back and work out exactly what you are
protecting against. Because it looks like your suggestion won't protect
against a meteor stike on your data centre and 100km surrounding.

Tell us what you are protecting against, then you can get some more
useful answers.

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-07 Thread Simon Riggs
On Wed, 2007-09-05 at 23:31 -0400, Greg Smith wrote:

 Tom gets credit for naming the attached patch, which is my latest attempt to 
 finalize what has been called the Automatic adjustment of 
 bgwriter_lru_maxpages patch for 8.3; that's not what it does anymore but 
 that's where it started.

This is a big undertaking, so well done for going for it.

 I decided to use pgbench for running my tests.  The scripting framework to 
 collect all that data and usefully summarize it is now available as 
 pgbench-tools-0.2 at 
 http://www.westnet.com/~gsmith/content/postgresql/pgbench-tools.htm

For me, the main role of the bgwriter is to avoid dirty writes in
backends. The purpose of doing that is to improve the response time
distribution as perceived by users. I think that is what we should be
measuring, perhaps in a simple way such as calculating the 90th
percentile of the response time distribution. Looking at the headline
numbers especially tps is notoriously difficult to determine any
meaning from test results. 

Looking at the tps also tempts us to run a test which maxes out the
server, an area we already know and expect the bgwriter to be unhelpful
in.

If I run a server at or below 70% capacity, what settings of the
bgwriter help maintain my response time distribution?

 Coping with idle periods
 
 
 While I was basically happy with these results, the data Kevin Grittner 
 submitted in response to my last call for commentary left me concerned. While 
 the JIT approach works fine as long as your system is active, it does 
 absolutely nothing if the system is idle.  I noticed that a lot of the writes 
 that were being done by the client backends were after idle periods where the 
 JIT writer just didn't react fast enough during the ramp-up.  For example, if 
 the system went from idle for a while to full-speed just as the 200ms sleep 
 started, by the time the BGW woke up again the backends could have needed to 
 write many buffers already themselves.

You've hit the nail on the head there. I can't see how you can do
anything sensible when the bgwriter keeps going to sleep for long
periods.

The bgwriter's activity curve should ideally be the same shape as a
critically damped harmonic oscillator. It should wake up, lots of
writing if needed, then trail off over time. The only way to do that
seems to be to vary the sleep automatically, or make short sleeps.

For me, the bgwriter should sleep for at most 10ms at a time. If it has
nothing to do it can go straight back to sleep again. Trying to set that
time is fairly difficult, so it would be better not to have to set it at
all.

If you've changed bgwriter so it doesn't scan if no blocks have been
allocated, I don't see any reason to keep the _delay parameter at all.

 I think I can safely say there is a level of intelligence going into what the 
 LRU background writer does with this patch that has never been applied to 
 this 
 problem before.  There have been a lot of good ideas thrown out in this area, 
 but it took a hybrid approach that included and carefully balanced all of 
 them 
 to actually get results that I felt were usable. What I don't know is whether 
 that will also be true for other testers.

I get the feeling that what we have here is better than what we had
before, but I guess I'm a bit disappointed we still have 3 magic
parameters, or 5 if you count your hard-coded ones also.

There's still no formal way to tune these. As long as we have *any*
magic parameters, we need a way to tune them in the field, or they are
useless. At very least we need a plan for how people will report results
during Beta. That means we need a log_bgwriter (better name, please...)
parameter that provides information to assist with tuning. At the very
least we need this to be present during Beta, if not beyond.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Sorting the Stop word lists

2007-09-07 Thread Teodor Sigaev

1. Sort the stopword lists in the main distribution
2. Require them to be sorted
3. Remove the sort from readstoplist()
I don't believe that will a big win in performance - lists are rather small. And 
it needed to add check of sorting




--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Installation problem and a question

2007-09-07 Thread Richard Huxton

Phil wrote:
1) Should PL/JAVA be able to be installed on Vista x64 or is there some 
known issue? What environment variables should be set, and how should I 
set them for a JRE vs. JDK installation?


Sorry - don't know.

2) Where is the source code for the Windows installer? I've tried fairly 
hard to find it, including Googling, browsing development sites, and 
looking at the PG 8.2.4 source code directory, but no luck so far. Sorry 
if this is a simple question but I don't know where else to ask at this 
point. Among other things I wanted to look at how the installer was 
detecting the existence of a JRE.


Can help here:
http://pgfoundry.org/projects/pginstaller/

--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?

2007-09-07 Thread Florian G. Pflug

Simon Riggs wrote:

On Fri, 2007-09-07 at 06:36 +0200, Florian G. Pflug wrote:

Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:
- I actually think with just a little bit of more work, we

can go even further, and get rid of the ReadNewTransactionId() call
completely during snapshotting.

[ squint... ]  This goes a bit far for me.  In particular, I think this
will fail in the edge case when there are no live XIDs visible in
ProcArray.  You cannot go back and do ReadNewTransactionId afterward,
at least not without re-scanning the ProcArray a second time, which
makes it at best a questionable win.

Why would it?


I think the additional suggestion goes a bit too far. You may be right,
but I don't want to change the transaction system in advanced ways this
close to the next release. We may have difficulty spotting bugs in that
thinking during beta.


Ok, those were two clear votes against doing this, so I'll stop
arguing ;-). I do think that we should have another look at this when
8.4 opens, though.

greetings, Florian Pflug


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] HEAD build troubles, buildfarm misconfigurations

2007-09-07 Thread Robert Treat
On Wednesday 05 September 2007 12:01, Tom Lane wrote:
 Robert Treat [EMAIL PROTECTED] writes:
  On Wednesday 05 September 2007 00:06, Tom Lane wrote:
  BTW, on re-reading that, it seems a tad surprising to get an error right
  there --- if postgres_fe.h or anything it includes were broken, then the
  build should have failed earlier.  Is the failure reproducible in
  repeated tries?
 
  Yes.  Just for giggles I played with the configure flags as well, but it
  always errors in that spot.

 Hmph.  I checked the CVS logs and there haven't been any recent changes
 that seem like they could be related.

 It might be worth getting gcc -E output and looking to see what that
 part of netdb.h looks like after macro expansion.


Sorry, my C-fu is lacking. My makefile claims it uses gcc -E, but I'm not sure 
how to get a look at results after macro expansion 

However... I looked again at the output of the make command on some ubuntu 
buildfarm members, and the results were different enough that I got suspicous 
of the flex/bison derived files in the snapshot, so I decided to build from 
cvs and that works fine.  FWIW I am using flex 2.5.31 and bison 2.1, not sure 
what is used to produce the snapshots... 

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-07 Thread Greg Smith

On Fri, 7 Sep 2007, Simon Riggs wrote:

I think that is what we should be measuring, perhaps in a simple way 
such as calculating the 90th percentile of the response time 
distribution.


I do track the 90th percentile numbers, but in these pgbench tests where 
I'm writing as fast as possible they're actually useless--in many cases 
they're *smaller* than the average response, because there are enough 
cases where there is a really, really long wait that they skew the average 
up really hard.  Take a look at any of the inidividual test graphs and 
you'll see what I mean.



Looking at the tps also tempts us to run a test which maxes out the
server, an area we already know and expect the bgwriter to be unhelpful
in.


I tried to turn that around and make my thinking be that if I built a 
bgwriter that did most of the writes without badly impacting the measure 
we know and expect it to be unhelpful in, that would be more likely to 
yield a robust design.  It kept me out of areas where I might have built 
something that had to be disclaimed with don't run this when the server 
is maxed out.



For me, the bgwriter should sleep for at most 10ms at a time. If it has
nothing to do it can go straight back to sleep again. Trying to set that
time is fairly difficult, so it would be better not to have to set it at
all.


I wanted to get this patch out there so people could start thinking about 
what I'd done and consider whether this still fit into the 8.3 timeline. 
What I'm doing myself right now is running tests with a much lower setting 
for the delay time--am testing 20ms right now.  I personally would be 
happy saying it's 10ms and that's it.  Is anyone using a time lower than 
that right now?  I seem to recall that 10ms was also the shortest interval 
Heikki used in his tests as well.



I get the feeling that what we have here is better than what we had
before, but I guess I'm a bit disappointed we still have 3 magic
parameters, or 5 if you count your hard-coded ones also.


I may be able to eliminate more of them, but I didn't want to take them 
out before beta.  If it can be demonstrated that some of these parameters 
can be set to specific values and still work across a wider range of 
applications than what I've tested, then there's certainly room to fix 
some of these, which actually makes some things easier.  For example, I'd 
be more confident fixing the weighted average smoothing period to a 
specific number if I knew the delay was fixed, and there's two parameters 
gone.  And the multiplier is begging to be eliminated, just need some more 
data to confirm that's true.



There's still no formal way to tune these. As long as we have *any*
magic parameters, we need a way to tune them in the field, or they are
useless. At very least we need a plan for how people will report results
during Beta. That means we need a log_bgwriter (better name, please...)
parameter that provides information to assist with tuning.


Once I got past the does it work? stage, I've been doing all the tuning 
work using a before/after snapshot of pg_stat_bgwriter data during a 
representative snapshot of activity and looking at the delta.  Been a 
while since I actually looked into the logs for anything.  It's very 
straightforward to put together a formal tuning plan using the data in 
there, particularly compared to the the impossibility of creating such a 
plan in the current code.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?

2007-09-07 Thread Simon Riggs
On Fri, 2007-09-07 at 06:36 +0200, Florian G. Pflug wrote:
 Tom Lane wrote:
  Florian G. Pflug [EMAIL PROTECTED] writes:
  So I believe you're right, and we can skip taking the lock in the no
  xid case 

Sounds good.

 - I actually think with just a little bit of more work, we
  can go even further, and get rid of the ReadNewTransactionId() call
  completely during snapshotting.
  
  [ squint... ]  This goes a bit far for me.  In particular, I think this
  will fail in the edge case when there are no live XIDs visible in
  ProcArray.  You cannot go back and do ReadNewTransactionId afterward,
  at least not without re-scanning the ProcArray a second time, which
  makes it at best a questionable win.
 
 Why would it?

I think the additional suggestion goes a bit too far. You may be right,
but I don't want to change the transaction system in advanced ways this
close to the next release. We may have difficulty spotting bugs in that
thinking during beta.

lazy XID assignment will reduce the number of times
GetNextTransactionId() is called and if we also avoid taking the
ProcArrayLock for CommitTransaction() then we will have significantly
reduced contention for mixed workloads (i.e. most real ones).

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] left outer join vs subplan

2007-09-07 Thread Filip Rembiałkowski
2007/9/6, Simon Riggs [EMAIL PROTECTED]:

 The query formulation does seem a fairly common one.

  First query:
  explain analyze
  select *
  from
   a
   left outer join (
   select b.id, sum(b.val)
   from b
   group by b.id
   ) bagg
   on bagg.id = a.id
  where
   a.id  1
  order by a.addon, a.id
  limit 100;


 The value of sum(b.val) is never used in the query, so the aggregate
 itself could be discarded. I suspect there are other conditions you
 aren't showing us that would make this impossible?

The value of sum(b.val) is being output in the select *, so saying
it's never used is an oversimplification. But it's actually not used
in any join, or filter. That should be enough to optimize...


 The aggregate prevents the condition bagg.id = a.id from being pushed
 down so that we know b.id = a.id. If we knew that then we could use b.id
 = ? as an index condition to retrieve the rows.

That's exactly the point... But if we all can see it, maybe it's
possible to code it?


Cheers,
Filip Rembiałkowski

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[HACKERS] GIN readme is out of date

2007-09-07 Thread Heikki Linnakangas
access/gin/README describes the Gin interface, but it hasn't been
updated since the change to extractQuery interface to allow no query
can match return value.

Since that interface documentation has been copied to the manual,
gin.sgml, section 52.2, which is up to date, how about we just remove it
from the README?

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-07 Thread Greg Smith

On Fri, 7 Sep 2007, Simon Riggs wrote:

I think we should do some more basic tests to see where those outliers 
come from. We need to establish a clear link between number of dirty 
writes and response time.


With the test I'm running, which is specifically designed to aggrevate 
this behavior, the outliers on my system come from how Linux buffers 
writes.  I can adjust them a bit by playing with the parameters as 
described at http://www.westnet.com/~gsmith/content/linux-pdflush.htm but 
on the hardware I've got here (single 7200RPM disk for database, another 
for WAL) they don't move much.  Once /proc/meminfo shows enough Dirty 
memory that pdflush starts blocking writes, game over; you're looking at 
multi-second delays before my plain old IDE disks clear enough debris out 
to start responding to new requests even with the Areca controller I'm 
using.



Perhaps output the number of dirty blocks written on the same line as
the output of log_min_duration_statement so that we can correlate
response time to dirty-block-writes on that statement.


On Linux at least, I'd expect this won't reveal much.  There, the 
interesting correlation is with how much dirty data is in the underlying 
OS buffer cache.  And exactly how that plays into things is a bit strange 
sometimes.  If you go back to Heikki's DBT2 tests with the background 
writer schemes he tested, he got frustrated enough with that disconnect 
that he wrote a little test program just to map out the underlying 
weirdness: 
http://archives.postgresql.org/pgsql-hackers/2007-07/msg00261.php


I've confirmed his results on my system and done some improvements to that 
program myself, but pushed further work on it to the side to finish up the 
main background writer task instead.  I may circle back to that.  I'd 
really like to run all this on another OS as well (I have Solaris 10 on my 
server box but not fully setup yet), but I can only volunteer so much time 
to work on all this right now.


If there's anything that needs to be looked at more carefully during tests 
in this area, it's getting more data about just what the underlying OS is 
doing while all this is going on.  Just the output from vmstat/iostat is 
very informative.  Those using DBT2 for their tests get some nice graphs 
of this already.  I've done some pgbench-based tests that included that 
before that were very enlightening but sadly that system isn't available 
to me anymore.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Hash index todo list item

2007-09-07 Thread Martijn van Oosterhout
On Thu, Sep 06, 2007 at 01:08:59PM -0500, Kenneth Marshall wrote:
 Since we already have to check the actual tuple values for any index
 lookup in postgresql, we could only store the full hash value and the
 corresponding TIDs in the bucket. Then when we lookup an item by
 calculating its hash, if the exact hash is not present in the bucket,
 then we know that the item is not in the index.

Sounds like you'd be returning a bitmap for use with a bitmap scan.
That's a different take on other suggestions I've heard and would allow
a hash index to have an almost unlimited key size yet flexible
matching... (combined with other index, or even just the same index).

Neat.

Have a nice day,
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?

2007-09-07 Thread Tom Lane
Florian G. Pflug [EMAIL PROTECTED] writes:
 Why would it? The idea was to remember the largest committed xid, and that
 won't go away just because the proc array is rather empty xid-wise.

I hadn't fully absorbed this idea last night, but now that I have, I'm
starting to think it's a good one.

 (That slightly lagging of largestCommittedXid might cause some tuples not to
 be VACUUMED though, so we might want to update largestCommittedXid for
 ABORTS too, and probably rename it to largestNonRunningXid or whatever ;-) ).

LargestCompletedXid, perhaps?

 This would rid us of the rather complicated entanglement of XidGenLock and
 the ProcArrayLock, lessen the lock contention, and reduce the average snapshot
 size a bit.

In view of Simon's earlier comments at
http://archives.postgresql.org/pgsql-hackers/2007-07/msg00948.php
it seems like not having to take the XidGenLock during GetSnapshotData
ought to be a pretty serious win performance-wise, and it might open up
some other possibilities that are now foreclosed by the risk of deadlock
between the two locks.

I've spent the past hour or so trying to consolidate the comments in
GetSnapshotData and related places into a single chunk of text to be
added to src/backend/access/transam/README.  Attached is what I have so
far --- this incorporates the idea of not taking ProcArrayLock to exit
an XID-less transaction, but not yet Florian's idea.  I think it'd get
simpler if we changed to that, but am posting this for comments.

regards, tom lane


Interlocking transaction begin, transaction end, and snapshots
--

We try hard to minimize the amount of overhead and lock contention involved
in the frequent activities of beginning/ending a transaction and taking a
snapshot.  Unfortunately, we must have some interlocking for this, because
it is critical that all backends agree on the commit order of transactions.
For example, suppose an UPDATE in xact A is blocked by xact B's prior
update of the same row, and xact B is doing commit while xact C gets a
snapshot.  Xact A can complete and commit as soon as B releases its locks.
If xact C's GetSnapshotData sees xact B as still running, then it had
better see xact A as still running as well, or it will be able to see two
tuple versions - one deleted by xact B and one inserted by xact A.  That
is, if A thinks it committed after B, C had better think the same.  We
enforce this by not allowing any transaction to exit the set of running
transactions while a snapshot is being taken.  (This rule is probably
stronger than necessary, but see the next problem.)  The implementation
of this is that GetSnapshotData takes the ProcArrayLock in shared mode
(thereby allowing multiple backends to take snapshots in parallel), but
xact.c must take the ProcArrayLock in exclusive mode while clearing
MyProc-xid at transaction end (either commit or abort).

Here is another variant of the risk scenario:

1. Xact A is running (in Read Committed mode).
2. Xact C's GetSnapshotData reads next transaction ID into xmax, then is
   swapped out before it can acquire ProcArrayLock.
3. Xact B gets new XID (= C's xmax), makes changes and commits.
4. Xact A changes some row R changed by xact B and commits.
5. Xact C finishes getting its snapshot data.  It sees xact A as done,
   but sees xact B as still running (since B = xmax).

Now C will see R changed by xact B and then xact A, *but* does not see
other changes made by xact B.  If C is supposed to be in Serializable mode,
this is wrong.

To prevent this it is necessary that GetSnapshotData acquire ProcArrayLock
before it calls ReadNewTransactionId.  This prevents xact A from exiting
the set of running transactions seen by xact C.  Therefore both A and B
will be seen as still running = no inconsistency.

In short, then, the rule is that no transactions may exit the set of
currently-running transactions between the time we fetch xmax and the time
we finish building our snapshot.  However, this restriction only applies
to transactions that have an XID --- read-only transactions can end without
acquiring ProcArrayLock, since they don't affect anyone else's snapshot.

We must also require GetNewTransactionId to store the new XID into the
shared ProcArray before releasing XidGenLock.  This ensures that when
GetSnapshotData calls ReadNewTransactionId (which also takes XidGenLock),
all active XIDs before the returned value of nextXid are already present in
the ProcArray and can't be missed by GetSnapshotData.  Unfortunately, we
can't have GetNewTransactionId take ProcArrayLock to do this, else it could
deadlock against GetSnapshotData.  Therefore, we simply let
GetNewTransactionId store into MyProc-xid without any lock.  We are
thereby relying on fetch/store of an XID to be atomic, else other backends
might see a partially-set XID.  (NOTE: for multiprocessors that need
explicit memory access fence instructions, this means that

Re: [HACKERS] HEAD build troubles, buildfarm misconfigurations

2007-09-07 Thread Andrew Dunstan



Robert Treat wrote:

On Wednesday 05 September 2007 12:01, Tom Lane wrote:
  

Robert Treat [EMAIL PROTECTED] writes:


On Wednesday 05 September 2007 00:06, Tom Lane wrote:
  

BTW, on re-reading that, it seems a tad surprising to get an error right
there --- if postgres_fe.h or anything it includes were broken, then the
build should have failed earlier.  Is the failure reproducible in
repeated tries?


Yes.  Just for giggles I played with the configure flags as well, but it
always errors in that spot.
  

Hmph.  I checked the CVS logs and there haven't been any recent changes
that seem like they could be related.

It might be worth getting gcc -E output and looking to see what that
part of netdb.h looks like after macro expansion.




Sorry, my C-fu is lacking. My makefile claims it uses gcc -E, but I'm not sure 
how to get a look at results after macro expansion 

However... I looked again at the output of the make command on some ubuntu 
buildfarm members, and the results were different enough that I got suspicous 
of the flex/bison derived files in the snapshot, so I decided to build from 
cvs and that works fine.  FWIW I am using flex 2.5.31 and bison 2.1, not sure 
what is used to produce the snapshots... 
  


scan.c in snapshot says:

#define YY_FLEX_MAJOR_VERSION 2
#define YY_FLEX_MINOR_VERSION 5


gram.c in snapshot says:

/* A Bison parser, made by GNU Bison 1.875.  */

cheers

andrew



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)

2007-09-07 Thread apoc9009

Trevor Talbot schrieb:

Backup 12/24/2008 Version 2
/pg/backup/12_24_2008/base/rcvry.rcv--- Basebackup
/pg/backup/12_24_2008/changes/0001.chg  --- Changed Data
 /changes/0002.chg  --- Changed Data
 /changes/0003.chg  --- Changed Data
  
 /changes/0010.chg  --- Changed Data
 /changes/0001.rsf  ---  Recovery

Stripeset File (10 MByte) addon of Basebackup
delete *.chg

if a Stripeset of 10 *.chg Files exist, they should be converted or merged
to one bigger *.rsf Stripesetfile (to prevent for having Millions *.chg)





Why?  What does this actually do?
This is the Structure on FTP-Site of the Backupserver! Not locally where 
the Postmaster runs on!


Locally the Postmaster can create a basebackup (rcvry.rcv) then the 
FTP-Deamon logs on to the

Remote FTP-Backupserver and Uploads this File to keep it save:

/pg/backup/12_24_2008/
/pg/backup/12_24_2008/base/rcvry.rcv--- Basebackup

At the Next:
If a New Record at the Database was written, then the Postmaster locally 
creates


A change File named: 0001.chg

After thits the locally FTP-Deamon transfers this File to

/pg/backup/12_24_2008/changes/0001.chg  --- Changed Data

Then

/pg/backup/12_24_2008/changes/0002.chg  --- Changed Data
/pg/backup/12_24_2008/changes/0003.chg  --- Changed Data

/pg/backup/12_24_2008/changes/0010.chg  --- Changed Data


Then the FTP-Backupdeamon merged the last 10.x  *.chg Files to one greater
*.RSF File Online while logged into the FTP-Backupserver and Delete Online
the Last 10 *.chg Files (after this the Last 10.x *.chg Files, created 
by the Postmaster

can be deleted localy where the Postmaster runs).

The Benefit:
If the Machine where the Database run on having a Mailfunction, then the 
Last Record

of the Databasecatalouge is backed up to an *.chg or *. RSF File.

Thadt my Idea of this

Apoc

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)

2007-09-07 Thread apoc9009

Andrew Sullivan schrieb:

It seems that what you want is near-real-time online backups with _no
cost_, which is not a feature that I think anyone will ever work on.  
A

100% Correct!

I think anyone commit the Statement, thadt a Databases is a very 
imported Part of Software
for a wide range of Peoples. It is absoluty normaly thadt thoose Peoples 
who using Databases having the Wish to have
there Data save and up to Date until to the Last Record of a Customer 
(for Example: A Order form a Onlineshop).


Very Large Databases with sizes more then over 1 Terrabyte cannot be 
backed up continously using Dumps

or the common Backupmethods

The Abstract need is:

1.An exact Copy (1:1) of the Productiondatabase should be mirrored at a 
diffrent Machine, connected via (LAN/WAN) (as Backup)
2.The Backup Operations should be done normaly while the Database is 
Online with very small Bandwich, I/O and CPU Trafficcosts.
3.Per Month or Week (or choosable) at a New Backup-Timeline/Version 
should be started (depend on the Basebackup)
4.A New Base and  its Streaming Backup Configuration should be 
Hot-Addable to an running Productiondatabase (without Shutdown or Lock)
5.A Remote Operationmanager Interface (GUI) should be there for 
Monitoring and Maintaining the Backups (maybe in PGAdmin)
6.If the Production and Mirror replicated Database is crashed (Hardware 
Failure or Provider Problems) the Recovery should done verry fast.


(An Idea for 5. / Clientside)...
It would be a great Feature, if the PSQL-Client having a Local Deamon, 
thadt can Download the Backup Catalogues from
the WAN-Backup FTP-Server continously down to the local Adminmachine. 
Then the Admin is able to burn the DB-Backups on
DVD or write it on Tape on its local Machine at his Company (if the 
Backupreplicationserver fails two or the Internetprovider has
Problems,  the DB-Admin can apply its local Backups from DVD to the New 
DB on a New Machine anytime)


Apoc

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[HACKERS] Sorting the Stop word lists

2007-09-07 Thread Simon Riggs

I notice we sort the stop word list after we read it into memory.

Wouldn't it be easier to

1. Sort the stopword lists in the main distribution

2. Require them to be sorted

3. Remove the sort from readstoplist()

We should at very least do (1) to improve the sort speed at start.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Sorting the Stop word lists

2007-09-07 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 I notice we sort the stop word list after we read it into memory.

I see nothing wrong with that; it only happens once per backend session,
and it makes maintenance of the files easier.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[HACKERS] integrated tsearch doesn't work with non utf8 database

2007-09-07 Thread Pavel Stehule
Hello

last time I checked utf8 database. Now I checked latin2 encoding
database. I used dictionaries from last test.

client_encoding | utf8
lc_collate  | cs_CZ.iso-8859-2
lc_ctype| cs_CZ.iso-8859-2
lc_messages | cs_CZ.iso-8859-2
lc_monetary | cs_CZ.iso-8859-2
lc_numeric  | cs_CZ.iso-8859-2
lc_time | cs_CZ.iso-8859-2

postgres=# create Text search dictionary cspell2(template=ispell,
afffile=czech, dictfile=czech);
CREATE TEXT SEARCH DICTIONARY
postgres=# alter text search configuration cs alter mapping for word,
lword  with cspell2, simple;
ALTER TEXT SEARCH CONFIGURATION
postgres=# select ts_debug('cs','Příliš žluťoučký kůň se napil žluté vody');
ERROR:  character 0xc3a5 of encoding UTF8 has no equivalent in LATIN2
CONTEXT:  SQL function ts_debug statement 1
postgres=#


database was initialised

/usr/local/pgsql/bin/initdb  --encoding=latin2
--locale=cs_CZ.iso-8859-2 -D /usr/local/pgsql/data/

locales is correct


postgres=# set client_encoding to utf8;
SET
postgres=# select upper('Příliš žluťoučký kůň se napil žluté vody');
  upper
--
 PŘÍLIŠ ŽLUŤOUČKÝ KŮŇ SE NAPIL ŽLUTÉ VODY
(1 row)

Regards
Pavel Stehule

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?

2007-09-07 Thread Florian G. Pflug

Tom Lane wrote:

I've spent the past hour or so trying to consolidate the comments in
GetSnapshotData and related places into a single chunk of text to be
added to src/backend/access/transam/README.  Attached is what I have so
far --- this incorporates the idea of not taking ProcArrayLock to exit
an XID-less transaction, but not yet Florian's idea.  I think it'd get
simpler if we changed to that, but am posting this for comments.



Interlocking transaction begin, transaction end, and snapshots
--

We try hard to minimize the amount of overhead and lock contention involved
in the frequent activities of beginning/ending a transaction and taking a
snapshot.  Unfortunately, we must have some interlocking for this, because
it is critical that all backends agree on the commit order of transactions.

This is actually still a slightly stronger requirement than what we really
need I think.

To simplify the following discussion, A - B shall mean that transaction A saw
B as committed. Conversely, A ! B shall mean that A treats B as in-progress.
If A was in read-committed mode, the visibility refers to the latest snapshot
that A used.

Now assume A and B commit at nearly the same time, and for two other
transactions C and D the following holds:
C - A, C ! B but
D ! A, D - B.

This would violate the requirement that the commit order is globally
agreed upon, yet as long as both A ! B and B ! A holds, there is no
conflict. (Note that if A and B are serializable, A ! B  B ! A
implies that A and B cannot have touched the same record and have
both committed - one would have been aborted due to a
SerializationError).

I must admit, though, that this is a quite academic case, since the
prerequisite A ! B and B ! A is something we have no control over
for read-committed transactions - who knows when they might have taken
their last snapshot...

Still, I wanted to mention this because I believe that the minimal
requirement that we actually *need* to enforce is
A - B and B - C imply A - C.  (T1)

The actual implementation will probably always have to enforce
something slightly stronger, but it's still nice to know the
minimal guarantee needed to be able to judge correctness.


For example, suppose an UPDATE in xact A is blocked by xact B's prior
update of the same row, and xact B is doing commit while xact C gets a
snapshot.  Xact A can complete and commit as soon as B releases its locks.
If xact C's GetSnapshotData sees xact B as still running, then it had
better see xact A as still running as well, or it will be able to see two
tuple versions - one deleted by xact B and one inserted by xact A. 

In my notation this becomes: A - B and C ! B implies C ! A.

This then follows from (T1) - Assume that A - B, C ! B but C - A,
then with (A) C - B follows from C - A and A - B, which contradicts
C ! B.


We
enforce this by not allowing any transaction to exit the set of running
transactions while a snapshot is being taken.  (This rule is probably
stronger than necessary, but see the next problem.)  The implementation
of this is that GetSnapshotData takes the ProcArrayLock in shared mode
(thereby allowing multiple backends to take snapshots in parallel), but
xact.c must take the ProcArrayLock in exclusive mode while clearing
MyProc-xid at transaction end (either commit or abort).

Agreed. We *do* enforce a strict global ordering of committs and snapshots.
This then guarantees (T1) because if A - B and B - C, than A *must*
have taken it's snapshot after B committed, and B in turn *must* have
taken it's snapshot after C committed, so surely A - C will hold too.


Here is another variant of the risk scenario:

1. Xact A is running (in Read Committed mode).
2. Xact C's GetSnapshotData reads next transaction ID into xmax, then is
   swapped out before it can acquire ProcArrayLock.
3. Xact B gets new XID (= C's xmax), makes changes and commits.
4. Xact A changes some row R changed by xact B and commits.
5. Xact C finishes getting its snapshot data.  It sees xact A as done,
   but sees xact B as still running (since B = xmax).

Now C will see R changed by xact B and then xact A, *but* does not see
other changes made by xact B.  If C is supposed to be in Serializable mode,
this is wrong.

I never really grasped why we need to assume serializable here - this seems
wrong if C is read-committed too. Seeing only half of a transaction's changes
can never be right, can it?


To prevent this it is necessary that GetSnapshotData acquire ProcArrayLock
before it calls ReadNewTransactionId.  This prevents xact A from exiting
the set of running transactions seen by xact C.  Therefore both A and B
will be seen as still running = no inconsistency.

Another point of view is that determining the xmax of a snapshot really *is*
part of taking the snapshot. Since we already obtained that we need to
serialize snapshotting and committing, it follows that we must not allow
committs to happen between the 

Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?

2007-09-07 Thread Tom Lane
Florian G. Pflug [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 Here is another variant of the risk scenario:
 
 1. Xact A is running (in Read Committed mode).
 2. Xact C's GetSnapshotData reads next transaction ID into xmax, then is
 swapped out before it can acquire ProcArrayLock.
 3. Xact B gets new XID (= C's xmax), makes changes and commits.
 4. Xact A changes some row R changed by xact B and commits.
 5. Xact C finishes getting its snapshot data.  It sees xact A as done,
 but sees xact B as still running (since B = xmax).
 
 Now C will see R changed by xact B and then xact A, *but* does not see
 other changes made by xact B.  If C is supposed to be in Serializable mode,
 this is wrong.

 I never really grasped why we need to assume serializable here - this seems
 wrong if C is read-committed too. Seeing only half of a transaction's changes
 can never be right, can it?

You'd be surprised ;-) ... Read Committed is not totally consistent,
even considering a single statement using a single snapshot, because of
the rules about UPDATEs starting from the latest committed version.
But let's not get into that today.

Anyway, one thing I realized while making this text is that the example
cases mentioned in CommitTransaction and GetSnapshotData really were
basically the same case.  The difference is that in this case, the
failure to see B as committed is closely related to the way that xmax is
computed.  We could get rid of this example if we switched over to your
LatestCompletedXid proposal.

 I had a rather hard time understanding these things initially - at least
 for me, the enlightenment came when I realized that most of the locking
 in there actually ensures that (T1) holds - or in other words, that
 the relation A sees B as committed *has* to be a transitive one.
 So I'd like to see this reasoning in that README - though maybe I'm the
 only one to whom this is the clearest wording...

I'll put some of this into the README.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Oddity with psql \d and pg_table_is_visible

2007-09-07 Thread Decibel!
On Wed, Sep 05, 2007 at 03:27:50PM -0400, Tom Lane wrote:
 Decibel! [EMAIL PROTECTED] writes:
  While this is correct on a per-relation level, I'm thinking that it's
  not what we'd really like to have happen in psql. What I'd like \d to do
  is show me everything in any schema that's in my search_path, even if
  there's something higher in the search_path that would over-ride it.
  ISTM that's what most people would expect out of \d.
 
 I don't agree with that reasoning in the least, particularly not if you
 intend to fix it by redefining pg_table_is_visible() ...
 
No, pg_table_is_visible is correct as-is.

 What will happen if we change \d to work that way is that it will show
 you a table, and you'll try to access it, and you'll get the wrong table
 because the access will go to the one that really is visible.

That's why I was suggesting that any table showing up in \d that in-fact
wasn't visible be marked somehow, either with a separate field, or by
sticking an * after it's name.

This is confusing because when using \d you generally think in terms of
what schemas are in your search path, not if an individual object has
been superseded by something further up the chain.
-- 
Decibel!, aka Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)


pgpSd6sbfMLD2.pgp
Description: PGP signature


[HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-07 Thread Avery Payne
Be forewarned - this is probably a very long post, and I'm just a mere 
mortal (ie. admin) who doesn't write copious amounts of C code.  Take 
the following posts and suggestions with a grain of salt.


So I've been seeing/hearing all of the hoopla over vertical databases 
(column stores), and how they'll not only slice bread but also make 
toast, etc.  I've done some quick searches for past articles on 
C-Store, Vertica, Column Store, and Vertical Database, and have 
seen little discussion on this.  And then a funny thought occurs to me - 
when I look at the directory structure and file layout of a PostgreSQL 
database, I see that each OID corresponds to a table, which corresponds 
to (generally) a single file. Then I have a second funny thought - what 
if there was a low-friction, low-cost-of-implementation way to bring 
similar advantages to PostgreSQL without major alterations, recoding, 
etc?  Finally it occurs to me that PostgreSQL already does something 
similar but it could do it so much better, with only one language change 
and minor changes to the storage layout.  So here's my plum-crazy 
proposal (and I've made some before - see 
http://archives.postgresql.org/pgsql-odbc/2006-10/msg00040.php - and 
they not only made it into production, but they are in active use by me 
on a weekly basis - Thanks Hiroshi!!!), bear with me...


Make one small, very tiny syntactic change to CREATE TABLE that 
includes a new keyword, COLUMN-STORE or something similar.  I don't 
care where it appears as long as it's after the CREATE TABLE.  You 
would not have to break any existing SQL conventions, PostgreSQL would 
continue to be SQL compliant, and given the odd wording, I highly doubt 
that the folks who work on SQL keywords will end up using it at any 
point in time.  If adding COLUMN-STORE is objectionable because it will 
cloud the compliance of the language then simply move the 
functionality into the table space functionality.  In hindsight, it 
might even belong there instead.  So, instead of specifying it by 
table,  create a table space that has an attribute Column Storage set 
as active.  When inactive, it uses the traditional one-file-per-table 
layout.


Make each table column capable of receiving an OID.  This will be 
critical for the following steps...


If a table is created with COLUMN-STORE as an option, then it will 
continue to behave in the same way it always has, but the storage will 
be different.  Each column in the table will be represented by a single 
file, with the file name being (naturally) the OID.  
INSERT/UPDATE/DELETE would function as it always has, etc. Nothing would 
change.  Except how the data is stored.  The existing TOAST mechanisms 
continue to work - because the engine would treat each file as a 
single-column table! 

One additional column would be added to the store, an invisible one 
that not only tracks the OID for the rows in this type of setup, but 
also the state of the row.  Let's call this the Control Column.  Given 
that the metadata size for the row would be fixed/constant, we won't 
have to worry about what is in the other columns and rows, they can 
be any size.  BTW, the Control Column would be just another column 
from the storage engine's point of view.  It just happens to be one that 
no-one can see, other than the database (and maybe the administrator).


When you go to VACUUM a table, you would treat each column as a 
single-row table, so if a row is a candidate for a VACUUM reclamation, 
then it will adjust each column an equal amount.  Under no 
circumstances would you have columns out of sync, so if a record goes, 
it means each adjacent column goes with it.  This sounds disk-intensive 
at first, until you realize that the admin will have made a contentious 
decision to use this format, and understands the advantages/drawbacks to 
this method.  So it takes a little longer to VACUUM, I don't care, 
because as an admin I will have specified this layout for a reason - to 
do OLAP, not OLTP.  Which means, I rarely VACUUM it.  Add to this the 
high efficiency you would gain by packing more records into buffers per 
read, and most of the losses you take in re-reading data would really 
not amount to as big a loss as you might think.


DELETE would simply mark a row off as deleted in the Control Column.  
If the storage engine needed to reclaim a row, it would not have to look 
any further than the control column to find an empty spot where it 
could overwrite data.


INSERT/UPDATE continue to work as they always have.  The storage engine 
would perceive each column as a single-column table, meaning that the 
existing TOAST mechanisms continue to work!   Nothing needs to change 
there.  The real change would be that the table's columns would be 
split up into individual updates, and the Control Column would be 
used to keep all of the records in sync.


Why bother with this?  Because, when you are said and done, you will 
find yourself with a rough 

Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-07 Thread Josh Berkus
Avery,

 Make one small, very tiny syntactic change to CREATE TABLE that
 includes a new keyword, COLUMN-STORE or something similar. 

If someone writes the rest of the code, I doubt the syntax will be the 
holdup.  But writing an efficient C-store table mechanism is much harder 
than I think you think it is; Vertica worked on it for a year and failed, 
and Paraccel took two years to succeed.   FYI, Paraccel is based on 
Postgres.

So, put up a pgfoundry project and start hacking a c-store table; I'm sure 
you;ll get interest if you can make something work.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] integrated tsearch doesn't work with non utf8 database

2007-09-07 Thread Heikki Linnakangas
Pavel Stehule wrote:
 postgres=# select ts_debug('cs','Příliš žluťoučký kůň se napil žluté vody');
 ERROR:  character 0xc3a5 of encoding UTF8 has no equivalent in LATIN2
 CONTEXT:  SQL function ts_debug statement 1

I can reproduce that. In fact, you don't need the custom config or
dictionary at all:

postgres=# CREATE DATABASE latin2 encoding='latin2';
CREATE DATABASE
postgres=# \c latin2
You are now connected to database latin2.
latin2=#  select ts_debug('simple','foo');
ERROR:  character 0xc3a5 of encoding UTF8 has no equivalent in LATIN2
CONTEXT:  SQL function ts_debug statement 1

It fails trying to lexize the string using the danish snowball stemmer,
because the danish stopword file contains character 'å' which doesn't
have an equivalent in LATIN2.

Now what the heck is it doing with the danish stemmer, you might ask.
ts_debug is implemented as a SQL function; EXPLAINing the complex SELECT
behind it, I get this plan:

latin2=# \i foo.sql
 QUERY PLAN

-
 Hash Join  (cost=2.80..1134.45 rows=80 width=100)
   Hash Cond: (parse.tokid = tt.tokid)
   InitPlan
 -  Seq Scan on pg_ts_config  (cost=0.00..1.20 rows=1 width=4)
   Filter: (oid = 3748::oid)
 -  Seq Scan on pg_ts_config  (cost=0.00..1.20 rows=1 width=4)
   Filter: (oid = 3748::oid)
   -  Function Scan on ts_parse parse  (cost=0.00..12.50 rows=1000
width=36)
   -  Hash  (cost=0.20..0.20 rows=16 width=68)
 -  Function Scan on ts_token_type tt  (cost=0.00..0.20 rows=16
width=68)
   SubPlan
 -  Limit  (cost=7.33..7.36 rows=1 width=36)
   -  Subquery Scan dl  (cost=7.33..7.36 rows=1 width=36)
 -  Sort  (cost=7.33..7.34 rows=1 width=8)
   Sort Key: m.mapseqno
   -  Seq Scan on pg_ts_config_map m
(cost=0.00..7.32 rows=1 width=8)
 Filter: ((ts_lexize(mapdict, $1) IS NOT
NULL) AND (mapcfg = 3765::oid) AND (maptokentype = $0))
 -  Sort  (cost=6.57..6.57 rows=1 width=8)
   Sort Key: m.mapseqno
   -  Seq Scan on pg_ts_config_map m  (cost=0.00..6.56 rows=1
width=8)
 Filter: ((mapcfg = 3765::oid) AND (maptokentype = $0))
(21 rows)

Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict,
$1). That means that it will call ts_lexize on every dictionary, which
will try to load every dictionary. And loading danish_stem dictionary
fails in latin2 encoding, because of the problem with the stopword file.

We could rewrite ts_debug as a C-function, so that it doesn't try to
access any unnecessary dictionaries. It seems wrong to install
dictionaries in databases where they won't work in the first place, but
I don't see an easy fix for that. Any comments or better ideas?

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


[HACKERS] equivilant to contrib tsearch trigger function in 8.3

2007-09-07 Thread Robert Treat
Just wondering if it is already in 8.3 with a new name, or if not if there are 
plans to add it?  TIA

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[HACKERS] Follow-Up to A Silly Idea for Vertically-Oriented Databases

2007-09-07 Thread Avery Payne
In hindsight, I did miss quite a bit in my last post.  Here's a summary 
that might clear it up:


Add a single keyword that specifies that the storage format changes 
slightly.  The keyword should not affect SQL compliancy while still 
extending functionality.  It can be specified as either part of the 
CREATE TABLE statement or part of the tablespace mechanism.


When a table is created with this setting, all columns in a record are 
split vertically into individual, 1-column-wide tables, and each column 
in the table is assigned an OIDs.  Each OID corresponds to one of our 
1-wide tables.  An additional control column will be created that is 
only visible to the database and the administrator.  This column stores 
a single logical indicating if the record is allocated or not.  You 
might even be able to create a special bitmap index that is hidden, and 
just use existing bitmap functions in the index code.  In essence, this 
column helps keep all of the other columns in sync when dealing with rows.


When writing data to the table, each individual column will update, but 
the engine invisibly wraps together all of the columns into a single 
transaction.  That is, each row insert is still atomic and behaves like 
it normally would - either the insert succeeds or it doesn't.  Because 
the updates are handled by the engine as many separate tables, no 
special changes are required, and existing storage mechanisms (TOAST) 
continue to function as they always did.  This could be written as a 
super-function of sorts, one that would combine all of the smaller steps 
together and use the existing mechanisms.


Updates are performed in the same manner, with each column being 
rolled up into a single invisible mini-transaction for the given record.


Deletes are performed by marking not only the columns as deleted but 
also the control column as having that row available for overwrite.  I'm 
simplifying quite a bit but I think the general idea is understood.  
Yes, a delete will have significant overhead compared to an insert or 
update but this is a known tradeoff that the administrator is willing to 
make, so they can gain faster read speeds - ie. they want an 
OLAP-oriented store, not an OLTP-oriented store.


The control column would be used to locate records that can be 
overwritten quickly.  When a record is deleted, the control column's 
bitmap was adjusted to indicate that a free space was available.  The 
engine would then co-ordinate as it did above, but it can cheat - 
instead of trying to figure things out for each table, the offset to 
write to is already known, so the update proceeds as listed above, other 
than each part of the little mini-transaction writes to the same 
offset (ie. each column in the record will have the same hole, so 
when you go to write the record out, write it to the same record 
spot).  This is where the control column not only coordinates deletes 
but also inserts that re-use space from deleted records.


Hopefully that makes it a little clearer.




---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)

2007-09-07 Thread Hannu Krosing
Ühel kenal päeval, R, 2007-09-07 kell 16:41, kirjutas apoc9009:
 Filip Rembiałkowski schrieb:
  please take following remarks:
 thx, but if i need some advice form a scandinavian dickhead then i will 
 let you know this

Is this apoc9009 guy real ?

For some time I honestly believed (based in part on the english-like
language used in postings) that he just cant understand what is written
in our documentation. 

My other suspicion was that he has never actually tried to do what he
claims, but has just click-checked some checkboxes on some third-party
backup software and is actually lamenting about the lack of box labeled
Absolutely Reliable PostgreSQL backup.

But now i suspect that someone is just pulling our collective leg and
just trolling under false name to test how long we stand it ?

---
Hannu

P.S.: any ideas, whom he suspect think to be scandinavian in this
thread ? or is advice form a scandinavian dickhead an idiom in some
language ? 


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?

2007-09-07 Thread Tom Lane
Here's some revised text for the README file, based on using Florian's
idea of a global latestCompletedXid variable.  As I worked through it
I realized that in this design, XidGenLock gates entry of new XIDs into
the ProcArray while ProcArrayLock gates their removal.  Which is an
interesting sort of symmetry property.  It also turns out that the
reason we need to gate entry with XidGenLock is to keep from breaking
GetOldestXmin, rather than to ensure correctness of snapshots per se.

(Note: I refer in the text to ProcArrayEndTransaction(), which is a
function I'm thinking of putting into procarray.c to replace the current
inline-in-xact.c code that clears xid and related fields.)

Comments?

regards, tom lane


Interlocking transaction begin, transaction end, and snapshots
--

We try hard to minimize the amount of overhead and lock contention involved
in the frequent activities of beginning/ending a transaction and taking a
snapshot.  Unfortunately, we must have some interlocking for this, because
we must ensure consistency about the commit order of transactions.
For example, suppose an UPDATE in xact A is blocked by xact B's prior
update of the same row, and xact B is doing commit while xact C gets a
snapshot.  Xact A can complete and commit as soon as B releases its locks.
If xact C's GetSnapshotData sees xact B as still running, then it had
better see xact A as still running as well, or it will be able to see two
tuple versions - one deleted by xact B and one inserted by xact A.  Another
reason why this would be bad is that C would see (in the row inserted by A)
earlier changes by B, and it would be inconsistent for C not to see any
of B's changes elsewhere in the database.

Formally, the correctness requirement is if a snapshot A considers
transaction X as committed, and any of transaction X's snapshots considered
transaction Y as committed, then snapshot A must consider transaction Y as
committed.

What we actually enforce is strict serialization of commits and rollbacks
with snapshot-taking: we do not allow any transaction to exit the set of
running transactions while a snapshot is being taken.  (This rule is
stronger than necessary for consistency, but is relatively simple to
enforce, and it assists with some other issues as explained below.)  The
implementation of this is that GetSnapshotData takes the ProcArrayLock in
shared mode (so that multiple backends can take snapshots in parallel),
but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode
while clearing MyProc-xid at transaction end (either commit or abort).

ProcArrayEndTransaction also holds the lock while advancing the shared
latestCompletedXid variable.  This allows GetSnapshotData to use
latestCompletedXid + 1 as xmax for its snapshot: there can be no
transaction = this xid value that the snapshot needs to consider as
completed.

In short, then, the rule is that no transaction may exit the set of
currently-running transactions between the time we fetch latestCompletedXid
and the time we finish building our snapshot.  However, this restriction
only applies to transactions that have an XID --- read-only transactions
can end without acquiring ProcArrayLock, since they don't affect anyone
else's snapshot nor latestCompletedXid.

Transaction start, per se, doesn't have any interlocking with these
considerations, since we no longer assign an XID immediately at transaction
start.  But when we do decide to allocate an XID, GetNewTransactionId must
store the new XID into the shared ProcArray before releasing XidGenLock.
This ensures that all top-level XIDs = latestCompletedXid are either
present in the ProcArray, or not running anymore.  (This guarantee doesn't
apply to subtransaction XIDs, because of the possibility that there's not
room for them in the subxid array; instead we guarantee that they are
present or the overflow flag is set.)  If a backend released XidGenLock
before storing its XID into MyProc, then it would be possible for another
backend to allocate and commit a later XID, causing latestCompletedXid to
pass the first backend's XID, before that value became visible in the
ProcArray.  That would break GetOldestXmin, as discussed below.

We allow GetNewTransactionId to store the XID into MyProc-xid (or the
subxid array) without taking ProcArrayLock.  This was once necessary to
avoid deadlock; while that is no longer the case, it's still beneficial for
performance.  We are thereby relying on fetch/store of an XID to be atomic,
else other backends might see a partially-set XID.  This also means that
readers of the ProcArray xid fields must be careful to fetch a value only
once, rather than assume they can read it multiple times and get the same
answer each time.

Another important activity that uses the shared ProcArray is GetOldestXmin,
which must determine a lower bound for the oldest xmin of any active MVCC
snapshot, system-wide.  

Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-07 Thread Avery Payne




Avery,

my ramblings snipped

If someone writes the rest of the code, I doubt the syntax will be the 
holdup.  But writing an efficient C-store table mechanism is much harder 
than I think you think it is; Vertica worked on it for a year and failed, 
and Paraccel took two years to succeed.   FYI, Paraccel is based on 
Postgres.

So, put up a pgfoundry project and start hacking a c-store table; I'm sure 
you;ll get interest if you can make something work.

-- 
--Josh

Well, I did say it was a *crazy* idea. :-)

Given that I would be starting from the ground-floor, learning not only
the innards of PostgreSQL but also C coding as well, I would probably
have to overcome near-insurmountable odds to make this project take off.  Still,
if I was crazy enough to think it, maybe I'll be crazy enough to 
try for it. ;-)

Just ignore my 2nd posting, I was trying to clarify some of the 
ramblings I was typing.





Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?

2007-09-07 Thread Florian G. Pflug

Tom Lane wrote:
Here's some revised text for the README file, based on using Florian's idea 
of a global latestCompletedXid variable.  As I worked through it I realized 
that in this design, XidGenLock gates entry of new XIDs into the ProcArray 
while ProcArrayLock gates their removal.  Which is an interesting sort of 
symmetry property.  It also turns out that the reason we need to gate entry 
with XidGenLock is to keep from breaking GetOldestXmin, rather than to ensure

correctness of snapshots per se.

I believe it would break both, no? If an xid = latestCompletedXid is
not included in the snapshot, but later used for updates, the snapshot
will see those changes as committed when they really are not.

But other than that, it really sounds fine. It certainly explains things much
better than the comments in the existing code.

I noticed two rather cosmetic issues
.) latestCompletedXid sounds as it might refer to the *last* completed xid,
but it actually refers to the largest / highest completed xid. So maybe we
should call it highestCompletedXid or largestCompletedXid.

.) Since you mention that we assume reading and writing int4s are atomic
operations, maybe we should mention that for safety's sake we mark the
corresponding pointers with volatile?

greetings, Florian Pflug



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?

2007-09-07 Thread Tom Lane
Florian G. Pflug [EMAIL PROTECTED] writes:
 I noticed two rather cosmetic issues
 .) latestCompletedXid sounds as it might refer to the *last* completed xid,
 but it actually refers to the largest / highest completed xid. So maybe we
 should call it highestCompletedXid or largestCompletedXid.

Actually that was an intentional choice: because of the wraparound
behavior of XIDs, the latest value is not necessarily numerically
largest.  I'm not wedded to it though.

 .) Since you mention that we assume reading and writing int4s are atomic
 operations, maybe we should mention that for safety's sake we mark the
 corresponding pointers with volatile?

Couldn't hurt.

I have a draft patch that I'll post shortly.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-07 Thread Josh Tolley
On 9/7/07, Avery Payne [EMAIL PROTECTED] wrote:

  Avery,

 my ramblings snipped

 If someone writes the rest of the code, I doubt the syntax will be the
 holdup. But writing an efficient C-store table mechanism is much harder
 than I think you think it is; Vertica worked on it for a year and failed,
 and Paraccel took two years to succeed. FYI, Paraccel is based on
 Postgres.

 So, put up a pgfoundry project and start hacking a c-store table; I'm sure
 you;ll get interest if you can make something work.

 --
 --Josh

 Well, I did say it was a *crazy* idea. :-)

 Given that I would be starting from the ground-floor, learning not only
 the innards of PostgreSQL but also C coding as well, I would probably
 have to overcome near-insurmountable odds to make this project take off.
 Still,
 if I was crazy enough to think it, maybe I'll be crazy enough to
 try for it. ;-)

 Just ignore my 2nd posting, I was trying to clarify some of the
 ramblings I was typing.

For whatever it's worth, I was reading about the same things today and
came up with the same basic idea, without the same level of
implementation details you've talked about, Avery. And it sounds
really neat. Hard, but neat, and potentially worth it if, say,
Paraccel doesn't open source their stuff first :)

But I don't know the PostgreSQL internals either, though I've been
known to throw together some C code now and again. So in short,
there's interest. Whether there's collective skill/free
time/motivation/insanity/etc. enough to make something useful of the
interest is another question altogether. :)

- Josh/eggyknap

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] equivilant to contrib tsearch trigger function in 8.3

2007-09-07 Thread Tom Lane
Robert Treat [EMAIL PROTECTED] writes:
 Just wondering if it is already in 8.3 with a new name, or if not if there 
 are 
 plans to add it?  TIA

tsvector_update_trigger(), see docs section 9.13.2 (at the moment)

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Follow-Up to A Silly Idea for Vertically-Oriented Databases

2007-09-07 Thread Decibel!
On Fri, Sep 07, 2007 at 02:58:03PM -0700, Avery Payne wrote:
 In hindsight, I did miss quite a bit in my last post.  Here's a summary 
 that might clear it up:
 
 Add a single keyword that specifies that the storage format changes 
 slightly.  The keyword should not affect SQL compliancy while still 
 extending functionality.  It can be specified as either part of the 
 CREATE TABLE statement or part of the tablespace mechanism.
 
 When a table is created with this setting, all columns in a record are 
 split vertically into individual, 1-column-wide tables, and each column 
 in the table is assigned an OIDs.  Each OID corresponds to one of our 
 1-wide tables.  An additional control column will be created that is 
 only visible to the database and the administrator.  This column stores 
 a single logical indicating if the record is allocated or not.  You 
 might even be able to create a special bitmap index that is hidden, and 
 just use existing bitmap functions in the index code.  In essence, this 
 column helps keep all of the other columns in sync when dealing with rows.

OID's aren't the way to link this stuff together. It would make more
sense for there to be one file that stores all the MVCC and other row
overhead, and for that table to store ctids, because that will be the
fastest way to look up the columns.

 When writing data to the table, each individual column will update, but 
 the engine invisibly wraps together all of the columns into a single 
 transaction.  That is, each row insert is still atomic and behaves like 
 it normally would - either the insert succeeds or it doesn't.  Because 
 the updates are handled by the engine as many separate tables, no 
 special changes are required, and existing storage mechanisms (TOAST) 
 continue to function as they always did.  This could be written as a 
 super-function of sorts, one that would combine all of the smaller steps 
 together and use the existing mechanisms.
 
 Updates are performed in the same manner, with each column being 
 rolled up into a single invisible mini-transaction for the given record.

The problem is that the idea of rows being a string of bytes within a
page is spread pretty widely throughout the code; I'm pretty sure it
extends far beyond just smgr. At some point stuff becomes just datums,
but I don't know that there's a nice, clean line where that happens. I
think this is probably the biggest obstacle that you're facing.

 Deletes are performed by marking not only the columns as deleted but 
 also the control column as having that row available for overwrite.  I'm 
 simplifying quite a bit but I think the general idea is understood.  
 Yes, a delete will have significant overhead compared to an insert or 
 update but this is a known tradeoff that the administrator is willing to 
 make, so they can gain faster read speeds - ie. they want an 
 OLAP-oriented store, not an OLTP-oriented store.

You do *not* want to try and change how MVCC works at the same time
you're doing this. There *may* be some possibility of changing things
afterwards, but trying to tackle that off the bat is suicide. On top of
that, HOT might well may this kind of optimization pointless.

 The control column would be used to locate records that can be 
 overwritten quickly.  When a record is deleted, the control column's 
 bitmap was adjusted to indicate that a free space was available.  The 
 engine would then co-ordinate as it did above, but it can cheat - 
 instead of trying to figure things out for each table, the offset to 
 write to is already known, so the update proceeds as listed above, other 
 than each part of the little mini-transaction writes to the same 
 offset (ie. each column in the record will have the same hole, so 
 when you go to write the record out, write it to the same record 
 spot).  This is where the control column not only coordinates deletes 
 but also inserts that re-use space from deleted records.
-- 
Decibel!, aka Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)


pgp0bviTEQ2r6.pgp
Description: PGP signature


Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)

2007-09-07 Thread Decibel!
On Sat, Sep 08, 2007 at 01:02:04AM +0300, Hannu Krosing wrote:
 ??hel kenal p??eval, R, 2007-09-07 kell 16:41, kirjutas apoc9009:
  Filip Rembia??kowski schrieb:
   please take following remarks:
  thx, but if i need some advice form a scandinavian dickhead then i will 
  let you know this
 
 Is this apoc9009 guy real ?

Pretty much as soon as I saw that comment I just nuked the whole thread
and moved on. I suggest everyone else just do the same.
-- 
Decibel!, aka Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)


pgpwJlvmA4mop.pgp
Description: PGP signature


[HACKERS] apparent tsearch breakage on 64-bit machines

2007-09-07 Thread Tom Lane
On my x86_64 machine, CVS HEAD is throwing the following scary-looking
warnings:

to_tsany.c: In function 'pushval_morph':
to_tsany.c:247: warning: cast from pointer to integer of different size
to_tsany.c: In function 'to_tsquery_byid':
to_tsany.c:306: warning: cast to pointer from integer of different size
to_tsany.c: In function 'plainto_tsquery_byid':
to_tsany.c:344: warning: cast to pointer from integer of different size

Whether the code is actually safe or not, these are not acceptable.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] apparent tsearch breakage on 64-bit machines

2007-09-07 Thread Tom Lane
I wrote:
 Whether the code is actually safe or not, these [warnings] are not acceptable.

On looking closer, it seems the intent is to pass an argument of
unspecified type through parse_tsquery to a PushFunction:

typedef void (*PushFunction)(void *opaque, TSQueryParserState state, char *, 
int, int2);

extern TSQuery parse_tsquery(char *buf,
  PushFunction pushval,
  void *opaque, bool isplain);

That's fine, but not in a way that throws compiler warnings.  There is a
standard solution for this task within PG: the arguments should be
declared as Datum not void*.  Use the DatumGetFoo/FooGetDatum macros to
coerce back and forth.

Also, the declaration of typedef PushFunction really ought to provide
names for all the arguments, for documentation purposes.  (Dare I
suggest a comment block specifying the API?)

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly