Re: [HACKERS] Hash index todo list item
On Sun, 2007-02-09 at 13:04 -0500, Kenneth Marshall wrote: 2. Evaluate the performance of different hash index implementations and/or changes to the current implementation. My current plan is to keep the implementation as simple as possible and still provide the desired performance. Several hash index suggestions deal with changing the layout of the keys on a page to improve lookup performance, including reducing the bucket size to a fraction of a page or only storing the hash value on the page, instead of the index value itself. You might find this patch useful: http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php It implements the just store the hash in the index idea; it also sorts the entries in a bucket by the hash value, which allows binary search to be used to locate candidate matches. I was surprised that this didn't result in a performance improvement for the benchmarks that I ran, but I never got around to investigating further (either running more benchmarks or checking whether there was a bug in the implementation). Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge it up to HEAD if you'd like. -Neil ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] Installation problem and a question
I'm running Vista x64 and while I can successfully install PG, I can't seem to get PL/JAVA installed. The installer doesn't recognize that I have a JRE installed (I do) and the option is grayed out during installation. So a couple of questions: 1) Should PL/JAVA be able to be installed on Vista x64 or is there some known issue? What environment variables should be set, and how should I set them for a JRE vs. JDK installation? 2) Where is the source code for the Windows installer? I've tried fairly hard to find it, including Googling, browsing development sites, and looking at the PG 8.2.4 source code directory, but no luck so far. Sorry if this is a simple question but I don't know where else to ask at this point. Among other things I wanted to look at how the installer was detecting the existence of a JRE. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
Hi, apoc9009 wrote: Thadt is Replication NOT Backup I've now read all of your messages in this thread, but I simply fail to understand why you are that much opposed to the term 'replication'. I think the only thing which comes any close to what you're looking for is replication (in particular eager multi-master replication). I'd recommend you familiarize yourself with the world of database replication. You already know the important chapter from our manual, learn that by heart. Then read [2] and [3]. :-) Regards Markus [1]: Postgres advocacy wiki: http://developer.postgresql.org/index.php/Replication%2C_Clustering%2C_and_Connection_Pooling [2]: Terms and Definitions of Database Replication http://www.postgres-r.org/documentation/terms ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
backup is not replication. but replicated database can be treated as good source of backup. please take following remarks: 1) in English you don't capitalize nouns 2) read what other people write to you and try to understand that. 3) this is open source, try to be more cooperative not just cry for a ready made solution 2007/9/7, apoc9009 [EMAIL PROTECTED]: Markus Schiltknecht schrieb: Hi, apoc9009 wrote: Thadt is Replication NOT Backup I've now read all of your messages in this thread, but I simply fail to understand why you are that much opposed to the term 'replication'. I think the only thing which comes any close to what you're looking for is replication (in particular eager multi-master replication). What is your Problem in understanding the Word Backup? Translation for you: A Backup is a File or Set of Files thadt contains the Data of your Business critical Informations. It should not be Archived on the same place, the same House or the same Room. A Replication Database has nothing to do with a Backup, it works only for Failover if the Primary Database has a Mailfunction. A good Backuptool is needed if you have Databases with sizes over 1 Terrabyte. The common Backup methods wont Work with Online Productiondatabases and without the Problem of Datalosses, this is only a Way for small and Mediumsize Databases, not for Hugh Databases. Keep in Mind: Backup is NOT Replication! Write it down 100 times and maybe you understand ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org -- Filip Rembiałkowski ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] left outer join vs subplan
I think we need to build up a library of autogenerated queries, so we can do things which address multiple use cases. Can you tell us more about who/what generated it, so we can research? Sorry, I can't publish a lot of information, that is on of the biggest russian software company, it tries to migrate from MS SQL to PostgreSQL. MS SQL can optimize such queries to form similar to second query. The query formulation does seem a fairly common one. It seems to me too. Instead of SUM aggregates it can be MIN/AMX/AVG etc or more complex subquery. But pgsql usually optimizes non-aggregate subquery rather well. The value of sum(b.val) is never used in the query, so the aggregate itself could be discarded. I suspect there are other conditions you aren't showing us that would make this impossible? No, - select *, ie all fields from a and bagg tables. The aggregate prevents the condition bagg.id = a.id from being pushed down so that we know b.id = a.id. If we knew that then we could use b.id = ? as an index condition to retrieve the rows. In this case, it's safe to push down clause b.id=a.id. BTW, is pgsql understand that query 'select id,sum() ... group by id' produces result with unique id? Since we can't use the best technique, we use another. That then hits a third optimization problem. When an IndexScan is used to enforce order, we don't estimate how much of the table needs to be scanned before we start hitting rows. In the example you give we need to scan 65% of the Why 65%? a.addon has only 100 unique values and in first 100 tuples in index (a2idx) it will be about 80 tuples with id1. table using an IndexScan before we hit any rows. So we would probably be better off doing a Sort-SeqScan to apply the condition. That's true for query without LIMIT clause (second query is slower for 10% in compare with first one - both without LIMIT). -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
apoc9009 wrote: Write it down 100 times and maybe you understand If you are going to be rude nobody will bother to respond to you. Acknowledged experts have been very patient with you so far in this thread. You should be appreciative, not truculent. cheers andrew ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Hash index todo list item
On Fri, Sep 07, 2007 at 09:50:07AM -0400, Mark Mielke wrote: Kenneth Marshall wrote: On Thu, Sep 06, 2007 at 11:56:25PM -0700, Neil Conway wrote: You might find this patch useful: http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php ... Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge it up to HEAD if you'd like. This is a great starting point. I would appreciate it if you have the time and could make it apply cleanly to HEAD. I remember when you first posted it but had forgotten, probably because of the lack-luster results. Just a quick glance at the patch and from what I can tell, tagging the index as lossy causes a lot more work to be done than should be needed in theory. Currently the index-scan machinery will recheck the value against the original value for lossy indexes. However, given that we are using a good hash function with a low chance of collision, if we mark the unique items in the index then they do not actually have to be rechecked during the scan. Do you have any suggestions for implementing that optimization or is there any option to tell the scan machinery to only re-check a certain list of tuples? Thank you again for pointing this patch out and please let me know when you have a version against HEAD. What do you mean by mark the unique items in the index then they do not actually have to be rechecked during the scan. Even if there is a unique hash value mapping to a unique key, there is no guarantee that a new value won't result in that same hash value. Such is the nature of hashes. A hash key map does not mean a value match. The value must be checked. The opposite, however, may be true. If the hash key is not found, then we know the row for the value does not exist. Have you measured the performance of re-checking? I have always assumed the performance of re-checking was near free when compared to the cost of looking up the tuples in the table to determine whether or not they are live. This is why I have not been upset that bitmap index scans often re-check. Cheers, mark I understand that a hash value is a many-to-one mapping. That is the point of the flag in the index. The flag means that there is only one item in the heap corresponding to that hash value. In this case we know that the value in the heap is the correct one and a possibly very expensive string comparison can be skipped. Given that the hash function is doing its job, almost every string comparison can be skipped. How long would it take to compare 1-32K of data? How much CPU usage? With this field in place, you only need to check tuple visibility. Regards, Ken ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
Markus Schiltknecht wrote: [2]: Terms and Definitions of Database Replication http://www.postgres-r.org/documentation/terms Markus, the links in the left side menu are broken on the about and documentation page. They point to http://www.postgres-r.org/overview instead of http://www.postgres-r.org/documentation/overview, etc. Rainer ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Hash index todo list item
Kenneth Marshall wrote: On Thu, Sep 06, 2007 at 11:56:25PM -0700, Neil Conway wrote: You might find this patch useful: http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php ... Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge it up to HEAD if you'd like. This is a great starting point. I would appreciate it if you have the time and could make it apply cleanly to HEAD. I remember when you first posted it but had forgotten, probably because of the lack-luster results. Just a quick glance at the patch and from what I can tell, tagging the index as lossy causes a lot more work to be done than should be needed in theory. Currently the index-scan machinery will recheck the value against the original value for lossy indexes. However, given that we are using a good hash function with a low chance of collision, if we mark the unique items in the index then they do not actually have to be rechecked during the scan. Do you have any suggestions for implementing that optimization or is there any option to tell the scan machinery to only re-check a certain list of tuples? Thank you again for pointing this patch out and please let me know when you have a version against HEAD. What do you mean by mark the unique items in the index then they do not actually have to be rechecked during the scan. Even if there is a unique hash value mapping to a unique key, there is no guarantee that a new value won't result in that same hash value. Such is the nature of hashes. A hash key map does not mean a value match. The value must be checked. The opposite, however, may be true. If the hash key is not found, then we know the row for the value does not exist. Have you measured the performance of re-checking? I have always assumed the performance of re-checking was near free when compared to the cost of looking up the tuples in the table to determine whether or not they are live. This is why I have not been upset that bitmap index scans often re-check. Cheers, mark -- Mark Mielke [EMAIL PROTECTED]
Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
Markus Schiltknecht schrieb: Hi, apoc9009 wrote: Thadt is Replication NOT Backup I've now read all of your messages in this thread, but I simply fail to understand why you are that much opposed to the term 'replication'. I think the only thing which comes any close to what you're looking for is replication (in particular eager multi-master replication). What is your Problem in understanding the Word Backup? Translation for you: A Backup is a File or Set of Files thadt contains the Data of your Business critical Informations. It should not be Archived on the same place, the same House or the same Room. A Replication Database has nothing to do with a Backup, it works only for Failover if the Primary Database has a Mailfunction. A good Backuptool is needed if you have Databases with sizes over 1 Terrabyte. The common Backup methods wont Work with Online Productiondatabases and without the Problem of Datalosses, this is only a Way for small and Mediumsize Databases, not for Hugh Databases. Keep in Mind: Backup is NOT Replication! Write it down 100 times and maybe you understand ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Hash index todo list item
On Fri, Sep 07, 2007 at 12:55:37PM +0100, Heikki Linnakangas wrote: Neil Conway wrote: You might find this patch useful: http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php Oh, I had forgot about that. It implements the just store the hash in the index idea; it also sorts the entries in a bucket by the hash value, which allows binary search to be used to locate candidate matches. I was surprised that this didn't result in a performance improvement for the benchmarks that I ran, but I never got around to investigating further (either running more benchmarks or checking whether there was a bug in the implementation). You did get a smaller index, which has cache benefits with larger tables. You didn't compare the size comparison against b-tree, but a smaller index is a good thing. I think you left some money on the table on that front. Since all the HashItems are fixed size, you can get rid of the line pointers altogether. Given that a sizeof(HashItemData) is 8 bytes, and a line pointer is 4 bytes, that should give a further 1/3 reduction in index size. If you're willing to give up on the alignment of HashItemData, you can get it down to 6 bytes. Even from a CPU point of view, as the table becomes bigger and the b-tree becomes deeper, the difference between O(1) and O(log n) lookup starts to become more significant. If you use the size values from my initial tests, the hash index is down to 3X the btree size instead of 5X. If we can make these additional changes and add a unique flags to the index entry, we would have a much smaller index. I had not thought of it at the time, but the addition of the unique/sole item flag would allow the use of the hash index to support unique indexes and be used for primary keys. In some usage cases, the O(1) would be very advantageous. Regards, Ken ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Hash index todo list item
On Thu, Sep 06, 2007 at 11:56:25PM -0700, Neil Conway wrote: On Sun, 2007-02-09 at 13:04 -0500, Kenneth Marshall wrote: 2. Evaluate the performance of different hash index implementations and/or changes to the current implementation. My current plan is to keep the implementation as simple as possible and still provide the desired performance. Several hash index suggestions deal with changing the layout of the keys on a page to improve lookup performance, including reducing the bucket size to a fraction of a page or only storing the hash value on the page, instead of the index value itself. You might find this patch useful: http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php It implements the just store the hash in the index idea; it also sorts the entries in a bucket by the hash value, which allows binary search to be used to locate candidate matches. I was surprised that this didn't result in a performance improvement for the benchmarks that I ran, but I never got around to investigating further (either running more benchmarks or checking whether there was a bug in the implementation). Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge it up to HEAD if you'd like. -Neil This is a great starting point. I would appreciate it if you have the time and could make it apply cleanly to HEAD. I remember when you first posted it but had forgotten, probably because of the lack-luster results. Just a quick glance at the patch and from what I can tell, tagging the index as lossy causes a lot more work to be done than should be needed in theory. Currently the index-scan machinery will recheck the value against the original value for lossy indexes. However, given that we are using a good hash function with a low chance of collision, if we mark the unique items in the index then they do not actually have to be rechecked during the scan. Do you have any suggestions for implementing that optimization or is there any option to tell the scan machinery to only re-check a certain list of tuples? Thank you again for pointing this patch out and please let me know when you have a version against HEAD. Regards, Ken ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
Ühel kenal päeval, R, 2007-09-07 kell 12:03, kirjutas apoc9009: Andrew Sullivan schrieb: It seems that what you want is near-real-time online backups with _no cost_, which is not a feature that I think anyone will ever work on. A 100% Correct! I think anyone commit the Statement, thadt a Databases is a very imported Part of Software for a wide range of Peoples. It is absoluty normaly thadt thoose Peoples who using Databases having the Wish to have there Data save and up to Date until to the Last Record of a Customer (for Example: A Order form a Onlineshop). Very Large Databases with sizes more then over 1 Terrabyte cannot be backed up continously using Dumps or the common Backupmethods The Abstract need is: 1.An exact Copy (1:1) of the Productiondatabase should be mirrored at a diffrent Machine, connected via (LAN/WAN) (as Backup) 2.The Backup Operations should be done normaly while the Database is Online with very small Bandwich, I/O and CPU Trafficcosts. 3.Per Month or Week (or choosable) at a New Backup-Timeline/Version should be started (depend on the Basebackup) Probably you don't want your backup as a base backup + a bunch of logfiles. If you need to rally move to backup, then even a weeks worth of logfiles will take several hours or possibly days to apply. What you want is a standby machine, which applies all WAL files as they are copied over. 4.A New Base and its Streaming Backup Configuration should be Hot-Addable to an running Productiondatabase (without Shutdown or Lock) It currently is 5.A Remote Operationmanager Interface (GUI) should be there for Monitoring and Maintaining the Backups (maybe in PGAdmin) Huh ? I don't think this has anything to do with postgres-hackers list. Actually your other questions should also go to some users/newbies list, as this is a question of using existing features, and not something that needs to be added to backend. 6.If the Production and Mirror replicated Database is crashed (Hardware Failure or Provider Problems) the Recovery should done verry fast. Yep. That is what Hot-standby as described in postgres manual is meant to do. (An Idea for 5. / Clientside)... It would be a great Feature, if the PSQL-Client having a Local Deamon, thadt can Download the Backup Catalogues from the WAN-Backup FTP-Server continously down to the local Adminmachine. Then the Admin is able to burn the DB-Backups on DVD or write it on Tape on its local Machine at his Company (if the Backupreplicationserver fails two or the Internetprovider has Problems, the DB-Admin can apply its local Backups from DVD to the New DB on a New Machine anytime) All this keeping the WAL logs on CD's/tapes has nothing to do with fast recovery after failure. This is something that would be useful, but for entirely other purposes. Namely for a case, when you need to _GO_BACK_ to an earlier state. For example you discover that an careless operator or software bug has deleted important data last Wednesday and you need a way to get it back. In that case you take your last Sundays base backup and apply WAL up to few moments before the error happened. But it will not be fast. Hannu ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
Ühel kenal päeval, R, 2007-09-07 kell 12:20, kirjutas apoc9009: Trevor Talbot schrieb: Backup 12/24/2008 Version 2 /pg/backup/12_24_2008/base/rcvry.rcv--- Basebackup /pg/backup/12_24_2008/changes/0001.chg --- Changed Data /changes/0002.chg --- Changed Data /changes/0003.chg --- Changed Data /changes/0010.chg --- Changed Data /changes/0001.rsf --- Recovery Stripeset File (10 MByte) addon of Basebackup delete *.chg if a Stripeset of 10 *.chg Files exist, they should be converted or merged to one bigger *.rsf Stripesetfile (to prevent for having Millions *.chg) Why? What does this actually do? This is the Structure on FTP-Site of the Backupserver! Not locally where the Postmaster runs on! This can currently be done with a simple script, either with ftp or ssh/scp - Hannu ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Hash index todo list item
Neil Conway wrote: You might find this patch useful: http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php Oh, I had forgot about that. It implements the just store the hash in the index idea; it also sorts the entries in a bucket by the hash value, which allows binary search to be used to locate candidate matches. I was surprised that this didn't result in a performance improvement for the benchmarks that I ran, but I never got around to investigating further (either running more benchmarks or checking whether there was a bug in the implementation). You did get a smaller index, which has cache benefits with larger tables. You didn't compare the size comparison against b-tree, but a smaller index is a good thing. I think you left some money on the table on that front. Since all the HashItems are fixed size, you can get rid of the line pointers altogether. Given that a sizeof(HashItemData) is 8 bytes, and a line pointer is 4 bytes, that should give a further 1/3 reduction in index size. If you're willing to give up on the alignment of HashItemData, you can get it down to 6 bytes. Even from a CPU point of view, as the table becomes bigger and the b-tree becomes deeper, the difference between O(1) and O(log n) lookup starts to become more significant. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Hash index todo list item
On Thu, Sep 06, 2007 at 11:56:25PM -0700, Neil Conway wrote: On Sun, 2007-02-09 at 13:04 -0500, Kenneth Marshall wrote: 2. Evaluate the performance of different hash index implementations and/or changes to the current implementation. My current plan is to keep the implementation as simple as possible and still provide the desired performance. Several hash index suggestions deal with changing the layout of the keys on a page to improve lookup performance, including reducing the bucket size to a fraction of a page or only storing the hash value on the page, instead of the index value itself. You might find this patch useful: http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php It implements the just store the hash in the index idea; it also sorts the entries in a bucket by the hash value, which allows binary search to be used to locate candidate matches. I was surprised that this didn't result in a performance improvement for the benchmarks that I ran, but I never got around to investigating further (either running more benchmarks or checking whether there was a bug in the implementation). Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge it up to HEAD if you'd like. -Neil I have another question. Did the scan code at this time use the heap-order scanning? Could that have had an impact on the patch performance? Ken ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Hash index todo list item
Kenneth Marshall wrote: I understand that a hash value is a many-to-one mapping. That is the point of the flag in the index. The flag means that there is only one item in the heap corresponding to that hash value. In this case we know that the value in the heap is the correct one and a possibly very expensive string comparison can be skipped. Given that the hash function is doing its job, almost every string comparison can be skipped. How long would it take to compare 1-32K of data? How much CPU usage? With this field in place, you only need to check tuple visibility The value comparison cannot be skipped. I do not think you understand the many-to-one mapping in its entirety. Example: Table has: a(1), b(2), c(3) Index has: 1 = 1 (unique), 2 = 2 (unique), 3 = 3 (unique) Query: select * from table where key = 'z'; If 'z' hashes to '3' (completely possible), then the index record 3 points to tuple 3, and it exists. Only, it doesn't because 'a' 'z'. You MUST check the value. Cheers, mark -- Mark Mielke [EMAIL PROTECTED] ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Hash index todo list item
On Fri, Sep 07, 2007 at 10:36:41AM -0400, Brian Hurt wrote: Kenneth Marshall wrote: I understand that a hash value is a many-to-one mapping. That is the point of the flag in the index. The flag means that there is only one item in the heap corresponding to that hash value. In this case we know that the value in the heap is the correct one and a possibly very expensive string comparison can be skipped. Given that the hash function is doing its job, almost every string comparison can be skipped. How long would it take to compare 1-32K of data? How much CPU usage? With this field in place, you only need to check tuple visibility. How likely is it that you will get a hash collision, two strings that are different that will hash to the same value? To avoid this requires a very large hash key (128 bits, minimum)- otherwise you get into birthday attack problems. With a 32-bit hash, the likelyhood is greater than 50% that two strings in a collection of 100,000 will hash to the same value. With a 64-bit hash, the likelyhood is greater than 50% that two strings in a collection of 10 billion will has to same value. 10 billion is a large number, but not an unreasonable number, of strings to want to put into a hash table- and it's exactly this case where the O(1) cost of hashtables starts being a real win. Brian Yes, there is a non-negligible chance of collision (In a DB is there any chance that is non-negligible? :) ) and the values must be checked against the actual. The win is the collapse of the index size and only needed to check a small fraction of the actual tuples. Regards, Ken ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
Hi, apoc9009 wrote: Translation for you: A Backup is a File or Set of Files thadt contains the Data of your Business critical Informations. It should not be Archived on the same place, the same House or the same Room. I disagree, a backup does not necessarily have to be a single file or a set of files. Wikipedia has this definition: backup refers to making copies of data so that these additional copies may be used to restore the original after a data loss event. While for replica, it states: replica is a copy that is relatively indistinguishable from the original Thus a backup can very well be thought of as replica, and vice versa. A Replication Database has nothing to do with a Backup, it works only for Failover if the Primary Database has a Mailfunction. That's certainly plain wrong, see multi-master replication where failover doesn't make any sense. Wikipedia again (although that's unfair, as I've contributed to that definition myself) [1]: Replication is the process of sharing information so as to ensure consistency between redundant resources ..for example a master database and a backup. Keep in Mind: Backup is NOT Replication! Write it down 100 times and maybe you understand A backup IS a replica. A backup IS a replica. A backup IS a replica. A backup IS a replica... Regards Markus [1]: http://en.wikipedia.org/wiki/Replication_%28computer_science%29 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Hash index todo list item
On Fri, Sep 07, 2007 at 11:08:13AM -0400, Brian Hurt wrote: Kenneth Marshall wrote: How likely is it that you will get a hash collision, two strings that are different that will hash to the same value? To avoid this requires a very large hash key (128 bits, minimum)- otherwise you get into birthday attack problems. With a 32-bit hash, the likelyhood is greater than 50% that two strings in a collection of 100,000 will hash to the same value. With a 64-bit hash, the likelyhood is greater than 50% that two strings in a collection of 10 billion will has to same value. 10 billion is a large number, but not an unreasonable number, of strings to want to put into a hash table- and it's exactly this case where the O(1) cost of hashtables starts being a real win. Brian Yes, there is a non-negligible chance of collision (In a DB is there any chance that is non-negligible? :) ) and the values must be checked against the actual. The win is the collapse of the index size and only needed to check a small fraction of the actual tuples. Ah, OK- I misunderstood you. I thought you were saying that the hash values would need to be unique, and you wouldn't check the original values at all. My bad. Brian No, you were correct. I misstated originally and you and Mark both pointed out my mistake. Regards, Ken ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] terms for database replication: synchronous vs eager
Hi, I'm asking for advice and hints regarding terms in database replication, especially WRT Postgres-R. (Sorry for crossposting, but I fear not reaching enough people on the Postgres-R ML alone) I'm struggling on how to classify the Postgres-R algorithm. Up until recently, most people thought of it as synchronous replication, but it's not synchronous in the strong (and very common) sense. I.e. after a node confirms to have committed a transaction, other nodes didn't necessarily commit already. (They only promise that they *will* commit without conflicts). This violates the common understanding of synchrony, because you can't commit on a node A and then query another node B and expect it be coherent immediately. None the less, Postgres-R is eager (or pessimistic?) in the sense that it replicates *before* committing, so as to avoid divergence. In [1] I've tried to make that distinction clear, and I'm currently advocating for using synchronous only in the very strong (and commonly used) sense. I've choosen the word 'eager' to mean 'replicates before committing'. According to that definitions, Postgres-R is async but eager. Do these definitions violate any common meaning? Maybe in other areas like distributed storage or lock managers? Regards Markus [1]: Terms and Definitions of Database Replication http://www.postgres-r.org/documentation/terms ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
Filip Rembiałkowski schrieb: please take following remarks: thx, but if i need some advice form a scandinavian dickhead then i will let you know this ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Hash index todo list item
On Fri, Sep 07, 2007 at 10:30:30AM -0400, Mark Mielke wrote: Kenneth Marshall wrote: I understand that a hash value is a many-to-one mapping. That is the point of the flag in the index. The flag means that there is only one item in the heap corresponding to that hash value. In this case we know that the value in the heap is the correct one and a possibly very expensive string comparison can be skipped. Given that the hash function is doing its job, almost every string comparison can be skipped. How long would it take to compare 1-32K of data? How much CPU usage? With this field in place, you only need to check tuple visibility The value comparison cannot be skipped. I do not think you understand the many-to-one mapping in its entirety. Example: Table has: a(1), b(2), c(3) Index has: 1 = 1 (unique), 2 = 2 (unique), 3 = 3 (unique) Query: select * from table where key = 'z'; If 'z' hashes to '3' (completely possible), then the index record 3 points to tuple 3, and it exists. Only, it doesn't because 'a' 'z'. You MUST check the value. Cheers, mark Yes, you are completely correct. Thank you for the reminder. Regards, Ken ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 apoc9009 wrote: Filip Rembiałkowski schrieb: please take following remarks: thx, but if i need some advice form a scandinavian dickhead then i will let you know this This is not acceptable on our lists. Do not post in such a way again. Sincerely, Joshua D. Drake ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings - -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 PostgreSQL solutions since 1997 http://www.commandprompt.com/ UNIQUE NOT NULL Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG4W9BATb/zqfZUUQRAgglAJ9Le5Yxu796/tfJfVTXUfRSecGnlACfe+iB KNK7jelJo30lh8ymw1Ppfqo= =CSbC -END PGP SIGNATURE- ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
apoc9009 wrote: Filip Rembiałkowski schrieb: please take following remarks: thx, but if i need some advice form a scandinavian dickhead then i will let you know this That kind of remark is not acceptable on the PostgreSQL mailing lists. Please do not post here again unless you can speak to people with an appropriate amount of respect. Regards, Dave. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Hash index todo list item
Kenneth Marshall wrote: How likely is it that you will get a hash collision, two strings that are different that will hash to the same value? To avoid this requires a very large hash key (128 bits, minimum)- otherwise you get into birthday attack problems. With a 32-bit hash, the likelyhood is greater than 50% that two strings in a collection of 100,000 will hash to the same value. With a 64-bit hash, the likelyhood is greater than 50% that two strings in a collection of 10 billion will has to same value. 10 billion is a large number, but not an unreasonable number, of strings to want to put into a hash table- and it's exactly this case where the O(1) cost of hashtables starts being a real win. Brian Yes, there is a non-negligible chance of collision (In a DB is there any chance that is non-negligible? :) ) and the values must be checked against the actual. The win is the collapse of the index size and only needed to check a small fraction of the actual tuples. Ah, OK- I misunderstood you. I thought you were saying that the hash values would need to be unique, and you wouldn't check the original values at all. My bad. Brian
Re: [HACKERS] GIN readme is out of date
Heikki Linnakangas [EMAIL PROTECTED] writes: Since that interface documentation has been copied to the manual, gin.sgml, section 52.2, which is up to date, how about we just remove it from the README? +1 ... README files should not duplicate info that's in the SGML docs. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] GIN readme is out of date
Since that interface documentation has been copied to the manual, gin.sgml, section 52.2, which is up to date, how about we just remove it from the README? Agreed -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Hash index todo list item
Kenneth Marshall wrote: I understand that a hash value is a many-to-one mapping. That is the point of the flag in the index. The flag means that there is only one item in the heap corresponding to that hash value. In this case we know that the value in the heap is the correct one and a possibly very expensive string comparison can be skipped. Given that the hash function is doing its job, almost every string comparison can be skipped. How long would it take to compare 1-32K of data? How much CPU usage? With this field in place, you only need to check tuple visibility. How likely is it that you will get a hash collision, two strings that are different that will hash to the same value? To avoid this requires a very large hash key (128 bits, minimum)- otherwise you get into birthday attack problems. With a 32-bit hash, the likelyhood is greater than 50% that two strings in a collection of 100,000 will hash to the same value. With a 64-bit hash, the likelyhood is greater than 50% that two strings in a collection of 10 billion will has to same value. 10 billion is a large number, but not an unreasonable number, of strings to want to put into a hash table- and it's exactly this case where the O(1) cost of hashtables starts being a real win. Brian ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Just-in-time Background Writer Patch+Test Results
On Fri, 2007-09-07 at 11:48 -0400, Greg Smith wrote: On Fri, 7 Sep 2007, Simon Riggs wrote: I think that is what we should be measuring, perhaps in a simple way such as calculating the 90th percentile of the response time distribution. I do track the 90th percentile numbers, but in these pgbench tests where I'm writing as fast as possible they're actually useless--in many cases they're *smaller* than the average response, because there are enough cases where there is a really, really long wait that they skew the average up really hard. Take a look at any of the inidividual test graphs and you'll see what I mean. I've looked at the graphs now, but I'm not any wiser, I'm very sorry to say. We need something like a frequency distribution curve, not just the actual times. Bottom line is we need a good way to visualise the detailed effects of the patch. I think we should do some more basic tests to see where those outliers come from. We need to establish a clear link between number of dirty writes and response time. If there is one, which we all believe, then it is worth minimising those with these techniques. We might just be chasing the wrong thing. Perhaps output the number of dirty blocks written on the same line as the output of log_min_duration_statement so that we can correlate response time to dirty-block-writes on that statement. For me, we can enter Beta while this is still partially in the air. We won't be able to get this right without lots of other feedback. So I think we should concentrate now on making sure we've got the logging in place so we can check whether your patch works when its out there. I'd say lets include what you've done and then see how it works during Beta. We've been trying to get this right for years now, so we have to allow some slack to make sure we get this right. We can reduce or strip out logging once we go RC. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
On Fri, Sep 07, 2007 at 12:03:31PM +0200, apoc9009 wrote: Andrew Sullivan schrieb: It seems that what you want is near-real-time online backups with _no cost_, which is not a feature that I think anyone will ever work on. A 100% Correct! I think anyone commit the Statement, thadt a Databases is a very imported Part of Software for a wide range of Peoples. It is absoluty normaly thadt thoose Peoples who using Databases having the Wish to have there Data save and up to Date until to the Last Record of a Customer (for Example: A Order form a Onlineshop). Ah, but that's not what you asked (at least not as I interpreted it). You see, postgres does what you want if there's a simple power failure. Everything committed stays committed. You can protect against disk failure with RAID arrays also. What it sounds like is that you want to protect from someone blowing up your storage system. I think you need to step back and work out exactly what you are protecting against. Because it looks like your suggestion won't protect against a meteor stike on your data centre and 100km surrounding. Tell us what you are protecting against, then you can get some more useful answers. Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ From each according to his ability. To each according to his ability to litigate. signature.asc Description: Digital signature
Re: [HACKERS] Just-in-time Background Writer Patch+Test Results
On Wed, 2007-09-05 at 23:31 -0400, Greg Smith wrote: Tom gets credit for naming the attached patch, which is my latest attempt to finalize what has been called the Automatic adjustment of bgwriter_lru_maxpages patch for 8.3; that's not what it does anymore but that's where it started. This is a big undertaking, so well done for going for it. I decided to use pgbench for running my tests. The scripting framework to collect all that data and usefully summarize it is now available as pgbench-tools-0.2 at http://www.westnet.com/~gsmith/content/postgresql/pgbench-tools.htm For me, the main role of the bgwriter is to avoid dirty writes in backends. The purpose of doing that is to improve the response time distribution as perceived by users. I think that is what we should be measuring, perhaps in a simple way such as calculating the 90th percentile of the response time distribution. Looking at the headline numbers especially tps is notoriously difficult to determine any meaning from test results. Looking at the tps also tempts us to run a test which maxes out the server, an area we already know and expect the bgwriter to be unhelpful in. If I run a server at or below 70% capacity, what settings of the bgwriter help maintain my response time distribution? Coping with idle periods While I was basically happy with these results, the data Kevin Grittner submitted in response to my last call for commentary left me concerned. While the JIT approach works fine as long as your system is active, it does absolutely nothing if the system is idle. I noticed that a lot of the writes that were being done by the client backends were after idle periods where the JIT writer just didn't react fast enough during the ramp-up. For example, if the system went from idle for a while to full-speed just as the 200ms sleep started, by the time the BGW woke up again the backends could have needed to write many buffers already themselves. You've hit the nail on the head there. I can't see how you can do anything sensible when the bgwriter keeps going to sleep for long periods. The bgwriter's activity curve should ideally be the same shape as a critically damped harmonic oscillator. It should wake up, lots of writing if needed, then trail off over time. The only way to do that seems to be to vary the sleep automatically, or make short sleeps. For me, the bgwriter should sleep for at most 10ms at a time. If it has nothing to do it can go straight back to sleep again. Trying to set that time is fairly difficult, so it would be better not to have to set it at all. If you've changed bgwriter so it doesn't scan if no blocks have been allocated, I don't see any reason to keep the _delay parameter at all. I think I can safely say there is a level of intelligence going into what the LRU background writer does with this patch that has never been applied to this problem before. There have been a lot of good ideas thrown out in this area, but it took a hybrid approach that included and carefully balanced all of them to actually get results that I felt were usable. What I don't know is whether that will also be true for other testers. I get the feeling that what we have here is better than what we had before, but I guess I'm a bit disappointed we still have 3 magic parameters, or 5 if you count your hard-coded ones also. There's still no formal way to tune these. As long as we have *any* magic parameters, we need a way to tune them in the field, or they are useless. At very least we need a plan for how people will report results during Beta. That means we need a log_bgwriter (better name, please...) parameter that provides information to assist with tuning. At the very least we need this to be present during Beta, if not beyond. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Sorting the Stop word lists
1. Sort the stopword lists in the main distribution 2. Require them to be sorted 3. Remove the sort from readstoplist() I don't believe that will a big win in performance - lists are rather small. And it needed to add check of sorting -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Installation problem and a question
Phil wrote: 1) Should PL/JAVA be able to be installed on Vista x64 or is there some known issue? What environment variables should be set, and how should I set them for a JRE vs. JDK installation? Sorry - don't know. 2) Where is the source code for the Windows installer? I've tried fairly hard to find it, including Googling, browsing development sites, and looking at the PG 8.2.4 source code directory, but no luck so far. Sorry if this is a simple question but I don't know where else to ask at this point. Among other things I wanted to look at how the installer was detecting the existence of a JRE. Can help here: http://pgfoundry.org/projects/pginstaller/ -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?
Simon Riggs wrote: On Fri, 2007-09-07 at 06:36 +0200, Florian G. Pflug wrote: Tom Lane wrote: Florian G. Pflug [EMAIL PROTECTED] writes: - I actually think with just a little bit of more work, we can go even further, and get rid of the ReadNewTransactionId() call completely during snapshotting. [ squint... ] This goes a bit far for me. In particular, I think this will fail in the edge case when there are no live XIDs visible in ProcArray. You cannot go back and do ReadNewTransactionId afterward, at least not without re-scanning the ProcArray a second time, which makes it at best a questionable win. Why would it? I think the additional suggestion goes a bit too far. You may be right, but I don't want to change the transaction system in advanced ways this close to the next release. We may have difficulty spotting bugs in that thinking during beta. Ok, those were two clear votes against doing this, so I'll stop arguing ;-). I do think that we should have another look at this when 8.4 opens, though. greetings, Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] HEAD build troubles, buildfarm misconfigurations
On Wednesday 05 September 2007 12:01, Tom Lane wrote: Robert Treat [EMAIL PROTECTED] writes: On Wednesday 05 September 2007 00:06, Tom Lane wrote: BTW, on re-reading that, it seems a tad surprising to get an error right there --- if postgres_fe.h or anything it includes were broken, then the build should have failed earlier. Is the failure reproducible in repeated tries? Yes. Just for giggles I played with the configure flags as well, but it always errors in that spot. Hmph. I checked the CVS logs and there haven't been any recent changes that seem like they could be related. It might be worth getting gcc -E output and looking to see what that part of netdb.h looks like after macro expansion. Sorry, my C-fu is lacking. My makefile claims it uses gcc -E, but I'm not sure how to get a look at results after macro expansion However... I looked again at the output of the make command on some ubuntu buildfarm members, and the results were different enough that I got suspicous of the flex/bison derived files in the snapshot, so I decided to build from cvs and that works fine. FWIW I am using flex 2.5.31 and bison 2.1, not sure what is used to produce the snapshots... -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Just-in-time Background Writer Patch+Test Results
On Fri, 7 Sep 2007, Simon Riggs wrote: I think that is what we should be measuring, perhaps in a simple way such as calculating the 90th percentile of the response time distribution. I do track the 90th percentile numbers, but in these pgbench tests where I'm writing as fast as possible they're actually useless--in many cases they're *smaller* than the average response, because there are enough cases where there is a really, really long wait that they skew the average up really hard. Take a look at any of the inidividual test graphs and you'll see what I mean. Looking at the tps also tempts us to run a test which maxes out the server, an area we already know and expect the bgwriter to be unhelpful in. I tried to turn that around and make my thinking be that if I built a bgwriter that did most of the writes without badly impacting the measure we know and expect it to be unhelpful in, that would be more likely to yield a robust design. It kept me out of areas where I might have built something that had to be disclaimed with don't run this when the server is maxed out. For me, the bgwriter should sleep for at most 10ms at a time. If it has nothing to do it can go straight back to sleep again. Trying to set that time is fairly difficult, so it would be better not to have to set it at all. I wanted to get this patch out there so people could start thinking about what I'd done and consider whether this still fit into the 8.3 timeline. What I'm doing myself right now is running tests with a much lower setting for the delay time--am testing 20ms right now. I personally would be happy saying it's 10ms and that's it. Is anyone using a time lower than that right now? I seem to recall that 10ms was also the shortest interval Heikki used in his tests as well. I get the feeling that what we have here is better than what we had before, but I guess I'm a bit disappointed we still have 3 magic parameters, or 5 if you count your hard-coded ones also. I may be able to eliminate more of them, but I didn't want to take them out before beta. If it can be demonstrated that some of these parameters can be set to specific values and still work across a wider range of applications than what I've tested, then there's certainly room to fix some of these, which actually makes some things easier. For example, I'd be more confident fixing the weighted average smoothing period to a specific number if I knew the delay was fixed, and there's two parameters gone. And the multiplier is begging to be eliminated, just need some more data to confirm that's true. There's still no formal way to tune these. As long as we have *any* magic parameters, we need a way to tune them in the field, or they are useless. At very least we need a plan for how people will report results during Beta. That means we need a log_bgwriter (better name, please...) parameter that provides information to assist with tuning. Once I got past the does it work? stage, I've been doing all the tuning work using a before/after snapshot of pg_stat_bgwriter data during a representative snapshot of activity and looking at the delta. Been a while since I actually looked into the logs for anything. It's very straightforward to put together a formal tuning plan using the data in there, particularly compared to the the impossibility of creating such a plan in the current code. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?
On Fri, 2007-09-07 at 06:36 +0200, Florian G. Pflug wrote: Tom Lane wrote: Florian G. Pflug [EMAIL PROTECTED] writes: So I believe you're right, and we can skip taking the lock in the no xid case Sounds good. - I actually think with just a little bit of more work, we can go even further, and get rid of the ReadNewTransactionId() call completely during snapshotting. [ squint... ] This goes a bit far for me. In particular, I think this will fail in the edge case when there are no live XIDs visible in ProcArray. You cannot go back and do ReadNewTransactionId afterward, at least not without re-scanning the ProcArray a second time, which makes it at best a questionable win. Why would it? I think the additional suggestion goes a bit too far. You may be right, but I don't want to change the transaction system in advanced ways this close to the next release. We may have difficulty spotting bugs in that thinking during beta. lazy XID assignment will reduce the number of times GetNextTransactionId() is called and if we also avoid taking the ProcArrayLock for CommitTransaction() then we will have significantly reduced contention for mixed workloads (i.e. most real ones). -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] left outer join vs subplan
2007/9/6, Simon Riggs [EMAIL PROTECTED]: The query formulation does seem a fairly common one. First query: explain analyze select * from a left outer join ( select b.id, sum(b.val) from b group by b.id ) bagg on bagg.id = a.id where a.id 1 order by a.addon, a.id limit 100; The value of sum(b.val) is never used in the query, so the aggregate itself could be discarded. I suspect there are other conditions you aren't showing us that would make this impossible? The value of sum(b.val) is being output in the select *, so saying it's never used is an oversimplification. But it's actually not used in any join, or filter. That should be enough to optimize... The aggregate prevents the condition bagg.id = a.id from being pushed down so that we know b.id = a.id. If we knew that then we could use b.id = ? as an index condition to retrieve the rows. That's exactly the point... But if we all can see it, maybe it's possible to code it? Cheers, Filip Rembiałkowski ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] GIN readme is out of date
access/gin/README describes the Gin interface, but it hasn't been updated since the change to extractQuery interface to allow no query can match return value. Since that interface documentation has been copied to the manual, gin.sgml, section 52.2, which is up to date, how about we just remove it from the README? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Just-in-time Background Writer Patch+Test Results
On Fri, 7 Sep 2007, Simon Riggs wrote: I think we should do some more basic tests to see where those outliers come from. We need to establish a clear link between number of dirty writes and response time. With the test I'm running, which is specifically designed to aggrevate this behavior, the outliers on my system come from how Linux buffers writes. I can adjust them a bit by playing with the parameters as described at http://www.westnet.com/~gsmith/content/linux-pdflush.htm but on the hardware I've got here (single 7200RPM disk for database, another for WAL) they don't move much. Once /proc/meminfo shows enough Dirty memory that pdflush starts blocking writes, game over; you're looking at multi-second delays before my plain old IDE disks clear enough debris out to start responding to new requests even with the Areca controller I'm using. Perhaps output the number of dirty blocks written on the same line as the output of log_min_duration_statement so that we can correlate response time to dirty-block-writes on that statement. On Linux at least, I'd expect this won't reveal much. There, the interesting correlation is with how much dirty data is in the underlying OS buffer cache. And exactly how that plays into things is a bit strange sometimes. If you go back to Heikki's DBT2 tests with the background writer schemes he tested, he got frustrated enough with that disconnect that he wrote a little test program just to map out the underlying weirdness: http://archives.postgresql.org/pgsql-hackers/2007-07/msg00261.php I've confirmed his results on my system and done some improvements to that program myself, but pushed further work on it to the side to finish up the main background writer task instead. I may circle back to that. I'd really like to run all this on another OS as well (I have Solaris 10 on my server box but not fully setup yet), but I can only volunteer so much time to work on all this right now. If there's anything that needs to be looked at more carefully during tests in this area, it's getting more data about just what the underlying OS is doing while all this is going on. Just the output from vmstat/iostat is very informative. Those using DBT2 for their tests get some nice graphs of this already. I've done some pgbench-based tests that included that before that were very enlightening but sadly that system isn't available to me anymore. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Hash index todo list item
On Thu, Sep 06, 2007 at 01:08:59PM -0500, Kenneth Marshall wrote: Since we already have to check the actual tuple values for any index lookup in postgresql, we could only store the full hash value and the corresponding TIDs in the bucket. Then when we lookup an item by calculating its hash, if the exact hash is not present in the bucket, then we know that the item is not in the index. Sounds like you'd be returning a bitmap for use with a bitmap scan. That's a different take on other suggestions I've heard and would allow a hash index to have an almost unlimited key size yet flexible matching... (combined with other index, or even just the same index). Neat. Have a nice day, Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ From each according to his ability. To each according to his ability to litigate. signature.asc Description: Digital signature
Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?
Florian G. Pflug [EMAIL PROTECTED] writes: Why would it? The idea was to remember the largest committed xid, and that won't go away just because the proc array is rather empty xid-wise. I hadn't fully absorbed this idea last night, but now that I have, I'm starting to think it's a good one. (That slightly lagging of largestCommittedXid might cause some tuples not to be VACUUMED though, so we might want to update largestCommittedXid for ABORTS too, and probably rename it to largestNonRunningXid or whatever ;-) ). LargestCompletedXid, perhaps? This would rid us of the rather complicated entanglement of XidGenLock and the ProcArrayLock, lessen the lock contention, and reduce the average snapshot size a bit. In view of Simon's earlier comments at http://archives.postgresql.org/pgsql-hackers/2007-07/msg00948.php it seems like not having to take the XidGenLock during GetSnapshotData ought to be a pretty serious win performance-wise, and it might open up some other possibilities that are now foreclosed by the risk of deadlock between the two locks. I've spent the past hour or so trying to consolidate the comments in GetSnapshotData and related places into a single chunk of text to be added to src/backend/access/transam/README. Attached is what I have so far --- this incorporates the idea of not taking ProcArrayLock to exit an XID-less transaction, but not yet Florian's idea. I think it'd get simpler if we changed to that, but am posting this for comments. regards, tom lane Interlocking transaction begin, transaction end, and snapshots -- We try hard to minimize the amount of overhead and lock contention involved in the frequent activities of beginning/ending a transaction and taking a snapshot. Unfortunately, we must have some interlocking for this, because it is critical that all backends agree on the commit order of transactions. For example, suppose an UPDATE in xact A is blocked by xact B's prior update of the same row, and xact B is doing commit while xact C gets a snapshot. Xact A can complete and commit as soon as B releases its locks. If xact C's GetSnapshotData sees xact B as still running, then it had better see xact A as still running as well, or it will be able to see two tuple versions - one deleted by xact B and one inserted by xact A. That is, if A thinks it committed after B, C had better think the same. We enforce this by not allowing any transaction to exit the set of running transactions while a snapshot is being taken. (This rule is probably stronger than necessary, but see the next problem.) The implementation of this is that GetSnapshotData takes the ProcArrayLock in shared mode (thereby allowing multiple backends to take snapshots in parallel), but xact.c must take the ProcArrayLock in exclusive mode while clearing MyProc-xid at transaction end (either commit or abort). Here is another variant of the risk scenario: 1. Xact A is running (in Read Committed mode). 2. Xact C's GetSnapshotData reads next transaction ID into xmax, then is swapped out before it can acquire ProcArrayLock. 3. Xact B gets new XID (= C's xmax), makes changes and commits. 4. Xact A changes some row R changed by xact B and commits. 5. Xact C finishes getting its snapshot data. It sees xact A as done, but sees xact B as still running (since B = xmax). Now C will see R changed by xact B and then xact A, *but* does not see other changes made by xact B. If C is supposed to be in Serializable mode, this is wrong. To prevent this it is necessary that GetSnapshotData acquire ProcArrayLock before it calls ReadNewTransactionId. This prevents xact A from exiting the set of running transactions seen by xact C. Therefore both A and B will be seen as still running = no inconsistency. In short, then, the rule is that no transactions may exit the set of currently-running transactions between the time we fetch xmax and the time we finish building our snapshot. However, this restriction only applies to transactions that have an XID --- read-only transactions can end without acquiring ProcArrayLock, since they don't affect anyone else's snapshot. We must also require GetNewTransactionId to store the new XID into the shared ProcArray before releasing XidGenLock. This ensures that when GetSnapshotData calls ReadNewTransactionId (which also takes XidGenLock), all active XIDs before the returned value of nextXid are already present in the ProcArray and can't be missed by GetSnapshotData. Unfortunately, we can't have GetNewTransactionId take ProcArrayLock to do this, else it could deadlock against GetSnapshotData. Therefore, we simply let GetNewTransactionId store into MyProc-xid without any lock. We are thereby relying on fetch/store of an XID to be atomic, else other backends might see a partially-set XID. (NOTE: for multiprocessors that need explicit memory access fence instructions, this means that
Re: [HACKERS] HEAD build troubles, buildfarm misconfigurations
Robert Treat wrote: On Wednesday 05 September 2007 12:01, Tom Lane wrote: Robert Treat [EMAIL PROTECTED] writes: On Wednesday 05 September 2007 00:06, Tom Lane wrote: BTW, on re-reading that, it seems a tad surprising to get an error right there --- if postgres_fe.h or anything it includes were broken, then the build should have failed earlier. Is the failure reproducible in repeated tries? Yes. Just for giggles I played with the configure flags as well, but it always errors in that spot. Hmph. I checked the CVS logs and there haven't been any recent changes that seem like they could be related. It might be worth getting gcc -E output and looking to see what that part of netdb.h looks like after macro expansion. Sorry, my C-fu is lacking. My makefile claims it uses gcc -E, but I'm not sure how to get a look at results after macro expansion However... I looked again at the output of the make command on some ubuntu buildfarm members, and the results were different enough that I got suspicous of the flex/bison derived files in the snapshot, so I decided to build from cvs and that works fine. FWIW I am using flex 2.5.31 and bison 2.1, not sure what is used to produce the snapshots... scan.c in snapshot says: #define YY_FLEX_MAJOR_VERSION 2 #define YY_FLEX_MINOR_VERSION 5 gram.c in snapshot says: /* A Bison parser, made by GNU Bison 1.875. */ cheers andrew ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
Trevor Talbot schrieb: Backup 12/24/2008 Version 2 /pg/backup/12_24_2008/base/rcvry.rcv--- Basebackup /pg/backup/12_24_2008/changes/0001.chg --- Changed Data /changes/0002.chg --- Changed Data /changes/0003.chg --- Changed Data /changes/0010.chg --- Changed Data /changes/0001.rsf --- Recovery Stripeset File (10 MByte) addon of Basebackup delete *.chg if a Stripeset of 10 *.chg Files exist, they should be converted or merged to one bigger *.rsf Stripesetfile (to prevent for having Millions *.chg) Why? What does this actually do? This is the Structure on FTP-Site of the Backupserver! Not locally where the Postmaster runs on! Locally the Postmaster can create a basebackup (rcvry.rcv) then the FTP-Deamon logs on to the Remote FTP-Backupserver and Uploads this File to keep it save: /pg/backup/12_24_2008/ /pg/backup/12_24_2008/base/rcvry.rcv--- Basebackup At the Next: If a New Record at the Database was written, then the Postmaster locally creates A change File named: 0001.chg After thits the locally FTP-Deamon transfers this File to /pg/backup/12_24_2008/changes/0001.chg --- Changed Data Then /pg/backup/12_24_2008/changes/0002.chg --- Changed Data /pg/backup/12_24_2008/changes/0003.chg --- Changed Data /pg/backup/12_24_2008/changes/0010.chg --- Changed Data Then the FTP-Backupdeamon merged the last 10.x *.chg Files to one greater *.RSF File Online while logged into the FTP-Backupserver and Delete Online the Last 10 *.chg Files (after this the Last 10.x *.chg Files, created by the Postmaster can be deleted localy where the Postmaster runs). The Benefit: If the Machine where the Database run on having a Mailfunction, then the Last Record of the Databasecatalouge is backed up to an *.chg or *. RSF File. Thadt my Idea of this Apoc ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
Andrew Sullivan schrieb: It seems that what you want is near-real-time online backups with _no cost_, which is not a feature that I think anyone will ever work on. A 100% Correct! I think anyone commit the Statement, thadt a Databases is a very imported Part of Software for a wide range of Peoples. It is absoluty normaly thadt thoose Peoples who using Databases having the Wish to have there Data save and up to Date until to the Last Record of a Customer (for Example: A Order form a Onlineshop). Very Large Databases with sizes more then over 1 Terrabyte cannot be backed up continously using Dumps or the common Backupmethods The Abstract need is: 1.An exact Copy (1:1) of the Productiondatabase should be mirrored at a diffrent Machine, connected via (LAN/WAN) (as Backup) 2.The Backup Operations should be done normaly while the Database is Online with very small Bandwich, I/O and CPU Trafficcosts. 3.Per Month or Week (or choosable) at a New Backup-Timeline/Version should be started (depend on the Basebackup) 4.A New Base and its Streaming Backup Configuration should be Hot-Addable to an running Productiondatabase (without Shutdown or Lock) 5.A Remote Operationmanager Interface (GUI) should be there for Monitoring and Maintaining the Backups (maybe in PGAdmin) 6.If the Production and Mirror replicated Database is crashed (Hardware Failure or Provider Problems) the Recovery should done verry fast. (An Idea for 5. / Clientside)... It would be a great Feature, if the PSQL-Client having a Local Deamon, thadt can Download the Backup Catalogues from the WAN-Backup FTP-Server continously down to the local Adminmachine. Then the Admin is able to burn the DB-Backups on DVD or write it on Tape on its local Machine at his Company (if the Backupreplicationserver fails two or the Internetprovider has Problems, the DB-Admin can apply its local Backups from DVD to the New DB on a New Machine anytime) Apoc ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] Sorting the Stop word lists
I notice we sort the stop word list after we read it into memory. Wouldn't it be easier to 1. Sort the stopword lists in the main distribution 2. Require them to be sorted 3. Remove the sort from readstoplist() We should at very least do (1) to improve the sort speed at start. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Sorting the Stop word lists
Simon Riggs [EMAIL PROTECTED] writes: I notice we sort the stop word list after we read it into memory. I see nothing wrong with that; it only happens once per backend session, and it makes maintenance of the files easier. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] integrated tsearch doesn't work with non utf8 database
Hello last time I checked utf8 database. Now I checked latin2 encoding database. I used dictionaries from last test. client_encoding | utf8 lc_collate | cs_CZ.iso-8859-2 lc_ctype| cs_CZ.iso-8859-2 lc_messages | cs_CZ.iso-8859-2 lc_monetary | cs_CZ.iso-8859-2 lc_numeric | cs_CZ.iso-8859-2 lc_time | cs_CZ.iso-8859-2 postgres=# create Text search dictionary cspell2(template=ispell, afffile=czech, dictfile=czech); CREATE TEXT SEARCH DICTIONARY postgres=# alter text search configuration cs alter mapping for word, lword with cspell2, simple; ALTER TEXT SEARCH CONFIGURATION postgres=# select ts_debug('cs','Příliš žluťoučký kůň se napil žluté vody'); ERROR: character 0xc3a5 of encoding UTF8 has no equivalent in LATIN2 CONTEXT: SQL function ts_debug statement 1 postgres=# database was initialised /usr/local/pgsql/bin/initdb --encoding=latin2 --locale=cs_CZ.iso-8859-2 -D /usr/local/pgsql/data/ locales is correct postgres=# set client_encoding to utf8; SET postgres=# select upper('Příliš žluťoučký kůň se napil žluté vody'); upper -- PŘÍLIŠ ŽLUŤOUČKÝ KŮŇ SE NAPIL ŽLUTÉ VODY (1 row) Regards Pavel Stehule ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?
Tom Lane wrote: I've spent the past hour or so trying to consolidate the comments in GetSnapshotData and related places into a single chunk of text to be added to src/backend/access/transam/README. Attached is what I have so far --- this incorporates the idea of not taking ProcArrayLock to exit an XID-less transaction, but not yet Florian's idea. I think it'd get simpler if we changed to that, but am posting this for comments. Interlocking transaction begin, transaction end, and snapshots -- We try hard to minimize the amount of overhead and lock contention involved in the frequent activities of beginning/ending a transaction and taking a snapshot. Unfortunately, we must have some interlocking for this, because it is critical that all backends agree on the commit order of transactions. This is actually still a slightly stronger requirement than what we really need I think. To simplify the following discussion, A - B shall mean that transaction A saw B as committed. Conversely, A ! B shall mean that A treats B as in-progress. If A was in read-committed mode, the visibility refers to the latest snapshot that A used. Now assume A and B commit at nearly the same time, and for two other transactions C and D the following holds: C - A, C ! B but D ! A, D - B. This would violate the requirement that the commit order is globally agreed upon, yet as long as both A ! B and B ! A holds, there is no conflict. (Note that if A and B are serializable, A ! B B ! A implies that A and B cannot have touched the same record and have both committed - one would have been aborted due to a SerializationError). I must admit, though, that this is a quite academic case, since the prerequisite A ! B and B ! A is something we have no control over for read-committed transactions - who knows when they might have taken their last snapshot... Still, I wanted to mention this because I believe that the minimal requirement that we actually *need* to enforce is A - B and B - C imply A - C. (T1) The actual implementation will probably always have to enforce something slightly stronger, but it's still nice to know the minimal guarantee needed to be able to judge correctness. For example, suppose an UPDATE in xact A is blocked by xact B's prior update of the same row, and xact B is doing commit while xact C gets a snapshot. Xact A can complete and commit as soon as B releases its locks. If xact C's GetSnapshotData sees xact B as still running, then it had better see xact A as still running as well, or it will be able to see two tuple versions - one deleted by xact B and one inserted by xact A. In my notation this becomes: A - B and C ! B implies C ! A. This then follows from (T1) - Assume that A - B, C ! B but C - A, then with (A) C - B follows from C - A and A - B, which contradicts C ! B. We enforce this by not allowing any transaction to exit the set of running transactions while a snapshot is being taken. (This rule is probably stronger than necessary, but see the next problem.) The implementation of this is that GetSnapshotData takes the ProcArrayLock in shared mode (thereby allowing multiple backends to take snapshots in parallel), but xact.c must take the ProcArrayLock in exclusive mode while clearing MyProc-xid at transaction end (either commit or abort). Agreed. We *do* enforce a strict global ordering of committs and snapshots. This then guarantees (T1) because if A - B and B - C, than A *must* have taken it's snapshot after B committed, and B in turn *must* have taken it's snapshot after C committed, so surely A - C will hold too. Here is another variant of the risk scenario: 1. Xact A is running (in Read Committed mode). 2. Xact C's GetSnapshotData reads next transaction ID into xmax, then is swapped out before it can acquire ProcArrayLock. 3. Xact B gets new XID (= C's xmax), makes changes and commits. 4. Xact A changes some row R changed by xact B and commits. 5. Xact C finishes getting its snapshot data. It sees xact A as done, but sees xact B as still running (since B = xmax). Now C will see R changed by xact B and then xact A, *but* does not see other changes made by xact B. If C is supposed to be in Serializable mode, this is wrong. I never really grasped why we need to assume serializable here - this seems wrong if C is read-committed too. Seeing only half of a transaction's changes can never be right, can it? To prevent this it is necessary that GetSnapshotData acquire ProcArrayLock before it calls ReadNewTransactionId. This prevents xact A from exiting the set of running transactions seen by xact C. Therefore both A and B will be seen as still running = no inconsistency. Another point of view is that determining the xmax of a snapshot really *is* part of taking the snapshot. Since we already obtained that we need to serialize snapshotting and committing, it follows that we must not allow committs to happen between the
Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?
Florian G. Pflug [EMAIL PROTECTED] writes: Tom Lane wrote: Here is another variant of the risk scenario: 1. Xact A is running (in Read Committed mode). 2. Xact C's GetSnapshotData reads next transaction ID into xmax, then is swapped out before it can acquire ProcArrayLock. 3. Xact B gets new XID (= C's xmax), makes changes and commits. 4. Xact A changes some row R changed by xact B and commits. 5. Xact C finishes getting its snapshot data. It sees xact A as done, but sees xact B as still running (since B = xmax). Now C will see R changed by xact B and then xact A, *but* does not see other changes made by xact B. If C is supposed to be in Serializable mode, this is wrong. I never really grasped why we need to assume serializable here - this seems wrong if C is read-committed too. Seeing only half of a transaction's changes can never be right, can it? You'd be surprised ;-) ... Read Committed is not totally consistent, even considering a single statement using a single snapshot, because of the rules about UPDATEs starting from the latest committed version. But let's not get into that today. Anyway, one thing I realized while making this text is that the example cases mentioned in CommitTransaction and GetSnapshotData really were basically the same case. The difference is that in this case, the failure to see B as committed is closely related to the way that xmax is computed. We could get rid of this example if we switched over to your LatestCompletedXid proposal. I had a rather hard time understanding these things initially - at least for me, the enlightenment came when I realized that most of the locking in there actually ensures that (T1) holds - or in other words, that the relation A sees B as committed *has* to be a transitive one. So I'd like to see this reasoning in that README - though maybe I'm the only one to whom this is the clearest wording... I'll put some of this into the README. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Oddity with psql \d and pg_table_is_visible
On Wed, Sep 05, 2007 at 03:27:50PM -0400, Tom Lane wrote: Decibel! [EMAIL PROTECTED] writes: While this is correct on a per-relation level, I'm thinking that it's not what we'd really like to have happen in psql. What I'd like \d to do is show me everything in any schema that's in my search_path, even if there's something higher in the search_path that would over-ride it. ISTM that's what most people would expect out of \d. I don't agree with that reasoning in the least, particularly not if you intend to fix it by redefining pg_table_is_visible() ... No, pg_table_is_visible is correct as-is. What will happen if we change \d to work that way is that it will show you a table, and you'll try to access it, and you'll get the wrong table because the access will go to the one that really is visible. That's why I was suggesting that any table showing up in \d that in-fact wasn't visible be marked somehow, either with a separate field, or by sticking an * after it's name. This is confusing because when using \d you generally think in terms of what schemas are in your search path, not if an individual object has been superseded by something further up the chain. -- Decibel!, aka Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) pgpSd6sbfMLD2.pgp Description: PGP signature
[HACKERS] A Silly Idea for Vertically-Oriented Databases
Be forewarned - this is probably a very long post, and I'm just a mere mortal (ie. admin) who doesn't write copious amounts of C code. Take the following posts and suggestions with a grain of salt. So I've been seeing/hearing all of the hoopla over vertical databases (column stores), and how they'll not only slice bread but also make toast, etc. I've done some quick searches for past articles on C-Store, Vertica, Column Store, and Vertical Database, and have seen little discussion on this. And then a funny thought occurs to me - when I look at the directory structure and file layout of a PostgreSQL database, I see that each OID corresponds to a table, which corresponds to (generally) a single file. Then I have a second funny thought - what if there was a low-friction, low-cost-of-implementation way to bring similar advantages to PostgreSQL without major alterations, recoding, etc? Finally it occurs to me that PostgreSQL already does something similar but it could do it so much better, with only one language change and minor changes to the storage layout. So here's my plum-crazy proposal (and I've made some before - see http://archives.postgresql.org/pgsql-odbc/2006-10/msg00040.php - and they not only made it into production, but they are in active use by me on a weekly basis - Thanks Hiroshi!!!), bear with me... Make one small, very tiny syntactic change to CREATE TABLE that includes a new keyword, COLUMN-STORE or something similar. I don't care where it appears as long as it's after the CREATE TABLE. You would not have to break any existing SQL conventions, PostgreSQL would continue to be SQL compliant, and given the odd wording, I highly doubt that the folks who work on SQL keywords will end up using it at any point in time. If adding COLUMN-STORE is objectionable because it will cloud the compliance of the language then simply move the functionality into the table space functionality. In hindsight, it might even belong there instead. So, instead of specifying it by table, create a table space that has an attribute Column Storage set as active. When inactive, it uses the traditional one-file-per-table layout. Make each table column capable of receiving an OID. This will be critical for the following steps... If a table is created with COLUMN-STORE as an option, then it will continue to behave in the same way it always has, but the storage will be different. Each column in the table will be represented by a single file, with the file name being (naturally) the OID. INSERT/UPDATE/DELETE would function as it always has, etc. Nothing would change. Except how the data is stored. The existing TOAST mechanisms continue to work - because the engine would treat each file as a single-column table! One additional column would be added to the store, an invisible one that not only tracks the OID for the rows in this type of setup, but also the state of the row. Let's call this the Control Column. Given that the metadata size for the row would be fixed/constant, we won't have to worry about what is in the other columns and rows, they can be any size. BTW, the Control Column would be just another column from the storage engine's point of view. It just happens to be one that no-one can see, other than the database (and maybe the administrator). When you go to VACUUM a table, you would treat each column as a single-row table, so if a row is a candidate for a VACUUM reclamation, then it will adjust each column an equal amount. Under no circumstances would you have columns out of sync, so if a record goes, it means each adjacent column goes with it. This sounds disk-intensive at first, until you realize that the admin will have made a contentious decision to use this format, and understands the advantages/drawbacks to this method. So it takes a little longer to VACUUM, I don't care, because as an admin I will have specified this layout for a reason - to do OLAP, not OLTP. Which means, I rarely VACUUM it. Add to this the high efficiency you would gain by packing more records into buffers per read, and most of the losses you take in re-reading data would really not amount to as big a loss as you might think. DELETE would simply mark a row off as deleted in the Control Column. If the storage engine needed to reclaim a row, it would not have to look any further than the control column to find an empty spot where it could overwrite data. INSERT/UPDATE continue to work as they always have. The storage engine would perceive each column as a single-column table, meaning that the existing TOAST mechanisms continue to work! Nothing needs to change there. The real change would be that the table's columns would be split up into individual updates, and the Control Column would be used to keep all of the records in sync. Why bother with this? Because, when you are said and done, you will find yourself with a rough
Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases
Avery, Make one small, very tiny syntactic change to CREATE TABLE that includes a new keyword, COLUMN-STORE or something similar. If someone writes the rest of the code, I doubt the syntax will be the holdup. But writing an efficient C-store table mechanism is much harder than I think you think it is; Vertica worked on it for a year and failed, and Paraccel took two years to succeed. FYI, Paraccel is based on Postgres. So, put up a pgfoundry project and start hacking a c-store table; I'm sure you;ll get interest if you can make something work. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] integrated tsearch doesn't work with non utf8 database
Pavel Stehule wrote: postgres=# select ts_debug('cs','Příliš žluťoučký kůň se napil žluté vody'); ERROR: character 0xc3a5 of encoding UTF8 has no equivalent in LATIN2 CONTEXT: SQL function ts_debug statement 1 I can reproduce that. In fact, you don't need the custom config or dictionary at all: postgres=# CREATE DATABASE latin2 encoding='latin2'; CREATE DATABASE postgres=# \c latin2 You are now connected to database latin2. latin2=# select ts_debug('simple','foo'); ERROR: character 0xc3a5 of encoding UTF8 has no equivalent in LATIN2 CONTEXT: SQL function ts_debug statement 1 It fails trying to lexize the string using the danish snowball stemmer, because the danish stopword file contains character 'å' which doesn't have an equivalent in LATIN2. Now what the heck is it doing with the danish stemmer, you might ask. ts_debug is implemented as a SQL function; EXPLAINing the complex SELECT behind it, I get this plan: latin2=# \i foo.sql QUERY PLAN - Hash Join (cost=2.80..1134.45 rows=80 width=100) Hash Cond: (parse.tokid = tt.tokid) InitPlan - Seq Scan on pg_ts_config (cost=0.00..1.20 rows=1 width=4) Filter: (oid = 3748::oid) - Seq Scan on pg_ts_config (cost=0.00..1.20 rows=1 width=4) Filter: (oid = 3748::oid) - Function Scan on ts_parse parse (cost=0.00..12.50 rows=1000 width=36) - Hash (cost=0.20..0.20 rows=16 width=68) - Function Scan on ts_token_type tt (cost=0.00..0.20 rows=16 width=68) SubPlan - Limit (cost=7.33..7.36 rows=1 width=36) - Subquery Scan dl (cost=7.33..7.36 rows=1 width=36) - Sort (cost=7.33..7.34 rows=1 width=8) Sort Key: m.mapseqno - Seq Scan on pg_ts_config_map m (cost=0.00..7.32 rows=1 width=8) Filter: ((ts_lexize(mapdict, $1) IS NOT NULL) AND (mapcfg = 3765::oid) AND (maptokentype = $0)) - Sort (cost=6.57..6.57 rows=1 width=8) Sort Key: m.mapseqno - Seq Scan on pg_ts_config_map m (cost=0.00..6.56 rows=1 width=8) Filter: ((mapcfg = 3765::oid) AND (maptokentype = $0)) (21 rows) Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict, $1). That means that it will call ts_lexize on every dictionary, which will try to load every dictionary. And loading danish_stem dictionary fails in latin2 encoding, because of the problem with the stopword file. We could rewrite ts_debug as a C-function, so that it doesn't try to access any unnecessary dictionaries. It seems wrong to install dictionaries in databases where they won't work in the first place, but I don't see an easy fix for that. Any comments or better ideas? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] equivilant to contrib tsearch trigger function in 8.3
Just wondering if it is already in 8.3 with a new name, or if not if there are plans to add it? TIA -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] Follow-Up to A Silly Idea for Vertically-Oriented Databases
In hindsight, I did miss quite a bit in my last post. Here's a summary that might clear it up: Add a single keyword that specifies that the storage format changes slightly. The keyword should not affect SQL compliancy while still extending functionality. It can be specified as either part of the CREATE TABLE statement or part of the tablespace mechanism. When a table is created with this setting, all columns in a record are split vertically into individual, 1-column-wide tables, and each column in the table is assigned an OIDs. Each OID corresponds to one of our 1-wide tables. An additional control column will be created that is only visible to the database and the administrator. This column stores a single logical indicating if the record is allocated or not. You might even be able to create a special bitmap index that is hidden, and just use existing bitmap functions in the index code. In essence, this column helps keep all of the other columns in sync when dealing with rows. When writing data to the table, each individual column will update, but the engine invisibly wraps together all of the columns into a single transaction. That is, each row insert is still atomic and behaves like it normally would - either the insert succeeds or it doesn't. Because the updates are handled by the engine as many separate tables, no special changes are required, and existing storage mechanisms (TOAST) continue to function as they always did. This could be written as a super-function of sorts, one that would combine all of the smaller steps together and use the existing mechanisms. Updates are performed in the same manner, with each column being rolled up into a single invisible mini-transaction for the given record. Deletes are performed by marking not only the columns as deleted but also the control column as having that row available for overwrite. I'm simplifying quite a bit but I think the general idea is understood. Yes, a delete will have significant overhead compared to an insert or update but this is a known tradeoff that the administrator is willing to make, so they can gain faster read speeds - ie. they want an OLAP-oriented store, not an OLTP-oriented store. The control column would be used to locate records that can be overwritten quickly. When a record is deleted, the control column's bitmap was adjusted to indicate that a free space was available. The engine would then co-ordinate as it did above, but it can cheat - instead of trying to figure things out for each table, the offset to write to is already known, so the update proceeds as listed above, other than each part of the little mini-transaction writes to the same offset (ie. each column in the record will have the same hole, so when you go to write the record out, write it to the same record spot). This is where the control column not only coordinates deletes but also inserts that re-use space from deleted records. Hopefully that makes it a little clearer. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
Ühel kenal päeval, R, 2007-09-07 kell 16:41, kirjutas apoc9009: Filip Rembiałkowski schrieb: please take following remarks: thx, but if i need some advice form a scandinavian dickhead then i will let you know this Is this apoc9009 guy real ? For some time I honestly believed (based in part on the english-like language used in postings) that he just cant understand what is written in our documentation. My other suspicion was that he has never actually tried to do what he claims, but has just click-checked some checkboxes on some third-party backup software and is actually lamenting about the lack of box labeled Absolutely Reliable PostgreSQL backup. But now i suspect that someone is just pulling our collective leg and just trolling under false name to test how long we stand it ? --- Hannu P.S.: any ideas, whom he suspect think to be scandinavian in this thread ? or is advice form a scandinavian dickhead an idiom in some language ? ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?
Here's some revised text for the README file, based on using Florian's idea of a global latestCompletedXid variable. As I worked through it I realized that in this design, XidGenLock gates entry of new XIDs into the ProcArray while ProcArrayLock gates their removal. Which is an interesting sort of symmetry property. It also turns out that the reason we need to gate entry with XidGenLock is to keep from breaking GetOldestXmin, rather than to ensure correctness of snapshots per se. (Note: I refer in the text to ProcArrayEndTransaction(), which is a function I'm thinking of putting into procarray.c to replace the current inline-in-xact.c code that clears xid and related fields.) Comments? regards, tom lane Interlocking transaction begin, transaction end, and snapshots -- We try hard to minimize the amount of overhead and lock contention involved in the frequent activities of beginning/ending a transaction and taking a snapshot. Unfortunately, we must have some interlocking for this, because we must ensure consistency about the commit order of transactions. For example, suppose an UPDATE in xact A is blocked by xact B's prior update of the same row, and xact B is doing commit while xact C gets a snapshot. Xact A can complete and commit as soon as B releases its locks. If xact C's GetSnapshotData sees xact B as still running, then it had better see xact A as still running as well, or it will be able to see two tuple versions - one deleted by xact B and one inserted by xact A. Another reason why this would be bad is that C would see (in the row inserted by A) earlier changes by B, and it would be inconsistent for C not to see any of B's changes elsewhere in the database. Formally, the correctness requirement is if a snapshot A considers transaction X as committed, and any of transaction X's snapshots considered transaction Y as committed, then snapshot A must consider transaction Y as committed. What we actually enforce is strict serialization of commits and rollbacks with snapshot-taking: we do not allow any transaction to exit the set of running transactions while a snapshot is being taken. (This rule is stronger than necessary for consistency, but is relatively simple to enforce, and it assists with some other issues as explained below.) The implementation of this is that GetSnapshotData takes the ProcArrayLock in shared mode (so that multiple backends can take snapshots in parallel), but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode while clearing MyProc-xid at transaction end (either commit or abort). ProcArrayEndTransaction also holds the lock while advancing the shared latestCompletedXid variable. This allows GetSnapshotData to use latestCompletedXid + 1 as xmax for its snapshot: there can be no transaction = this xid value that the snapshot needs to consider as completed. In short, then, the rule is that no transaction may exit the set of currently-running transactions between the time we fetch latestCompletedXid and the time we finish building our snapshot. However, this restriction only applies to transactions that have an XID --- read-only transactions can end without acquiring ProcArrayLock, since they don't affect anyone else's snapshot nor latestCompletedXid. Transaction start, per se, doesn't have any interlocking with these considerations, since we no longer assign an XID immediately at transaction start. But when we do decide to allocate an XID, GetNewTransactionId must store the new XID into the shared ProcArray before releasing XidGenLock. This ensures that all top-level XIDs = latestCompletedXid are either present in the ProcArray, or not running anymore. (This guarantee doesn't apply to subtransaction XIDs, because of the possibility that there's not room for them in the subxid array; instead we guarantee that they are present or the overflow flag is set.) If a backend released XidGenLock before storing its XID into MyProc, then it would be possible for another backend to allocate and commit a later XID, causing latestCompletedXid to pass the first backend's XID, before that value became visible in the ProcArray. That would break GetOldestXmin, as discussed below. We allow GetNewTransactionId to store the XID into MyProc-xid (or the subxid array) without taking ProcArrayLock. This was once necessary to avoid deadlock; while that is no longer the case, it's still beneficial for performance. We are thereby relying on fetch/store of an XID to be atomic, else other backends might see a partially-set XID. This also means that readers of the ProcArray xid fields must be careful to fetch a value only once, rather than assume they can read it multiple times and get the same answer each time. Another important activity that uses the shared ProcArray is GetOldestXmin, which must determine a lower bound for the oldest xmin of any active MVCC snapshot, system-wide.
Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases
Avery, my ramblings snipped If someone writes the rest of the code, I doubt the syntax will be the holdup. But writing an efficient C-store table mechanism is much harder than I think you think it is; Vertica worked on it for a year and failed, and Paraccel took two years to succeed. FYI, Paraccel is based on Postgres. So, put up a pgfoundry project and start hacking a c-store table; I'm sure you;ll get interest if you can make something work. -- --Josh Well, I did say it was a *crazy* idea. :-) Given that I would be starting from the ground-floor, learning not only the innards of PostgreSQL but also C coding as well, I would probably have to overcome near-insurmountable odds to make this project take off. Still, if I was crazy enough to think it, maybe I'll be crazy enough to try for it. ;-) Just ignore my 2nd posting, I was trying to clarify some of the ramblings I was typing.
Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?
Tom Lane wrote: Here's some revised text for the README file, based on using Florian's idea of a global latestCompletedXid variable. As I worked through it I realized that in this design, XidGenLock gates entry of new XIDs into the ProcArray while ProcArrayLock gates their removal. Which is an interesting sort of symmetry property. It also turns out that the reason we need to gate entry with XidGenLock is to keep from breaking GetOldestXmin, rather than to ensure correctness of snapshots per se. I believe it would break both, no? If an xid = latestCompletedXid is not included in the snapshot, but later used for updates, the snapshot will see those changes as committed when they really are not. But other than that, it really sounds fine. It certainly explains things much better than the comments in the existing code. I noticed two rather cosmetic issues .) latestCompletedXid sounds as it might refer to the *last* completed xid, but it actually refers to the largest / highest completed xid. So maybe we should call it highestCompletedXid or largestCompletedXid. .) Since you mention that we assume reading and writing int4s are atomic operations, maybe we should mention that for safety's sake we mark the corresponding pointers with volatile? greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?
Florian G. Pflug [EMAIL PROTECTED] writes: I noticed two rather cosmetic issues .) latestCompletedXid sounds as it might refer to the *last* completed xid, but it actually refers to the largest / highest completed xid. So maybe we should call it highestCompletedXid or largestCompletedXid. Actually that was an intentional choice: because of the wraparound behavior of XIDs, the latest value is not necessarily numerically largest. I'm not wedded to it though. .) Since you mention that we assume reading and writing int4s are atomic operations, maybe we should mention that for safety's sake we mark the corresponding pointers with volatile? Couldn't hurt. I have a draft patch that I'll post shortly. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases
On 9/7/07, Avery Payne [EMAIL PROTECTED] wrote: Avery, my ramblings snipped If someone writes the rest of the code, I doubt the syntax will be the holdup. But writing an efficient C-store table mechanism is much harder than I think you think it is; Vertica worked on it for a year and failed, and Paraccel took two years to succeed. FYI, Paraccel is based on Postgres. So, put up a pgfoundry project and start hacking a c-store table; I'm sure you;ll get interest if you can make something work. -- --Josh Well, I did say it was a *crazy* idea. :-) Given that I would be starting from the ground-floor, learning not only the innards of PostgreSQL but also C coding as well, I would probably have to overcome near-insurmountable odds to make this project take off. Still, if I was crazy enough to think it, maybe I'll be crazy enough to try for it. ;-) Just ignore my 2nd posting, I was trying to clarify some of the ramblings I was typing. For whatever it's worth, I was reading about the same things today and came up with the same basic idea, without the same level of implementation details you've talked about, Avery. And it sounds really neat. Hard, but neat, and potentially worth it if, say, Paraccel doesn't open source their stuff first :) But I don't know the PostgreSQL internals either, though I've been known to throw together some C code now and again. So in short, there's interest. Whether there's collective skill/free time/motivation/insanity/etc. enough to make something useful of the interest is another question altogether. :) - Josh/eggyknap ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] equivilant to contrib tsearch trigger function in 8.3
Robert Treat [EMAIL PROTECTED] writes: Just wondering if it is already in 8.3 with a new name, or if not if there are plans to add it? TIA tsvector_update_trigger(), see docs section 9.13.2 (at the moment) regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Follow-Up to A Silly Idea for Vertically-Oriented Databases
On Fri, Sep 07, 2007 at 02:58:03PM -0700, Avery Payne wrote: In hindsight, I did miss quite a bit in my last post. Here's a summary that might clear it up: Add a single keyword that specifies that the storage format changes slightly. The keyword should not affect SQL compliancy while still extending functionality. It can be specified as either part of the CREATE TABLE statement or part of the tablespace mechanism. When a table is created with this setting, all columns in a record are split vertically into individual, 1-column-wide tables, and each column in the table is assigned an OIDs. Each OID corresponds to one of our 1-wide tables. An additional control column will be created that is only visible to the database and the administrator. This column stores a single logical indicating if the record is allocated or not. You might even be able to create a special bitmap index that is hidden, and just use existing bitmap functions in the index code. In essence, this column helps keep all of the other columns in sync when dealing with rows. OID's aren't the way to link this stuff together. It would make more sense for there to be one file that stores all the MVCC and other row overhead, and for that table to store ctids, because that will be the fastest way to look up the columns. When writing data to the table, each individual column will update, but the engine invisibly wraps together all of the columns into a single transaction. That is, each row insert is still atomic and behaves like it normally would - either the insert succeeds or it doesn't. Because the updates are handled by the engine as many separate tables, no special changes are required, and existing storage mechanisms (TOAST) continue to function as they always did. This could be written as a super-function of sorts, one that would combine all of the smaller steps together and use the existing mechanisms. Updates are performed in the same manner, with each column being rolled up into a single invisible mini-transaction for the given record. The problem is that the idea of rows being a string of bytes within a page is spread pretty widely throughout the code; I'm pretty sure it extends far beyond just smgr. At some point stuff becomes just datums, but I don't know that there's a nice, clean line where that happens. I think this is probably the biggest obstacle that you're facing. Deletes are performed by marking not only the columns as deleted but also the control column as having that row available for overwrite. I'm simplifying quite a bit but I think the general idea is understood. Yes, a delete will have significant overhead compared to an insert or update but this is a known tradeoff that the administrator is willing to make, so they can gain faster read speeds - ie. they want an OLAP-oriented store, not an OLTP-oriented store. You do *not* want to try and change how MVCC works at the same time you're doing this. There *may* be some possibility of changing things afterwards, but trying to tackle that off the bat is suicide. On top of that, HOT might well may this kind of optimization pointless. The control column would be used to locate records that can be overwritten quickly. When a record is deleted, the control column's bitmap was adjusted to indicate that a free space was available. The engine would then co-ordinate as it did above, but it can cheat - instead of trying to figure things out for each table, the offset to write to is already known, so the update proceeds as listed above, other than each part of the little mini-transaction writes to the same offset (ie. each column in the record will have the same hole, so when you go to write the record out, write it to the same record spot). This is where the control column not only coordinates deletes but also inserts that re-use space from deleted records. -- Decibel!, aka Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) pgp0bviTEQ2r6.pgp Description: PGP signature
Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
On Sat, Sep 08, 2007 at 01:02:04AM +0300, Hannu Krosing wrote: ??hel kenal p??eval, R, 2007-09-07 kell 16:41, kirjutas apoc9009: Filip Rembia??kowski schrieb: please take following remarks: thx, but if i need some advice form a scandinavian dickhead then i will let you know this Is this apoc9009 guy real ? Pretty much as soon as I saw that comment I just nuked the whole thread and moved on. I suggest everyone else just do the same. -- Decibel!, aka Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) pgpwJlvmA4mop.pgp Description: PGP signature
[HACKERS] apparent tsearch breakage on 64-bit machines
On my x86_64 machine, CVS HEAD is throwing the following scary-looking warnings: to_tsany.c: In function 'pushval_morph': to_tsany.c:247: warning: cast from pointer to integer of different size to_tsany.c: In function 'to_tsquery_byid': to_tsany.c:306: warning: cast to pointer from integer of different size to_tsany.c: In function 'plainto_tsquery_byid': to_tsany.c:344: warning: cast to pointer from integer of different size Whether the code is actually safe or not, these are not acceptable. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] apparent tsearch breakage on 64-bit machines
I wrote: Whether the code is actually safe or not, these [warnings] are not acceptable. On looking closer, it seems the intent is to pass an argument of unspecified type through parse_tsquery to a PushFunction: typedef void (*PushFunction)(void *opaque, TSQueryParserState state, char *, int, int2); extern TSQuery parse_tsquery(char *buf, PushFunction pushval, void *opaque, bool isplain); That's fine, but not in a way that throws compiler warnings. There is a standard solution for this task within PG: the arguments should be declared as Datum not void*. Use the DatumGetFoo/FooGetDatum macros to coerce back and forth. Also, the declaration of typedef PushFunction really ought to provide names for all the arguments, for documentation purposes. (Dare I suggest a comment block specifying the API?) regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly