Re: [HACKERS] Core team statement on replication in PostgreSQL
Hello Andrew, Andrew Sullivan wrote: Yes. And silent as ever. :-) Are the slides of your PgCon talk available for download somewhere? BTW: up until recently, there was yet another mailing list: [EMAIL PROTECTED] It was less focused on hooks and got at least some traffic. :-) Are those mails still archived somewhere? Regards Markus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning grammar
Hi Jeff, Jeff Cohen wrote: If you don't define a default partition to handle outliers, the insert should fail with an error. IMO, you should always have a default partition, then, so as not to violate the constraints (by rejecting tuples which are correct according to the constraints). With the generic approach, you start with a single table, and start by splitting it into two six-month partitions: ALTER TABLE sales SPLIT where sales_date date '2007-06-01' INTO ( PARTITION first_half PARTITION second_half ); We could implement this approach using check constraints and table inheritance: the partition second_half is a child table where sales_date date '2007-06-01', and the partition first_half has the complementary constraint NOT(sales_date date '2007-06-01'). Next, you split each partition: ALTER TABLE sales SPLIT PARTITION first_half where sales_date date '2007-03-01' INTO ( PARTITION first_quarter PARTITION second_quarter ); So now the child table for first_half itself has two children. As you continue this process you construct a binary tree of table inheritance using 12 ALTER statements. nitpickingThere are just 11 splits between 12 months, otherwise correct, yes./nitpicking In the long grammar you can create and partition the table in one statement: CREATE TABLE sales ... PARTITION BY sales_date ( start (date '2007-01-01') end (date '2008-01-01') every (interval '1 month') ); To be fair, you should add the 12 partition names here as well. I can certainly see merit in letting the database system handle the binary tree. Thanks for your feedback. Partitioning the table using series of splits is a clever solution for situations where the partitioning operation cannot be described using simple equality (like list,hash) or ordered comparison (range). But for many common business cases, the long grammar is easier to specify. Easier to specify initially, maybe, yes. But how about managing it afterwards? Having seen all the different options for merging, splitting, exchanging, coalescing and adding, all of them with small little differences for hash, range and list partitioning - let alone sub-partitioning - with all of that, the proposed grammar doesn't look particularly easy to me. Let's at least drop the differences for list, hash and range partitioning, those are pretty unneeded, IMO. Regards Markus ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Declarative partitioning grammar
Hi, Hannu Krosing wrote: I guess it would go to some default partition ? Which doesn't have a name so far, which prevents from addressing that partition. Nor is it pretty well defined, it's just a rest. sure, but this can become really tedious for 1024 partitions, Well, managing 1024 partitions manually is a tedious job, no matter what grammar you take: You'll have to deal with 1024 different partition names. What do you need so many partitions for? not to mention hard for optimiser. It's pretty much the same for the optimizer: a binary tree. Granted, that binary tree should better be balanced by the RDBMS. Regards Markus ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Declarative partitioning grammar
Hi, Hans-Juergen Schoenig wrote: What do you need so many partitions for? having so many tables is not funny but it can be the only reasonable choice. Well, what do you do with all those partitions? Most of them will end up on the same storage subsystem. So, if you don't partition to spread your data across storage with different characteristics, why do you need partitioning at all? Isn't an index better in most cases? Or are you using it as a form of CLUSTERing? Where you expect to reduce time for sequential scans over a range? Simon's Segment Exclusion proposal looks like a much better fit to that purpose, IMO. It would prevent you from having to handle all those partitions manually. Regards Markus ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Declarative partitioning grammar
Hi, Tom Lane wrote: I don't agree with that at all. I can imagine plenty of situations where a tuple falling outside the range of available partitions *should* be treated as an error. For instance, consider timestamped observations --- data in the future is certainly bogus, and data further back than you want to deal with must be an entry error as well. Isn't it better to have these constraints as table constraints, instead of burying them in the partitioning definition? Mixing those two concepts seems very wired to me. Or am I missing any benefit of mixing them? Regards Markus ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Declarative partitioning grammar
Hi, Gregory Stark wrote: In a previous life I had a database which had daily partitions. I assure you it was unquestionably the right decision. Each day's data wasn't just distinguished by the timestamp but actually was entirely separate from the previous day's data. Both the archiving strategy and the many reports which were ran all depended specifically on the day the data was collected on. Wouldn't Segment Exclusion (maybe together with a specialized form of CLUSTERing) handle that case much better than partitioning? Without the need to name all those thousands of partitions and manage them manually. What I would want in such a case, is exactly not manual management of partitions, but rather a performance optimization for scanning a range of rows, which is something in between indexes (for very few rows) and a seq scan (for almost all rows of a table). I know, this now sounds like I've turned sides to Simon's proposal. And yes, in a way, that's true. I certainly see merit for Segment Exclusion, more and more. OTOH I'm still skeptical about it replacing declarative partitioning entirely. But declarative partitioning only really makes sense, if you partition into different storage subsystems, IMO. Everything happening on the same storage subsystem shouldn't need manual partitioning, but should be optimized pretty automatically. As Simon proposed, that's well possible in many cases. Regards Markus ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Declarative partitioning grammar
Hi, Tom Lane wrote: DBAs tend to be belt *and* suspenders guys, no? I rather know those admins with stupid looking faces who are wondering why their transactions fail. Often enough, that can have a lot of different reasons. Extending the set of possible traps doesn't seem like a clever idea for those admins. I'd think a lot of them would want a table constraint, plus a partitioning rule that rejects anything outside the intended partitions. I'm rather a fan of the DRY principle (don't repeat yourself). Because having to maintain redundant constraints smells suspiciously like a maintenance nightmare. And where's the real use of making the database system check twice? Want to protect against memory corruption in between the two checks, eh? :-) Regards Markus ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Declarative partitioning grammar
Hi, Zeugswetter Andreas ADI SD wrote: Yes, but the problem with the timestamp partitioned tables is, that the window is sliding. Thus you would need two alter tables for each new period. One that changes the constraint + one that creates the new partition. So it seems natural to join the two concepts for such a partitioning syntax. If you think in terms of split points, having to alter two table is not true. It's better Personally I find the automatic partition idea intriguing, where you only have to choose an expression that equates to one value (value group) per partition (and possibly a way to derive a partition name). Then a partition is automatically created when a new row arrives for a new value. That does not however address Tom's concern of rejecting data that is outside the acceptable window, but maybe that is better dealt with in the application anyways. Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Declarative partitioning grammar
Hi, (sorry for the previous one, if delivered, that went of too early...) Zeugswetter Andreas ADI SD wrote: Yes, but the problem with the timestamp partitioned tables is, that the window is sliding. Thus you would need two alter tables for each new period. One that changes the constraint + one that creates the new partition. So it seems natural to join the two concepts for such a partitioning syntax. If you think in terms of split points, having to alter two partitions isn't true, you just add a split point. Of course, that also alters the constraints of the partitions, but I think we all agree that the system should maintain those constraints automatically, anyway. As such, they don't even have to be visible to the DBA. Personally I find the automatic partition idea intriguing, where you only have to choose an expression that equates to one value (value group) per partition (and possibly a way to derive a partition name). IMO, better go right to a fully automated approach. Or why would you need partition names in such a case? Regards Markus ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Declarative partitioning grammar
Hi, Jeff Cohen wrote: We did look at allowing general functions for partitioning and this was one concern. The other is that we want to enforce that a row only gets inserted into a single partition, so we wanted a declarative syntax where it was relatively easy to check that range and list specifications don't overlap. Why do you need to define a split point so ambiguously at all? Why not just give the DBA exactly *one* place to define the split point? I don't think the separation into list, hash and range partitioning is adequate. What is the system supposed to do, if you try to insert a row which doesn't fit any of the values in your list or doesn't fit any of the ranges you defined? I prefer a partitioning grammar which doesn't interfere with constraints. We all know how to define constraints. Please don't introduce a new, ambiguous way. A partitioning definition should be able to tell the target partition for *every* row which satisfies the constraints (the real ones, not ambiguous ones). IMO, a single DDL command should only touch a single split point, i.e. split a table into two partitions, move the split point or remove the split point (joining the partitions again). Those are the only basic commands you need to be able to handle partitioning. Sorry, but for my taste, the proposed grammar is too long per command, not flexible enough and instead ambiguous for split points as well as for constraints. To me it looks like repeating the mistakes of others. Regards Markus ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Some ideas about Vacuum
Hi, Gokulakannan Somasundaram wrote: I'm also not sure it really buys us anything over having a second dead-space-map data structure. The WAL is much larger and serves other purposes which would limit what we can do with it. Ok. One obvious advantage is that it saves the contention over DSM for the DML operations and Vacuum process. Do you have evidence of that contention being so worse, that it justifies the additional WAL reading from disk? (Assuming no WAL archiving). IMO we can get about any granularity we want for DSM update locking, depending on how we arrange the DSM bits. Since Vacuum process is going to have much more information on what has happened in the database, Why should that be? IMO, collecting the information at transaction time can give you exactly the same information, if not more or better information. it is possible for some new structures. For example i have been thinking of changing our current index structure in such a way, it won't hold any duplicate tuples for different versions of data. Whenever there is a update, only the indexes relevant to the columns changed will get updated. The Vacuum has to play the role of changing the tid, the index tuple points to, whenever it vacuums a older version. Huh? The index would then point to the old tuple only, until a VACUUM comes by, right. How are following transactions expected to find the new tuple before that VACUUMing? Regards Markus ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Some ideas about Vacuum
Hi, Tom Lane wrote: Well, one of the principal arguments for having VACUUM at all is that it off-loads required maintenance effort from foreground transaction code paths. Off-loading doesn't mean we don't have to do the work, so it's obviously is a compromise. AFAICT, having to write some DSM blocks from foreground transaction code paths may well be worth it overall, if it saves VACUUM from doing much more I/O. Especially if the bgwriter can defer the I/O to after commit time (which I'm thinking of as another form of off-loading work from foreground transaction code). Regards Markus ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Some ideas about Vacuum
Hi, Gokulakannan Somasundaram wrote: But i am just thinking of creating the DSM by reading through the WAL Logs, instead of asking the Inserts, updates and deletes to do the DSM creation. What's the advantage of that? What's wrong with collecting the information for DSM at transaction processing time? The overhead is certainly smaller than the overhead for doing it later on. I think what Gregory is coming at is, if we schedule the Vacuum after 20% of table changes, then we essentially say we need 120% of the disk space and hence our select operations might end up doing more I/Os. Well, full sequential scans end up doing more I/O, but not index scans typical for OLTP. So if autovacuum is the only thing doing full sequential scans, you'd better reduce the number of full scans, instead of saving only some percentage per scan, no? Of course, depending on how much of your table fits in ram, you also need to consider the space savings in RAM... However, I'm assuming a reasonably low ratio of RAM size vs table size. Regards Markus ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Named vs Unnamed Partitions
Hi, Simon Riggs wrote: On Wed, 2008-01-09 at 18:04 +0100, Markus Schiltknecht wrote: What do you think about letting the database system know the split point vs it having to find optimal split points automatically? For me, managing the table's files can be separate from the chunking that allows partition exclusion. Agreed. So in your terms, my question is: who does the chunking? What you are proposing with SE is that the database does the chunking automatically (based on spontaneous ordering). That's a novel and interesting idea to me. Managing the table's files must be a manual operation. We can't infer the presence of a new tablespace etc.. Sure. However, letting the database do the chunking (i.e. define split points), but leaving it up to the user to decide where to put those chunks certainly doesn't work. So, there are only two options: either we let the user choose split points manually, or else we tell the database those missing pieces of information and rely on automatisms. If I understand correctly, you are stating, that in the case of read-only vs read-writable, the database has enough information to make good decisions. Those files would need less than 10 zones or chunks, usually just one. Agreed. The WHERE clause approach might easily allow more than 2 chunks and they need not be logically contiguous. So the phrase split point doesn't really fit because it implies a one dimensional viewpoint, but I'm happy for you to give it a name. I consider read-only vs. read-writable to be pretty one dimensional. And the storage is logically organized in contiguous blocks. So there are split points between segments with differing read-only property, according to my definition. Regards Markus ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Some ideas about Vacuum
Hi, Gokulakannan Somasundaram wrote: because of the contention. Am i missing something here? While Vacuum is reading the DSM, operations may not be able to update the bits. We need to put the DSM in shared memory, if all the processes are going to update it, whereas if Vacuum is going to form the DSM, then it might well be in the process local memory. I can think of things like False sharing which might be avoided. But i think the main stuff is contention. Ah, I begin to understand where you are coming from now, yes. However, (ab-)using the WAL and archiver still doesn't look like a good idea to me. Even in indexes, we might end up reading dead tuples. We would mark it with LP_DEAD. So the overhead is less, but its there. That's a good point, yes. Ofcourse its natural to think of some background jobs during OLTP, and they will be affected Agreed. Regards Markus ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Some ideas about Vacuum
Hi, Gokulakannan Somasundaram wrote: If we can ask the Vacuum process to scan the WAL log, it can get all the relevant details on where it needs to go. You seem to be assuming that only few tuples have changed between vacuums, so that WAL could quickly guide the VACUUM processes to the areas where cleaning is necessary. Let's drop that assumption, because by default, autovacuum_scale_factor is 20%, so a VACUUM process normally kicks in after 20% of tuples changed (disk space is cheap, I/O isn't). Additionally, there's a default nap time of one minute - and VACUUM is forced to take at least that much of a nap. So it's easily possible having more dead tuples, than live ones. In such cases, scanning the WAL can easily takes *longer* than scanning the table, because the amount of WAL to read would be bigger. One main restriction it places on the WAL Logs is that the WAL Log needs to be archived only after all the transactions in it completes. In other words, WAL logs need to be given enough space, to survive the longest transaction of the database. It is possible to avoid this situation by asking the Vacuum process to take the necessary information out of WAL log and store it somewhere and wait for the long running transaction to complete. That would result in even more I/O... The information of interest in WAL is only the table inserts/updates/deletes. So if everyone accepts that this is a good idea, till this point, there is a point in reading further. Well, that's the information of interest, the question is where to store that information. Maintaining a dead space map looks a lot cheaper to me, than relying on the WAL to store that information. Ultimately, what has been achieved till now is that we have made the sequential scans made by the Vacuum process on each table into a few random i/os. Of course there are optimizations possible to group the random i/os and find some sequential i/o out of it. But still we need to do a full index scan for all those indexes out there. HOT might have saved some work over there. But i am pessimistic here and wondering how it could have been improved. So it just strikes me, we can do the same thing which we did just with the tables. Convert a seq scan of the entire table into a random scan of few blocks. We can read the necessary tuple information from the tuples, group them and hit at the index in just those blocks and clean it up. Sorry, I don't quite get what you are talking about here. What do indexes have to do with dead space? Why not just keep acting on the block level? I can already hear people, saying that it is not always possible to go back to index from table. There is this culprit called unstable function based indexes. No, there's no such thing. Citing [1]: All functions and operators used in an index definition must be immutable, that is, their results must depend only on their arguments and never on any outside influence. Of course, you can mark any function IMMUTABLE and get unstable function based indexes, but that turns into a giant foot gun very quickly. P.S.: Let the objections/opposing views have a subtle reduction in its harshness. I'm just pointing at things that are in conflict with my knowledge, assumptions and believes, all which might be erroneous, plain wrong or completely mad. ;-) Regards Markus [1]: the Very Fine Postgres Manual on CREATE INDEX: http://www.postgresql.org/docs/8.3/static/sql-createindex.html ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] LD_LIBRARY_PATH not honored on Debian unstable
Hi, I'm trying to run 'make check' on a 64bit Debian unstable. That aborts after 60 seconds due to not being able to connect to the postmaster. I figured that there's nothing wrong with the postmaster, rather psql can't start up, because it gets linked against an older libpq.so.5. It looks like for some reason, it doesn't respect the LD_LIBRARY_PATH env variable, see ldd output: [EMAIL PROTECTED]:/home/markus/projects/pgsql/sources/current/pgsql/src/test/regress# LD_LIBRARY_PATH=/home/markus/projects/pgsql/sources/current/pgsql/src/test/regress/./tmp_check/install//usr/lib ldd -r -v tmp_check/install/usr/bin/psql linux-vdso.so.1 = (0x7fffc2bfe000) libpq.so.5 = /usr/lib/libpq.so.5 (0x2ac8e81ba000) libz.so.1 = /usr/lib/libz.so.1 (0x2ac8e83db000) libreadline.so.5 = /lib/libreadline.so.5 (0x2ac8e8606000) libcrypt.so.1 = /lib/libcrypt.so.1 (0x2ac8e8846000) libdl.so.2 = /lib/libdl.so.2 (0x2ac8e8a7e000) libm.so.6 = /lib/libm.so.6 (0x2ac8e8c82000) libc.so.6 = /lib/libc.so.6 (0x2ac8e8f04000) libssl.so.0.9.8 = /usr/lib/libssl.so.0.9.8 (0x2ac8e9262000) libcrypto.so.0.9.8 = /usr/lib/libcrypto.so.0.9.8 (0x2ac8e94ae000) libkrb5.so.3 = /usr/lib/libkrb5.so.3 (0x2ac8e983c000) libcom_err.so.2 = /lib/libcom_err.so.2 (0x2ac8e9ad7000) libpthread.so.0 = /lib/libpthread.so.0 (0x2ac8e9cd9000) libncurses.so.5 = /usr/lib/libncurses.so.5 (0x2ac8e9ef5000) /lib64/ld-linux-x86-64.so.2 (0x2ac8e7f9c000) libk5crypto.so.3 = /usr/lib/libk5crypto.so.3 (0x2ac8ea132000) libkrb5support.so.0 = /usr/lib/libkrb5support.so.0 (0x2ac8ea357000) libkeyutils.so.1 = /lib/libkeyutils.so.1 (0x2ac8ea55f000) libresolv.so.2 = /lib/libresolv.so.2 (0x2ac8ea761000) undefined symbol: pg_valid_server_encoding_id (tmp_check/install/usr/bin/psql) undefined symbol: PQconnectionNeedsPassword (tmp_check/install/usr/bin/psql) Giving it an additional LD_PRELOAD for the newish libpq.5.1 helps: [EMAIL PROTECTED]:/home/markus/projects/pgsql/sources/current/pgsql/src/test/regress# LD_PRELOAD=/home/markus/projects/pgsql/sources/current/pgsql/src/test/regress/tmp_check/install/usr/lib/libpq.so.5.1 LD_LIBRARY_PATH=/home/markus/projects/pgsql/sources/current/pgsql/src/test/regress/./tmp_check/install//usr/lib ldd -r -v tmp_check/install/usr/bin/psql linux-vdso.so.1 = (0x7fffe97fe000) /home/markus/projects/pgsql/sources/current/pgsql/src/test/regress/tmp_check/install/usr/lib/libpq.so.5.1 (0x2b69c1547000) libz.so.1 = /usr/lib/libz.so.1 (0x2b69c1765000) libreadline.so.5 = /lib/libreadline.so.5 (0x2b69c199) libcrypt.so.1 = /lib/libcrypt.so.1 (0x2b69c1bd) libdl.so.2 = /lib/libdl.so.2 (0x2b69c1e08000) libm.so.6 = /lib/libm.so.6 (0x2b69c200c000) libc.so.6 = /lib/libc.so.6 (0x2b69c228e000) libncurses.so.5 = /usr/lib/libncurses.so.5 (0x2b69c25ec000) /lib64/ld-linux-x86-64.so.2 (0x2b69c1329000) Somebody have an idea on what's wrong here? Thanks. Regards Markus ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] LD_LIBRARY_PATH not honored on Debian unstable
Andrew Dunstan wrote: Smells suspiciously like an rpath problem to me. What are your configure settings? Ah, yeah, I see. Using something else than --prefix=/usr helped. Thanks for the hint! Regards Markus ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Simon Riggs wrote: Hmmm. I think it fits rather neatly with BitmapIndexScans. It would be easy to apply the index condition and/or filters to see which segments are excluded and then turn off bits in the bitmap appropriately. Yeah, good point. Not fully sure about IndexScans yet. I don't think it would be worth trying to apply SE until we estimated we would return say 100 rows. It needs to be able to work without slowing down the common path. Yup. Or put it another way: SE is an optimization for sequential scans. For tables where it works well, it could possibly replace the index entirely. True Without the index, you would rely on SE to always be able to exclude enough segments, so that the seq scan is less expensive than an index scan with the following table lookups. It would have to be a very fat index scan for so large a table... ..for SE to be faster than an index scan, you mean? Yes. Regards Markus ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Named vs Unnamed Partitions
Simon Riggs wrote: I have to admit I always found it kludgy to have objects named invoices_2000_JAN and invoices_2000_FEB and so on. It's kind of an meta denormalization. But so is specifying where clauses repeatedly. The idea for using the WHERE clauses was to specifically avoid naming. I understand, and I'm all for avoiding needless, kludgy names. As I pointed out, knowledge of split points might be important for the database system. Maybe we can store the split point without the need for names? Dunno. If you guys really want names, we can have names, but I think I want to see a case where the storage characteristics of the table are so complex we can only make sense of it by naming particular chunks. Well, assuming you only have to deal with one split point, that's certainly true. However, there are people using more than two table spaces, thus obviously needing more split points. Can we name the split points, rather than the partitions? Regards Markus ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Some ideas about Vacuum
Hi, Gregory Stark wrote: That's an interesting thought. I think your caveats are right but with some more work it might be possible to work it out. For example if a background process processed the WAL and accumulated an array of possibly-dead tuples to process in batch. It would wait whenever it sees an xid which isn't yet past globalxmin, and keep accumulating until it has enough to make it worthwhile doing a pass. I don't understand why one would want to go via the WAL, that only creates needless I/O. Better accumulate the data right away, during the inserts, updates and deletes. Spilling the accumulated data to disk, if absolutely required, would presumably still result in less I/O. I think a bigger issue with this approach is that it ties all your tables together. You can't process one table frequently while some other table has some long-lived deleted tuples. Don't use the WAL as the source of that information and that's issue's gone. I'm also not sure it really buys us anything over having a second dead-space-map data structure. The WAL is much larger and serves other purposes which would limit what we can do with it. Exactly. You seem to be assuming that only few tuples have changed between vacuums, so that WAL could quickly guide the VACUUM processes to the areas where cleaning is necessary. Let's drop that assumption, because by default, autovacuum_scale_factor is 20%, so a VACUUM process normally kicks in after 20% of tuples changed (disk space is cheap, I/O isn't). Additionally, there's a default nap time of one minute - and VACUUM is forced to take at least that much of a nap. I think this is exactly backwards. The goal should be to improve vacuum, then adjust the autovacuum_scale_factor as low as we can. As vacuum gets cheaper the scale factor can go lower and lower. But you can't lower it endlessly, it's still a compromise, because it also means reducing the amount of tuples being cleaned per scan, which is against the goal of minimizing overall I/O cost of vacuuming. We shouldn't allow the existing autovacuum behaviour to control the way vacuum works. That's a point. As a side point, disk is cheap, I/O isn't is a weird statement. The more disk you use the more I/O you'll have to do to work with the data. That's only true, as long as you need *all* your data to work with it. I still maintain the default autovacuum_scale_factor is *far* to liberal. If I had my druthers it would be 5%. But that's mostly informed by TPCC experience, in real life the actual value will vary depending on the width of your records and the relative length of your transactions versus transaction rate. The TPCC experience is with ~ 400 byte records and many short transactions. Hm.. 5% vs 20% would mean 4x as many vacuum scans, but only a 15% growth in size (105% vs 120%), right? Granted, those 15% are also taken from memory and caches, resulting in additional I/O... Still these numbers are surprising me. Or am I missing something? Regards Markus ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Named vs Unnamed Partitions
Hi, Simon Riggs wrote: With that in mind, can I clarify what you're thinking, please? Sure, I can try to clarify: 2) the things you've been discussing are essential requirements of partitioning and we could never consider it complete until they are also included and we must therefore talk about them now to check that its all possible before we do anything on SE I thought so, but am slowly dropping that point of view. In favor of something like: hey, if you manage to do it all automatically, cool, go for it! 3) doing SE first is right, I'm just thinking ahead Yes, SE certainly has merit. Combine it with some sort of maintained CLUSTERing order and it's worth doing, IMO. I'm not convinced about dynamic partitioning being able to generally replace explicit partitioning anytime soon. Sorry if that seems blunt, I'm just not clear where we're going. Well, implicit or automatic partitioning is still a pretty new concept to me, but I'm slowly beginning to like it. Thank you for pointing me at it. Regards Markus ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Named vs Unnamed Partitions
Hi, Simon Riggs wrote: When I delete all rows WHERE some_date 'cut-off date' on a segment boundary value that would delete all segments that met the criteria. The following VACUUM will then return those segments to be read-write, where they can then be refilled with new incoming data. The only command we would have to run is the DELETE, everything else is automatic. Agreed, that would be very nice. So not convinced of the need for named sections of tables yet. It all seems like detail, rather than actually what we want for managing large tables. What do you think about letting the database system know the split point vs it having to find optimal split points automatically? Read-write vs. read-only is as good start, but can that concept be expanded to automatically choosing hash partitioning between storage systems, for example? Or more generally: can the database system gather enough information about the storage systems to take a decision as good as or better than the DBA? Regards Markus ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Proposal - libpq Type System beta-0.8a (was PGparam)
Hi, Andrew Chernow wrote: It might be something with the attachment, who knows. Most probably that was the case, yes. The -hackers list is limited, please use -patches to send patches. ;-) Regards Markus ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] Named vs Unnamed Partitions
Hi, IMO, the lengthy discussion about Segment Exclusion and Segment Visibility Maps has long turned into a discussion about partitioning in general. I'm thankful for all the new insights it has brought me and I want to continue sharing my view on things. What's following is highly theoretical and has brainstorming characteristics. You've been warned. There are two very distinct ways to handle partitioning. For now, I'm calling them named and unnamed partitioning. Let's have a closer look at both options from a users point of view. I'm using Andrew's pseudo DDL example from the above mentioned thread: ALTER TABLE foo SET read_only='t' WHERE created_on '2007-01-01'; Given all tuples were read-writeable before, that implicitly created two partitions. Giving them names could look like that: ALTER TABLE foo SPLIT INTO old_foos AND new_foos; AT created_on '2007-01-01' ALTER PARTITION old_foos SET READ ONLY; Instead of only setting the read-only property, one could also set an alternative table space for the partition: ALTER TABLE foo SET TABLE SPACE large_but_slow_storage WHERE created_on '2007-01-01'; vs: ALTER PARTITION old_foos SET TABLE SPACE large_but_slow_storage; Please also note, that neither variant is limited to range partitioning. You can theoretically partition by pretty much anything, for example with a WHERE clause like: ..WHERE (id % 5) 2 The primary difference I see between these two ways to declare partitions is, that the former only modifies tuple properties (read-only, storage location), while the later also tells the database *why* it has to modify them. That has several different effects. First, newly inserted tuples are treated differently. For unnamed partitions, there must be defaults, like read-writable and a default table space. With named partitions, you define split points, so I guess one expects newly inserted tuples to end up in the right partition automatically. Unnamed partitioning could be equally automatic when letting a function decide, where to insert the new tuple. Second, repartitioning must be treated differently. With unnamed partitioning, the admin must first adjust the defaults (if required) and then move the existing tuple properties accordingly. With named partitions, the admin only needs to adjust the split point and the database system knows what it has to do. And third, but IMO most importantly: to be able to optimize queries, the database system has to know split points, so it can exclude partitions or segments from scanning. Obviously, with named partitions, it always knows them. Otherwise, you'll have to maintain some information about the tuples in your partitions, as Simon does with the min/max tuples. As soon as required, it could also maintain additional min/max values, i.e. for (id % 5) for the above example. I hope to have shown the most relevant aspects. To conclude, I'd say that named partitioning is closer to manually managed partitioning, as already known and often used. While unnamed partitioning is closer to automated partitioning, where the DBA does *not need* to have names for partitions, which is a pretty new and interesting idea to me. Regards Markus ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi Csaba, Csaba Nagy wrote: One additional thought: what about a kind of segment fill factor ? Meaning: each segment has some free space reserved for future updates/inserts of records in the same range of it's partitioning constraint. And when inserting/updating you put the new record into the corresponding segment... just like a very coarse clustering. Hm.. yeah. That way, a few writes to a read optimized segment could be accepted, without having to drop the optimization immediately. And the other way around: generally prevent having to drop the optimization by forcing tuples to be written to a segment with matching min/max tuples. Although, that's not exactly trivial, I think. However, for tables which don't fit the use case of SE, people certainly don't want such a fill factor to bloat their tables. Regards Markus ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi, Csaba Nagy wrote: Sure, but it could be configurable and should only be enabled if the table is marked as partitioned on some condition... As I'm regarding SE as an optimization, I disagree here.. As all optimizations, SE should conceptually be reasonably close to cost-less when unneeded. I think it would be a bad idea anyway if the DB would start partitioning on some arbitrary criteria based on analyzing he table, so the DBA should be the one to decide on what criteria to partition. I absolutely agree for real partitioning, which targets manageability of table partitions. In particular it could be a bad idea on occasions to partition on the clustering criteria for tables which were clustered once. Why is that? AFAIUI, Segment Exclusion combines perfectly well with clustering. Or even better, with an upcoming feature to maintain clustered ordering. Where do you see disadvantages such an optimization for sequential scans? Regards Markus ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi, Andrew Sullivan wrote: On Sat, Jan 05, 2008 at 08:02:41PM +0100, Markus Schiltknecht wrote: Well, management of relations is easy enough, known to the DBA and most importantly: it already exists. Having to set up something which is *not* tied to a relation complicates things just because it's an additional concept. But we're already dealing with some complicated concepts. Possibly, yes, but that's by far no reason to introduce even more complicated concepts... Does anything speak against letting the DBA handle partitions as relations? There isn't anything that will prevent current-style partitioning strategies from continuing to work in the face of Simon's proposal. Agreed. Nor will Simon's proposal completely replace them. Without even trying, I can think of a dozen examples in the past 5 years where I could have used that sort of functionality. Because the cost of data retrieval was high enough, we had to decide that the question wasn't worth answering. Some of those answers might have been quite valuable indeed to the Internet community, to be frank; but because I had to pay the cost without getting much direct benefit, it just wasn't worth the effort. Sure, there's value in Simon's proposal. But it has pretty strict requirements. IMO, it's pretty hard to say, if it would have helped at all for your cases. Any of them still available to check? Remember the requirements: no single tuple in the segment may be significantly out of the average bounds. Otherwise, the min/max gets pretty useless and the segment can never be excluded. As said before, combination with CLUSTERing might help, yes. But if you need to maintain CLUSTERed ordering, aren't there better ways? For example, you could use binary searching on the relation directly, much like with indices, instead of sequentially scanning on the CLUSTERed relation. That would even give us some sort of indices with visibility. Agreed. I'd say that's why the DBA needs to be able to define the split point between partitions: only he knows the meaning of the data. I think this is only partly true. A casual glance at the -general list will reveal all manner of false assumptions on the parts of administrators about how their data is structured. My experience is that, given that the computer has way more information about the data than I do, it is more likely to make the right choice. To the extent it doesn't do so, that's a problem in the planning (or whatever) algorithms, and it ought to be fixed there. Well, Postgres doesn't automatically create indices, for a counter example. With regard to partitioning over multiple table spaces, I think the DBA definitely has more information available, than the computer. A DBA (hopefully) knows future plans and emergency strategies for the storage system, for example. Lacking such information, the database system will have a pretty hard time taking a good decision on how to partition between table spaces, IMO. Regards Markus ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi, Andrew Sullivan wrote: On Mon, Jan 07, 2008 at 07:16:35PM +0100, Markus Schiltknecht wrote: Does anything speak against letting the DBA handle partitions as relations? Yes: it doesn't solve the problem I have, which is that I don't want to have to manage a whole bunch of tables. I want one table, and I want to be able to say, That section is closed. That's a fair requirement. Wanting to manage partitions manually is another fair requirement, IMO. They can well coexist. Remember the requirements: no single tuple in the segment may be significantly out of the average bounds. Otherwise, the min/max gets pretty useless and the segment can never be excluded. The segment can never be excluded in a search on that key, yes. But consider the likely cases we're looking at: Uh, which key are you talking about? AFAIU Simon's proposal, he suggests maintaining min/max values for all columns of the table. WHERE some_date = '1999-01-01' AND some_date '2001-01-01'; Yeah, and if only *one* tuple in the 1G segment has: some_date = '1998-12-31' OR some_date = '2001-01-01' Segment Exclusion can't exclude that segment. That's all I'm saying. That said, I am not trying to argue it should be adopted without poking at its weaknesses. I just think it unfair to ask the proposal to address problems it's not really aimed at. Huh? I'm certainly not the one asking for it. Quite the opposite, I'm warning from over-estimating the use of SE. In his proposal, Simon was explicitly comparing to declarative partitioning, pointing out lots of advantages and implicitly stating that SE could cover most, if not all needs of what's commonly understand by partitioning. That's where I disagree. But certainly, SE and SVM has some merit for it's use case. And I'm looking forward to test patches ;-) Regards Markus ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi, Gokulakannan Somasundaram wrote: On Jan 5, 2008 6:15 PM, [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: One thought I had back then, with partitioned tables was gee -- B-tree index is already doing a partition; why do a manual partition on top of that?. Can you please explain more on what you are trying to say here? I think this has to do with SE not being of much use for index scans. Or put it another way: SE is an optimization for sequential scans. For tables where it works well, it could possibly replace the index entirely. Without the index, you would rely on SE to always be able to exclude enough segments, so that the seq scan is less expensive than an index scan with the following table lookups. With an index, the planner gets a hard time deciding between the index scan and the (possibly SE optimized) seq scan. Regards Markus ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi, Robert Treat wrote: On Saturday 05 January 2008 14:02, Markus Schiltknecht wrote: To satisfy all the different requirements of partitioning with segments based partitioning, we'd have to allow a table to span multiple table spaces. I'm not very keen on going that way. Why? Uh.. if a table (RELKIND_RELATION) can only span one table space, as it is now, all of its segments are in the same table space. I don't quite call that partitioning. Well, sure, you could call it so, but then, each and every Postgres table is already partitioned in 1G segments. It all depends on the definitions, but in my world, horizontal partitioning for databases involves multiple table spaces (and is quite useless without that). Calling anything else partitioning is confusing, IMO. I'm not following this. If we can work out a scheme, I see no reason not to allow a single table to span multiple tablespaces. Do you see a problem with that? Uhm... well, no. I was just pointing out that it's a requirement. It depends on how you define things, but I'm seeing it that way: table -- 1:n -- partition -- 1:1 -- table space -- 1:n -- segments What I'm advocating is making partitions available to the DBA as some kind of a relation, she can query separately and move around between table spaces. Why should that not be possible with other schemes? Moving the split point between two partitions involves moving tuples around, no matter if you are going to move them between segments or between relations building the partitions. The difference is that, if I currently have a table split by month, I can re-partition it into weekly segments, and only shuffle one months data at a time minimize impact on the system while I shuffle it. This can even be used to do dynamic management, where data from the current month is archived by day, data from the past year by week, and data beyond that done monthly. This should be possible for both schemes, I see no connection to what we've discussed. SE doesn't magically give you this level of control you are requesting here. Quite the opposite: referring to CLUSTERing to makes me wonder, if that's not going to shuffle way too many tuples around. What I'm saying is, that SE doesn't partition the segments into different table spaces. Thus I don't consider it database partitioning in the first place. As I currently understand it, it's: table -- 1:1 -- table space -- 1:n -- partitions -- 1:n -- segments On many other databases, if you change the partition scheme, it requires exclusive locks and a shuffleing of all of the data, even data whose partitions arent being redefined. Even worse are systems like mysql, where you need to rewrite the indexes as well. To me, these requirements always seem like show stoppers; I generally can't afford to lock a table while the database rewrites a billion rows of data. I fully agree here. How do you plan to solve that problem on top of SE? In a more general sense, a global index is a an index that spans multiple partitions, as opposed to a local index, which is an index on specific partitions; postgresql current supports the latter, not the former. In any case, my thinking is if we had the segment exclusion technique, I could convert that partitioned table into a regular table again, ... on a single table space ... use segment exclusion to handle what is currently handled by partitions, ... except, that there is no partitioning (!?!) (between table spaces) and create a global index across all the other data for that other, currently killer, query. I thought the table you are referring to is bigger than your fastest table space? That would even make it impossible. See where I'm coming from? And why I'm stating that SE is an optimization (for seq scans), but not partitioning? Regards Markus ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Andrew Sullivan wrote: On Fri, Jan 04, 2008 at 10:26:54PM +0100, Markus Schiltknecht wrote: I'm still puzzled about how a DBA is expected to figure out which segments to mark. I think that part might be hand-wavy still. But once this facility is there, what's to prevent the current active segment (and the rest) from also getting this mark, which would mean the table is read only? Well, sure, marking *all* segments read-only is pretty easy. But that was not quite my point. Regards Markus ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi, [EMAIL PROTECTED] wrote: The main proposal deliberately has few, if any, knobs and dials. That's a point of philosophy that I've had views on previously: my normal stance is that we need some knobs to allow the database to be tuned to individual circumstances. One thought I had back then, with partitioned tables was gee -- B-tree index is already doing a partition; why do a manual partition on top of that?. Well, that's why I'm so focused on manageability: I think the primary purpose of partitioning is manageability (of the underlying storage for a table). Regards Markus ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi, Simon Riggs wrote: On Fri, 2008-01-04 at 22:26 +0100, Markus Schiltknecht wrote: I'm still puzzled about how a DBA is expected to figure out which segments to mark. Simon, are you assuming we are going to pass on segment numbers to the DBA one day? No Way! Ah, I'm glad ;-) Simon Riggs wrote: Skepticism is OK, but I'd like to get your detailed thoughts on this. I've been an advocate of the multi-tables approach now for many years, so I don't expect everybody to switch their beliefs on my say-so overnight. Let me make a few more comments in this area: I've so far always thought about some sort of multi-relations approach for partitioning, yes. Let's see if I can get my mind around single-table partitioning. The main proposal deliberately has few, if any, knobs and dials. That's a point of philosophy that I've had views on previously: my normal stance is that we need some knobs to allow the database to be tuned to individual circumstances. In this case, partitioning is way too complex to administer effectively and requires application changes that make it impossible to use for packaged applications. The latest Oracle TPC-H benchmark uses 10 pages of DDL to set it up and if I can find a way to avoid that, I'd recommend it to all. I do still want some knobs and dials, just not 10 pages worth, though I'd like yours and others' guidance on what those should be. Oracle have been responding to feedback with their new interval partitioning, but its still a multi-table approach in essence. I can absolutely support your efforts to minimize knobs and configuration DDL. However, my current feeling is, that segments based partitioning complicates things, because the DBA doesn't have tools and commands to handle segments. To satisfy all the different requirements of partitioning with segments based partitioning, we'd have to allow a table to span multiple table spaces. I'm not very keen on going that way. However, what I certainly like is the automated split point definition. Instead of having to create tables by hand and linking them via inheritance and constraint exclusion, I have something very similar in mind, like what you proposed for marking read-only segments. Something like: SPLIT TABLE customers AT cust_name 'n'; or: SPLIT TABLE inventory AT inv_id % 4 = 2; In my imagination, this should automatically create the underlying relations, i.e.: NOTICE: relations inventory__l and inventory__r have been created. That way, the DBA could then handle those like normal relations, querying them or moving them to different table spaces like all other normal relations. In a way, that's not so different from possible extensions on top of Segment Exclusion, except that the DBA additionally get a relation name to be able to address the set of segments which form a partition. Or put it the other way around: go for Segment Exclusion, but add some sort of a sentinel relation for each set of segments, to make them reachable for the DBA. My observation of partitioned databases is that they all work beautifully at the design stage, but problems emerge over time. A time-based range partitioned table can often have different numbers of rows per partition, giving inconsistent response times. A height-balanced approach where we make the partitions all the same size, yet vary the data value boundaries will give much more consistent query times and can be completely automated much more easily. Uh.. well, consistent query time isn't the first thing I'm expecting from partitioning by time ranges. If I wanted consistent query times I'd rather use hash partition or something, no? I'd even state, that one *wants* inconsistent response times when using time based range partitioning, by moving old, seldom used data to slower storage and keeping only a small amount of often used tuples on the faster disks, for example. The SVM concept doesn't cover everything that you can do with partitioning, but my feeling is it covers the main use cases well. As I regard manageability to be the main advantage of partitioning, which you've intentionally left out for now, I disagree here. How could SVM or Segment Exclusion potentially be covering what hash partitioning does? Maybe together with the ability to store different segments of a table on different table spaces. That could be considered an approach to range partitioning. But then, that would be the partitioning, and not SVM or Segment Exclusion. To me, both of SVM and SE look much more like an optimization for certain special cases and don't have much to do with partitioning. Regards Markus ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi, Robert Treat wrote: Personally I cant say it complicates things, because it isn't clear how it will be managed. :-) Well, management of relations is easy enough, known to the DBA and most importantly: it already exists. Having to set up something which is *not* tied to a relation complicates things just because it's an additional concept. But as I've pointed out, maybe what we have in mind isn't that different at all. Just have a sentinel relation mean a set of segments, i.e. all read-only segments of a table. Then again, a table - in a way - is not much else than a set of segments. So where's the real difference? To satisfy all the different requirements of partitioning with segments based partitioning, we'd have to allow a table to span multiple table spaces. I'm not very keen on going that way. Why? Uh.. if a table (RELKIND_RELATION) can only span one table space, as it is now, all of its segments are in the same table space. I don't quite call that partitioning. Well, sure, you could call it so, but then, each and every Postgres table is already partitioned in 1G segments. It all depends on the definitions, but in my world, horizontal partitioning for databases involves multiple table spaces (and is quite useless without that). Calling anything else partitioning is confusing, IMO. So the one thing that always scares me about these define it all and let the database sort it out methods is they seem to lead to cases where the system ends up rewriting the data to fit into some new partition layout. That holds true no matter if you shuffle between segments or relations. To be able to let the DBA define an exact split point, the database *will* have to shuffle tuples around. Why does that scare you? It's a regular database system's maintenance procedure. One thing that is nice about the current partitioning scheme is you can control the impact of this behavior in these scenarios, but moving around small portions of the table at a time. Uh.. I'm not quite following. What current partitioning scheme are you referring to? Why should that not be possible with other schemes? Moving the split point between two partitions involves moving tuples around, no matter if you are going to move them between segments or between relations building the partitions. More to the point (I think) is that people define access to the data based on the meaning of the data, not how it is stored on disk. For example, in some tables we only need to be active on 1 months worth of data... how that is laid out on disk (# partitions, which tablespaces) is a means to the end of working actively on 1 months worth of data. I can't think of many cases where people would actually say the want to work actively on the most recent GB of data. Agreed. I'd say that's why the DBA needs to be able to define the split point between partitions: only he knows the meaning of the data. To me, both of SVM and SE look much more like an optimization for certain special cases and don't have much to do with partitioning. Even if this were true, it might still be a useful optimization. Possibly, yes. To me, the use case seems pretty narrow, though. For example it doesn't affect index scans much. One table I am thinking of in particular in my system has one query we need to run across partitions, which ends up doing a slew of bitmap index scans for all the partitions. If using segment exclusion on it meant that I could get a global index to help that query, I'd be happy. As proposed, Segment Exclusion works only on exactly one table. Thus, if you already have your data partitioned into multiple relations, it most probably won't affect your setup much. It certainly has nothing to do with what I understand by 'global index' (that's an index spanning multiple tables, right?). Regards Markus ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hello Simon, Simon Riggs wrote: I've come up with an alternative concept to allow us to discuss the particular merits of each. ISTM that this new proposal has considerable potential. Hm.. interesting idea. If we were able to keep track of which sections of a table are now read-only then we would be able to record information about the data in that section and use it to help solve queries. This is turning the current thinking on its head: we could derive the constraints from the data, rather than placing the data according to the constraints. That would be much more natural: load data into the table and have the system work out the rest. Yeah, but that's also the most limiting factor of your approach: it covers only horizontal partitioning by time (or to be more precise, by columns which are very likely to increase or decrease with time). All other columns will very likely contain values from the full range of possible values. As you have pointed out, that might be a very frequent use case. I can't argue about that, however, I think it's important to be well aware of that limitation. Other scans types might also use segment exclusion, though this would only be useful for scans retrieving many rows, otherwise the overhead of segment exclusion might not be worthwhile. Uh.. the overhead of checking against min/max values doesn't seem that big to me. I rather think the gain for index scans would be prohibitively small, because (given frequent enough vacuuming) an index scan shouldn't return many pointers to tuples in segments which could be optimized away by segment exclusion. If we collect data for all columns then many of our implicit constraints would be useless. e.g. if a column only has a few values and these are present in all segments. Matching our predicate against all constraints would be expensive, so we must weed out poor constraints. We would do this by removing any constraint that overlapped more than 10% of other segments. Various heuristics would likely need to be discovered during development to make this work without resorting to manual commands. Uh, well, that's about the limitation I've pointed out above. But is it really worth maintaining statistics about overlapping values and removing min/max checks for certain columns? It would save you the min/max check per segment and scan, but cost maintaining the statistics and checking against the statistics once per scan. AFAICS the block with the min/max tuple per segment will often remain cached anyway... dunno. Noting which segments are read-only --- Everything so far has relied upon our ability to note which segments of a table are read-only. We could do this in two ways 1) have the system automatically keep track of non-changing data 2) have the DBA issue a command to mark a segment as read-only now Probably a combination of the two is better, so we have three states for segments - READ_WRITE_ALLOWED - EFFECTIVE_READ_ONLY (set by 1 only) - MARKED_READ_ONLY (set by 2 only) Having said that I want to concentrate on (1), though consider the other one also in case requested by reviewers. Hm.. AFAICT, horizontal partitioning often serves manageability, which is quite limited having all data in one table (you can't move a single segment to a different tablespace). Thus I think option 2 is pretty constrained is usability. What would the DBA gain by setting a segment to read only? How does the DBA figure out, in which segment a tuple is stored in (so she can decide to mark it read only)? The user may also wish to clear down very old data, so allowing DELETEs can ensure the user can still remove old data from the table. By carefully choosing the values to be deleted, a whole segment can be deleted and then returned to the FSM. Oh, yeah, that sounds like a good optimization. Bulk deletes, yummie! This proposal offers many of the advantages of the earlier Visibility Map proposal, yet without major changes to heap structure. Since the segment-level visibility map is more granular it would only be 80% as effective as the more detailed block-level map. Having said that, the bookkeeping overheads will also be considerably reduced, so it does seem a good joint approach. It also allows freezing to be handled fully, which was a problem with the original visibility map proposal. WAL logging visibility map changes is also now much easier. I generally agree, although I'm somewhat dubious about the 80% factor. My thoughts have been targeted directly at partitioning, yet I have to admit that this idea overlaps, and in my world view, replaces the Visibility Map proposal. I very much like the name Visibility Map though. I would even say, that partitioning is somewhat of a misleading term here, because it normally allows the DBA to decide on where to split. Given that we are operating on segments here, to which the DBA has very limited information and access, I
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi, Simon Riggs wrote: - any Fact table where measurements/observations/events are accumulated e.g. Web Hits (any Internet events) Call Detail Records Sales Security Events Scientific Measurements Process Control - any Major Entity where new entities are created from a sequence e.g. Orders, OrderItems Invoices Shipments, Returns most SCM/DCM events ...and only changed very seldom after a while, I would add. Because changing an old tuple would invalidate the optimization for the affected segment. That's why this optimization can't help for inventory tables, where an id might correlate with time and storage location, but writing access doesn't correlate with storage location (segment number) and time. Regards Markus ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi, Simon Riggs wrote: The smaller the partition size the greater the overhead of managing it. Also I've been looking at read-only tables and compression, as you may know. My idea was that in the future we could mark segments as either - read-only - compressed - able to be shipped off to hierarchical storage Those ideas work best if the partitioning is based around the physical file sizes we use for segments. As much as I'd like this simplification.. But I'm still thinking of these segments as an implementation detail of Postgres, and not something the user should have to deal with. Allowing the DBA to move segments to a different table space and giving him the possibility to check which tuples are in which segment seems awkward from a users perspective, IMO. Regards Markus ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi, Simon Riggs wrote: On Fri, 2008-01-04 at 13:06 -0500, Andrew Sullivan wrote: On Fri, Jan 04, 2008 at 01:29:55PM +0100, Markus Schiltknecht wrote: Agreed. Just a minor note: I find marked read-only too strong, as it implies an impossibility to write. I propose speaking about mostly-read segments, or optimized for reading or similar. Hm.. yeah, after rereading, I realize that I've mixed up states no. 2 and 3 here, sorry. I do want some segments to be _marked_ read-only: I want attempted writes to them to _fail_. Well, I can see use cases for marking tuples or complete relations as read only. But segments? I'm still puzzled about how a DBA is expected to figure out which segments to mark. Simon, are you assuming we are going to pass on segment numbers to the DBA one day? If not, a more user friendly command like MARK READ ONLY WHERE date = 2006 would have to move tuples around between segments, so as to be able to satisfy the split point exactly, right? I think Markus thought that we would mark them read only automatically, which was not my intention. I believe its possible to have this in a way that will make you both happy. Some more explanation: There would be three different states for a segment: 1. read write 2. optimized for reading, as Markus says it 3. marked read only by explicit command Thanks for clarification. Regards Markus ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] [GENERAL] Slow PITR restore
Hi, Alvaro Herrera wrote: Simon Riggs wrote: ISTM its just autovacuum launcher + Hot Standby mixed. I don't think you need a launcher at all. Just get the postmaster to start a configurable number of wal-replay processes (currently the number is hardcoded to 1). I also see similarity to what I do for Postgres-R: a manager and helper backends which can be started upon request. Such a scheme is currently used for autovacuum, I'm using it for replication, it could help for parallelizing recovery and it certainly helps for parallelizing queries as discussed in another thread. Maybe it's worth considering a general framework for such a manager or auto launcher, as well as helper backends. It certainly depends on the complexity of that manager, but it should probably better be an external process. What all of the helper backends have in common, AFAICT: - a connection to a database - no client connection - superuser privileges (For parallelized queries, superuser privileges might appear wrong, but I'm arguing that parallelizing the rights checking isn't worth the trouble, so the initiating worker backend should do that and only delegate safe jobs to hepler backends. Or is that a serious limitation in a way?) Most code for that already exists, as we already have various helpers. What's missing, IMO, is a communication channel between the worker and helper backends as well as between the backends and the manager. That's needed i.e. for worker backends being able to request helper backends and feed them with their wishes. Unix pipes can only be set up between the parent and the child of a fork, they eat file descriptors, need to copy data to the kernel and back and IIRC, there were portability issues. That's why I've written the internal message passing (IMessage) stuff, see -patches [1]. I'm all for unifying such a manager process and generalizing the requesting and launching of helpers as well as management of their state (handling died helper processes, keeping a pool of idle helpers which are already connected to a database, etc..). Most of that already exists in my Postgres-R code, maybe I can derive a general purpose patch to start contributing code from Postgres-R? Comments? Use cases I'm missing? Regards Markus [1]: last time I published IMessage stuff on -patches, WIP: http://archives.postgresql.org/pgsql-patches/2007-01/msg00578.php ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [GENERAL] Slow PITR restore
Hannu Krosing wrote: until N fubbers used ..whatever a fubber is :-) Nice typo! Markus ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [GENERAL] Slow PITR restore
Hello Hannu, Hannu Krosing wrote: (For parallelized queries, superuser privileges might appear wrong, but I'm arguing that parallelizing the rights checking isn't worth the trouble, so the initiating worker backend should do that and only delegate safe jobs to hepler backends. Or is that a serious limitation in a way?) at least functions defined with SECURITY DEFINER; may be a problem Uhm.. what I had in mind was parallelizing seqential scans, index scans, joins and such - database internal stuff. Parallelizing user defined functions (or what did you have in mind?) is more difficult and sometimes impossible, because the planner cannot know ahead, what the function's going to do. However, thinking about it, maybe, one could and should try to parallelize computationally intensive IMMUTABLE functions. But already with STABLE ones I'm getting suspicious. It would require users to write real thread-safe (well, multi-process-safe) functions, which I doubt somewhat. Granted, they theoretically *should* be safe, but... Anyway, if that's the only show stopper, one could certainly tell helper backends to substitute their superuser privileges with the invoker's privileges. Not sure if that's worth the trouble, though. Regards Markus ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] VLDB Features
Hello Gregory, Gregory Stark wrote: Oracle is using Direct I/O so they need the reader and writer threads to avoid blocking on i/o all the time. We count on the OS doing readahead and buffering our writes so we don't have to. Direct I/O and needing some way to do asynchronous writes and reads are directly tied. Yeah, except in cases where we can tell ahead non-sequential reads. Which admittedly doesn't come up too frequently and can probably be handled with posix_fadvice - as you are currently testing. Where Parallel query is useful is when you have queries that involve a substantial amount of cpu resources, especially if you have a very fast I/O system which can saturate the bandwidth to a single cpu. Full ACK, the very same applies to parallel querying on shared-nothing clusters. Those can help if the bandwidth to all processing cores together becomes the bottleneck (and the resulting data is relatively small compared to the input data). For example, Sun's UltraSparc T2 features only 8 PCIe lanes for those 8 cores, so you end up with 250 MiB/sec per core or about 32 MiB/sec per thread on average. To be fair: their 10 Gig Ethernet ports don't go via PCIe, so you get an additional 2x 1 GiB/sec for the complete chip. And memory bandwidth looks a lot better: Sun claims 60+ GiB/sec, leaving almost 8 GiB/sec per core or 1 GiB/sec per thread. If my calculations for Intel are correct, a Quad Xeon with a 1.33 GHz FSB has around 21 GiB/sec throughput to main memory, giving 5 GiB/sec per core. (Why are these numbers so hard to find? It looks like Intel deliberately obfuscates them with FSB MHz or Giga-transactions per sec and the like.) Regards Markus ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] VLDB Features
Hi, Josh Berkus wrote: Here's the other VLDB features we're missing: Parallel Query Uh.. this only makes sense in a distributed database, no? I've thought about parallel querying on top of Postgres-R. Does it make sense implementing some form of parallel querying apart from the distribution or replication engine? Windowing Functions Isn't Gavin Sherry working on this? Haven't read anything from him lately... Parallel Index Build (not sure how this works exactly, but it speeds Oracle up considerably) Sounds interesting *turs-away-to-google* On-disk Bitmap Index (anyone game to finish GP patch?) Anybody having an idea of what's missing there (besides good use cases, which some people doubt)? Again: Gavin? Simon, we should start a VLDB-Postgres developer wiki page. Thanks, Simon, wiki page looks good! Regards Markus ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] VLDB Features
Hi Josh, Josh Berkus wrote: Sure. Imagine you have a 5TB database on a machine with 8 cores and only one concurrent user. You'd like to have 1 core doing I/O, and say 4-5 cores dividing the scan and join processing into 4-5 chunks. Ah, right, thank for enlightenment. Heck, I'm definitely too focused on replication and distributed databases :-) However, there's certainly a great deal of an intersection between parallel processing on different machines and parallel processing on multiple CPUs - especially considering NUMA architecture. *comes-to-think-again*... Isn't Gavin Sherry working on this? Haven't read anything from him lately... Me neither. Swallowed by Greenplum and France. Hm.. good for him, I guess! Regards Markus ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Ordered Append Node
Hello Gregory, Gregory Stark wrote: It is kind of like a merge join but not quite. It's interleaving rows rather than matching them up. It's more like the final merge of a sort which also uses a heap to efficiently find the next value from the source tapes. Well, maybe my point here is: why do you need the heap to sort? The inputs you've got are already sorted, you only need to merge them. To me this sounds very much like the final stage of a merge sort, which would not require much memory. IMO, a merge sort could easier be implemented by a binary tree zipper node, as I had in mind. Leading to a plan like that (well, hey, this is all made up): Zipper (cost..., sort by public.t.i) - Zipper (cost .., sort by public.t.i) - Zipper (cost .., sort by public.t.i) - Index Scan using ti1 on t1 - Index Scan using t12 on t2 - Index Scan using ti2 on t3 - Zipper (cost .., sort by public.t.i) - Index Scan using ti4 on t4 - Index Scan using ti5 on t5 Every zipper node would simply have to keep the two top tuples from its inputs in memory, compare them and return the best. But maybe that's exactly how Knuth's polyphase merge sort internally also merge their inputs (or runs). And perhaps it makes sense to show the user just one simple append node instead of throwing a tree of Zipper nodes at her. ;-) Not necessarily but it is something Postgres supports and I don't think we want to break it. Actually it's useful for partitioned tables if you build the new partition in a separate table and then add it to the partitioned table. In that case you may have gone through several steps of adding columns and dropping them to get the structure to line up. Agreed, especially because lining up the columns isn't that hard after all. OTOH I think Postgres is way too flexible in how it allows partitioning to be done and thus it often can't optimize it properly. I'd very much like to teach it a stricter and simpler to use partitioning scheme than what we have with constraint exclusion. But that's pipe dreaming, and your improvement to the append node is certainly a good step towards the right direction, keep up the good work! Regards Markus ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Ordered Append Node
Hi, Florian Weimer wrote: Florian Weimer wrote: I think you need it because there are potentially many input types. Eh, tapes. Aha.. Given the partitioning case, I'd expect all rows to have an equal tuple descriptor. Maybe this is a matter of what to optimize, then? Could you elaborate on what use case you have in mind? You need a priority queue to figure out from which tape (partition) you need to remove the next tuple. And why do you need lots of heap memory to do that? Anything wrong with the zipper approach I've outlined upthread? Am I missing something? Regards Markus ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Ordered Append Node
Hi, Florian Weimer wrote: I think you need it because there are potentially many input types. Given the partitioning case, I'd expect all rows to have an equal tuple descriptor. Maybe this is a matter of what to optimize, then? Could you elaborate on what use case you have in mind? Regards Markus ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Ordered Append Node
Gregory Stark wrote: Not quite the same since the Executor-based implementation would have a static tree structure based on the partitions. Even if the partitions are all empty except for one or two you would still have to push the result records through all the nodes for the empty partitions. A heap only has the next record from each input. If an input has reached eof then the heap doesn't have an entry for that input. So if one of the inputs is empty (often the parent of the inheritance tree) it doesn't require a test anywhere to propagate the record up past it. Ah, so the binary tree (binary heap?) gets adjusted dynamically. Very clever! (OTOH probably a micro optimization, but as code is already there, use it, yeah!) I also did an optimization similar to the bounded-sort case where we check if the next tuple from the same input which last contributed the result record comes before the top element of the heap. That avoids having to do an insert and siftup only to pull out the same record you just inserted. In theory this is not an optimization but in practice I think partitioned tables will often contain contiguous blocks of key values and queries will often be joining against that key and therefore often want to order by it. Hm... that assumes range partitioning, no? If you partition among three partitions by id modulo 3, tuples are most probably coming from one partition after the other, i.e.: 1 2 3 1 2 3 1 2 3 ... And with hash partitioning, you're completely unable to predict the ordering. Ideally we would also be able to do this in the planner. If the planner could determine from the constraints that all the key values from each partition are non-overlapping and order them properly then it could generate a regular append node with a path order without the overhead of the run-time comparisons. Agreed. But that requires a) dealing with the problem of the parent table which has no constraints and b) an efficient way to prove that constraints don't overlap and order them properly. The latter looks like an O(n^2) problem to me, though it's a planning problem which might be worth making slow in exchange for even a small speedup at run-time. Well, I think someday, Postgres needs better support for vertical data partitioning in general. Dealing with constraints and inheritance is way too flexible and prone to error. I'll shortly start a new thread about that, to outline my current thoughts about that topic. Regards Markus ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Ordered Append Node
Hi, Heikki Linnakangas wrote: AFAICT it's roughly the same data structure as the zipper tree you envisioned, but not implemented with separate executor nodes for each level. Aha, that sounds good to me. ;-) As I've already mentioned, I think it's even better to simply show the user a single append node, instead of lots of small nodes from a binary tree merger. Regards Markus ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Ordered Append Node
Hello Gregory, Gregory Stark wrote: I've been hacking on the idea of an Append node which maintains the ordering of its subtables merging their records in order. This is important for partitioned tables since otherwise a lot of plans are not available such as merge joins. Cool! Some time ago, I've been trying something very similar, but didn't get very far. I'd like to share my thoughts anyway. First of, I envisioned that append node to be applicable also to unordered reading from multiple partitions, simply interleaving between the partitions. The logic I've followed is to do as follows: 1) Go through all subrels asking for any interesting pathkey lists. Gather up the union of all of these. I also tried to modify the Append node first, then figured that it might be better to base on the merge join node instead. While this seems farther away, I had the hope that a binary tree of such 'plain merge' nodes would require less comparisons in total. Plus, I found it a lot simpler to cope with exactly two input relations instead of n, as with the append node. :-) 2) Go back through all the subrels and for each accumulated pathkey list ask for the best path for that subrel for that pathkey list. 3) Generate an two append paths for each of these pathkey lists, one using the best startup_cost subpath and one with the best total_cost subpath available for the pathkey list for each child rel. If there is none available take the best unordered path and put a sort node above it. 4) Additionally advertise the traditional unordered append node which our parent could choose to put a sort node above same as ever. 5) Append plan nodes look a lot like Sort plan nodes glued onto an Append plan node, with sort function oids and so on. 6) Append exec state nodes look a lot like a (a few fields from) tuplesortstate glued onto an Append node. They have the ScanKey array and the fmgr sort functions. They also have an array of TupleTableSlot and a heap of indexes into that array. 8) ExecAppend does something similar to tuplesort's bounded sort (I fear I'm going to get typecasted) or more to the point, similar to the final merge of a merge sort. It directly returns the slot the child rel last returned to it. Uh.. well, yeah. I guess you have a better understanding of the planner and executor that I do. Open questions: 1) I still haven't completely figured out what to do with equivalence classes. My hack of just stuffing all the append subrel vars into there seems to work fine. I need to understand what's going on to see if there's really a problem with it or not. 2) I'm not sure this code will work when the append rel is a target (ie UPDATE and DELETE stmts). 3) It does seem to work when the columns in the subrels don't line up but I didn't do anything special to handle this case. Uh.. is there any use case for that? WRT partitioning certainly not, is there? I'll have a look at your patch. Regards Markus ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Raw device I/O for large objects
Hi, Georgi Chulkov wrote: Please allow me to ask then: 1. In your opinion, would the above scenario indeed benefit from a raw-device interface for large objects? No, because file systems also try to do what you outline above. They certainly don't split sequential data up into blocks and distribute them randomly over the device, at least not without having a pretty good reason to do so (with which you'd also have to fight). The possible gain achievable is pretty minimal, especially in conjunction with a (hopefully battery backed) write cache. 2. How feasible it is to decouple general table storage from large object storage? I think that would be the easiest part. I would go for a pluggable storage implementation, selectable per tablespace. But then again, I wouldn't do it at all. After all, this is what MySQL is doing. And we certainly don't want to repeat their mistakes! Or do you know anybody who goes like: Yepee, multiple storages engines to choose from for my (un)valuable data, lets put some here and others there Let's optimize the *one* storage engine we have and try to make that work well together with the various filesystems it uses. Because filesystems are already very good in what they are used for. (And we are glad we can use a filesystem and don't need to implement one ourselves). Regards Markus ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [COMMITTERS] pgsql: Fix up text concatenation so that it accepts all the reasonable
Hi, Gregory Stark wrote: Perhaps if you're doing some form of replication between different architectures you might want to use binary representation for your transfers. Or if you're doing something in a PL language like compressing or bundling up multiple data in a container format or something. I'm not quite sure I understand where you're coming from, but isn't that what the send and recv functions for most data types are for? Don't those provide a cross-platform compatible binary representation? Regards Markus ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [COMMITTERS] pgsql: Fix up text concatenation so that it accepts all the reasonable
Hello Tom, Tom Lane wrote: I think you'd be nuts to bet your data on the binary representations really being cross-platform compatible. Can you elaborate on this? AFAICT the send/recv functions use network byte ordering. What are the other problems between different architectures? There might be some excuse for doing this within a single architecture, but I can't get very excited about it ... Is the textual representation (i.e. OidOutputFunctionCall) more cross platform compatible? Gregory Stark wrote: Well they're not very useful for their intended purpose of client-server communication if they're not. Agreed. Up until now, I'd have considered it a bug, if a send/recv on different platforms would not lead to the very same result. Regards Markus ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [COMMITTERS] pgsql: Fix up text concatenation so that it accepts all the reasonable
Hi, Tom Lane wrote: Well, you're probably fine with integers and text, but beyond that it gets a bit more dicey. I wouldn't want to assume that floats are compatible across any random pair of architectures, and in the more complex datatypes (such as arrays or geometric types) there might be some impact from alignment rules. (Or not, I'm too lazy to go look at the moment.) Okay, thank you for the hints, I'll go figure out. Regards Markus ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] terms for database replication: synchronous vs eager
Hello Jan, thank you for your feedback. Jan Wieck wrote: On 9/7/2007 11:01 AM, Markus Schiltknecht wrote: This violates the common understanding of synchrony, because you can't commit on a node A and then query another node B and expect it be coherent immediately. That's right. And there is no guarantee about the lag at all. So you can find old data on node B long after you committed a change to node A. I'm in doubt about the long after. In practice you'll mostly have nodes which perform about equally fast. And as the origin node has to do more processing, than a node which solely replays a transaction, it's trivial to balance the load. Additionally, a node which lags behind is unable to commit any (conflicting) local transactions before having caught up (due to the GCS total ordering). So this is even somewhat self regulating. Postgres-R is an asynchronous replication system by all means. It only makes sure that the workset data (that's what Postgres-R calls the replication log for one transaction) It's most often referred to as the writeset. has been received by a group communication system supporting total order and that the group communication system decided it to be the transaction that (logically) happened before any possibly conflicting concurrent transaction. Correct. That's as far as the Postgres-R algorithm goes. I should have been more precise on what I'm talking about, as I'm continuing to develop Postgres-R (the software). That might be another area where a new name should be introduced to differentiate between Postgres-R, the original algorithm and my continuous work on the software, implementing the algorithm. This is the wonderful idea how Postgres-R will have a failsafe conflict resolution mechanism in an asynchronous system. I don't know what you associate with the word eager. I'm speaking of the property, that a transaction is replicated before commit, so as to avoid later conflicts. IMO, this is the only real requirement people have when requesting synchronous replication: most people don't need synchrony, but they need reliable commit guarantees. I've noticed that you are simply speaking of a failsafe conflict resolution mechanism. I dislike that description, because is does not say anything about *when* the conflict resolution happens WRT commit. And there may well be lazy failsafe conflict resolutions mechanisms (i.e. for a counter), which reconciliate after commit. I'd like to have a simple term, so that we could say: you probably don't need fully synchronous replication, but eager replication may already serve you well. All I see is that Postgres-R makes sure that some other process, which might still reside on the same hardware as the DB, is now in charge of delivery. ..and Postgres-R waits until that other process confirms the delivery, whatever exactly that means. See below. This delay before commit is important. It is what makes Postgres-R eager, according to my definition of it. I'm open for better terms. Nobody said that the GC implementation cannot have made the decision about the total order of two workset messages and already reported that to the local client application before those messages ever got transmitted over the wire. While this is certainly true in theory, it does not make sense in practice. It would mean letting the GCS decide on a message ordering without having delivered the messages to be ordered. That would be troublesome for the GCS, because it could loose an already ordered message. Most GCS start their ordering algorithm by sending out the message to be ordered. Anyway, as I've described on -hackers before, I'm intending to decouple replication from log writing. Thus not requiring the GCS to provide any delivery guarantees at all (GCSs are complicated enough already!). That would allow the user to decouple transaction processing nodes from log writing nodes. Those tasks have different I/O requirements anyway. And what would more that two or three replicas of the transaction logs be good for anyway? Think of them as an efficient backup - you won't need it until your complete cluster goes down. Regards Markus ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] terms for database replication: synchronous vs eager
Hi, Chris Browne wrote: The approach that was going to be taken, in Slony-II, to apply locks as early as possible so as to find conflicts as soon as possible, rather than waiting, seems eager to me. Agreed. WRT locking, one might also call it pessimistic, but that sounds so... negative. I find the as soon as possible bit rather weak, instead it's exactly before the origin node confirms commit. Of course only conflicts which could possibly lead to an abort of the transaction in question are taken into account. A possible definition may be: Eager replication systems do only confirm the commit of a transaction after they have checked for cross-node conflicts, which could require the transaction to abort. (While lazy systems may confirm the commit before). Note how much less restrictive that definition is, that that of a fully synchronous system. But I'm not sure to what extent that notion has been drawn into the Postgres-R work... My current variant of Postgres-R goes the very same path, using MVCC instead of locking wherever possible (with the very same effect, but allowing more concurrency :-) ). Regards Markus ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] WIP patch for latestCompletedXid method of computing snapshot xmax
Hello Tom, Tom Lane wrote: So on the strength of that, I'm going to go ahead and commit the patch, but I'd be interested to see benchmarks from people with access to better hardware. I've just completed two dbt2 test runs on a mid-level system, with 4GB RAM and a 7 disk SATA RAID 1+0 w/ BBU. Once with code as of 2007/09/05 18:00 (which is *before* the first lazy xid commit) and once with cvs HEAD (2007/09/08 +latestCompletedXid.patch. Here are the results from the first test run (test run 33, without lazy xid): $ cat 33/driver/results.out Response Time (s) Transaction %Average :90th %TotalRollbacks % - - --- --- - Delivery 3.97 3.745 : 7.844118440 0.00 New Order 45.35 3.844 : 7.692 135192 1352 1.01 Order Status 3.95 2.728 : 6.371117640 0.00 Payment 42.74 2.649 : 6.349 1274150 0.00 Stock Level 4.00 2.172 : 5.634119150 0.00 - - --- --- - 1103.45 new-order transactions per minute (NOTPM) 120.1 minute duration 0 total unknown errors 1003 second(s) ramping up And that's with HEAD +latestCompletedXid.patch (test run 34): $ cat 34/driver/results.out Response Time (s) Transaction %Average :90th %TotalRollbacks % - - --- --- - Delivery 3.96 3.843 : 8.223117600 0.00 New Order 45.28 4.049 : 8.451 134398 1300 0.98 Order Status 3.97 2.877 : 6.815117770 0.00 Payment 42.80 2.745 : 6.718 1270270 0.00 Stock Level 4.00 2.280 : 6.129118590 0.00 - - --- --- - 1097.71 new-order transactions per minute (NOTPM) 120.1 minute duration 0 total unknown errors 1003 second(s) ramping up Both tests ran for two hours, had 100 warehouses and 50 connections. shared_buffers were set to 1024MB, effective_cachesize = 3800MB, all other settings were standard. Regards Markus ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
Hi, apoc9009 wrote: Thadt is Replication NOT Backup I've now read all of your messages in this thread, but I simply fail to understand why you are that much opposed to the term 'replication'. I think the only thing which comes any close to what you're looking for is replication (in particular eager multi-master replication). I'd recommend you familiarize yourself with the world of database replication. You already know the important chapter from our manual, learn that by heart. Then read [2] and [3]. :-) Regards Markus [1]: Postgres advocacy wiki: http://developer.postgresql.org/index.php/Replication%2C_Clustering%2C_and_Connection_Pooling [2]: Terms and Definitions of Database Replication http://www.postgres-r.org/documentation/terms ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
Hi, apoc9009 wrote: Translation for you: A Backup is a File or Set of Files thadt contains the Data of your Business critical Informations. It should not be Archived on the same place, the same House or the same Room. I disagree, a backup does not necessarily have to be a single file or a set of files. Wikipedia has this definition: backup refers to making copies of data so that these additional copies may be used to restore the original after a data loss event. While for replica, it states: replica is a copy that is relatively indistinguishable from the original Thus a backup can very well be thought of as replica, and vice versa. A Replication Database has nothing to do with a Backup, it works only for Failover if the Primary Database has a Mailfunction. That's certainly plain wrong, see multi-master replication where failover doesn't make any sense. Wikipedia again (although that's unfair, as I've contributed to that definition myself) [1]: Replication is the process of sharing information so as to ensure consistency between redundant resources ..for example a master database and a backup. Keep in Mind: Backup is NOT Replication! Write it down 100 times and maybe you understand A backup IS a replica. A backup IS a replica. A backup IS a replica. A backup IS a replica... Regards Markus [1]: http://en.wikipedia.org/wiki/Replication_%28computer_science%29 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] terms for database replication: synchronous vs eager
Hi, I'm asking for advice and hints regarding terms in database replication, especially WRT Postgres-R. (Sorry for crossposting, but I fear not reaching enough people on the Postgres-R ML alone) I'm struggling on how to classify the Postgres-R algorithm. Up until recently, most people thought of it as synchronous replication, but it's not synchronous in the strong (and very common) sense. I.e. after a node confirms to have committed a transaction, other nodes didn't necessarily commit already. (They only promise that they *will* commit without conflicts). This violates the common understanding of synchrony, because you can't commit on a node A and then query another node B and expect it be coherent immediately. None the less, Postgres-R is eager (or pessimistic?) in the sense that it replicates *before* committing, so as to avoid divergence. In [1] I've tried to make that distinction clear, and I'm currently advocating for using synchronous only in the very strong (and commonly used) sense. I've choosen the word 'eager' to mean 'replicates before committing'. According to that definitions, Postgres-R is async but eager. Do these definitions violate any common meaning? Maybe in other areas like distributed storage or lock managers? Regards Markus [1]: Terms and Definitions of Database Replication http://www.postgres-r.org/documentation/terms ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] PG-MQ?
Hi Chris, Chris Browne wrote: I'm seeing some applications where it appears that there would be value in introducing asynchronous messaging, ala message queueing. http://en.wikipedia.org/wiki/Message_queue ISTM that 'message queue' is a way too general term. There are hundreds of different queues at different levels on a standard server. So I'm somewhat unsure about what problem you want to solve. c) There are lesser names, like isectd http://isectd.sf.net and the (infamous?) Spread Toolkit which both implement memory-based messaging systems. If a GCS is about what you're looking for, then you also might want to consider these: ensemble, appia or jGroups. There's a Java layer called jGCS, which supports even more, similar systems. Another commonly used term is 'reliable multicast', which guarantees that messages are delivered to a group of recipients. These algorithms often are the basis for a GCS. My bias would be to have something that can basically run as a thin set of stored procedures atop PostgreSQL :-). It would be trivial to extend that to support SOAP/XML-RPC, if desired. Hm.. in Postgres-R I currently have (partial) support for ensemble and spread. Exporting that interface via stored procedures could be done, but you would probably need a manager process, as you certainly want your connections to persist across transactions (or not?). Together with that process, we already have half of what Postgres-R is: an additional process which connects to the GCS. Thus I'm questioning, if there's value for exporting the interface. Can you think of other use cases than database replication? Why do you want to do that via the database, then, and not directly with the GCS? It would be nice to achieve 'higher availability' by having queues where you might replicate the contents (probably using the MQ system itself ;-)) to other servers. Uhm.. sorry, but I fail to see the big news here. Which replication solution does *not* work that way? Regards Markus ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] COPYable logs status
Hi, Tom Lane wrote: We *have* a log-writing process. The problem is in getting the data to it. Remember the imessages approach I'm using for Postgres-R? It passes messages around using shared memory and signals the receiver on incoming data. It's not perfect, sure, but it's a general solution to a common problem. Maybe it's worth a thought, instead of fiddling with signals, special shmem areas and possible races every time the 'getting data to another process'-problem comes up? Regards Markus ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] COPYable logs status
Tom Lane wrote: Markus Schiltknecht [EMAIL PROTECTED] writes: Tom Lane wrote: We *have* a log-writing process. The problem is in getting the data to it. Remember the imessages approach I'm using for Postgres-R? It passes messages around using shared memory and signals the receiver on incoming data. It's not perfect, sure, but it's a general solution to a common problem. Uh-huh. And how will you get libc's dynamic-link code to buy into issuing link error messages this way? Not to mention every other bit of code that might get linked into the backend? I was refering to the 'getting data to another process' problem. If that's the problem (as you said upthread) imessages might be a solution. Trapping what comes out of stderr is simply too useful a behavior to lose. Sure. I've never said anything against that. Regards Markus ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Hacking on PostgreSQL via GIT
Hi, Alvaro Herrera wrote: Which is not always what happens in reality. Consider for example that we borrowed some files from NetBSD, OpenBSD, Tcl, zic and others. It would be nice to know exactly at what point we borrowed the file, so we can go to the upstream repo and check if there's any bug fix that we should also apply to our local copy. And we _also_ modify locally the file of course, so just digesting the file we have to get a SHA1 (or whatever) identifier is not an option. I consider such information (i.e. 'where is this file coming from') to be historical information. As such, this information clearly belongs to the VCS sphere and should be tracked and presented by the VCS. Advanced VCSes can import files from other projects and properly track those files or propagate on request. Even subversion can do that to some extent. My point here is: given a decent VCS, you don't need such historical information as often as you do with CVS. You can sit back and let the VCS do the job. (Like looking up, when the last 'import' of the file from the external project happened, what changed and merge those changes back into your (locally modified variant of the) file.) And if you really want to dig in the history of your project, you can ask the VCS, which you are going to need anyway for other historic information. Regards Markus ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Hacking on PostgreSQL via GIT
Hi Jim C. Nasby wrote: I understand the argument about metadata and all, and largely agree with it. But on the other hand I think a version identifier is a critical piece of information; it's just as critical as the file name when it comes to identifying the information contained in the file. If you really want the files in your releases to carry a version identifier, you should let your release process handle that. But often enough, people can't even tell the exact PostgreSQL version they are running. How do you expect them to be able to tell you what version a single file has? For the developers: they have all the history the VCS offers them. There are tags to associate a release with a revision in your repository. And because a decent VCS can handle all the diff'ing, patching and merging you normally need, you shouldn't ever have to process files outside of your repository. So what exactly is the purpose of a version identifier within the file's contents? For whom could such a thing be good for? Regards Markus ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Auto Partitioning
Simon Riggs wrote: i.e. if we have partitions for each year (2001, 2002, 2003 2004, 2005, 2006, 2007) AND we have already proved that 2005 is excluded when we have a WHERE clause saying year = 2006, then we should be able to use the ordering to prove that partitions for 2004 and before are also automatically excluded. Provided you've set up the right constraints, the current constraint_exclusion feature does exactly that, no? I'll think some more about the Merge node, but not right now. I've looked at nodeAppend.c and nodeMergeJoin.c. Probably we can use much of nodeMergeJoin, just without the join? Instead returning the tuples as they are, but in the correct order. The nodeMergeJoin code can only handle two inputs (a left and a right node), but it might be beneficial to structure multiple merge nodes into a binary tree layout anyway. (I'm guessing that might reduce the amount of comparisons needed). What do you think? Regards Markus ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Auto Partitioning
Hi, Zeugswetter Andreas ADI SD wrote: CREATE INDEX x ON test(a, b, c); isn't the same as CRETAE INDEX x ON test(c, b, a); That is only a problem if you also want to avoid a sort (e.g. for an order by), ..or if you want to use that index for 'WHERE a = 5'. The first one is probably helping you, the second isn't. (an example would be a query where c=5 and b between 0 and 20 and two partitions one for 0 = b 10 and a second for 10 = b) Hm.. in that case, an index on (a, b, c) wouldn't help. An index on (c, b, a) would be just perfect, agreed? Now, for the partitioning: you simply have to scan two partitions in that case, no matter how you arrange your indexes. And this is where we need some sort of multi-table index scan functionality. (I'm not saying a multi-table index. Such a thing would be too large on disk. That functionality should probably better be realized by using the underlying per-table indexes). That's why I'd say, the first columns of an index would have to be equal to all of the columns used in the partitioning key. I correct my own statement somewhat, here: only in that case, a single table index can satisfy your request. For other cases, you'd have to query more than one partition's indexes and mix them correctly to maintain the right order, if required. No. It may change performance in some situations, but it is not needed for unique constraints. Agreed, for unique constraints. But indexes are used for some more things than just unique constraints checking. ;-) Regards Markus ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Auto Partitioning
Hi, Martijn van Oosterhout wrote: The executor would have to be clever enough to not do a single index scan, but possibly scan through multiple indexes when asking for uniqueness, depending on the partitioning rule set. But it's not the executor that checks uniqueness, it's built into the btre code. Well, it's the executor calling into the btree code. Couldn't the executor choose which (btree-) indexes to query for uniqueness? If someone manages to crack uniqueness for GiST indexes, we'll have our answer, since it has exactly the same problem but on a different scale. (Or vice-versa, if some gets uniqueness for multiple indexes, we can do it for GiST also). Uh.. can you elaborate on that? AFAICS, you would simply have to query multiple btree indexes and make sure non of them is violated. How would that help making unique GiST indexes possible? What's the problem there? Regards Markus ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Auto Partitioning
Hi, NikhilS wrote: The following things are TODOs: iv) Auto generate rules using the checks mentioned for the partitions, to handle INSERTs/DELETEs/UPDATEs to navigate them to the appropriate child. Note that checks specified directly on the master table will get inherited automatically. Am planning to do the above by using the check constraint specified for each partition. This constraint's raw_expr field ends up becoming the whereClause for the rule specific to that partition. I appreciate you efforts, but I'm not sure if this has been discussed enough. There seem to be two ideas floating around: - you are heading for automating the current kludge, which involves creating partitions and constraints by hand. AFAICT, you want to support list and range partitioning. - Simon Riggs has proposed partitioning functions, which could easily handle any type of partitioning (hash, list, range and any mix of those). Both proposals do not have much to do with the missing multi-table indices. It's clear to me that we have to implement those someday, anyway. AFAICT, the first proposal does not ease the task of writing correct constraints, so that we are sure that each row ends up in only exactly one partition. The second would. But the second proposal makes it hard for the planner to choose the right partitions, i.e. if you request a range of ids, the planner would have to query the partitioning function for every possible value. The first variant could use constraint exclusion for that. None of the two has gone as far as thinking about switching from one partitioning rule set to another. That gets especially hard if you consider database restarts during re-partitioning. Here are some thought I have come up with recently. This is all about how to partition and not about how to implement multi-table indices. Sorry if this got somewhat longish. And no, this is certainly not for 8.3 ;-) I don't like partitioning rules, which leave open questions, i.e. when there are values for which the system does not have an answer (and would have to fall back to a default) or even worse, where it could give multiple correct answers. Given that premise, I see only two basic partitioning types: - splits: those can be used for what's commonly known as list and range partitioning. If you want customers A-M to end up on partition 1 and customers N-Z on partition 2 you would split between M and N. (That way, the system would still know what to do with a customer name beginning with an @ sign, for example. The only requirement for a split is that the underlying data type supports comparison operators.) - modulo: I think this is commonly known as hash partitioning. It requires an integer input, possibly by hashing, and calculates the remainder of a division by n. That should give an equal distribution among n partitions. Besides the expression to work on, a split always needs one argument, the split point, and divides into two buckets. A modulo splits into two or more buckets and needs the divisor as an argument. Of course, these two types can be combined. I like to think of these combinations as trees. Let me give you a simple examlpe: table customers | | split @ name = 'N' / \ / \ part1 part2 A combination of the two would look like: table invoices | | split @ id = 5 / \ / \ hash(id) modulo 3 part4 /| \ / | \ part1 part2 part3 Knowledge of these trees would allow the planner to choose more wisely, i.e. given a comparative condition (WHERE id 10) it could check the splits in the partitioning tree and only scan the partitions necessary. Likewise with an equality condition (WHERE id = 1234). As it's a better definition of the partitioning rules, the planner would not have to check constraints of all partitions, as the current constraint exclusion feature does. It might even be likely that querying this partitioning tree and then scanning the single-table index will be faster than an index scan on a multi-table index. At least, I cannot see why it should be any slower. Such partitioning rule sets would allow us to re-partition by adding a split node on top of the tree. The split point would have to increment together with the progress of moving around the rows among the partitions, so that the database would always be in a consistent state regarding partitioning. Additionally, it's easy to figure out, when no or only few moving
Re: [HACKERS] Auto Partitioning
Hi, NikhilS wrote: Our current partitioning solution is based on inheritance. With that in mind, for 8.3 I thought an implementation based on auto rules creation would be the way to go. That's completely reasonable. And as I've said, it's probably even a step towards what I've outlined (automation of creation of partitions). Regards Markus ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Auto Partitioning
Hi, Simon Riggs wrote: I agree with much of your post, though this particular point caught my eye. If you'll forgive me for jumping on an isolated point in your post: No problem. Multi-table indexes sound like a good solution until you consider how big they would be. The reason we need a multi-table index is because we are using partitioning, which we wouldn't be doing unless the data was fairly large. So the index is going to be (Num partitions * fairly-large) in size, which means its absolutely enormous. Adding and dropping partitions also becomes a management nightmare, so overall multi-table indexes look unusable to me. Multi-table indexes also remove the possibility of loading data quickly, then building an index on the data, then adding the table as a partition - both the COPY and the CREATE INDEX would be slower with a pre-existing multi-table index. I agree. (And thanks to TOAST, we never have very wide tables with relatively few rows, right? I mean, something like pictures stored in bytea columns or some such.) My hope is to have a mechanism to partition indexes or recognise that they are partitioned, so that a set of provably-distinct unique indexes can provide the exact same functionlity as a single large unique index, just without the management nightmare. Uhm... I don't quite get what you mean by provably-distinct unique indexes. As long as the first columns of an index are equal to all columns of the partitioning columns, there is no problem. You could easily reduce to simple per-table indexes and using the partitioning rule set to decide which index to query. But how to create an (unique) index which is completely different from the partitioning key? Regards Markus ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Auto Partitioning
Hi, Gregory Stark wrote: Put another way, multi-table indexes defeat the whole purpose of having partitioned the table in the first place. If you could have managed a single massive index then you wouldn't have bothered partitioning. That depends very much on the implementation of the multi-table index, as you describe below. I think the major missing part is not *how* such a meta-index should work - it's easily understandable, that one could use the per-table indices - but a programming interface, similar to the current index scan or sequential scan facility, which could return a table and tuple pointer, no? However there is a use case that can be handled by a kind of compromise index. Indexes that have leading columns which restrict all subtrees under that point to a single partition can be handled by a kind of meta-index. So you have one index which just points you to the right partition and corresponding index. Yeah. That lets you enforce unique constraints as long as the partition key is part of the unique constraint. Is that already sufficient? That would alter the ordering of the columns in the index, no? I mean: CREATE INDEX x ON test(a, b, c); isn't the same as CRETAE INDEX x ON test(c, b, a); That's why I'd say, the first column of an index would have to be equal to all of the columns used in the partitioning key. Regards Markus ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Auto Partitioning
Hi, Simon Riggs wrote: Most high volume tables are Fact tables with potentially more than 1 row per Object/Dimension, so the unique index isn't appropriate in those cases. When partitioning a Major Entity its much easier to regard the PK as the partitioning key + unique key, which is frequently possible, even if it does break the exhortation against intelligent keys. Okay, so you are saying that a general purpose multi-table index isn't needed, but instead something based on the partitioning rule set and the per table indexes should be sufficient for the vast majority of cases? Regards Markus ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Auto Partitioning
Hi, Gregory Stark wrote: However there are also cases such as where you have a=0..99 in one partition and a=100..199 in partition two, etc. It could still automatically build indexes on (a,b,c) on each partition and somehow note that the unique constraint is guaranteed across the whole partitioned table. Uhm... yes, because 'a' is the partitioning key. According to my outline for partitioning rule sets, you would have a split @ a = 100. Probably another one @ a = 200, etc... but none the less, 'a' is the only column needed to decide what partition a row has to end up in, so 'a' is the only column in the partitioning key. What I'm saying is, that given your example, it's not easily possible to have an index on (b,a) even if 'a' is also in the partitioning key. It's very well possible to emulate a multi-table index on (a,b), though. Brainstorming about this somewhat more: how about having multiple columns in the partitioning key, i.e. 'a' and 'b', and the following rule set (which admittedly is somewhat special): table sample | | split @ a = 100 / \ / \ split @ b = 100 part3 /\ / \ part1 part2 An index on (a, b) could easily be 'emulated' by having such an index on all the partitions, but can we have an index on (b, a) like that? Probably not, because at the first split, we would have to duplicate. I.e. for an index scan on 'b = 22', we would have to scan the index on part3 as well as part1. Thus one can say, that an multi-table index can only easily be 'emulated', if it has the same columns as the partitioning key, in the same order. For the above example, these ones would be possible: (a) (a,b) (a,b,...) Yet another thought: the emulation of multi-table indexes, in this case, is like concatenating the indexes of the partitions in the right order. Asking for an index scan for 'WHERE a = 95 AND a = 105' when having a split at a = 100, you would have to start on the index in the left bucket (with a 100) and return everything until the end of the index, then continue on the index in the right bucket (with a = 100). So you also have to be able to determine an order, which is easily possible for splits, but not so simple for modulos (hash partitioning). For such a modulo node, the executor would have to start multiple index scans, i.e.: table sample | | 'id' modulo 4 /|| \ / || \ part1 part2 part3 part4 When scanning for a range (i.e. 'WHERE id = 5 AND id = 17'), the planner would have to request an index scan on each of the partition, joining the results in the right order. So, why not completely emulate all multi-table index scans? The above restriction would disappear, if we could teach the planner and executor how to join multiple index scan results, no? Questioning the other way around: do we need any sort of multi-table indexes at all, or isn't it enough to teach the planner and executor how to intelligently scan through (possibly) multiple indexes to get what is requested? Regards Markus ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Auto Partitioning
Hi, Joshua D. Drake wrote: If we don't have multi-table indexes how do we enforce a primary key against a partitioned set? The executor would have to be clever enough to not do a single index scan, but possibly scan through multiple indexes when asking for uniqueness, depending on the partitioning rule set. Regards Markus ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Auto Partitioning
Simon Riggs wrote: The planner already uses the Append node to put together multiple plans. The great thing is it will put together IndexScans and SeqScans as applicable. No need for multi-scans as a special node type. Yes... only that mixing 'concurrent' index scans in the right order would probably save us an extra sort step in some cases. Consider this with hash partitioning on (id): SELECT * FROM test WHERE id 1 AND id 999 ORDER BY id; Every partition should have an index on (id), so we already have pretty well sorted data, we just need to mix the results of the index scan in the correct order, no? Regards Markus ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Auto Partitioning
Andrew Dunstan wrote: David Fetter wrote: That would be where the provably-distinct part comes in, so yes. That assumes you can provide some provably distinct test. In the general case I have in mind that isn't so. Could you please give a somewhat more concrete example, I'm not following here. Thanks Markus ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Auto Partitioning
Hi, Andrew Dunstan wrote: I guess my point was really that multi-table indexes might have uses beyond partitioning. Aha, now I understand. Thanks for the clarification. Say I have two tables, each with a field FKed to a field in a third table. I'd like to create the values to be unique across the referring tables. Now, there are various tricks that can be played either with custom triggers or redundant data to do this, but there's no easy way. However, a multi-table unique index would do it for me quite nicely, if we could create such a thing. Maybe going into a similar direction and better think of it as a multi-table uniqueness constraint, which internally uses multiple, single-table indexes? Regards Markus ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] SCMS question
Hi, Andrew Dunstan wrote: O.k. everyone pay attention, I am about to agree with Greg! ;) Greg are their tools to migrate CVS to monotone or whatever your favorite is? The reason I ask is that I migrate the CVS to SVN every 4 hours I think it is and it isn't perfect. monotone ships it's own cvs_import. I'm about to improve that, using roughly the same algorithm as cvs2svn 2.0. The cvs2svn people are very well aware that their version 1.x is good, but has problems in some areas. There is a generic conversion tool called Tailor that might help you: http://www.darcs.net/DarcsWiki/Tailor Tailor does not have support for branches, but can only convert one branch at a time. And it is not half as clever as cvs2svn regarding clever reconstruction of broken or manually tweaked CVS repositories. Regards Markus ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] SCMS question
Hi, Matthew D. Fuller wrote: I would say that a far greater contributor in practice would simply be frequency. If you diverge on your significant feature for 6 months, then try to merge in upstream changes from the main dev, you will be in hell no matter what merge algorithm you use. Do you have experience with automatic merging tools, to back up this assertion. If so I'd be curious to know. I'm maintaining the Postgres-R branch since about late PostgreSQL 7.4 using monotone's cvs_import and propagating from the original PostgreSQL repository to my Postgres-R branch. And even if I propagate quite rarely, automatic merge tools (i.e. monotone) helped me a *great deal(tm)*. (What's still awkward, is the lacking cvs_import.) If you merge in upstream changes every few days, however, you will have many fewer and much simplier conflicts to deal with. A VCS as good as monotone gives you the option to merge random revisions in between. Thus, if you didn't merge for six months, you can easily do incremental merges and i.e. do six times a merge of one month worth of mainline code. Of course, that still seems like more work, than if you do it frequently :-) But the VCS should give you the option and let *you* choose, instead of enforcing whatever it thinks should be good for you. A VCS that makes frequent merges easy results in easier conflict handling, not by some magical auto-resolution, but just by letting you do it in ongoing regular and small bites. I disagree, by experience. And please note, that it's not magic, but (in case of monotone) a provably correct (and simple to understand, I might add) algorithm which merges cleanly whenever possible. Regards Markus ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] SCMS question
Hi, Tom Lane wrote: Yah know, the one bit of these pitches that always sounds like pure snake oil is the claim that they offer some kind of mechanical solution to merge conflicts. AFAICS that has nothing to do with the SCMS in use and everything to do with whether your diff command is AI-complete. You should have said merge command. Every tried that? Or kdiff3? Try it, you will be surprised! Regards Markus ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] [Monotone-devel] Re: SCMS question
Hi, Warren Turkal wrote: Cvs2svn seems to make as much sense of CVS data as possible. The only real problem I have seen is with regard to the malformed files I mentioned earlier. cvs2svn (1.x) still heavily relies on timestamps, which is certainly correct in most cases. But they are switching to a graph based approach for cvs2svn 2.0. I'm basing the reworked cvs_import feature of monotone on that same algorithm. Regards Markus ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] SCMS question
Hi, [EMAIL PROTECTED] wrote: I'll have to try kdiff3 - but the merge command, although it often works, I strongly dislike when it marks up the lines as there was a conflict here and gives you three files in the directory to choose to start from. This is far too manual, which invites mistakes. Agreed that this is somewhat annoying, but hey, it's a command line tool. How else would you solve displaying conflicts? If kdiff3 is more like the ClearCase graphical merge utility, I would far prefer that. Can you say I want change 2 followed by change 3 with checkboxes, a live final version to view, and the ability to manually type or adjust lines in the final version to view? Yup. That's possible. And much much more... ;-) (I don't know the ClearCase tool, so I can't really offer a comparison, sorry.) Others you might want to try: - meld (in python, IMO worse than kdiff3) - xxdiff (I've never really used that one, but other monotone hackers seem to like it as well) Regards Markus ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] tsearch in core patch, for inclusion
Hi, Peter Eisentraut wrote: Oleg Bartunov wrote: It's not so big addition to the gram.y, see a list of commands http://mira.sai.msu.su/~megera/pgsql/ftsdoc/sql-commands.html. As we still to still discuss the syntax: is there a proposal for how a function based syntax would look like? CREATE FULLTEXT CONFIGURATION myfts LIKE template_cfg AS DEFAULT; just seems so much more SQL-like than: SELECT add_fulltext_config('myfts', 'template_cfg', True); I admit, that's a very simple and not thought through example. But as long as those who prefer not to extend the grammar don't come up with a better alternative syntax, one easily gets the impression that extending the grammar in general is evil. In that proposed syntax, I would drop all =, ,, (, and ). They don't seem necessary and they are untypical for SQL commands. I'd compare with CREATE FUNCTION or CREATE SEQUENCE for SQL commands that do similar things. Yup, I'd second that. Regards Markus ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] SCMS question
Hi, Andrew Dunstan wrote: 1. The buildfarm is very heavily dependent on CVS, and any change to anything else will be quite painful. There is no guarantee that all the members even have SVN installed, But you can guarantee they have CVS or even cvsup installed? That seems dubious to me. let alone anything else. And someone would have to code and test significant client changes. That said, a lot of the tortuous logic could be removed - change detection would almost just resolve to comparing two tree numbers with SVN, for example. ..and a *real* VCS (as in monotone :-) ) would provide not only that, but give you correctness guarantees, built in certification of revisions (i.e. each buildfarm member could issue a cert on successful testing) and lightweight branches, so you could much easier test experimental patches of different authors. Just to name a few additional advantages. 2. Many people (and some buildfarm members) operate against mirrors of the main repo which are created with rsync or CVSup. I am not aware of any way to do the equivalent with SVN - any info would be gratefully received. You might want to have a look at svk. It can do exactly that. And the Blog of Thomas explains how, see [1]. Of course, SVN is better at disconnected operation than CVS, Really? I've dropped subversion exactly because it sucks big time when disconnected. But again, I'm probably not comparing against CVS... I have no doubt we'll change someday to something better. I don't know what it is and I don't think we need to be in any hurry. This space is still very fluid. Yup. Good to hear you see it that way. As I understand, you have good reasons to be still using CVS, but are open to good suggestions. That's a very good thing, but easily slips by when reading all the critics and pro-CVS statements. ;-) Regards Markus [1]: Remote Backup Of A Subversion Repository http://moelhave.dk/2006/07/remote-mirroring-a-subversion-svn-repository/ ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] SCMS question
Hi, [ I've CCed the monotone-devel list, as I'm sure those people are interested, too. ] Stefan Kaltenbrunner wrote: Beside that - are all of the currently supported Platforms officially supported by the proposed SCMSes ? I can only speak for monotone. We have (had) buildbots for x86 (linux, netbsd, freebsd, win32), amd64 (linux), ppc (osx) and one sparc (osol). So far all gcc compiled, AFAIK. We are very interested in increasing portability of monotone. If you could give me (or other monotone devels) ssh access to some of the more obscure boxes, that would help a lot. Please contact me privately. most of the issues with CVS in that regard have already been worked around (and are therefore solved). Huh? How do you guarantee the correctness of a local checkout? At best, you can check an md5 sum of a tar archive, but CVS itself does almost no integrity checking. Does the buildfarm code check somehow? Against what? (Note that we've already had quite some disk failures uncovered by monotone, which does extensive integrity checking. But I'm sure database people know how important that is, don't you?) Or quickly test experimental patches? Is that solved? Or merging between branches, to add another major annoyance of CVS (and subversion, for that matter). I currently fetch the whole PostgreSQL repository via cvsup and then import it into monotone to be able to do serious work. Of course that's possible, and you can work around all the other limitations of CVS somehow, but it's annoying. But I agree that for developers especially those that are doing large patches over a long period of time might gain something from another SCMS, but it is not really clear what that SCMS should be or if it warrants the imho enormous switching costs (and the potential disruption in development until that switch is done which might take days if not weeks). I certainly agree that switching to another VCS is a major undertaking. And I'm working on easing migration to monotone. And I'll quite certainly try again to convince you again, *at some point in the future*. I would not vote for switching the PostgreSQL repository to monotone, yet. (As if I had a vote...;-) ) Regards Markus ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] SCMS question
Hi, Andrew Dunstan wrote: CVSup is not required, and is absent from most existing clients. I don't use it any more since the Fedora project stopped supporting it. ..which is quite understandable, concerning the PITA compiling modula-3 gives you (or at least has given me, it still hurts). The point you are missing is that, while we know existing buildfarm members all have CVS installed, we don't know that they have SVN or whatever, and requiring them to install it will involve significant distributed pain. Okay, I certainly agree that CVS is much more wide spread and available than most (if not all) other VCSes. Let's change that ;-) It will also involve some considerable localised pain (probably on my part) in rewriting the client. Right now I'm thinking it might make some sense to future-proof buildfarm by creating some sort of snapshot server. OTOH, if we avoid use of whatever SCM system that the project uses, we aren't testing that part of the process. Did I mention that monotone is a very good snapshot server? *duck* You probably don't want to reinvent the weel, as 'snapshot serving' is exactly what a VCS should do (among other things). You're making Tom's point again :-) Yeah, sorry, couldn't resist :-) IIRC you don't need to be connected to the repo to run svn diff, whereas you do to run cvs diff. Yes, in the simplest case of comparing against the immediate successor revision. But certainly not for: svn diff -r${FURTHER_IN_THE_PAST}, as subversion does not have that data available (nor does CVS, for that matter). We know the warts. If this were a green fields project there is no doubt we would not use CVS. But many proponents of other systems ignore the downside of changing. Well, I guess many advocates for other VCSes (like myself) simply don't particularly like to talk about the downsides... But they are probably more aware of them than most other people. One thing I want to know is that whatever we change to will still be there, maintained and in widespread use, many years down the track. So far I am not sure about that for any possible replacement, with the possible exception of SVN. That's certainly a valid concern, too. Probably *the* one where monotone is weaker compared to git and mercurial. :-( We are working on that issue, though. Regards Markus ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] tsearch in core patch, for inclusion
Hi, Andrew Dunstan wrote: If we are worried about the size of the transition table and keeping it in cache (see remarks from Tom upthread) then adding more keywords seems a bad idea, as it will surely expand the table. OTOH, I'd hate to make that a design criterion. Yeah, me too. Especially because it's an implementation issue against ease of use. (Or can somebody convince me that functions would provide a simple interface?) My main worry has been that the grammar would be stable. You mean stability of the grammar for the new additions or for all the grammar? Why are you worried about that? Just to quantify all this, I did a quick check on the grammar using bison -v - we appear to have 473 terminal symbols, and 420 non-terminal sybols in 1749 rules, generating 3142 states. The biggest tables generated are yytable and yycheck, each about 90kb on my machine. That already sounds somewhat better that Tom's 300 kb. And considering that these caches most probably grow faster than our grammar... Regards Markus ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [Monotone-devel] Re: [HACKERS] SCMS question
Hello Richard, you should probably have read the thread on the PostgreSQL -hackers mailing list I've linked to... at least you didn't make Tom's point ;-) Richard Levitte - VMS Whacker wrote: 1. Do you want to stay with CVS or do you want to move to something else? Most PostgreSQL developers currently want to stay with CVS. Only some desperate souls including myself are fiddling with other VCSes. 3. What would you want a replacement to be able to do? That's being debated, with many voices saying: CVS (plus our own hackery) provides all we need. (And be warned again: as soon as you point out an advantage of your favourite VCS, you're making Tom's point. ;-) ) So far, I'm getting the sense that there are a lot of opinions on what replacement system to use, a bit carelessly before having answered the above questions thoroughly. How did you get that impression? I'm currently *using* monotone for Postgres-R development, doing cvs_import and propagating to my branch. And I know others did the same already, too. Regards Markus ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] tsearch in core patch, for inclusion
Hi, Pavel Stehule wrote: Functions maybe doesn't see efective, but user's cannot learn new syntax. Are you serious? That argument speaks exactly *for* extending the grammar. From other databases, users are used to: CREATE TABLE ... (SQL) CREATE INDEX ... (SQL) CREATE FULLTEXT INDEX ... (Transact-SQL) CREATE TABLE (... FULLTEXT ...) (MySQL) CREATE INDEX ... INDEXTYPE IS ctxsys.context PARAMETERS ... (Oracle Text) And users are constantly complaining that PostgreSQL doesn't have fulltext indexing capabilities (if they don't know about tsearch2) or about how hard it is to use tsearch2. SELECT create_fulltext_mapping(cfgname, ARRAY['lex..','..'], ARRAY['...']) is readable. Hardly. Because it's not like SQL: - it's counter-intuitive to have to SELECT, when you want to CREATE something. - it's confusing to have two actions (select create) - why do I have to write ARRAYs to list parameters? - it's not obvious what you're selecting (return value?) - you have to keep track of the brackets, which can easily get messed up with two levels of them. Especially if the command gets multiple lines long. I agree so enhancing parser oabout not standard construct isn't good. Generally? Wow! This would mean PostgreSQL would always lack behind other RDBSes, regarding ease of use. Please don't do that! Regards Markus ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [Monotone-devel] Re: [HACKERS] SCMS question
Hi, Richard Levitte - VMS Whacker wrote: In message [EMAIL PROTECTED] on Thu, 22 Feb 2007 17:38:26 +0100, Markus Schiltknecht [EMAIL PROTECTED] said: markus So far, I'm getting the sense that there are a lot of markus opinions on what replacement system to use, a bit carelessly markus before having answered the above questions thoroughly. markus markus How did you get that impression? You said it yourself: Most PostgreSQL developers currently want to stay with CVS. Uh, yah. But I was refering to the lots of opinions on what replacement system to use. This has not much to do with the want or need (for lack of a better alternative) to stay with CVS, IMO. Unless there's a majority that wants to move on, I doubt there will be a move. In the end, it has to be a group effort, or it will simply not happen. I absolutely agree. And I'm quite sure most PostgreSQL developers also know that, thus I don't see much point in warning them - they are resistant enough ;-) As you might have noticed, I myself don't consider monotone ready for use by the PostgreSQL project, yet. And I've never advocated for switching *now*. I only made Tom's point... Regards Markus ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] tsearch in core patch, for inclusion
Hi, Tom Lane wrote: You mean four different object types. I'm not totally clear on bison's scaling behavior relative to the number of productions You really want to trade parser performance (which is *very* implementation specific) for ease of use? Bison generates a LALR [1] parser, which depend quite a bit on the number of productions. But AFAIK the dependency is mostly on memory consumption for the internal symbol sets, not so much on runtime complexity. I didn't find hard facts about runtime complexity of LALR, though (pointers are very welcome). Are there any ongoing efforts to rewrite the parser (i.e. using another algorithm, like a recursive descent parser)? Regards Markus [1]: Wikipedia on the LALR parsing algorithm: http://en.wikipedia.org/wiki/LALR_parser ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] tsearch in core patch, for inclusion
Hi, Florian G. Pflug wrote: According to http://en.wikipedia.org/wiki/LR_parser processing one token in any LR(1) parser in the worst case needs to a) Do a lookup in the action table with the current (state, token) pair b) Do a lookup in the goto table with a (state, rule) pair. c) Push one state onto the stack, and pop n states with n being the number of symbols (tokens or other rules) on the right hand side of a rule. a) and b) should be O(1). Processing one token pushes at most one state onto the stack, so overall no more than N stats can be popped off again, making the whole algorithm O(N) with N being the number of tokens of the input stream. Looks correct, thanks. What exactly is Tom worried about, then? Are there any ongoing efforts to rewrite the parser (i.e. using another algorithm, like a recursive descent parser)? Why would you want to do that? I recall having read something about rewriting the parser. Together with Tom being worried about parser performance and knowing GCC has switched to a hand written parser some time ago, I suspected bison to be slow. That's why I've asked. Regards Markus ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] compilation of pg_config fails
Hi, since str(n?)cat got replaced with strlcat, I fail to build PostgreSQL (current CVS HEAD). HAVING_DECL_STRLCAT is not set, so AFAIK, the strlcat() function from src/port should be used. However, I've read the README there, but still don't quite know what's wrong. The linker throws: gcc -O1 -Wall -pg -Wall -Wmissing-prototypes -Wpointer-arith -Winline -Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing -g pg_config.o -L../../../src/port -Wl,-rpath,'/usr/local/pgsql/lib' -lpgport -lz -lreadline -lcrypt -ldl -lm -o pg_config pg_config.o: In function `show_pgxs': /home/markus/projects/pgsql/sources/trunk/src/bin/pg_config/pg_config.c:216: undefined reference to `strlcat' Even if objdump confirms that strlcat.o is compiled in ../../../src/port/libpgport.o: In archive ../../../src/port/libpgport.a: strlcpy.o: file format elf32-i386 SYMBOL TABLE: ldf *ABS* strlcpy.c ld .text .text ld .data .data ld .bss .bss ld .debug_abbrev .debug_abbrev ld .debug_info .debug_info ld .debug_line .debug_line ld .debug_frame .debug_frame ld .debug_loc .debug_loc ld .debug_pubnames .debug_pubnames ld .debug_aranges .debug_aranges ld .debug_str .debug_str ld .note.GNU-stack .note.GNU-stack ld .comment .comment g F .text 0045 strlcpy *UND* mcount ... # uname -a: Linux grml 2.6.20 #1 SMP PREEMPT Tue Feb 6 14:48:26 PST 2007 i686 GNU/Linux Any idea? Regards Markus ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Proposal: Commit timestamp
Hi, Jan Wieck wrote: Are we still discussing if the Postgres backend may provide support for a commit timestamp, that follows the rules for Lamport timestamps in a multi-node cluster? No. And I think you know my opinion about that by now. ;-) It seems more like we are drifting into what type of replication system I should design to please most people. Nobody is telling you what you should do. You're free to do whatever you want to. I'm only trying to get a discussion going, because a) I'm interested in how you plan to solve these problems and b) in the past, most people were complaining that all the different replication efforts didn't try to work together. I'm slowly trying to open up and discuss what I'm doing with Postgres-R on the lists. Just yesterday at the SFPUG meeting, I've experienced how confusing it is for the users to have such a broad variety of (existing and upcoming) replication solutions. And I'm all for working together and probably even for merging different replication solutions. Regards Markus ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Proposal: Commit timestamp
Hi, Jan Wieck wrote: Then let me give you a little puzzle just for the fun of it. A database containing customer contact information (among other things) is a two node multimaster system. One is serving the customer web portal, the other is used by the company staff including the call center. At 13:45 the two servers lose connectivity to each other, yet the internal staff can access the internal server while the web portal is accessible from the outside. At 13:50 customer A updates their credit card information through the web portal, while customer B does the same through the call center. At 13:55 both customers change their mind to use yet another credit card, now customer A phones the call center while customer B does it via the internet. Phew, a mind twister... one customer would already be enough to trigger that sort of conflict... At 14:00 the two servers reconnect and go through the conflict resolution. How do you intend to solve both conflicts without using any clock, because that seems to be a stopword causing instant rejection of whatever you propose. Needless to say, both customers will be dissatisfied if you charge the wrong credit card during your next billing cycle. Correct. But do these cases satisfy storing timestamps to each and every transaction you do? That's what I doubt, not the usefulness of time based conflict resolution for certain cases. You can always add a time based conflict resolution, by adding a timestamp column and decide upon that one. I'd guess that the overall costs are lower that way. But you've withdrawn that proposal already, so... Which is a good discussion because one of the reasons why I stopped looking into Postgres-R is the fact that is based on the idea to push all the replication information through a system that generates a global serialized message queue. That by itself isn't the problem, but the fact that implementing a global serialized message queue has serious throughput issues that are (among other details) linked to the speed of light. Agreed. Nevertheless, there are use cases for such systems, because they put less limitations to the application. One could even argue, that your above example would be one ;-) I am trying to start with a system, that doesn't rely on such a mechanism for everything. I do intend to add an option later, that allows to declare a UNIQUE NOT NULL constraint to be synchronous. What that means is, that any INSERT, UPDATE, DELETE and SELECT FOR UPDATE will require the node to currently be a member of the (quorum or priority defined) majority of the cluster. Sounds reasonable. An advisory lock system, based on a total order group communication, will grant the lock to the unique key values on a first come, first serve base. Every node in the cluster will keep those keys as locked until the asynchronous replication stream reports the locking transaction as ended. If another remote transaction in the meantime requires updating such key, the incoming stream from that node will be on hold until the lock is cleared. This is to protect agains node B replicating a transaction from node A and a later update on node B arrives on C before C got the first event from A. A node that got disconnected from the cluster must rebuild the current advisory lock list upon reconnecting to the cluster. Yeah, this is a convenient way to replicate sequences via a GCS. I think that this will be a way to overcome Postgres-R's communication bottleneck, as well as allowing limited update activity even during a completely disconnected state of a node. Synchronous or group communication messages are reduced to the cases, where the application cannot be implemented in a conflict free way, like allocating a natural primary key. There is absolutely no need to synchronize for example creating a sales order. Agreed, such cases can easily be optimized. But you have to be aware of he limitations these optimizations cause. Postgres-R is much more targeted at very general use cases. An application can use global unique ID's for the order number. And everything possibly referenced by an order (items, customers, ...) is stored in a way that the references are never updated. Deletes to those possibly referenced objects are implemented in a two step process, where they are first marked obsolete, and later on things that have been marked obsolete for X long are deleted. A REPLICA TRIGGER on inserting an order will simply reset the obsolete flag of referenced objects. If a node is disconnected longer than X, you have a problem - hunt down the guy who defined X. Yeah, that's another very nice optimization. Again, as long as you know the limitations, that's all well and fine. Merging certain ideas to come up with an async/sync hybrid? Seems to me we have similar enough ideas to need conflict resolution, because we had them simultaneously but communicate them asynchronously. Huh?
Re: [HACKERS] Proposal: Commit timestamp
Hi, Zeugswetter Andreas ADI SD wrote: And time based is surely one of the important conflict resolution methods for async MM replication. That's what I'm questioning. Wouldn't any other deterministic, but seemingly random abort decision be as clever as time based conflict resolution? It would then be clear to the user that it's random and not some in most cases time based, but no in others and only if... thing. Sure there are others, like rule based priority based but I think you don't need additional backend functionality for those. Got the point, yes. I'm impatient, sorry. Neither the less, I'm questioning if is it worth adding backend functionality for that. And given this probably is the most wanted resolution method, this question might be heretical. You could also see it as sort of an user educating question: don't favor time based resolution if that's the one resolution method with the most traps. Regards Markus ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Proposal: Commit timestamp
Hi, Jan Wieck wrote: Whatever strategy one will use, in an async multimaster there are always cases that can be resolved by rules (last update being one of them), and some that I can't even imagine solving so far. I guess some of the cases will simply boil down to the application has to make sure that ... never occurs. Think of a multi-item order, created on one node, while another node is deleting the long unused item (which would have to be backordered). Now while those two nodes figure out what to do to make this consistent again, a third node does a partial shipment of that order. It helps to categorize these conflict types. There basically are: * data conflicts: simple row data, i.e. update - update conflicts. * uniqueness conflicts: two rows conflict because they'd violate a uniquenes constraint, i.e. insert - insert, update - insert or update - update. * visibility conflicts: basically the remaining update - delete and delete - delete cases. But also SELECT FOR UPDATE candidates, etc... Everything having to do with a rows not yet or no longer being visible to a transaction. Your example certainly involves a visibility conflict (update - delete). Not even (sync) Postgres-R can guarantee consistency on the visibility level, i.e. a first transaction's SELECT FOR UPDATE might not see some just recently committed transactions newly inserted rows (because that one isn't replayed yet on the node, thus the transaction is working on an 'old' snapshot of the database state). Another simpler example: Postgres-R doesn't raise a serialization error on delete-delete conflicts, it simply deletes the row once, even if two transactions confirmed to have committed a transaction which deleted a row. Luckily, most applications don't need that anyway, though. The solution is simple, reinsert the deleted item ... ..at which point timestamps certainly won't help :-) Sorry, couldn't resist... only that there were rather nasty ON DELETE CASCADE's on that item that removed all the consumer reviews, product descriptions, data sheets and what not. It's going to be an awful lot of undo. Huh? Are you planning on aborting *parts* of a transaction? I didn't think about that, but my gut feeling is that you don't want to do that. I haven't really made up my mind about a user defined rule based conflict resolution interface yet. I do plan to have a unique and foreign key constraint based, synchronous advisory locking system on top of my system in a later version (advisory key locks would stay in place until the transaction, that placed them, replicates). You'd have to elaborate on that... I guess you see by now why I wanted to keep the discussion about the individual, rather generic support features in the backend separate from the particular features I plan to implement in the replication system. Sure. I know, discussions about replication can get endless, probably even are so by definition ;-) But hey, they're fun! Everyone has different needs and consequently an async multi-master must do a whole range of mutually exclusive things altogether ... because Postgres can never accept a partial solution. We want the egg laying milk-wool-pig or nothing. Like the one which would result from a merge of such an async replication with a sync one? Imagine being able to choose between sync and async per transaction... Regards Markus ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Proposal: Commit timestamp
Hi, Theo Schlossnagle wrote: On Feb 4, 2007, at 1:36 PM, Jan Wieck wrote: Obviously the counters will immediately drift apart based on the transaction load of the nodes as soon as the network goes down. And in order to avoid this clock confusion and wrong expectation, you'd rather have a system with such a simple, non-clock based counter and accept that it starts behaving totally wonky when the cluster reconnects after a network outage? I rather confuse a few people than having a last update wins conflict resolution that basically rolls dice to determine last. If your cluster partition and you have hours of independent action and upon merge you apply a conflict resolution algorithm that has enormous effect undoing portions of the last several hours of work on the nodes, you wouldn't call that wonky? You are talking about different things. Async replication, as Jan is planning to do, is per se wonky, because you have to cope with conflicts by definition. And you have to resolve them by late-aborting a transaction (i.e. after a commit). Or put it another way: async MM replication means continuing in disconnected mode (w/o quorum or some such) and trying to reconciliate later on. It should not matter if the delay is just some milliseconds of network latency or three days (except of course that you probably have more data to reconciliate). For sane disconnected (or more generally, partitioned) operation in multi-master environments, a quorum for the dataset must be established. Now, one can consider the database to be the dataset. So, on network partitions those in the quorum are allowed to progress with data modification and others only read. You can do this to *prevent* conflicts, but that clearly belongs to the world of sync replication. I'm doing this in Postgres-R: in case of network partitioning, only a primary partition may continue to process writing transactions. For async replication, it does not make sense to prevent conflicts when disconnected. Async is meant to cope with conflicts. So as to be independent of network latency. However, there is no reason why the dataset _must_ be the database and that multiple datasets _must_ share the same quorum algorithm. You could easily classify certain tables or schema or partitions into a specific dataset and apply a suitable quorum algorithm to that and a different quorum algorithm to other disjoint data sets. I call that partitioning (among nodes). And it's applicable to sync as well as async replication, while it makes more sense in sync replication. What I'm more concerned about, with Jan's proposal, is the assumption that you always want to resolve conflicts by time (except for balances, for which we don't have much information, yet). I'd rather say that time does not matter much if your nodes are disconnected. And (especially in async replication) you should prevent your clients from committing to one node and then reading from another, expecting to find your data there. So why resolve by time? It only makes the user think you could guarantee that order, but you certainly cannot. Regards Markus ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate