Re: [HACKERS] Proposal - libpq Type System beta-0.8a (was PGparam)
On Tue, Jan 08, 2008 at 05:33:51PM -0500, Merlin Moncure wrote: Here is a short example which demonstrates some of the major features. There are many other examples and discussions of minutia in the documentation. I havn't looked at the source but FWIW I think it's an awesome idea. Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Those who make peaceful revolution impossible will make violent revolution inevitable. -- John F Kennedy signature.asc Description: Digital signature
Re: [HACKERS] [BUGS] BUG #3852: Could not create complex aggregate
Tom Lane wrote: Joe Conway [EMAIL PROTECTED] writes: Did you want me to work on this? I could probably put some time into it this coming weekend. I'll try to get to it before that --- if no serious bugs come up this week, core is thinking of wrapping 8.3.0 at the end of the week, so it'd be nice to have this dealt with sooner than that. regards, tom lane CREATE AGGREGATE array_concat(anyarray) ( SFUNC=array_cat, STYPE=anyarray ); CREATE AGGREGATE array_build(anyelement) ( SFUNC=array_append, STYPE=anyarray ); ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[HACKERS] timestamp refactor effort
So...in the vein of my last mail, I have tried to create another patch for refactoring out some of the HAVE_INT64_TIMESTAMP ifdefs in the code in timestamp.c. I have attached the patch. Please let me know if this patch is acceptable and what I can do to continue this effort. Thanks, wt From 77db4f84999161c0c7ba8ded78636512cc719878 Mon Sep 17 00:00:00 2001 From: Warren Turkal [EMAIL PROTECTED] Date: Wed, 9 Jan 2008 00:19:46 -0800 Subject: [PATCH] Add PackedTime typedef. The PackedTime type is meant to be a type that holds the hour, minute, second, and fractional seconds part of the time. The actual primitive type that the PackedTime type uses is determined by the HAVE_INT64_TIMESTAMP define. I have also changed some of the instances of variable declaratations for times to use the PackedTime type instead of the primitive types. --- src/backend/utils/adt/timestamp.c | 21 ++--- src/include/utils/timestamp.h |9 +++-- 2 files changed, 9 insertions(+), 21 deletions(-) diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c index 2883caf..1a4c247 100644 --- a/src/backend/utils/adt/timestamp.c +++ b/src/backend/utils/adt/timestamp.c @@ -44,11 +44,7 @@ TimestampTz PgStartTime; -#ifdef HAVE_INT64_TIMESTAMP -static int64 time2t(const int hour, const int min, const int sec, const fsec_t fsec); -#else -static double time2t(const int hour, const int min, const int sec, const fsec_t fsec); -#endif +static PackedTime time2t(const int hour, const int min, const int sec, const fsec_t fsec); static int EncodeSpecialTimestamp(Timestamp dt, char *str); static Timestamp dt2local(Timestamp dt, int timezone); static void AdjustTimestampForTypmod(Timestamp *time, int32 typmod); @@ -1539,11 +1535,10 @@ tm2timestamp(struct pg_tm * tm, fsec_t fsec, int *tzp, Timestamp *result) { #ifdef HAVE_INT64_TIMESTAMP int date; - int64 time; #else - double date, -time; + double date; #endif + PackedTime time; /* Julian day routines are not correct for negative Julian days */ if (!IS_VALID_JULIAN(tm-tm_year, tm-tm_mon, tm-tm_mday)) @@ -1648,19 +1643,15 @@ tm2interval(struct pg_tm * tm, fsec_t fsec, Interval *span) return 0; } -#ifdef HAVE_INT64_TIMESTAMP -static int64 +static PackedTime time2t(const int hour, const int min, const int sec, const fsec_t fsec) { +#ifdef HAVE_INT64_TIMESTAMP return (hour * MINS_PER_HOUR) + min) * SECS_PER_MINUTE) + sec) * USECS_PER_SEC) + fsec; -} /* time2t() */ #else -static double -time2t(const int hour, const int min, const int sec, const fsec_t fsec) -{ return (((hour * MINS_PER_HOUR) + min) * SECS_PER_MINUTE) + sec + fsec; -} /* time2t() */ #endif +} /* time2t() */ static Timestamp dt2local(Timestamp dt, int tz) diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h index 6eec76d..945b970 100644 --- a/src/include/utils/timestamp.h +++ b/src/include/utils/timestamp.h @@ -36,20 +36,17 @@ #ifdef HAVE_INT64_TIMESTAMP typedef int64 Timestamp; typedef int64 TimestampTz; +typedef int64 PackedTime; #else typedef double Timestamp; typedef double TimestampTz; +typedef double PackedTime; #endif typedef struct { -#ifdef HAVE_INT64_TIMESTAMP - int64 time; /* all time units other than days, months and - * years */ -#else - double time; /* all time units other than days, months and + PackedTime time; /* all time units other than days, months and * years */ -#endif int32 day; /* days, after time for alignment */ int32 month; /* months and years, after time for alignment */ } Interval; -- 1.5.3.7 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [BUGS] BUG #3852: Could not create complex aggregate
Sorry for previous message having no comments. Just remark: These aggregates created successfuly both in 8.2 and 8.3beta4: CREATE AGGREGATE array_concat(anyarray) ( SFUNC=array_cat, STYPE=anyarray ); CREATE AGGREGATE array_build(anyelement) ( SFUNC=array_append, STYPE=anyarray ); But aggregate from first letter does not: create aggregate build_group(anyelement, int4) ( SFUNC= add_group, STYPE = anyarray ); Excuse me for being noisy and bad English. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] operator suggest interval / interval = numeric
I suggest one more standard date/time operator, to divide one interval by another with numeric (or float, for example) result. I.e. something like that: database=# SELECT '5400 seconds'::interval / '1 hour'::interval; ?column? -- 1.5 (1 row) Ilya A. Kovalenko ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] [HACKERS] Archiver behavior at shutdown
On Sat, 2008-01-05 at 12:09 +, Simon Riggs wrote: On Fri, 2008-01-04 at 17:28 +0900, Fujii Masao wrote: Simon Riggs wrote: My original one line change described on bug 3843 seems like the best solution for 8.3. +1 Is this change in time for RC1? Patch attached. Not sure why this hasn't being applied yet for 8.3 We have a small problem, a fix and a user voting for the fix. Can we discuss, briefly? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[HACKERS] Some ideas about Vacuum
Hi, May be i am reposting something which has been discussed to end in this forum. I have made a search in the archives and i couldn't find any immediately. With my relatively small experience in Performance Testing and Tuning, one of the rules of thumb for getting Performance is Don't do it, if you don't need to do it. When we look at clearing the older versions of tuples from our tables in PostgreSQL, we can't stop thinking about how it is done in other databases. When we compare the Oracle Undo Log approach with the Postgresql Vacuum approach, the pattern looks very similar to C++ memory de-allocation and Java garbage collection. So, as you may all know, the thing which worries us about Vacuum is that it is going to places where it need not goto. That's when we are thinking about Dead space Map. This dead space map is a map, if implemented correctly, would guide Vacuum to go and only look at places where there was some activity of Delete/Update/Insert after the last Vacuum. This is accomplished at the cost of some very small overhead to Inserts/Deletes/Updates. Dead space Map is like an undo-log, if we think its role is to get rid of the older versions of data. Instead of moving the tuples to separate location, it guides the Vacuum process to do the cleanup task. May be we can even think of something like Dead space log, which may not be a bitmap. In this log, transactions might enter their transaction ids and ctids, which can be scanned by the Vacuum process. While this might take more space, it is with lesser contention, while compared to Dead space Map. To me, as far as i can think of, the only advantage of Dead space Map over Dead space log is the disk space. It just strikes me that WAL log is already doing just that. I think you can follow my thought-line. If we can ask the Vacuum process to scan the WAL log, it can get all the relevant details on where it needs to go. One optimization, that can be placed here is to somehow make the archiver do a double-job of helping the Vacuum, while doing the archiving. For people, who have switched off archiving, this might not be a benefit. One main restriction it places on the WAL Logs is that the WAL Log needs to be archived only after all the transactions in it completes. In other words, WAL logs need to be given enough space, to survive the longest transaction of the database. It is possible to avoid this situation by asking the Vacuum process to take the necessary information out of WAL log and store it somewhere and wait for the long running transaction to complete. The information of interest in WAL is only the table inserts/updates/deletes. So if everyone accepts that this is a good idea, till this point, there is a point in reading further. Ultimately, what has been achieved till now is that we have made the sequential scans made by the Vacuum process on each table into a few random i/os. Of course there are optimizations possible to group the random i/os and find some sequential i/o out of it. But still we need to do a full index scan for all those indexes out there. HOT might have saved some work over there. But i am pessimistic here and wondering how it could have been improved. So it just strikes me, we can do the same thing which we did just with the tables. Convert a seq scan of the entire table into a random scan of few blocks. We can read the necessary tuple information from the tuples, group them and hit at the index in just those blocks and clean it up. I can already hear people, saying that it is not always possible to go back to index from table. There is this culprit called unstable function based indexes. The structure stops us from going back to index from table. So currently we should restrict the above said approach to only normal indexes(not the function based ones). I hope it would still give a good benefit. Of course Vacuum can convert the few random scans into a seq scan, if required by referring to table statistics. Thoughts about the idea Thanks, Gokul. P.S.: Let the objections/opposing views have a subtle reduction in its harshness.
Re: [HACKERS] VACUUM FULL out of memory
Thanks for the explanation on the ulimits; I can see how that could turn out a problem in some cases. Following Tom's suggestion, here is the startup script I used: #!/bin/sh ulimit -a $PGHOST/server.ulimit pg_ctl start -l $PGHOST/server.log The ulimits seem to be the same, though: $ cat server.ulimit core file size (blocks, -c) 1 data seg size (kbytes, -d) unlimited max nice(-e) 0 file size (blocks, -f) unlimited pending signals (-i) unlimited max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) unlimited max rt priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Regards, Michael A. Tom Lane wrote: Andrew Sullivan [EMAIL PROTECTED] writes: On Tue, Jan 08, 2008 at 05:27:16PM +0100, Michael Akinde wrote: Those are the ulimits of the db_admin account (i.e., the user that set up and runs the DB processes). Is Postgres limited by other settings? On one system I used many years ago, /bin/sh wasn't what I thought it was, and so the ulimit that I got when logged in was not what the postmaster was starting under. Took me many days to figure out what was up. The only thing I find convincing is to insert ulimit -a someplace into the script that starts the postmaster, adjacent to where it does so, and then reboot. There are too many systems on which daemons are launched under settings different from what interactive shells use (a policy that's often a good one, too). regards, tom lane begin:vcard fn:Michael Akinde n:Akinde;Michael org:Meteorologisk Institutt, Norge;IT adr;quoted-printable:;;Gaustadall=C3=A9en 30D;Oslo;;0313;Norge email;internet:[EMAIL PROTECTED] tel;work:22963379 tel;cell:45885379 x-mozilla-html:FALSE url:http://www.met.no version:2.1 end:vcard ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Some ideas about Vacuum
Hi, Gokulakannan Somasundaram wrote: If we can ask the Vacuum process to scan the WAL log, it can get all the relevant details on where it needs to go. You seem to be assuming that only few tuples have changed between vacuums, so that WAL could quickly guide the VACUUM processes to the areas where cleaning is necessary. Let's drop that assumption, because by default, autovacuum_scale_factor is 20%, so a VACUUM process normally kicks in after 20% of tuples changed (disk space is cheap, I/O isn't). Additionally, there's a default nap time of one minute - and VACUUM is forced to take at least that much of a nap. So it's easily possible having more dead tuples, than live ones. In such cases, scanning the WAL can easily takes *longer* than scanning the table, because the amount of WAL to read would be bigger. One main restriction it places on the WAL Logs is that the WAL Log needs to be archived only after all the transactions in it completes. In other words, WAL logs need to be given enough space, to survive the longest transaction of the database. It is possible to avoid this situation by asking the Vacuum process to take the necessary information out of WAL log and store it somewhere and wait for the long running transaction to complete. That would result in even more I/O... The information of interest in WAL is only the table inserts/updates/deletes. So if everyone accepts that this is a good idea, till this point, there is a point in reading further. Well, that's the information of interest, the question is where to store that information. Maintaining a dead space map looks a lot cheaper to me, than relying on the WAL to store that information. Ultimately, what has been achieved till now is that we have made the sequential scans made by the Vacuum process on each table into a few random i/os. Of course there are optimizations possible to group the random i/os and find some sequential i/o out of it. But still we need to do a full index scan for all those indexes out there. HOT might have saved some work over there. But i am pessimistic here and wondering how it could have been improved. So it just strikes me, we can do the same thing which we did just with the tables. Convert a seq scan of the entire table into a random scan of few blocks. We can read the necessary tuple information from the tuples, group them and hit at the index in just those blocks and clean it up. Sorry, I don't quite get what you are talking about here. What do indexes have to do with dead space? Why not just keep acting on the block level? I can already hear people, saying that it is not always possible to go back to index from table. There is this culprit called unstable function based indexes. No, there's no such thing. Citing [1]: All functions and operators used in an index definition must be immutable, that is, their results must depend only on their arguments and never on any outside influence. Of course, you can mark any function IMMUTABLE and get unstable function based indexes, but that turns into a giant foot gun very quickly. P.S.: Let the objections/opposing views have a subtle reduction in its harshness. I'm just pointing at things that are in conflict with my knowledge, assumptions and believes, all which might be erroneous, plain wrong or completely mad. ;-) Regards Markus [1]: the Very Fine Postgres Manual on CREATE INDEX: http://www.postgresql.org/docs/8.3/static/sql-createindex.html ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Sat, 2008-01-05 at 16:30 -0500, Robert Treat wrote: I'm not following this. If we can work out a scheme, I see no reason not to allow a single table to span multiple tablespaces. That seems to be something we might want anyway, so yes. The difference is that, if I currently have a table split by month, I can re-partition it into weekly segments, and only shuffle one months data at a time minimize impact on the system while I shuffle it. This can even be used to do dynamic management, where data from the current month is archived by day, data from the past year by week, and data beyond that done monthly. Understood On many other databases, if you change the partition scheme, it requires exclusive locks and a shuffleing of all of the data, even data whose partitions arent being redefined. Even worse are systems like mysql, where you need to rewrite the indexes as well. To me, these requirements always seem like show stoppers; I generally can't afford to lock a table while the database rewrites a billion rows of data. Agreed In any case, my thinking is if we had the segment exclusion technique, I could convert that partitioned table into a regular table again, use segment exclusion to handle what is currently handled by partitions, and create a global index across all the other data for that other, currently killer, query. Yes, that's what I have in mind. Can I ask that you produce a gap analysis between what you have now and what you would have in the future, so we can see what omissions or errors there are in the segex proposal? If we had indexes that spanned partitions, would we find that some of the queries that were producing seq scans will now produce better join and index plans, do you think? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] LD_LIBRARY_PATH not honored on Debian unstable
Hi, I'm trying to run 'make check' on a 64bit Debian unstable. That aborts after 60 seconds due to not being able to connect to the postmaster. I figured that there's nothing wrong with the postmaster, rather psql can't start up, because it gets linked against an older libpq.so.5. It looks like for some reason, it doesn't respect the LD_LIBRARY_PATH env variable, see ldd output: [EMAIL PROTECTED]:/home/markus/projects/pgsql/sources/current/pgsql/src/test/regress# LD_LIBRARY_PATH=/home/markus/projects/pgsql/sources/current/pgsql/src/test/regress/./tmp_check/install//usr/lib ldd -r -v tmp_check/install/usr/bin/psql linux-vdso.so.1 = (0x7fffc2bfe000) libpq.so.5 = /usr/lib/libpq.so.5 (0x2ac8e81ba000) libz.so.1 = /usr/lib/libz.so.1 (0x2ac8e83db000) libreadline.so.5 = /lib/libreadline.so.5 (0x2ac8e8606000) libcrypt.so.1 = /lib/libcrypt.so.1 (0x2ac8e8846000) libdl.so.2 = /lib/libdl.so.2 (0x2ac8e8a7e000) libm.so.6 = /lib/libm.so.6 (0x2ac8e8c82000) libc.so.6 = /lib/libc.so.6 (0x2ac8e8f04000) libssl.so.0.9.8 = /usr/lib/libssl.so.0.9.8 (0x2ac8e9262000) libcrypto.so.0.9.8 = /usr/lib/libcrypto.so.0.9.8 (0x2ac8e94ae000) libkrb5.so.3 = /usr/lib/libkrb5.so.3 (0x2ac8e983c000) libcom_err.so.2 = /lib/libcom_err.so.2 (0x2ac8e9ad7000) libpthread.so.0 = /lib/libpthread.so.0 (0x2ac8e9cd9000) libncurses.so.5 = /usr/lib/libncurses.so.5 (0x2ac8e9ef5000) /lib64/ld-linux-x86-64.so.2 (0x2ac8e7f9c000) libk5crypto.so.3 = /usr/lib/libk5crypto.so.3 (0x2ac8ea132000) libkrb5support.so.0 = /usr/lib/libkrb5support.so.0 (0x2ac8ea357000) libkeyutils.so.1 = /lib/libkeyutils.so.1 (0x2ac8ea55f000) libresolv.so.2 = /lib/libresolv.so.2 (0x2ac8ea761000) undefined symbol: pg_valid_server_encoding_id (tmp_check/install/usr/bin/psql) undefined symbol: PQconnectionNeedsPassword (tmp_check/install/usr/bin/psql) Giving it an additional LD_PRELOAD for the newish libpq.5.1 helps: [EMAIL PROTECTED]:/home/markus/projects/pgsql/sources/current/pgsql/src/test/regress# LD_PRELOAD=/home/markus/projects/pgsql/sources/current/pgsql/src/test/regress/tmp_check/install/usr/lib/libpq.so.5.1 LD_LIBRARY_PATH=/home/markus/projects/pgsql/sources/current/pgsql/src/test/regress/./tmp_check/install//usr/lib ldd -r -v tmp_check/install/usr/bin/psql linux-vdso.so.1 = (0x7fffe97fe000) /home/markus/projects/pgsql/sources/current/pgsql/src/test/regress/tmp_check/install/usr/lib/libpq.so.5.1 (0x2b69c1547000) libz.so.1 = /usr/lib/libz.so.1 (0x2b69c1765000) libreadline.so.5 = /lib/libreadline.so.5 (0x2b69c199) libcrypt.so.1 = /lib/libcrypt.so.1 (0x2b69c1bd) libdl.so.2 = /lib/libdl.so.2 (0x2b69c1e08000) libm.so.6 = /lib/libm.so.6 (0x2b69c200c000) libc.so.6 = /lib/libc.so.6 (0x2b69c228e000) libncurses.so.5 = /usr/lib/libncurses.so.5 (0x2b69c25ec000) /lib64/ld-linux-x86-64.so.2 (0x2b69c1329000) Somebody have an idea on what's wrong here? Thanks. Regards Markus ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] LD_LIBRARY_PATH not honored on Debian unstable
Markus Schiltknecht wrote: Hi, I'm trying to run 'make check' on a 64bit Debian unstable. That aborts after 60 seconds due to not being able to connect to the postmaster. I figured that there's nothing wrong with the postmaster, rather psql can't start up, because it gets linked against an older libpq.so.5. It looks like for some reason, it doesn't respect the LD_LIBRARY_PATH env variable, see ldd output: [EMAIL PROTECTED]:/home/markus/projects/pgsql/sources/current/pgsql/src/test/regress# LD_LIBRARY_PATH=/home/markus/projects/pgsql/sources/current/pgsql/src/test/regress/./tmp_check/install//usr/lib ldd -r -v tmp_check/install/usr/bin/psql [snip] undefined symbol: pg_valid_server_encoding_id (tmp_check/install/usr/bin/psql) undefined symbol: PQconnectionNeedsPassword (tmp_check/install/usr/bin/psql) Giving it an additional LD_PRELOAD for the newish libpq.5.1 helps: [EMAIL PROTECTED]:/home/markus/projects/pgsql/sources/current/pgsql/src/test/regress# LD_PRELOAD=/home/markus/projects/pgsql/sources/current/pgsql/src/test/regress/tmp_check/install/usr/lib/libpq.so.5.1 LD_LIBRARY_PATH=/home/markus/projects/pgsql/sources/current/pgsql/src/test/regress/./tmp_check/install//usr/lib ldd -r -v tmp_check/install/usr/bin/psql [snip] Somebody have an idea on what's wrong here? Thanks. Smells suspiciously like an rpath problem to me. What are your configure settings? cheers andrew ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] operator suggest interval / interval = numeric
am Wed, dem 09.01.2008, um 17:33:00 +0700 mailte Ilya A. Kovalenko folgendes: I suggest one more standard date/time operator, to divide one interval by another with numeric (or float, for example) result. I.e. something like that: database=# SELECT '5400 seconds'::interval / '1 hour'::interval; ?column? -- 1.5 (1 row) test=# SELECT extract(epoch from '5400 seconds'::interval) / extract(epoch from '1 hour'::interval); ?column? -- 1.5 Andreas -- Andreas Kretschmer Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: - Header) GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] LD_LIBRARY_PATH not honored on Debian unstable
Andrew Dunstan wrote: Smells suspiciously like an rpath problem to me. What are your configure settings? Ah, yeah, I see. Using something else than --prefix=/usr helped. Thanks for the hint! Regards Markus ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Sun, 2008-01-06 at 11:39 +0100, Markus Schiltknecht wrote: I think this has to do with SE not being of much use for index scans. Hmmm. I think it fits rather neatly with BitmapIndexScans. It would be easy to apply the index condition and/or filters to see which segments are excluded and then turn off bits in the bitmap appropriately. Not fully sure about IndexScans yet. I don't think it would be worth trying to apply SE until we estimated we would return say 100 rows. It needs to be able to work without slowing down the common path. Or put it another way: SE is an optimization for sequential scans. For tables where it works well, it could possibly replace the index entirely. True Without the index, you would rely on SE to always be able to exclude enough segments, so that the seq scan is less expensive than an index scan with the following table lookups. It would have to be a very fat index scan for so large a table... -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Mon, 2008-01-07 at 14:20 +0100, Markus Schiltknecht wrote: AFAIUI, Segment Exclusion combines perfectly well with clustering. Yes, seems like it would be possible to have a segment-aware CLUSTER, so it was actually usable on large tables. Not planning that initially though. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Some ideas about Vacuum
So it's easily possible having more dead tuples, than live ones. In such cases, scanning the WAL can easily takes *longer* than scanning the table, because the amount of WAL to read would be bigger. Yes... i made a wrong assumption there.. so the idea is totally useless. Thanks, Gokul.
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Simon Riggs wrote: Hmmm. I think it fits rather neatly with BitmapIndexScans. It would be easy to apply the index condition and/or filters to see which segments are excluded and then turn off bits in the bitmap appropriately. Yeah, good point. Not fully sure about IndexScans yet. I don't think it would be worth trying to apply SE until we estimated we would return say 100 rows. It needs to be able to work without slowing down the common path. Yup. Or put it another way: SE is an optimization for sequential scans. For tables where it works well, it could possibly replace the index entirely. True Without the index, you would rely on SE to always be able to exclude enough segments, so that the seq scan is less expensive than an index scan with the following table lookups. It would have to be a very fat index scan for so large a table... ..for SE to be faster than an index scan, you mean? Yes. Regards Markus ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Mon, 2008-01-07 at 12:14 +0100, Csaba Nagy wrote: On Wed, 2008-01-02 at 17:56 +, Simon Riggs wrote: Like it? Very cool :-) Thanks. As ever, a distillation of various thoughts, not all mine. One additional thought: what about a kind of segment fill factor ? Meaning: each segment has some free space reserved for future updates/inserts of records in the same range of it's partitioning constraint. And when inserting/updating you put the new record into the corresponding segment... just like a very coarse clustering. Then you could vacuum the segments separately to keep the free space not running out. For active segments you would then fix the partitioning constraint range once the fill factor is reached, to allow for keeping it's constraint even when heavily updating (heavily vacuuming it too as response to that), and create a new segment for the unbounded range for new inserts... this would work fine for tables where the constraint is based on ever increasing keys and accidental inserts in old ranges (which do happen occasionally in real life). Lots of thoughts there, so I'll try to separate them out and analyse. The way I originally described it is a very simple mechanism and we could tweak that some more. All ideas welcome. If we had dynamic segment constraints when the segment was not yet read only that would lead to changes in the segment constraint for each INSERT when we have increasing keys. It seems better to set the constraints once only, when the segment was full, then prevent further INSERTs. The accidental inserts in old ranges seem like something that can be avoided if it is a real-world problem. For UPDATEs on a segment with constraints we might choose to apply the constraints to see what to do. You might want to allow the UPDATE and have it stretch the constraints outwards or you might want to prevent it and throw an error, or you might want to allow the UPDATE, yet migrate the new tuple to the appropriate segment. Dynamic partitioning works in multiple dimensions, so there isn't just one single valid location for any row. i.e. if we update a have a row with OrderId, OrderDate, RequiredByDate, ShipDate and LastModifiedDate on it, we'll probably expand the constraints on at least one of those. If we were lucky enough to have only changed one of those it might turn out there was *in that case* a single more sensible location for the new tuple, but that probably isn't the common case. So the likely best behaviour for UPDATEs is to try to keep the new row version in the same segment, then change the constraints. The segment constraints concept and the read only concept were linked. You're right we could separate them, thought that turns out not to be that desirable. When we do an DELETE or an UPDATE we don't know whether the deleted row version was the last tuple with that particular boundary value. So we don't know whether the DELETE or UPDATE changes the constraints or not, and however we try to avoid it, we'll probably need to recalc the constraints in some circumstance. So updating the constraints dynamically isn't anywhere near as easy as it sounds and ultimately probably isn't worth the effort. So thats why constraints and read-only go together. HOT allows us to keep the new row version within the segment, in many cases. What might be worth doing is marking the FSM space for that segment as update-only to exclude inserts from using it, then forcing UPDATEs to stay within the segment if possible by providing the current block number in each call to the FSM. That would also mean that every UPDATE that wants to do a multi-block update on a currently read-only segment would need to call the FSM. Sounds good, but that would only work for the first UPDATE on a segment after it is marked read only, which isn't much use, or we would do it for *every* block-spanning UPDATE, which would cause contention in other use cases. So although I'm willing to listen and tweak, that hasn't yet resulted in any additional design points, unless I missed something above? When the change rate of old segments get down, the segments could be reorganized to have a smaller fill factor, so that you still allow for accidental updates but keep space usage efficient. This would be some similar action as a clustering, but hopefully not blocking (which might be a hard thing to do)... and later again you could mark some of the really old things as read only and put them in special segments with no wasted space. Yes, hopefully marking read only and compressed segments is a possible future. One problem would be when the segment's free space runs out, so you must put records from the same constraint range in multiple segments - but that could still work, you just would have multiple segments covering the same range, but if the segment fill factor is chosen properly it should not be the case... you could normally maintain a set of non-overlapping segments in terms of the
Re: [HACKERS] Some notes about the index-functions security vulnerability
On Wed, 2008-01-09 at 00:22 -0500, Tom Lane wrote: pgsql-core wasted quite a lot of time Core's efforts are appreciated by all, so not time wasted. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Sat, 2008-01-05 at 16:42 +0100, Markus Schiltknecht wrote: Simon Riggs wrote: On Fri, 2008-01-04 at 22:26 +0100, Markus Schiltknecht wrote: I'm still puzzled about how a DBA is expected to figure out which segments to mark. Simon, are you assuming we are going to pass on segment numbers to the DBA one day? No Way! Ah, I'm glad ;-) But now you want names on partitions? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Named vs Unnamed Partitions
On Wed, 2008-01-09 at 02:25 +, Gregory Stark wrote: Markus Schiltknecht [EMAIL PROTECTED] writes: There are two very distinct ways to handle partitioning. For now, I'm calling them named and unnamed partitioning. The naming is precisely the useful part in that it is how the DBA associates the properties with chunks of data. Why does giving something a name help partition exclusion? Without naming the DBA would have to specify the same ranges every time he wants to change the properties. He might do a SET read_only WHERE created_on '2000-01-01' one day then another SET tablespace tsslow WHERE created_on '2000-01-01' and then later again do SET offline WHERE created_on '2000-01-01' I have to admit I always found it kludgy to have objects named invoices_2000_JAN and invoices_2000_FEB and so on. It's kind of an meta denormalization. But so is specifying where clauses repeatedly. The idea for using the WHERE clauses was to specifically avoid naming. In most cases the table is divided into old read only and newer data. So there is one split point that make it easy to use a simple WHERE clause. If you guys really want names, we can have names, but I think I want to see a case where the storage characteristics of the table are so complex we can only make sense of it by naming particular chunks. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Gavin and all, This is quite a long reply, so apologies for that. On Wed, 2008-01-09 at 07:28 +0100, Gavin Sherry wrote: On Wed, Jan 02, 2008 at 05:56:14PM +, Simon Riggs wrote: This technique would be useful for any table with historical data keyed by date or timestamp. It would also be useful for data where a time-of-insert component is implicit, such as many major entity tables where the object ids are assigned by a sequence. e.g. an Orders table with an OrderId as PK. Once all orders placed in a period have been shipped/resolved/closed then the segments will be marked read-only. A novel approach to the problem. For me, this proposal addresses some of the other problems in postgres (visibility in the heap vs. index) rather than the problems with partitioning itself. I think you're right that it isn't attempting to address the problems with partitioning directly, it is attempting to address the core use case itself. I think we have an opportunity to bypass the legacy-of-thought that Oracle has left us and implement something more usable. There are some low-level technical issues associated with declarative partitioning that I'll address last, which are in many ways the big reason to want to go to non-declarative partitioning. It might seem otherwise, but for me partitioning is tool for managing large volumes of data. It allows the user to encode what they know about the nature of their data, and that's often about management. In this way, your proposal actually makes partitioning too smart: the user wants to tell the system how to organise the data, as opposed to the other way around. At Greenplum, we've been discussing this in depth. Interestingly, we also felt that the storage layer could be much smarter with large tables with predictable layouts and predictable data patterns. But the thing is, people with large tables like partitioning, putting different partitions on different storage; they like the ability to merge and split partitions; they like truncating old partitions. In a word, they seem to like the managability partitions give them -- as well as the performance that comes with this. Do people really like running all that DDL? There is significant manpower cost in implementing and maintaining a partitioning scheme, plus significant costs in getting it wrong. If people with large tables like partitioning why is Oracle moving towards automated partitioning in 11g? Automated partitioning was one of the major goals for this next set of Postgres partitioning functionality also, whether or not we have declarative partitioning. My thinking is lets go for the best ideas and skip over the stuff that will be (is) obsolete before its left the design stage. I see many more systems in the Terabyte range now than I did 10 years ago, but I see about the same number of DBAs. We'll always need experts, but I feel we should be using our expertise to simplify the standard cases, not just maintain the same level of difficulty in managing them. One of the big benefits of the dynamic partitioning approach is that it needs no DDL. So it will work out-of-the-box, for anybody. Deleting older data would be optimised under the proposed scheme, so that's not really a problem. Loading data is actually slightly harder and slower with declarative partitioning (see below). Merging and splitting partitions are tools for fine tuning a very complex partitioning scheme. They do also allow a non-linear segmentation scheme, which might aid performance in some cases. (On a different note, I'm strongly in favour of a declarative approach to other optimizer information to allow the DBA to say what they know about how a table will be used. But that's off-topic for now.) One major advantage of the dynamic approach is that it can work on multiple dimensions simultaneously, which isn't possible with declarative partitioning. For example if you have a table of Orders then you will be able to benefit from Segment Exclusion on all of these columns, rather than just one of them: OrderId, OrderDate, RequiredByDate, LastModifiedDate. This will result in some sloppiness in the partitioning, e.g. if we fill 1 partition a day of Orders, then the OrderId and OrderData columns will start out perfectly arranged. Any particular RequiredByDate will probably be spread out over 7 partitions, but thats way better than being spread out over 365+ partitions. When we look at the data in the partition we can look at any number of columns. When we declaratively partition, you get only one connected set of columns, which is one of the the reasons you want multi-dimensional partitioning in the first place. To this end, we (well, Jeff Cohen) looked at the syntax and semantics of partitining in leading databases (Oracle, Informix, DB2) and came up with a highly expressive grammar which takes the best of each I think (I'll post details on the grammar in a seperate thread). The idea is that range (for
Re: [HACKERS] Some ideas about Vacuum
Markus Schiltknecht [EMAIL PROTECTED] writes: Hi, Gokulakannan Somasundaram wrote: If we can ask the Vacuum process to scan the WAL log, it can get all the relevant details on where it needs to go. That's an interesting thought. I think your caveats are right but with some more work it might be possible to work it out. For example if a background process processed the WAL and accumulated an array of possibly-dead tuples to process in batch. It would wait whenever it sees an xid which isn't yet past globalxmin, and keep accumulating until it has enough to make it worthwhile doing a pass. I think a bigger issue with this approach is that it ties all your tables together. You can't process one table frequently while some other table has some long-lived deleted tuples. I'm also not sure it really buys us anything over having a second dead-space-map data structure. The WAL is much larger and serves other purposes which would limit what we can do with it. You seem to be assuming that only few tuples have changed between vacuums, so that WAL could quickly guide the VACUUM processes to the areas where cleaning is necessary. Let's drop that assumption, because by default, autovacuum_scale_factor is 20%, so a VACUUM process normally kicks in after 20% of tuples changed (disk space is cheap, I/O isn't). Additionally, there's a default nap time of one minute - and VACUUM is forced to take at least that much of a nap. I think this is exactly backwards. The goal should be to improve vacuum, then adjust the autovacuum_scale_factor as low as we can. As vacuum gets cheaper the scale factor can go lower and lower. We shouldn't allow the existing autovacuum behaviour to control the way vacuum works. As a side point, disk is cheap, I/O isn't is a weird statement. The more disk you use the more I/O you'll have to do to work with the data. I still maintain the default autovacuum_scale_factor is *far* to liberal. If I had my druthers it would be 5%. But that's mostly informed by TPCC experience, in real life the actual value will vary depending on the width of your records and the relative length of your transactions versus transaction rate. The TPCC experience is with ~ 400 byte records and many short transactions. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] [HACKERS] Archiver behavior at shutdown
Simon Riggs [EMAIL PROTECTED] writes: Not sure why this hasn't being applied yet for 8.3 Because it doesn't fix the problem ... which is that the postmaster kills the archiver (and the stats collector too) at what is now the wrong point in the shutdown sequence. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Named vs Unnamed Partitions
Simon Riggs wrote: I have to admit I always found it kludgy to have objects named invoices_2000_JAN and invoices_2000_FEB and so on. It's kind of an meta denormalization. But so is specifying where clauses repeatedly. The idea for using the WHERE clauses was to specifically avoid naming. I understand, and I'm all for avoiding needless, kludgy names. As I pointed out, knowledge of split points might be important for the database system. Maybe we can store the split point without the need for names? Dunno. If you guys really want names, we can have names, but I think I want to see a case where the storage characteristics of the table are so complex we can only make sense of it by naming particular chunks. Well, assuming you only have to deal with one split point, that's certainly true. However, there are people using more than two table spaces, thus obviously needing more split points. Can we name the split points, rather than the partitions? Regards Markus ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] operator suggest interval / interval = numeric
Ilya A. Kovalenko [EMAIL PROTECTED] writes: I suggest one more standard date/time operator, to divide one interval by another with numeric (or float, for example) result. You'd have to define exactly what that means, which seems a little tricky for incommensurate intervals. For instance what is the result of '1 month' / '1 day' ? regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] [HACKERS] Archiver behavior at shutdown
On Wed, 2008-01-09 at 10:15 -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: Not sure why this hasn't being applied yet for 8.3 Because it doesn't fix the problem ... which is that the postmaster kills the archiver (and the stats collector too) at what is now the wrong point in the shutdown sequence. The original bug report states the problem as being that the archiver stays for a noticeable period after postmaster shutdown. My patch fixes that very safely. It doesn't fix your request for redesign, which I accept is still pending and I've explained why. I don't see any reason to leave the original problem hanging just because the fix isn't as wide as we might really like. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Named vs Unnamed Partitions
On Wed, 2008-01-09 at 16:20 +0100, Markus Schiltknecht wrote: Simon Riggs wrote: I have to admit I always found it kludgy to have objects named invoices_2000_JAN and invoices_2000_FEB and so on. It's kind of an meta denormalization. But so is specifying where clauses repeatedly. The idea for using the WHERE clauses was to specifically avoid naming. I understand, and I'm all for avoiding needless, kludgy names. As I pointed out, knowledge of split points might be important for the database system. Maybe we can store the split point without the need for names? Dunno. If you guys really want names, we can have names, but I think I want to see a case where the storage characteristics of the table are so complex we can only make sense of it by naming particular chunks. Well, assuming you only have to deal with one split point, that's certainly true. However, there are people using more than two table spaces, thus obviously needing more split points. Can we name the split points, rather than the partitions? So far, I've been looking at partition exclusion as the most important feature for the VLDB use case. You seem to have moved straight on to what I've regarded as later features for partitioning. From my side, if I can't make SE work then most of the other features seem moot, even though I personally regard them as important also. With that in mind, can I clarify what you're thinking, please? 1) the things you've been discussing are so important I should do them first, which would necessarily require named chunks of tables 2) the things you've been discussing are essential requirements of partitioning and we could never consider it complete until they are also included and we must therefore talk about them now to check that its all possible before we do anything on SE 3) doing SE first is right, I'm just thinking ahead 4) the topics aren't really linked and I'm suggesting doing development on them in parallel or... Sorry if that seems blunt, I'm just not clear where we're going. I have to think about implementability, planning and priorities if I'm to get it done. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Named vs Unnamed Partitions
On Wed, 2008-01-09 at 15:53 +, Gregory Stark wrote: Simon Riggs [EMAIL PROTECTED] writes: Perhaps a good analogy is indexes. Index names are themselves kind of redundant and people usually use names which encode up most of the information of the definition. But the reason you need names for indexes is so that you can refer to them later to drop them, rebuild them, change their properties such as tablespace, fill factor, etc? You could imagine imposing on users that they should restate the index definition every time they want to change the fill factor or tablespace but I'm sure you could see the downsides with that approach. Which is exactly what we do with DROP FUNCTION... You'd run these things as often as you run ALTER TABLE SET tablespace, so it doesn't seem a problem. When I delete all rows WHERE some_date 'cut-off date' on a segment boundary value that would delete all segments that met the criteria. The following VACUUM will then return those segments to be read-write, where they can then be refilled with new incoming data. The only command we would have to run is the DELETE, everything else is automatic. If we have named chunks, then you'd have to specifically reset the boundary conditions on the named chunk after deletion before the chunk could be reused. That all becomes DDL, which means additional code to be written, bugs to be fixed, as well as the table locking required. Seems like a lot just for some occasional convenience. So not convinced of the need for named sections of tables yet. It all seems like detail, rather than actually what we want for managing large tables. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Some ideas about Vacuum
On Wed, 2008-01-09 at 15:10 +, Gregory Stark wrote: The goal should be to improve vacuum, then adjust the autovacuum_scale_factor as low as we can. As vacuum gets cheaper the scale factor can go lower and lower. We shouldn't allow the existing autovacuum behaviour to control the way vacuum works. Very much agreed. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Named vs Unnamed Partitions
Simon Riggs [EMAIL PROTECTED] writes: On Wed, 2008-01-09 at 02:25 +, Gregory Stark wrote: Without naming the DBA would have to specify the same ranges every time he wants to change the properties. He might do a SET read_only WHERE created_on '2000-01-01' one day then another SET tablespace tsslow WHERE created_on '2000-01-01' and then later again do SET offline WHERE created_on '2000-01-01' I have to admit I always found it kludgy to have objects named invoices_2000_JAN and invoices_2000_FEB and so on. It's kind of an meta denormalization. But so is specifying where clauses repeatedly. The idea for using the WHERE clauses was to specifically avoid naming. In most cases the table is divided into old read only and newer data. So there is one split point that make it easy to use a simple WHERE clause. If you guys really want names, we can have names, but I think I want to see a case where the storage characteristics of the table are so complex we can only make sense of it by naming particular chunks. Perhaps a good analogy is indexes. Index names are themselves kind of redundant and people usually use names which encode up most of the information of the definition. But the reason you need names for indexes is so that you can refer to them later to drop them, rebuild them, change their properties such as tablespace, fill factor, etc? You could imagine imposing on users that they should restate the index definition every time they want to change the fill factor or tablespace but I'm sure you could see the downsides with that approach. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Some ideas about Vacuum
Hi, Gregory Stark wrote: That's an interesting thought. I think your caveats are right but with some more work it might be possible to work it out. For example if a background process processed the WAL and accumulated an array of possibly-dead tuples to process in batch. It would wait whenever it sees an xid which isn't yet past globalxmin, and keep accumulating until it has enough to make it worthwhile doing a pass. I don't understand why one would want to go via the WAL, that only creates needless I/O. Better accumulate the data right away, during the inserts, updates and deletes. Spilling the accumulated data to disk, if absolutely required, would presumably still result in less I/O. I think a bigger issue with this approach is that it ties all your tables together. You can't process one table frequently while some other table has some long-lived deleted tuples. Don't use the WAL as the source of that information and that's issue's gone. I'm also not sure it really buys us anything over having a second dead-space-map data structure. The WAL is much larger and serves other purposes which would limit what we can do with it. Exactly. You seem to be assuming that only few tuples have changed between vacuums, so that WAL could quickly guide the VACUUM processes to the areas where cleaning is necessary. Let's drop that assumption, because by default, autovacuum_scale_factor is 20%, so a VACUUM process normally kicks in after 20% of tuples changed (disk space is cheap, I/O isn't). Additionally, there's a default nap time of one minute - and VACUUM is forced to take at least that much of a nap. I think this is exactly backwards. The goal should be to improve vacuum, then adjust the autovacuum_scale_factor as low as we can. As vacuum gets cheaper the scale factor can go lower and lower. But you can't lower it endlessly, it's still a compromise, because it also means reducing the amount of tuples being cleaned per scan, which is against the goal of minimizing overall I/O cost of vacuuming. We shouldn't allow the existing autovacuum behaviour to control the way vacuum works. That's a point. As a side point, disk is cheap, I/O isn't is a weird statement. The more disk you use the more I/O you'll have to do to work with the data. That's only true, as long as you need *all* your data to work with it. I still maintain the default autovacuum_scale_factor is *far* to liberal. If I had my druthers it would be 5%. But that's mostly informed by TPCC experience, in real life the actual value will vary depending on the width of your records and the relative length of your transactions versus transaction rate. The TPCC experience is with ~ 400 byte records and many short transactions. Hm.. 5% vs 20% would mean 4x as many vacuum scans, but only a 15% growth in size (105% vs 120%), right? Granted, those 15% are also taken from memory and caches, resulting in additional I/O... Still these numbers are surprising me. Or am I missing something? Regards Markus ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] OUTER JOIN performance regression remains in 8.3beta4
Alvaro Herrera [EMAIL PROTECTED] writes: Tom Lane wrote: Comparing the behavior of this to my patch for HEAD, I am coming to the conclusion that this is actually a *better* planning method than removing the redundant join conditions, even when they're truly rendundant! The reason emerges as soon as you look at cases involving more than a single join. If we strip the join condition from just one of the joins, then we find that the planner insists on doing that join last, whether it's a good idea or not, because clauseful joins are always preferred to clauseless joins in the join search logic. Would it be a good idea to keep removing redundant clauses and rethink the preference for clauseful joins, going forward? I don't understand what's going on here. The planner is choosing one join order over another because one join has more join clauses than the other? Even though some of those joins are entirely redundant and have no selectivity? That seems like a fortuitous choice made on entirely meaningless data. Is there some other source of data we could use to make this decision instead of the number of clauses? I would suggest the selectivity but from the sound of it that's not going to help at all. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services! ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Named vs Unnamed Partitions
Hi, Simon Riggs wrote: With that in mind, can I clarify what you're thinking, please? Sure, I can try to clarify: 2) the things you've been discussing are essential requirements of partitioning and we could never consider it complete until they are also included and we must therefore talk about them now to check that its all possible before we do anything on SE I thought so, but am slowly dropping that point of view. In favor of something like: hey, if you manage to do it all automatically, cool, go for it! 3) doing SE first is right, I'm just thinking ahead Yes, SE certainly has merit. Combine it with some sort of maintained CLUSTERing order and it's worth doing, IMO. I'm not convinced about dynamic partitioning being able to generally replace explicit partitioning anytime soon. Sorry if that seems blunt, I'm just not clear where we're going. Well, implicit or automatic partitioning is still a pretty new concept to me, but I'm slowly beginning to like it. Thank you for pointing me at it. Regards Markus ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] OUTER JOIN performance regression remains in 8.3beta4
Gregory Stark [EMAIL PROTECTED] writes: Alvaro Herrera [EMAIL PROTECTED] writes: Would it be a good idea to keep removing redundant clauses and rethink the preference for clauseful joins, going forward? I don't understand what's going on here. The planner is choosing one join order over another because one join has more join clauses than the other? Not more join clauses, but any join clause at all. We will not explore join paths that don't have any join clause, unless forced to by lack of any other way to form the result. Even though some of those joins are entirely redundant and have no selectivity? You're confusing whether we explore a path (ie, cost it out) with whether we choose it. It's a necessary precondition, of course, but we won't pick the path unless it looks cheapest. Not exploring clauseless join paths is a heuristic that's needed to avoid exponential growth of the search space in large join problems. AFAIK every System-R-derived planner has done this. As an example, consider t1 join t2 on (...) join t3 on (...) ... join t8 on (...) and for simplicity suppose that each ON condition relates the new table to the immediately preceding table, and that we can't derive any additional join conditions through transitivity. In this situation there are going to be only seven ways to form a two-base-relation joinrel, as long as we allow only clauseful joins. But there are 8*7/2 = 28 distinct ways to form a join if we consider all possible join pairs whether they have a join clause or not. At the three-base-relation level there will be 12 joinrels if we only consider clauseful pairs, or 56 if we don't. It gets worse as you go up, and most if not all of those additional joinrels represent entirely useless variations on the theme of let's stupidly compute a cartesian product and then winnow it sometime later. This is not to say that there is never a case where an early cartesian product couldn't be a useful part of a plan, but rejecting them is a darn good heuristic. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
[EMAIL PROTECTED] (Simon Riggs) writes: I think we have an opportunity to bypass the legacy-of-thought that Oracle has left us and implement something more usable. This seems like a *very* good thing to me, from a couple of perspectives. 1. I think you're right on in terms of the issue of the cost of running all that DDL in managing partitioning schemes. When I was working as DBA, I was decidedly *NOT* interested in doing a lot of low level partition management work, and those that are in that role now would, I'm quite sure, agree that they are not keen on spending a lot of their time trying to figure out what tablespace to shift a particular table into, or what tablespace filesystem to get sysadmins to set up. 2. Blindly following what Oracle does has always been a dangerous sort of thing to do. There are two typical risks: a) There's always the worry that they may have patented some part of how they implement things, and if you follow too closely, There Be Dragons... b) They have enough billion$ of development dollar$ and development re$ource$ that they can follow strategies that are too expensive for us to even try to follow. 3. If, rather than blindly following, we create something at least quasi-new, there is the chance of doing fundamentally better. This very thing happened when it was discovered that IBM had a patent on the ARC cacheing scheme; the clock system that emerged was a lot better than ARC ever was. One major advantage of the dynamic approach is that it can work on multiple dimensions simultaneously, which isn't possible with declarative partitioning. For example if you have a table of Orders then you will be able to benefit from Segment Exclusion on all of these columns, rather than just one of them: OrderId, OrderDate, RequiredByDate, LastModifiedDate. This will result in some sloppiness in the partitioning, e.g. if we fill 1 partition a day of Orders, then the OrderId and OrderData columns will start out perfectly arranged. Any particular RequiredByDate will probably be spread out over 7 partitions, but thats way better than being spread out over 365+ partitions. I think it's worth observing both the advantages and demerits of this together. In effect, with the dynamic approach, Segment Exclusion provides its benefits as an emergent property of the patterns of how INSERTs get drawn into segments. The tendancy will correspondly be that Segment Exclusion will be able to provide useful constraints for those patterns that can naturally emerge from the INSERTs. We can therefore expect useful constraints for attributes that are assigned in some kind of more or less chronological order. Such attributes will include: - Object ID, if set by a sequence - Processing dates There may be a bit of sloppiness, but the constraints may still be useful enough to exclude enough segments to improve efficiency. _On The Other Hand_, there will be attributes that are *NOT* set in a more-or-less chronological order, and Segment Exclusion will be pretty useless for these attributes. In order to do any sort of Exclusion for non-chronological attributes, it will be necessary to use some mechanism other than the patterns that fall out of natural chronological insertions. If you want exclusion on such attributes, then there needs to be some sort of rule system to spread such items across additional partitions. Mind you, if you do such, that will weaken the usefulness of Segment Exclusion. For instance, suppose you have 4 regions, and scatter insertions by region. In that case, there will be more segments that overlap any given chronological range. When we look at the data in the partition we can look at any number of columns. When we declaratively partition, you get only one connected set of columns, which is one of the the reasons you want multi-dimensional partitioning in the first place. Upside: Yes, you get to exclude based on examining any number of columns. Downside: You only get the exclusions that are emergent properties of the data... The more I'm looking at the dynamic approach, the more I'm liking it... -- cbbrowne,@,cbbrowne.com http://linuxfinances.info/info/linuxxian.html Feel free to contribute build files. Or work on your motivational skills, and maybe someone somewhere will write them for you... -- Fredrik Lundh [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Named vs Unnamed Partitions
[EMAIL PROTECTED] (Markus Schiltknecht) writes: Simon Riggs wrote: With that in mind, can I clarify what you're thinking, please? Sure, I can try to clarify: 2) the things you've been discussing are essential requirements of partitioning and we could never consider it complete until they are also included and we must therefore talk about them now to check that its all possible before we do anything on SE I thought so, but am slowly dropping that point of view. In favor of something like: hey, if you manage to do it all automatically, cool, go for it! 3) doing SE first is right, I'm just thinking ahead Yes, SE certainly has merit. Combine it with some sort of maintained CLUSTERing order and it's worth doing, IMO. My suspicion is that if this gets added in with maintained CLUSTER order, we *lose* all of the exclusions aside from the ones directly established by the CLUSTER order. That is, the CLUSTER ordering winds up preventing other natural patterns from emerging, with the result that SE winds up being of pretty limited usefulness. I'm not convinced about dynamic partitioning being able to generally replace explicit partitioning anytime soon. It also seems to me that explicit partitioning would make this form of dynamic partitioning less useful. Suppose there are 4 more or less uniformly used partitions; if you're splitting the data evenly across 4x the partitions, then that means that each segment will tend to have ranges ~4x as wide, which makes SE rather less of a win. (Relax the assumption of uniform distributions, and that just changes the weights...) -- output = reverse(ofni.sesabatadxunil @ enworbbc) http://cbbrowne.com/info/advocacy.html It seems that perfection is attained not when nothing is left to add, but when nothing is left to be taken away. -- Antoine de Saint-Exupery. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Named vs Unnamed Partitions
On Wed, 2008-01-09 at 17:30 +0100, Markus Schiltknecht wrote: Simon Riggs wrote: With that in mind, can I clarify what you're thinking, please? Sure, I can try to clarify: 2) the things you've been discussing are essential requirements of partitioning and we could never consider it complete until they are also included and we must therefore talk about them now to check that its all possible before we do anything on SE I thought so, but am slowly dropping that point of view. In favor of something like: hey, if you manage to do it all automatically, cool, go for it! 3) doing SE first is right, I'm just thinking ahead Yes, SE certainly has merit. Combine it with some sort of maintained CLUSTERing order and it's worth doing, IMO. I'm not convinced about dynamic partitioning being able to generally replace explicit partitioning anytime soon. In all cases, no. But do you think it would work well for the specific databases you've used partitioning on? Would it be possible to check? Sorry if that seems blunt, I'm just not clear where we're going. Well, implicit or automatic partitioning is still a pretty new concept to me, but I'm slowly beginning to like it. Thank you for pointing me at it. OK, thanks. I'll write up what I've learned in last few days into a new version of the proposal and put it on the Wiki. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Named vs Unnamed Partitions
Hi, Simon Riggs wrote: When I delete all rows WHERE some_date 'cut-off date' on a segment boundary value that would delete all segments that met the criteria. The following VACUUM will then return those segments to be read-write, where they can then be refilled with new incoming data. The only command we would have to run is the DELETE, everything else is automatic. Agreed, that would be very nice. So not convinced of the need for named sections of tables yet. It all seems like detail, rather than actually what we want for managing large tables. What do you think about letting the database system know the split point vs it having to find optimal split points automatically? Read-write vs. read-only is as good start, but can that concept be expanded to automatically choosing hash partitioning between storage systems, for example? Or more generally: can the database system gather enough information about the storage systems to take a decision as good as or better than the DBA? Regards Markus ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Named vs Unnamed Partitions
On Wed, 2008-01-09 at 18:04 +0100, Markus Schiltknecht wrote: So not convinced of the need for named sections of tables yet. It all seems like detail, rather than actually what we want for managing large tables. What do you think about letting the database system know the split point vs it having to find optimal split points automatically? For me, managing the table's files can be separate from the chunking that allows partition exclusion. Managing the table's files must be a manual operation. We can't infer the presence of a new tablespace etc.. Those files would need less than 10 zones or chunks, usually just one. The chunking to allow partition exclusion can still be automatic, allowing a much finer grain of partition. If we restrict the actions allowed to be just - mark read-only then for read-only segments (only) - migrate tablespaces - compress read-only segments - mark as off-line (not fully convinced we need this yet) then it seems straightforward to allow this to occur by a WHERE clause only, since the constraints on a segment will be fixed in place when it is read-only. This also allows those operations to take place without holding locks for any length of time, since they are all just metadata ops or copying read only data to another place. The WHERE clause approach might easily allow more than 2 chunks and they need not be logically contiguous. So the phrase split point doesn't really fit because it implies a one dimensional viewpoint, but I'm happy for you to give it a name. If we want to perform manipulations on non-read-only chunks then we need named or numbered partitions, locking, DDL etc.. That seems like too much functionality for what we really need. I really am still open on that point, but I would like to see a good description of a use case that really needs it, rather than just saying of course we do. Which is exactly where *I* started, even as recently as 3 weeks ago now. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] OUTER JOIN performance regression remains in 8.3beta4
Tom Lane [EMAIL PROTECTED] writes: As an example, consider t1 join t2 on (...) join t3 on (...) ... join t8 on (...) and for simplicity suppose that each ON condition relates the new table to the immediately preceding table, and that we can't derive any additional join conditions through transitivity. So the problem is that if we happen to have some x=const clause for any variable listed in those join clauses then we drop that clause entirely and end up delaying that join until the very end? So is the fact that the user provided a useless clause the only information we have that these tables might be related? So if I write (along with some other joins): t1 join t2 on (t1.x=t2.x) where t1.x=3 I'll get a different result than if I write t1, t2 where t1.x=3 and t2.x=3 ? Perhaps we could be going the other direction and trying to add redundant selectivity 1.0 clauses when we have multiple variables which come out to the same value? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support! ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Problem with CVS HEAD's handling of mergejoins
I wrote: A perhaps less invasive idea is to discard any proposed mergeclauses that are redundant in this sense. This would still require some reshuffling of responsibility between select_mergejoin_clauses and the code in pathkeys.c, since right now select_mergejoin_clauses takes no account of that. However, I'm worried that that might result in planner failure on some FULL JOIN cases that work today, since we require all the join clauses to be mergejoinable for a FULL JOIN. People seem to complain when the planner fails, even for really stupid queries ;-). I think this would only be acceptable if we can prove that ignoring clauses that are redundant in this sense doesn't change the result --- which might be the case, but I'm not sure. On further study, this doesn't seem nearly as bad as it looked. I had hastily misread some of the existing code as assuming one-for-one correspondence of mergeclauses and pathkeys, but actually it just assumes that the pathkeys list contains at least one pathkey matching each mergeclause. Redundancy between two pathkeys in a list is eliminated by allowing adjacent mergeclauses to share the same pathkey. So that part actually all works, and the only problem is when an equivalence class is redundant-for-sorting in itself --- that is, when one of the eclass members is a constant. So that explains why we'd not seen it before; except in very odd corner cases, the old code would have eliminated the mergejoinable clause anyway. If we simply make select_mergejoin_clauses() reject a clause as not-mergejoinable if either side is equated to a constant, then the problems go away. We'd have a new problem if this meant that we fail to do a FULL JOIN, but AFAICS that can't happen: a variable coming from a full join can't be equivalenced to a constant, except perhaps in a below_outer_join eclass, which isn't considered redundant for sorting anyway. Bottom line is that it just takes another half dozen lines of code to fix this; attached is the core of the fix (there are some uninteresting prototype changes and macro-moving as well). I think I can fix this in a day or so, but I now definitely feel that we'll need an RC2 :-( I no longer think that about this problem, but we've still got some nasty-looking issues in xml.c, plus Hannes Dorbath's unsolved reports of GIST/GIN problems ... regards, tom lane + /* +* Insist that each side have a non-redundant eclass. This +* restriction is needed because various bits of the planner expect +* that each clause in a merge be associatable with some pathkey in a +* canonical pathkey list, but redundant eclasses can't appear in +* canonical sort orderings. (XXX it might be worth relaxing this, +* but not enough time to address it for 8.3.) +* +* Note: it would be bad if this condition failed for an otherwise +* mergejoinable FULL JOIN clause, since that would result in +* undesirable planner failure. I believe that is not possible +* however; a variable involved in a full join could only appear +* in below_outer_join eclasses, which aren't considered redundant. +* +* This *can* happen for left/right join clauses, however: the +* outer-side variable could be equated to a constant. (Because we +* will propagate that constant across the join clause, the loss of +* ability to do a mergejoin is not really all that big a deal, and +* so it's not clear that improving this is important.) +*/ + cache_mergeclause_eclasses(root, restrictinfo); + + if (EC_MUST_BE_REDUNDANT(restrictinfo-left_ec) || + EC_MUST_BE_REDUNDANT(restrictinfo-right_ec)) + { + have_nonmergeable_joinclause = true; + continue; /* can't handle redundant eclasses */ + } + result_list = lappend(result_list, restrictinfo); } ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[HACKERS] tzdata issue on cross-compiled postgresql
Hackers; I've noticed a strangeness on our cross-compiled uclibc linked postgresql package that I was hoping to elicit some help with. This is probably best described by showing some queries with commentary, so on with that. postgres=# select count(*) from pg_timezone_names where utc_offset != '00:00'; count --- 0 postgres=# select count(*) from pg_timezone_names where utc_offset != '00:00'; count --- 504 postgres=# select count(*) from pg_timezone_names where utc_offset != '00:00'; count --- 0 postgres=# select *from pg_timezone_names where utc_offset != '00:00' limit 5; name | abbrev | utc_offset | is_dst --+++ (0 rows) postgres=# select *from pg_timezone_names where utc_offset != '00:00' limit 5; name| abbrev | utc_offset | is_dst ---+++ Africa/Algiers| CET| 01:00:00 | f Africa/Luanda | WAT| 01:00:00 | f Africa/Porto-Novo | WAT| 01:00:00 | f Africa/Gaborone | CAT| 02:00:00 | f Africa/Bujumbura | CAT| 02:00:00 | f (5 rows) As seen, the output is erratic, but when it is giving results, they seem to be correct. # pg_config BINDIR = /usr/bin DOCDIR = /usr/doc/postgresql INCLUDEDIR = /usr/include PKGINCLUDEDIR = /usr/include/postgresql INCLUDEDIR-SERVER = /usr/include/postgresql/server LIBDIR = /usr/lib PKGLIBDIR = /usr/lib/postgresql LOCALEDIR = MANDIR = /usr/share/man SHAREDIR = /usr/share/postgresql SYSCONFDIR = /etc/postgresql PGXS = /usr/lib/postgresql/pgxs/src/makefiles/pgxs.mk CONFIGURE = '--target=i386-linux' '--host=i386-linux' '--build=i486-linux-gnu' '--program-prefix=' '--program-suffix=' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--datadir=/usr/share' '--includedir=/usr/include' '--infodir=/usr/share/info' '--libdir=/usr/lib' '--libexecdir=/usr/lib' '--localstatedir=/var' '--mandir=/usr/share/man' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--disable-nls' '--enable-shared' '--enable-static' '--disable-rpath' '--without-java' '--without-krb4' '--without-krb5' '--without-openssl' '--without-pam' '--without-perl' '--without-python' '--without-rendezvous' '--without-tcl' '--without-tk' '--with-zlib=yes' '--enable-depend' 'CC=i386-linux-uclibc-gcc' 'CFLAGS=-O2 -pipe -march=i486 -funit-at-a-time' 'CPPFLAGS= -I/mnt/build/kamikaze/staging_dir_i386/usr/include -I/mnt/build/kamikaze/staging_dir_i386/include' 'LDFLAGS=-L/mnt/build/kamikaze/staging_dir_i386/usr/lib -L/mnt/build/kamikaze/staging_dir_i386/lib' 'build_alias=i486-linux-gnu' 'host_alias=i386-linux' 'target_alias=i386-linux' CC = i386-linux-uclibc-gcc CPPFLAGS = -I/mnt/build/kamikaze/staging_dir_i386/usr/include -I/mnt/build/kamikaze/staging_dir_i386/include -D_GNU_SOURCE CFLAGS = -O2 -pipe -march=i486 -funit-at-a-time -Wall -Wmissing-prototypes -Wpointer-arith -Winline -Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing CFLAGS_SL = -fpic LDFLAGS = -L/mnt/build/kamikaze/staging_dir_i386/usr/lib -L/mnt/build/kamikaze/staging_dir_i386/lib LDFLAGS_SL = LIBS = -lpgport -lz -lreadline -lncurses -lcrypt -ldl -lm -lnotimpl VERSION = PostgreSQL 8.2.5 uclibc version is 0.9.28-10 from our tree (-10 is our versioning). You can see the current public cross-compilation Makefile for postgresql at https://dev.openwrt.org/browser/packages/libs/postgresql/Makefile Note, that in the current version of the Makefile that is uncommitted, I have the following changes: * I've removed --disable-integer-datetimes (per the suggestion of someone in #postgresql on freenode) * I've added a sed replacement (posix zone from the tzdata file doesn't seem to exist anymore?) * $(SED) '[EMAIL PROTECTED]/[EMAIL PROTECTED]/New_York@' $(PKG_BUILD_DIR)/src/timezone/Makefile None of those have affected it at this point. I'd appreciate any help on this matter that you guys can provide. If you need any more information, please don't hesitate to ask. /tmy ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Chris Browne wrote: _On The Other Hand_, there will be attributes that are *NOT* set in a more-or-less chronological order, and Segment Exclusion will be pretty useless for these attributes. Really?I was hoping that it'd be useful for any data with long runs of the same value repeated - regardless of ordering. My biggest tables are clustered by zip/postal-code -- which means that while the City, State, Country attributes aren't monotonically increasing or decreasing; they are grouped tightly together. I'd expect all queries for San Francisco to be able to come from at most 2 segments; and all queries for Texas to be able to come from only a fraction of the whole. If the segment sizes are configurable - I imagine this would even be useful for other data - like a people table organized by last_name,first_name. John's may be scattered through out the table -- but at least the John Smith's would all be on one segment, while the Aaron-through-Jim Smith segments might get excluded. Or am I missing something? ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Wed, Jan 09, 2008 at 11:47:31AM -0500, Chris Browne wrote: [EMAIL PROTECTED] (Simon Riggs) writes: I think we have an opportunity to bypass the legacy-of-thought that Oracle has left us and implement something more usable. This seems like a *very* good thing to me, from a couple of perspectives. [snip] 2. Blindly following what Oracle does has always been a dangerous sort of thing to do. There are two typical risks: a) There's always the worry that they may have patented some part of how they implement things, and if you follow too closely, There Be Dragons... I think that could be equally said of the dynamic partitioning approach. In fact, it might be more likely since declarative partitioning has been around for eons. b) They have enough billion$ of development dollar$ and development re$ource$ that they can follow strategies that are too expensive for us to even try to follow. I don't see that as an argument against the declarative approach. Reading the details on this thread so far, I think Simon's approach is probably more complex from an implementation POV. 3. If, rather than blindly following, we create something at least quasi-new, there is the chance of doing fundamentally better. This very thing happened when it was discovered that IBM had a patent on the ARC cacheing scheme; the clock system that emerged was a lot better than ARC ever was. Well, I don't think I'm proposing we /blindly follow/ others. I propose we choose a grammar which takes the best of what others have tried to do. Oracle's grammar is hideous, IBM's is too restrictive, for example. One major advantage of the dynamic approach is that it can work on multiple dimensions simultaneously, which isn't possible with declarative partitioning. For example if you have a table of Orders then you will be able to benefit from Segment Exclusion on all of these columns, rather than just one of them: OrderId, OrderDate, RequiredByDate, LastModifiedDate. This will result in some sloppiness in the partitioning, e.g. if we fill 1 partition a day of Orders, then the OrderId and OrderData columns will start out perfectly arranged. Any particular RequiredByDate will probably be spread out over 7 partitions, but thats way better than being spread out over 365+ partitions. I think it's worth observing both the advantages and demerits of this together. In effect, with the dynamic approach, Segment Exclusion provides its benefits as an emergent property of the patterns of how INSERTs get drawn into segments. The tendancy will correspondly be that Segment Exclusion will be able to provide useful constraints for those patterns that can naturally emerge from the INSERTs. Many people, in my experience, doing the kind of data processing which benefits from partitioning are regularly loading data, rather than collecting it in an OLTP fashion. Lets take the easily understandable concept of processing web site traffic. If the amount of data is large enough to benefit from partitioning, they probably have multiple web servers and therefore almost certainly multiple log files. If these files are not sorted into a single file, the records will not have a naturally progressing chronology: every file we go back to the beginning of the period of time the load covers. If you add parallelism to your load, things look even more different. This means you could end up with a bunch of partitions, under the dynamic model, which all cover the same time range. Then there's the way that really big databases are used (say, up around Simon's upper bound of 16 TB). It is expensive to keep data online so people aren't. They're loading and unloading data all the time, to perform different forms of analysis. A common scenario in the example above might be to unload all but the current month's data and then load the same month from the previous year. The unload step needs to be costly (i.e., TRUNCATE). Then, there's no guarantee that what they're interested in is the date range at all. They may want to compare user agent (look at bot activity). In this case, the partitioning is across a small list of strings (well, numbers most likely). Here, the partitions would all have the same range. Partitioning would be useless. We can therefore expect useful constraints for attributes that are assigned in some kind of more or less chronological order. Such attributes will include: - Object ID, if set by a sequence - Processing dates There may be a bit of sloppiness, but the constraints may still be useful enough to exclude enough segments to improve efficiency. I really like the idea of the system being able to identify trends and patterns in the data (actually useful at the user level) but it's the uncertainty of how the partitioning will work for different kinds of data that I don't like. _On The Other Hand_, there will be
Re: [HACKERS] OUTER JOIN performance regression remains in 8.3beta4
Gregory Stark [EMAIL PROTECTED] writes: So if I write (along with some other joins): t1 join t2 on (t1.x=t2.x) where t1.x=3 I'll get a different result than if I write t1, t2 where t1.x=3 and t2.x=3 In 8.3 you won't, because those are in fact exactly equivalent (and the new EquivalenceClass machinery can prove it). The cases that are interesting are more like t1 LEFT join t2 on (t1.x=t2.x) where t1.x=3 which is not equivalent to the other construction, because t2.x is only sort-of-equal to 3. Hmm ... now that I look at this, it might be a good idea if have_relevant_eclass_joinclause() didn't skip ec_has_const EquivalenceClasses. That would give us the same behavior for simple inner-join cases that I'm advocating for outer joins, namely that we can consider an early join between two rels that are related in the fashion you show. We don't actually need to invent dummy join clauses to make that happen, because the join search code believes have_relevant_eclass_joinclause() even if it doesn't see a joinclause for itself ... regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Wed, 2008-01-09 at 20:03 +0100, Gavin Sherry wrote: I think Simon's approach is probably more complex from an implementation POV. Much of the implementation is exactly the same, and I'm sure we agree on more than 50% of how this should work already. We just need to close in on the remainder. My current opinion on the SE approach is the opposite of the one above, though it gets us nowhere just to state it. I'm trying to avoid opinion and look at the details, which is the reason my viewpoint recently changed in favour of the dynamic approach as the main thrust for implementation. I've written a detailed email this morning to explain where and how the problems lie, which are nowhere near the syntax level. I haven't ruled out a declarative approach yet, but I need some detailed technical review of the issues. I hope you'll be replying to that? 3. If, rather than blindly following, we create something at least quasi-new, there is the chance of doing fundamentally better. This very thing happened when it was discovered that IBM had a patent on the ARC cacheing scheme; the clock system that emerged was a lot better than ARC ever was. Well, I don't think I'm proposing we /blindly follow/ others. I propose we choose a grammar which takes the best of what others have tried to do. Oracle's grammar is hideous, IBM's is too restrictive, for example. I assume the new grammar is good and if we do go that way, it sounds like the right starting place. One major advantage of the dynamic approach is that it can work on multiple dimensions simultaneously, which isn't possible with declarative partitioning. For example if you have a table of Orders then you will be able to benefit from Segment Exclusion on all of these columns, rather than just one of them: OrderId, OrderDate, RequiredByDate, LastModifiedDate. This will result in some sloppiness in the partitioning, e.g. if we fill 1 partition a day of Orders, then the OrderId and OrderData columns will start out perfectly arranged. Any particular RequiredByDate will probably be spread out over 7 partitions, but thats way better than being spread out over 365+ partitions. I think it's worth observing both the advantages and demerits of this together. In effect, with the dynamic approach, Segment Exclusion provides its benefits as an emergent property of the patterns of how INSERTs get drawn into segments. The tendancy will correspondly be that Segment Exclusion will be able to provide useful constraints for those patterns that can naturally emerge from the INSERTs. Many people, in my experience, doing the kind of data processing which benefits from partitioning are regularly loading data, rather than collecting it in an OLTP fashion. Lets take the easily understandable concept of processing web site traffic. If the amount of data is large enough to benefit from partitioning, they probably have multiple web servers and therefore almost certainly multiple log files. If these files are not sorted into a single file, the records will not have a naturally progressing chronology: every file we go back to the beginning of the period of time the load covers. If you add parallelism to your load, things look even more different. This means you could end up with a bunch of partitions, under the dynamic model, which all cover the same time range. Depends how big we make the partitions and how sloppy this is as to whether that is a problem or not. We might still expect a x100 gain from using the SE approach depending upon the data volume. Then there's the way that really big databases are used (say, up around Simon's upper bound of 16 TB). It is expensive to keep data online so people aren't. They're loading and unloading data all the time, to perform different forms of analysis. That isn't my experience. That sounds very time consuming. The storage cost issue was the reason Andrew wanted offline segments, and why I have been talking about hierarchical storage. A common scenario in the example above might be to unload all but the current month's data and then load the same month from the previous year. The unload step needs to be costly (i.e., TRUNCATE). Then, there's no guarantee that what they're interested in is the date range at all. They may want to compare user agent (look at bot activity). In this case, the partitioning is across a small list of strings (well, numbers most likely). Here, the partitions would all have the same range. Partitioning would be useless. I take it you mean SE-based partitioning would be useless, but declarative partitioning would be useful? I would agree, assuming they run queries with a few of the small list of strings. Seems like a contrived case. Some of the biggest I've seen are GIS information or network information. Those are good examples of where a declarative approach would be the only way of getting
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Ron Mayer [EMAIL PROTECTED] writes: Chris Browne wrote: _On The Other Hand_, there will be attributes that are *NOT* set in a more-or-less chronological order, and Segment Exclusion will be pretty useless for these attributes. Really?I was hoping that it'd be useful for any data with long runs of the same value repeated - regardless of ordering. My biggest tables are clustered by zip/postal-code -- which means that while the City, State, Country attributes aren't monotonically increasing or decreasing; they are grouped tightly together. I'd expect all queries for San Francisco to be able to come from at most 2 segments; and all queries for Texas to be able to come from only a fraction of the whole. If the segment sizes are configurable - I imagine this would even be useful for other data - like a people table organized by last_name,first_name. John's may be scattered through out the table -- but at least the John Smith's would all be on one segment, while the Aaron-through-Jim Smith segments might get excluded. Or am I missing something? Well, this can head in two directions... 1. Suppose we're not using an organize in CLUSTER order approach. If the data is getting added in roughly by order of insertion order, then there's no reason to expect San Francisco data to be clustered together. Ditto for John Smiths; we can expect them to be as scattered as their dates of creation. 1. Suppose we *are* using an organize in CLUSTER order approach. In that case, yes, indeed, you can expect segments to contain specific ranges of the data. However, in that case, the ONLY dimension under which the Segment Exclusion may be expected to be useful is that of the first column of the index being used. Smith should be useful to SE, but not John, because, as a secondary criteria, the first name values will be scattered across all segments. -- (reverse (concatenate 'string ofni.sesabatadxunil @ enworbbc)) http://www3.sympatico.ca/cbbrowne/linuxdistributions.html PALINDROME spelled backwards is EMORDNILAP. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Named vs Unnamed Partitions
Hi, Le Wednesday 09 January 2008 19:27:41 Simon Riggs, vous avez écrit : The WHERE clause approach might easily allow more than 2 chunks and they need not be logically contiguous. So the phrase split point doesn't really fit because it implies a one dimensional viewpoint, but I'm happy for you to give it a name. Maybe that's only me but I'm not yet clear, after reading this thread and the previous one, whether or not Segment Exclusion would allow for multi-level partitioning. I have a use case at the moment, where we load logs-like data from several server to a central one (batch loading), the central table having an extra server column to identify each tuple origin. Will SE technique be able to see that this table would be better partitionned by server AND date? That's what I would have done if it was easier to do with constraint exclusion (did only date partitioning), as the reporting queries will always have some server (stats by services, each service being installed on 1 or more servers) and date restrictions. Please note I'd be happy to have this use case handled by explicitly specifying the partitioning system I want PostgreSQL to use, and more than happy if you answer me than an automatic transparent code is able to optimize the data on disk for my need without me bothering about partitions, their names and split points... If we want to perform manipulations on non-read-only chunks then we need named or numbered partitions, locking, DDL etc.. That seems like too much functionality for what we really need. I really am still open on that point, but I would like to see a good description of a use case that really needs it, rather than just saying of course we do. Which is exactly where *I* started, even as recently as 3 weeks ago now. I like Markus ideas proposing to have SE at work inside partitions or tables, partitions being another kind of relations holding data. Then the DBA who needs to explicitly manage partitions to save faster tablespace for live data is able to tell that to the system and benefit fully from it. Regards, -- dim signature.asc Description: This is a digitally signed message part.
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Tue, 2008-01-08 at 02:12 +, Gregory Stark wrote: I also don't understand how this proposal deals with the more common use case of unloading and loading data. Normally in partitioned tables we build the data in a side table until the data is all correct then load it as a partition. If you treat it as a lower-level object then I don't see that working. The layout of the new table won't often match the layout of the target partitioned table. We optimised for that in 8.2, but I would say that not many people noticed and that it isn't normal. The problem with that approach, and the reason many people don't use it is that it requires all data for a partition to be available at the time you add the partition. That necessarily implies a time delay into the process of loading data, which is no long acceptable in the world of straight-through-processing or whatever you call the need for zero processing delay in an/your industry. So people choose to load data directly into the main table, allowing it to be immediately available, though at the cost of some performance. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Named vs Unnamed Partitions
On Wed, 2008-01-09 at 21:29 +0100, Dimitri Fontaine wrote: Le Wednesday 09 January 2008 19:27:41 Simon Riggs, vous avez écrit : The WHERE clause approach might easily allow more than 2 chunks and they need not be logically contiguous. So the phrase split point doesn't really fit because it implies a one dimensional viewpoint, but I'm happy for you to give it a name. Maybe that's only me but I'm not yet clear, after reading this thread and the previous one, whether or not Segment Exclusion would allow for multi-level partitioning. I have a use case at the moment, where we load logs-like data from several server to a central one (batch loading), the central table having an extra server column to identify each tuple origin. Will SE technique be able to see that this table would be better partitionned by server AND date? No, but it will be able to handle partitioning on other columns that provide a degree of differentiation that you possibly hadn't considered at design time. That's what I would have done if it was easier to do with constraint exclusion (did only date partitioning), as the reporting queries will always have some server (stats by services, each service being installed on 1 or more servers) and date restrictions. Hmm, well if you found declaring the partitions a problem with constraint exclusion it's not going to be any easier using other declarative approaches. So it will work with what you currently use. You can always use constraint exclusion to separate out the servers and then segment exclusion to handle the date range. Please note I'd be happy to have this use case handled by explicitly specifying the partitioning system I want PostgreSQL to use, and more than happy if you answer me than an automatic transparent code is able to optimize the data on disk for my need without me bothering about partitions, their names and split points... If we want to perform manipulations on non-read-only chunks then we need named or numbered partitions, locking, DDL etc.. That seems like too much functionality for what we really need. I really am still open on that point, but I would like to see a good description of a use case that really needs it, rather than just saying of course we do. Which is exactly where *I* started, even as recently as 3 weeks ago now. I like Markus ideas proposing to have SE at work inside partitions or tables, partitions being another kind of relations holding data. Then the DBA who needs to explicitly manage partitions to save faster tablespace for live data is able to tell that to the system and benefit fully from it. OK, thanks, -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] tzdata issue on cross-compiled postgresql
Tim Yardley [EMAIL PROTECTED] writes: postgres=# select count(*) from pg_timezone_names where utc_offset != '00:00'; count --- 0 postgres=# select count(*) from pg_timezone_names where utc_offset != '00:00'; count --- 504 postgres=# select count(*) from pg_timezone_names where utc_offset != '00:00'; count --- 0 That's just bizarre ... IIRC the output of that view is computed by scanning the timezone directories, and you wouldn't expect that to change. Can you strace the backend while it's doing this and see if there's a difference in the series of kernel calls issued? regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Wed, Jan 09, 2008 at 02:38:21PM -0500, Chris Browne wrote: Ron Mayer [EMAIL PROTECTED] writes: Or am I missing something? Well, this can head in two directions... 1. Suppose we're not using an organize in CLUSTER order approach. If the data is getting added in roughly by order of insertion order, then there's no reason to expect San Francisco data to be clustered together. Ditto for John Smiths; we can expect them to be as scattered as their dates of creation. 1. Suppose we *are* using an organize in CLUSTER order approach. In that case, yes, indeed, you can expect segments to contain specific ranges of the data. However, in that case, the ONLY dimension under which the Segment Exclusion may be expected to be useful is that of the first column of the index being used. Smith should be useful to SE, but not John, because, as a secondary criteria, the first name values will be scattered across all segments. One of the reasons for using partitions is that the usefulness of indexes breaks down because of the low cardinality of the indexed column. Also, the random IO can kill your performance. In my opinion, CLUSTER (and vacuum, mentioned else where) are not data management tools. Thanks, Gavin ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Wed, Jan 09, 2008 at 08:17:41PM +, Simon Riggs wrote: On Wed, 2008-01-09 at 20:03 +0100, Gavin Sherry wrote: I think Simon's approach is probably more complex from an implementation POV. Much of the implementation is exactly the same, and I'm sure we agree on more than 50% of how this should work already. We just need to close in on the remainder. My current opinion on the SE approach is the opposite of the one above, though it gets us nowhere just to state it. I'm trying to avoid opinion and look at the details, which is the reason my viewpoint recently changed in favour of the dynamic approach as the main thrust for implementation. I've written a detailed email this morning to explain where and how the problems lie, which are nowhere near the syntax level. I haven't ruled out a declarative approach yet, but I need some detailed technical review of the issues. I hope you'll be replying to that? I'm responding to that email seperately. Just a matter or addressing all the points in the detail it deserves. 3. If, rather than blindly following, we create something at least quasi-new, there is the chance of doing fundamentally better. This very thing happened when it was discovered that IBM had a patent on the ARC cacheing scheme; the clock system that emerged was a lot better than ARC ever was. Well, I don't think I'm proposing we /blindly follow/ others. I propose we choose a grammar which takes the best of what others have tried to do. Oracle's grammar is hideous, IBM's is too restrictive, for example. I assume the new grammar is good and if we do go that way, it sounds like the right starting place. One major advantage of the dynamic approach is that it can work on multiple dimensions simultaneously, which isn't possible with declarative partitioning. For example if you have a table of Orders then you will be able to benefit from Segment Exclusion on all of these columns, rather than just one of them: OrderId, OrderDate, RequiredByDate, LastModifiedDate. This will result in some sloppiness in the partitioning, e.g. if we fill 1 partition a day of Orders, then the OrderId and OrderData columns will start out perfectly arranged. Any particular RequiredByDate will probably be spread out over 7 partitions, but thats way better than being spread out over 365+ partitions. I think it's worth observing both the advantages and demerits of this together. In effect, with the dynamic approach, Segment Exclusion provides its benefits as an emergent property of the patterns of how INSERTs get drawn into segments. The tendancy will correspondly be that Segment Exclusion will be able to provide useful constraints for those patterns that can naturally emerge from the INSERTs. Many people, in my experience, doing the kind of data processing which benefits from partitioning are regularly loading data, rather than collecting it in an OLTP fashion. Lets take the easily understandable concept of processing web site traffic. If the amount of data is large enough to benefit from partitioning, they probably have multiple web servers and therefore almost certainly multiple log files. If these files are not sorted into a single file, the records will not have a naturally progressing chronology: every file we go back to the beginning of the period of time the load covers. If you add parallelism to your load, things look even more different. This means you could end up with a bunch of partitions, under the dynamic model, which all cover the same time range. Depends how big we make the partitions and how sloppy this is as to whether that is a problem or not. We might still expect a x100 gain from using the SE approach depending upon the data volume. This comes back to my problem with the dynamic approach: the user might get a certain experience but with the declarative approach the expectation matches the result. Then there's the way that really big databases are used (say, up around Simon's upper bound of 16 TB). It is expensive to keep data online so people aren't. They're loading and unloading data all the time, to perform different forms of analysis. That isn't my experience. That sounds very time consuming. I cannot offer any evidence but it's what we see companies doing. The storage cost issue was the reason Andrew wanted offline segments, and why I have been talking about hierarchical storage. The thing is, people with lots of data usually have good hierarchical storage systems. Can we really do better? A common scenario in the example above might be to unload all but the current month's data and then load the same month from the previous year. The unload step needs to be costly (i.e., TRUNCATE). Then, there's no guarantee that what they're interested in is the date range at all. They may want to
Re: [HACKERS] Named vs Unnamed Partitions
On Wed, Jan 09, 2008 at 08:51:30PM +, Simon Riggs wrote: That's what I would have done if it was easier to do with constraint exclusion (did only date partitioning), as the reporting queries will always have some server (stats by services, each service being installed on 1 or more servers) and date restrictions. Hmm, well if you found declaring the partitions a problem with constraint exclusion it's not going to be any easier using other declarative approaches. I disagree (although it is unreasonable for me to do so without posting syntax -- it's coming). Proper grammar for partition support means running a single DDL command. The user does not have to line up table generation with rules (or triggers) and check constraints. As such, I believe it to be much much easier. Thanks, Gavin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] operator suggest interval / interval = numeric
The year to month and day to second intervals should not overlap. The standard doesn't actually allow it IIRC. wt On Jan 9, 2008 7:17 AM, Tom Lane [EMAIL PROTECTED] wrote: Ilya A. Kovalenko [EMAIL PROTECTED] writes: I suggest one more standard date/time operator, to divide one interval by another with numeric (or float, for example) result. You'd have to define exactly what that means, which seems a little tricky for incommensurate intervals. For instance what is the result of '1 month' / '1 day' ? regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Named vs Unnamed Partitions
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Wed, 9 Jan 2008 23:52:09 +0100 Gavin Sherry [EMAIL PROTECTED] wrote: te restrictions. Hmm, well if you found declaring the partitions a problem with constraint exclusion it's not going to be any easier using other declarative approaches. I disagree (although it is unreasonable for me to do so without posting syntax -- it's coming). Proper grammar for partition support means running a single DDL command. The user does not have to line up table generation with rules (or triggers) and check constraints. As such, I believe it to be much much easier. +1 http://www.databasedesign-resource.com/oracle-partitions.html I am not saying I like Oracle's syntax (I don't) but: http://dev.mysql.com/tech-resources/articles/mysql_5.1_partitioning_with_dates.html and: http://www.databasejournal.com/features/mssql/article.php/3456991 Or worlds above us in usability. Sincerely, Joshua D. Drake - -- The PostgreSQL Company: Since 1997, http://www.commandprompt.com/ Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate SELECT 'Training', 'Consulting' FROM vendor WHERE name = 'CMD' -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHhVoAATb/zqfZUUQRAp9IAJ4+LQ+zHOgD1wpblH/q1OwF4+1W3QCdFaLU hlb5uRrbK7Z+oRCLMi+SNJs= =cmIs -END PGP SIGNATURE- ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] odd convert_from bug
Andrew Dunstan [EMAIL PROTECTED] writes: The case below has just been reported to me. It sure looks odd. I'm looking into it but any ideas would be welcome. The problem only occurs if we are updating more than one row. Pfree'ing something you didn't palloc is bad news... regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] odd convert_from bug
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: The case below has just been reported to me. It sure looks odd. I'm looking into it but any ideas would be welcome. The problem only occurs if we are updating more than one row. Pfree'ing something you didn't palloc is bad news... Ah, yes, thanks, looks like I was a little too eager on the CP. I see you have fixed it. cheers andrew ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] odd convert_from bug
Andrew Dunstan wrote: Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: The case below has just been reported to me. It sure looks odd. I'm looking into it but any ideas would be welcome. The problem only occurs if we are updating more than one row. Pfree'ing something you didn't palloc is bad news... Ah, yes, thanks, looks like I was a little too eager on the CP. I see you have fixed it. BTW, if calling pfree() at all here is actually a bug, then we should probably fix it in the back branches. It looks more to me like the problem was that pg_convert_from was calling pfree() with the wrong argument - src_encoding_name instead of dest_encoding_name. But maybe the pfree in the back branches is unnecessary but harmless. cheers andrew ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi Simon, On Wed, Jan 02, 2008 at 05:56:14PM +, Simon Riggs wrote: Segment Exclusion - After we note that a segment is read-only we can scan the segment and record min/max values for all columns. These are then implicit constraints, which can then be used for segment exclusion in a similar manner as we do with the current constraint exclusion mechanism. What about columns which are varlen? The maximum value could be very long -- 1 GB in fact. Say we decide to not support those. Well, what about numeric? What about user defined types? This would also allow a Merge Join to utilise exclusion for the inner plan at execution time. Instead of scanning the whole inner table plan from the start, we would be able to start the scan from the appropriate segment. This would require us to pass the current value of the outer plan down into the inner plan. The typical executor nodes on the inner plan would be a SortNode and below that a SeqScan. In that case the executor would need to pass the outer value from the MergeJoinNode down thru the SortNode to the SeqScan node. The SeqScan node could then perform partition exclusion, reducing the time for that scan and also reducing the time for the resulting sort. This sounds like it might be worth doing in normal cases also. It might turn out that the potentially applicable cases are already excluded during planning, I haven't thought about that aspect in enough detail yet. I don't think that would work at all. Say you have have skew of data across the partitions. You may end up doing a seq scan for every outer tuple. It would look like a really expensive nested loop. If we collect data for all columns then many of our implicit constraints would be useless. e.g. if a column only has a few values and these are present in all segments. Matching our predicate against all constraints would be expensive, so we must weed out poor constraints. We would do this by removing any constraint that overlapped more than 10% of other segments. Various heuristics would likely need to be discovered during development to make this work without resorting to manual commands. Note that all of this exclusion is happening in the executor, not the planner. That allows this form of exclusion to work with stable functions and parameters without problem. So, in that case, you've totally undercut the planner. All the steps up the tree would be based on the stats for scanning the entire table but you've only returned part of it. To solve that, you could put the min/max values in a catalog somewhere. For a 16 TB table, that's 16000 min/max values. That's pretty expensive. And how do we handle making things non-read-only. You'd have to wait for all current queries to finish... that could take a long time. How do we handle crashes during setting of min/max? I presume it needs WAL support. What happens if the free space map gives us a free block in a read only segment. Then we need to look at concurrency and transactionality of min/max values don't we? If changing it makes it not read-only (which seems to be the case) it would be trivial to totally degrade your partitioning scheme to full seq scan. One a 16 TB table. And getting the performance back might involve a day or more or VACUUMing. Impossible. Visibility Map -- [snip] No dynamic shared memory cache is required because any concurrent changes to the table would be ignored by a Scan anyway, so it doesn't matter if an INSERT, UPDATE or DELETE occurs while we are scanning. Any new scans that start will attempt to lock the table and then perform a rel cache check before continuing. So the visibility will be set correctly for *that* scan at least. Is that true in the presence of READ COMMITTED? In most cases the visibility map can be summarised as a single boolean to show whether any 100% visible segments exist. That makes accessing the map very cheap in the common, unset case. What do we do in failure scenarios. Does this go into WAL? Is it replicated via log shipping techniques. You mention Slony support below. I don't know that Slony's target is Very Big Tables (up to 16 TB) but even if it was being used for a 'small' system of a few hundred GB, when you fail over the system has degraded if you aren't vacuuming it. More over, the layout of data may be different and therefore the performance of the segment exclusion. That's a bit of a surprise. Setting the Visibility Map -- VACUUM would scan all READ_WRITE_ALLOWED segments and mark some of them as EFFECTIVE_READ_ONLY if 100% of the remaining rows are visible to all current backends. Marking the map will cause a rel cache invalidation. Argh. Remember, we're talking about 16 TB tables here. With the best of tuning, that takes all day. So, day 1, load data; day 2 VACUUM it? With the existing partitioning approach, there's no such thing. We would never mark the highest numbered
Re: [HACKERS] tzdata issue on cross-compiled postgresql
Tim Yardley [EMAIL PROTECTED] writes: Can you strace the backend while it's doing this and see if there's a difference in the series of kernel calls issued? See attached strace. Let me know if you see anything enlightening. Nope :-(. The strace output is *exactly* the same across all four instances, except for the number of bytes sent over to the client, which reflects the difference between 504 and 0. It's possible that something is lurking in the data that strace didn't show, but I think that's unlikely. Somehow the misfeasance is happening within Postgres proper. I think you'll have to get out a debugger and trace through it to find out where the rows are getting discarded ... regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] odd convert_from bug
Andrew Dunstan [EMAIL PROTECTED] writes: BTW, if calling pfree() at all here is actually a bug, then we should probably fix it in the back branches. It looks more to me like the problem was that pg_convert_from was calling pfree() with the wrong argument - src_encoding_name instead of dest_encoding_name. Right. I just took it out because it wasn't really that useful; in general SQL-callable functions can expect that they're called in fairly short-lived contexts, and so retail pfree's aren't very interesting unless you're talking about large chunks. BTW, just for the record, the (void *) casts were poor style too IMHO --- DatumGetPointer() would be better. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi Simon, On Wed, Jan 09, 2008 at 03:08:08PM +, Simon Riggs wrote: Do people really like running all that DDL? There is significant manpower cost in implementing and maintaining a partitioning scheme, plus significant costs in getting it wrong. Well... that's impossible for me to say. Most of the people who use lots of data like using MDX (a large and quite complex reporting and modelling language from MS, which is more complex than SQL IMHO) so... If people with large tables like partitioning why is Oracle moving towards automated partitioning in 11g? Automated partitioning was one of Have you used Oracle's partitioning? I'm not surprised they're moving away ;-). More seriously, I didn't know that. Do you have a URL? the major goals for this next set of Postgres partitioning functionality also, whether or not we have declarative partitioning. My thinking is lets go for the best ideas and skip over the stuff that will be (is) obsolete before its left the design stage. Well, you're assuming that because declarative partitioning has been around a while, it's obselete. I don't think that's the case. Postgres has been around about as long... I see many more systems in the Terabyte range now than I did 10 years ago, but I see about the same number of DBAs. We'll always need experts, but I feel we should be using our expertise to simplify the standard cases, not just maintain the same level of difficulty in managing them. One of the big benefits of the dynamic partitioning approach is that it needs no DDL. So it will work out-of-the-box, for anybody. It's fine to say that, but people have already started talking about grammar extensions for dynamic partitions. People want the tools to manage it. There's difficulty in learning how and when to use the different VACUUM options that have been discussed. Deleting older data would be optimised under the proposed scheme, so that's not really a problem. Loading data is actually slightly harder and slower with declarative partitioning (see below). Merging and splitting partitions are tools for fine tuning a very complex partitioning scheme. They do also allow a non-linear segmentation scheme, which might aid performance in some cases. What about adding partitions into a set of partitions. Greg's post on that was dismissed and it might not be representative but we have users doing that all the time. One major advantage of the dynamic approach is that it can work on multiple dimensions simultaneously, which isn't possible with declarative partitioning. For example if you have a table of Orders then you will be able to benefit from Segment Exclusion on all of these columns, rather than just one of them: OrderId, OrderDate, RequiredByDate, LastModifiedDate. This will result in some sloppiness in the partitioning, e.g. if we fill 1 partition a day of Orders, then the OrderId and OrderData columns will start out perfectly arranged. Any particular RequiredByDate will probably be spread out over 7 partitions, but thats way better than being spread out over 365+ partitions. When we look at the data in the partition we can look at any number of columns. When we declaratively partition, you get only one connected set of columns, which is one of the the reasons you want multi-dimensional partitioning in the first place. To this end, we (well, Jeff Cohen) looked at the syntax and semantics of partitining in leading databases (Oracle, Informix, DB2) and came up with a highly expressive grammar which takes the best of each I think (I'll post details on the grammar in a seperate thread). The idea is that range (for example, a date range), list (a list of distinct values) and hash partitioning be supported on multiple columns. Partitions can be added, merged, truncated. Partitions can reside on different tablespaces. The existing issues with the rewriter, COPY support and the like go away by smartening up the backend. To explore the grammar and semantics Jeff and I (to a small extent) have implemented the parsing of such a grammar in the backend. At the moment, this just results in the generation of a bunch of rules (i.e., it produces the current behaviour, as if someone had written rules themselves) and debugging output. The fully fledged approach will see partitioning rules stored in a new system catalog which can be interrogated by the planner or COPY to determine which partition a given tuple belongs in or which partition(s) a WHERE clause matches. When loading data we would need to examine each incoming tuple and distribute it to the correct place in the partitioned heap. That's a non-trivial overhead on data loading that I would like to avoid, since if we rely on the natural locality of loaded data then that overhead is zero. Yes, but it's a lot faster than what we have at the moment and the dynamic approach is not without its costs. I've pointed it out before, but I will
Re: [HACKERS] operator suggest interval / interval = numeric
Warren Turkal escribió: The year to month and day to second intervals should not overlap. The standard doesn't actually allow it IIRC. They do on Postgres anyway. Otherwise the type is not all that useful, is it? -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] operator suggest interval / interval = numeric
On Jan 10, 2008 2:17 AM, Tom Lane [EMAIL PROTECTED] wrote: You'd have to define exactly what that means, which seems a little tricky for incommensurate intervals. For instance what is the result of '1 month' / '1 day' ? Postgres has already made such definitions, to allow direct interval-interval comparison. 1 month is deemed equal to 30 days, 1 day is deemed equal to 24 hours (although for some reason we ignore the issue of years vs. days). I argued in a long-dead thread that we should disallow these kinds of comparisons altogether, but I didn't manage to generate much enthusiasm. The overall sentiment seemed to be that the slightly bogus results were more useful than no result at all. That being the case, if we're comfortable making these kinds of arbitrary definitions for comparison purposes, it doesn't seem like much of a stretch to allow multiplication and division of intervals using the same rules. Regards, BJ ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] operator suggest interval / interval = numeric
On Jan 9, 2008 8:33 PM, Brendan Jurd [EMAIL PROTECTED] wrote: I argued in a long-dead thread that we should disallow these kinds of comparisons altogether, but I didn't manage to generate much enthusiasm. The overall sentiment seemed to be that the slightly bogus results were more useful than no result at all. I was wondering why PostgreSQL allowed these types of comparisons. It really shouldn't allow them. wt ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] flex/bison output wrongly created in the source directory
I was wondering if there is a reason that the flex and bison and other generated source files end up in the source directory when doing an out-of-tree build. Would a patch that puts those files in the build trees be accepted? wt ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] operator suggest interval / interval = numeric
On Jan 10, 2008 3:33 PM, Brendan Jurd [EMAIL PROTECTED] wrote: 1 month is deemed equal to 30 days, 1 day is deemed equal to 24 hours (although for some reason we ignore the issue of years vs. days). Sorry, a correction. The issue of years vs. days isn't ignored. A year is just 12 months, which yields 12 * 30 = 360 days, which is actually a pretty significant error (1.4% on average). # select interval '1 year' = interval '360 days'; ?column? -- t (1 row) ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] operator suggest interval / interval = numeric
On Jan 9, 2008 9:29 PM, Brendan Jurd [EMAIL PROTECTED] wrote: Sorry, a correction. The issue of years vs. days isn't ignored. A year is just 12 months, which yields 12 * 30 = 360 days, which is actually a pretty significant error (1.4% on average). YEAR TO MONTH and DAY TO {HOUR,MINUTE,SECOND} intervals should not combine. PostgreSQL correctly doesn't allow {YEAR,MONTH} TO {DAY,HOUR,MINUTE,SECOND} intervals, and it shouldn't allow operating on invalid intervals combinations either. One mistake that PG does make is that it allows a FULL RANGE interval. This weirdness is essentially a YEAR TO SECOND interval that isn't allowed when explicitly requested. wt-time= select INTERVAL '1 year 1 month 1 day 1:1:1'; interval - 1 year 1 mon 1 day 01:01:01 (1 row) wt-time= select INTERVAL '1 year 1 month 1 day 1:1:1' YEAR TO SECOND; ERROR: syntax error at or near SECOND LINE 1: select INTERVAL '1 year 1 month 1 day 1:1:1' YEAR TO SECOND; ^ This is inconsistent. I would like to ultimately not allow operations on interval combinations that are not allowed by the SQL standard. wt ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] flex/bison output wrongly created in the source directory
Warren Turkal [EMAIL PROTECTED] writes: I was wondering if there is a reason that the flex and bison and other generated source files end up in the source directory when doing an out-of-tree build. Would a patch that puts those files in the build trees be accepted? Probably not, since our intention is that those files be distributed as part of source tarballs. Also, since they are (or should be) architecture-independent, what's the point? Out-of-tree builds are intended to support building for multiple architectures in parallel; but there's no reason to force independent reconstructions of these common derived files. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] operator suggest interval / interval = numeric
Warren Turkal [EMAIL PROTECTED] writes: YEAR TO MONTH and DAY TO {HOUR,MINUTE,SECOND} intervals should not combine. PostgreSQL correctly doesn't allow {YEAR,MONTH} TO {DAY,HOUR,MINUTE,SECOND} intervals, Really? I think you've confused some unimplemented decorative syntax with what the underlying datatype will or won't do. This is inconsistent. I would like to ultimately not allow operations on interval combinations that are not allowed by the SQL standard. The spec's approach to datetime operations in general is almost totally brain-dead, and so you won't find a lot of support around here for hewing to the straight-and-narrow-spec-compliance approach. If they have not even heard of daylight-savings time, how can anyone credit them with any meaningful contact with the real world? We'll cite the spec where it suits us, but in this area the spec says you can't do that carries very little weight. Or were you planning to lobby for removal of our DST support, too? regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] operator suggest interval / interval = numeric
-Original Message- From: [EMAIL PROTECTED] [mailto:pgsql-hackers- [EMAIL PROTECTED] On Behalf Of Tom Lane Sent: Wednesday, January 09, 2008 10:00 PM To: Warren Turkal Cc: Brendan Jurd; Ilya А. Кovalenko; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] operator suggest interval / interval = numeric Warren Turkal [EMAIL PROTECTED] writes: YEAR TO MONTH and DAY TO {HOUR,MINUTE,SECOND} intervals should not combine. PostgreSQL correctly doesn't allow {YEAR,MONTH} TO {DAY,HOUR,MINUTE,SECOND} intervals, Really? I think you've confused some unimplemented decorative syntax with what the underlying datatype will or won't do. This is inconsistent. I would like to ultimately not allow operations on interval combinations that are not allowed by the SQL standard. The spec's approach to datetime operations in general is almost totally brain-dead, and so you won't find a lot of support around here for hewing to the straight-and-narrow-spec-compliance approach. If they have not even heard of daylight-savings time, how can anyone credit them with any meaningful contact with the real world? We'll cite the spec where it suits us, but in this area the spec says you can't do that carries very little weight. Or were you planning to lobby for removal of our DST support, too? Don't forget indexes. The standard does not breathe a word about them. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] operator suggest interval / interval = numeric
Brendan Jurd [EMAIL PROTECTED] writes: On Jan 10, 2008 2:17 AM, Tom Lane [EMAIL PROTECTED] wrote: You'd have to define exactly what that means, which seems a little tricky for incommensurate intervals. For instance what is the result of '1 month' / '1 day' ? Postgres has already made such definitions, to allow direct interval-interval comparison. Sure. I was just twitting the OP for having not considered these issues. Given that you can get at that behavior by dividing extract(epoch) results, I tend to think we should leave well enough alone. If someone did come up with a brilliant definition of what interval division should do, it would be pretty annoying to have already locked ourselves into a not-so-brilliant definition ... regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] operator suggest interval / interval = numeric
On Jan 10, 2008 5:00 PM, Tom Lane [EMAIL PROTECTED] wrote: The spec's approach to datetime operations in general is almost totally brain-dead, and so you won't find a lot of support around here for hewing to the straight-and-narrow-spec-compliance approach. If they have not even heard of daylight-savings time, how can anyone credit them with any meaningful contact with the real world? We'll cite the spec where it suits us, but in this area the spec says you can't do that carries very little weight. It's true that the spec fails to consider DST, in that it doesn't partition day and second intervals separately. But is that really a reason to reject the concept of interval partitioning altogether? It seems the spec has the right idea, it just doesn't take it far enough to cover all the bases. Whether the spec is braindead w.r.t intervals or not, Postgres is clearly giving the wrong answer. A year interval is not 360 day intervals long. A month interval is not shorter than 31 day intervals. And, thanks to the geniuses who came up with DST, a day interval is not the same as 24 hour intervals anymore. None of these comparisons are sane. Regards, BJ ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] operator suggest interval / interval = numeric
On Jan 9, 2008 10:44 PM, Brendan Jurd [EMAIL PROTECTED] wrote: On Jan 10, 2008 5:00 PM, Tom Lane [EMAIL PROTECTED] wrote: The spec's approach to datetime operations in general is almost totally brain-dead, and so you won't find a lot of support around here for hewing to the straight-and-narrow-spec-compliance approach. If they have not even heard of daylight-savings time, how can anyone credit them with any meaningful contact with the real world? We'll cite the spec where it suits us, but in this area the spec says you can't do that carries very little weight. It's true that the spec fails to consider DST, in that it doesn't partition day and second intervals separately. Should the standard really do that? I mean, a day really is defined as some number of seconds. But is that really a reason to reject the concept of interval partitioning altogether? It seems the spec has the right idea, it just doesn't take it far enough to cover all the bases. I think the standard does a good job with the partitioning. Whether the spec is braindead w.r.t intervals or not, Postgres is clearly giving the wrong answer. A year interval is not 360 day intervals long. A month interval is not shorter than 31 day intervals. And, thanks to the geniuses who came up with DST, a day interval is not the same as 24 hour intervals anymore. None of these comparisons are sane. DST has no bearing on the fact that a day is still 86400 in the mean solar system. There really is no partition for Day down through seconds. It just means that for timestamp operations the day where we spring forward is 23 hours long, and the day where we fall back is 1 day 1 hour. Having said all this, neither a month nor a year is not a fixed number of days. The partitioning system used by the SQL standard seems to deal with this problem pretty well. wt ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] operator suggest interval / interval = numeric
Brendan Jurd [EMAIL PROTECTED] writes: On Jan 10, 2008 5:00 PM, Tom Lane [EMAIL PROTECTED] wrote: The spec's approach to datetime operations in general is almost totally brain-dead, ... It's true that the spec fails to consider DST, in that it doesn't partition day and second intervals separately. That's only one of the ways in which they ignore DST, and not even the most important one --- my vote for the spectacularly bad omission is that SET TIME ZONE only allows constant offsets from UTC. Whether the spec is braindead w.r.t intervals or not, Postgres is clearly giving the wrong answer. Sure, but it's not clear that there *is* a right answer. As noted upthread, a useful approximate answer can be better than no answer at all. None of these comparisons are sane. You can always refrain from making such comparisons, if you think they are incapable of yielding useful answers. This whole area is pretty messy, and I don't think that there is or can be a simple uniform solution :-(. We need to tread carefully in introducing new behaviors that we might regret later. So I'm not in favor of inventing an interval division operator that just duplicates functionality that's already there in a more-cumbersome notation. We might want that operator back someday. Who even wants to argue that the result datatype should be numeric? Dividing a three-component quantity by another one doesn't sound to me like an operation that naturally yields a scalar result. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] operator suggest interval / interval = numeric
On Jan 9, 2008 11:06 PM, Tom Lane [EMAIL PROTECTED] wrote: Brendan Jurd [EMAIL PROTECTED] writes: On Jan 10, 2008 5:00 PM, Tom Lane [EMAIL PROTECTED] wrote: The spec's approach to datetime operations in general is almost totally brain-dead, ... It's true that the spec fails to consider DST, in that it doesn't partition day and second intervals separately. That's only one of the ways in which they ignore DST, and not even the most important one --- my vote for the spectacularly bad omission is that SET TIME ZONE only allows constant offsets from UTC. I am assuming that you are advocating the use of the names for timezones that can indicate what happens over a DST change. I think that it would be useful to be able specify a timezone like PST8PDT. Whether the spec is braindead w.r.t intervals or not, Postgres is clearly giving the wrong answer. Sure, but it's not clear that there *is* a right answer. As noted upthread, a useful approximate answer can be better than no answer at all. I am not sure that I agree with that. If you need to keep track of the days, you should probably be using intervals using day to second (or narrower) resolution. None of these comparisons are sane. You can always refrain from making such comparisons, if you think they are incapable of yielding useful answers. Maybe a way to enable strict compliance to the standard would be useful. This whole area is pretty messy, and I don't think that there is or can be a simple uniform solution :-(. We need to tread carefully in introducing new behaviors that we might regret later. So I'm not in favor of inventing an interval division operator that just duplicates functionality that's already there in a more-cumbersome notation. We might want that operator back someday. Who even wants to argue that the result datatype should be numeric? Dividing a three-component quantity by another one doesn't sound to me like an operation that naturally yields a scalar result. I think this is reasonable. wt ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote: If the exclusion is executor driven, the planner cannot help but create a seq scan plan. The planner will think you're returning 100X rows when really you end up returning X rows. After that, all decisions made by the planner are totally bogus. One of the most important queries on large tables is handling WHERE Eventdate = CURRENT DATE; We cannot perform partition exclusion using this type of WHERE clause at planning time because the CURRENT DATE function is STABLE. We can decide that some form of partition exclusion is possible, but the actual partition we access can *only* be decided within the executor. That necessarily effects the selectivity estimates. The planner can see we want some data, but it can't tell which data, so it doesn't know whether we will hit the day with the most data or the date with the least data. You've mentioned this a few times, as if its a problem with dynamic partitioning, yet its clearly an issue for any form of partitioning. So it seems clear that we need to make partition exclusion work at executor time, whatever else we do. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] operator suggest interval / interval = numeric
On Jan 9, 2008 10:00 PM, Tom Lane [EMAIL PROTECTED] wrote: Really? I think you've confused some unimplemented decorative syntax with what the underlying datatype will or won't do. Fair enough. The underlying type certainly will do it since it works without the opt_interval. This is inconsistent. I would like to ultimately not allow operations on interval combinations that are not allowed by the SQL standard. The spec's approach to datetime operations in general is almost totally brain-dead, and so you won't find a lot of support around here for hewing to the straight-and-narrow-spec-compliance approach. If they have not even heard of daylight-savings time, how can anyone credit them with any meaningful contact with the real world? We'll cite the spec where it suits us, but in this area the spec says you can't do that carries very little weight. DST in the sense of doing arithmetic on timestamps? I was not aware that the standard defined the result in such a way that precluded allowing for DST and leap seconds and whatever other time warps you wanted to allow in your database. In fact, looking over the draft of the 2003 standard looks like it takes DST into consideration just fine. It just doesn't allow the use of a non-constant timezone identifier, which admittidly would be useful. Or were you planning to lobby for removal of our DST support, too? No. The DST support makes sense. wt ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] flex/bison output wrongly created in the source directory
On Jan 9, 2008 9:51 PM, Tom Lane [EMAIL PROTECTED] wrote: Warren Turkal [EMAIL PROTECTED] writes: I was wondering if there is a reason that the flex and bison and other generated source files end up in the source directory when doing an out-of-tree build. Would a patch that puts those files in the build trees be accepted? Probably not, since our intention is that those files be distributed as part of source tarballs. That makes sense. Also, since they are (or should be) architecture-independent, what's the point? Out-of-tree builds are intended to support building for multiple architectures in parallel; but there's no reason to force independent reconstructions of these common derived files. I was not building multiple builds in parallel as I thought they might get fouled by the different builds. Since they do appear to not change, I think they are just fine. Thanks for the help! wt ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster