Re: [HACKERS] Cost of XLogInsert CRC calculations
On Sun, 6 Mar 2005, Tom Lane wrote: I suppose that the bulk of the CPU cycles being attributed to XLogInsert are going into the inlined CRC calculations. Maybe we need to think twice about the cost/benefit ratio of using 64-bit CRCs to protect xlog records that are often only a few dozen bytes. Isn't the CRC quite important on recovery to recognize where the last valid log record is? Is there any better implementations of CRC-64? Would using a different polynomial help? Would it help to do the CRC calculation in a more wholesale fashion in XLogWrite? How about switching to CRC-32 or even CRC-16? I searched the archives for the reason CRC-64 was chosen in the first place. It seems that the difference in computation time was not considered to be significant, and there was 8 bytes available in the record header anyway. Just some thoughts... - Heikki ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Cost of XLogInsert CRC calculations
On Sun, 2005-03-06 at 00:17 -0500, Tom Lane wrote: I was profiling a case involving UPDATEs into a table with too many indexes (brought to mind by mysql's sql-bench, about which more later) and got this rather surprising result for routines costing more than 1% of the total runtime: Each sample counts as 0.01 seconds. % cumulative self self total time seconds secondscalls s/call s/call name 64.03 86.2086.20 133608 0.00 0.00 XLogInsert 3.50 90.91 4.71 2484787 0.00 0.00 _bt_compare 2.92 94.84 3.93 839893 0.00 0.00 hash_search 2.77 98.57 3.73 1875815 0.00 0.00 LWLockAcquire 1.89101.12 2.55 1887972 0.00 0.00 LWLockRelease 1.27102.83 1.71 125234 0.00 0.00 _bt_getroot 1.01104.19 1.36 403342 0.00 0.00 PinBuffer 1.00105.54 1.35 840002 0.00 0.00 hash_any I suppose that the bulk of the CPU cycles being attributed to XLogInsert are going into the inlined CRC calculations. Maybe we need to think twice about the cost/benefit ratio of using 64-bit CRCs to protect xlog records that are often only a few dozen bytes. Yes, in recent performance tests sponsored by Unisys, this result was also very clear. In those tests we used Intel VTune to identify the precise lines of code soaking up the cycles...it was the CRC checks. More results should be available from the Unisys testing within a few days. I had assumed that the majority of the cost of CRC checking was as a result of the need to log complete blocks, rather than the rather small xlog records themselves? Best Regards, Simon Riggs ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] I am in Copenhagen
I am in Copenhagen and am speaking tomorrow and will return on Sunday. I would have loved to hear your three hour speech about PostgreSQL administration, but unfortunately they put my own presentation at the exact same time, those LinuxForum 2005 bastards! -- Kaare Rasmussen--Linux, spil,--Tlf:3816 2582 Kaki Datatshirts, merchandize Fax:3816 2501 Nordre Fasanvej 12 Åben 12.00-18.00Email: [EMAIL PROTECTED] 2000 FrederiksbergLørdag 12.00-16.00 Web: www.suse.dk ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Missing coalesce
On Sunday 06 March 2005 02:09, Jim C. Nasby wrote: ERROR: function coalence(interval, interval) does not exist I'm guessing this is an oversight, right? Where would I go about fixing it? If you've copy/pasted that error, then you need to s/coaleNce/coalesece so to speak. -- Robert Treat Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] refactoring fork() and EXEC_BACKEND
This is a lot like what I was planning to work towards with the refactoring of the forkexec code I promised to do for 8.1. Cool. BTW, have we accepted that EXEC_BACKEND is the way we're going to workaround the lack of fork() on Win32 for the foreseeable future? I mean, it _works_, but it's slow, ugly, and complicates the code. If it's the only workable option for Win32 support, then fair enough -- I just don't know enough of the Win32 API to know if there's a better alternative out there (short of using threads, which is of course not really plausible). I don't beleive there is any other way than using threads. The only objects you can create are processes and threads, and I don't know there to be any other way to create a process than CreateProcess(). I was actually thinking of not passing these on the commandline at all, in order to avoid possible quoting issues etc (recall all the problems with the stupid commandline processing on win32). Instead moving it into a struct that is appended to the end of the backend variable file/shared memory. Sounds good to me. Finding a cleaner way to pass data to the child process than writing it out to a file would also be nice, if possible. Again, I'm not sure what options there are on Win32... Win32 already passes it using shared memory. It was when I asked to get that patch in during beta (or possibly even RC) that I promised to work on the cleanup stuff for 8.1. For unix/exec_backend it still writes to a file, but since that is never expected to be used in production where performance is an issue... I think that is a fairly clean way of doing it. You could pass it through a pipe or something, but I don't see that it would be a cleaner approach. You're still going to have a single place collecting all the data, which is where most of the uglyness comes from. That was also what I was thinking. Let me know if you want to split the load somewhere :-) Given that you're planning to work on this, I've scaled back my ambitions. I'll send a patch to -patches that just cleans up fork() and doesn't change the EXEC_BACKEND case. So fork_process() will: - flush stderr/stdout - save and restore the profiling timer if LINUX_PROFILE is defined - handle BeOS which means it should not be very invasive. Of course, there is plenty of room for improvement -- if you're interested in taking a look, please do... Ok. I'll look at it once your stuff is done. //Magnus ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[HACKERS] Speeding up tupledesc copying
I was just doing some profiling of plpgsql record operations, and noticed that an unreasonable amount of runtime was being consumed by palloc/pfree calls coming from CreateTupleDescCopy. The reason for this of course is that each attribute's storage is separately palloc'd. This seems a little silly given that the tupdesc management code doesn't have any provision for changing the number of attributes in an existing tupdesc. We could just as well allocate all the space needed in one palloc call, with a little bit of added address-arithmetic to set up the pointers. This would require some localized changes in tupdesc.c and relcache.c, but most users of tupdescs would never know the difference. Can anyone see a reason not to make this change? regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] Implementation of SQLCODE and SQLERRM variables for PL/pgSQL
Hello This is my second patch, than please will be tolerant :-). For one my project I miss information about exception when I use EXCEPTION WITH OTHERS THEN. I found two Oracle's varaibles SQLCODE and SQLERRM which carry this information. With patch you can: -- -- Test of built variables SQLERRM and SQLCODE -- create or replace function trap_exceptions() returns void as $_$ begin begin raise exception 'first exception'; exception when others then raise notice '% %', SQLCODE, SQLERRM; end; raise notice '% %', SQLCODE, SQLERRM; begin raise exception 'last exception'; exception when others then raise notice '% %', SQLCODE, SQLERRM; end; return; end; $_$ language plpgsql; select trap_exceptions(); drop function trap_exceptions(); CREATE FUNCTION NOTICE: P0001 first exception NOTICE: 00 Sucessful completion NOTICE: P0001 last exception trap_exceptions - (1 row) DROP FUNCTION Regards, Pavel Stehule -- -- Test of built variables SQLERRM and SQLCODE -- create or replace function trap_exceptions() returns void as $_$ begin begin raise exception 'first exception'; exception when others then raise notice '% %', SQLCODE, SQLERRM; end; raise notice '% %', SQLCODE, SQLERRM; begin raise exception 'last exception'; exception when others then raise notice '% %', SQLCODE, SQLERRM; end; return; end; $_$ language plpgsql; select trap_exceptions(); drop function trap_exceptions(); CREATE FUNCTION NOTICE: P0001 first exception NOTICE: 00 Sucessfull completation NOTICE: P0001 last exception trap_exceptions - (1 row) DROP FUNCTION 357a358,360 int sqlcode_varno; int sqlerrm_varno; 826a827,842 /* INICIALIZACE fooi a foot */ PLpgSQL_var *var; var = (PLpgSQL_var *) (estate-datums[block-sqlcode_varno]); var-isnull = false; var-freeval = false; var-value = DirectFunctionCall1(textin, CStringGetDatum(00)); var = (PLpgSQL_var *) (estate-datums[block-sqlerrm_varno]); var-isnull = false; var-freeval = false; var-value = DirectFunctionCall1(textin, CStringGetDatum(Sucessful completion)); 931a948,966 /* unpack MAKE_SQLSTATE code */ chartbuf[12]; int ssval; int i; ssval = edata-sqlerrcode; for (i = 0; i 5; i++) { tbuf[i] = PGUNSIXBIT(ssval); ssval = 6; } tbuf[i] = '\0'; var = (PLpgSQL_var *) (estate-datums[block-sqlcode_varno]); var-value = DirectFunctionCall1(textin, CStringGetDatum(tbuf)); var = (PLpgSQL_var *) (estate-datums[block-sqlerrm_varno]); var-value = DirectFunctionCall1(textin, CStringGetDatum(edata-message)); 88a89,93 struct { int sqlcode_varno; int sqlerrm_varno; } fict_vars; 104a110 %type fict_vars fict_vars_sect 251c257 pl_block : decl_sect K_BEGIN lno proc_sect exception_sect K_END --- pl_block : decl_sect fict_vars_sect K_BEGIN lno proc_sect exception_sect K_END 259c265 new-lineno = $3; --- new-lineno = $4; 263,264c269,272 new-body = $4; new-exceptions = $5; --- new-body = $5; new-exceptions = $6; new-sqlcode_varno = $2.sqlcode_varno; new-sqlerrm_varno = $2.sqlerrm_varno; 271a280,291 fict_vars_sect: { plpgsql_ns_setlocal(false); PLpgSQL_variable*var; var = plpgsql_build_variable(strdup(sqlcode), 0, plpgsql_build_datatype(TEXTOID, -1), true); $$.sqlcode_varno = var-dno; var = plpgsql_build_variable(strdup(sqlerrm), 0, plpgsql_build_datatype(TEXTOID, -1), true);
Re: [HACKERS] Implementation of SQLCODE and SQLERRM variables for PL/pgSQL
Pavel Stehule [EMAIL PROTECTED] writes: This is my second patch, than please will be tolerant :-). For one my project I miss information about exception when I use EXCEPTION WITH OTHERS THEN. I found two Oracle's varaibles SQLCODE and SQLERRM which carry this information. I think we discussed this last year and decided that it would be a bad idea to use those names because Oracle's use of them is not exactly compatible with our error codes and messages. SQLCODE in particular is not compatible at all --- it's an integer in Oracle, isn't it? IIRC we had put off solving this problem until we decided what to do with RAISE. There really needs to be some changes in RAISE to allow it to raise a specific error code rather than always P0001, but exactly what is still undecided. Some other problems with your patch: no documentation, and not in diff -c format. Plain diff patches are never acceptable because it's too risky to apply them against files that might have changed since you started working with them. Also, it's much easier to deal with one patch than with a separate diff for each file. (diff -c -r between an original and a modified directory is one good way to produce a useful patch.) regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Implementation of SQLCODE and SQLERRM variables for
I think we discussed this last year and decided that it would be a bad idea to use those names because Oracle's use of them is not exactly compatible with our error codes and messages. SQLCODE in particular is not compatible at all --- it's an integer in Oracle, isn't it? There is more incompatibilities to Oracle. SQLERRM is function on Oracle, only if you use it without parametr, returns current message error. SQLCODE is really integer. But it's only names. There is no problem change it. IIRC we had put off solving this problem until we decided what to do with RAISE. There really needs to be some changes in RAISE to allow it to raise a specific error code rather than always P0001, but exactly what is still undecided. I didn't know it. But for my work is SQLERRM more important. I have more constraints on tables and I need detect which which constraints raise exception. The possibility EXCEPTION WITH OTHERS is nice, but not too much usefull because I have not possibility get some informations about except. Some other problems with your patch: no documentation, and not in diff -c format. Plain diff patches are never acceptable because it's too risky to apply them against files that might have changed since you started working with them. Also, it's much easier to deal with one patch than with a separate diff for each file. (diff -c -r between an original and a modified directory is one good way to produce a useful patch.) I am not sure, I able create documentation - my english is poor. I will change diff's format and send patch again. Thank you Pavel ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
[HACKERS] About b-tree usage
Please let me know, if there is any option in postgresql to achieve the following usage of a b-tree index: For a relation R(att0, att1) and a btree index on attribute att0 In each insertion of a tuple on table: - look on index if the value of att0 of new entry does already exist in index, and - if no, allow the aprorpiate entry on b-tree - if yes, do not allow an entry. In my aplication i have always my relation clustered according to att0. And the only information needed for a query with a range condition over att0 in WHERE clause, is the place on disc where the first tuple with a given value on att0 is placed. The hint, is that beacause of too many raws of table, the index size is too big. But the number of discrete values of att0 is orders of magnitudes smaller than the number of tuples. I try to investigate, if there is a way to use an alternative of b-tree index, to decrease the blocks of indexed that are fetched into memory. Thanks. ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] About b-tree usage
If I understand your question, you want to reduce the index size by only pointing to the first tuple in a table with a given key in att0, since the rest of the tuples will be right afterward (because you keep the table clustered on that key). However, from the docs: http://www.postgresql.org/docs/8.0/static/sql-cluster.html When a table is clustered, it is physically reordered based on the index information. Clustering is a one-time operation: when the table is subsequently updated, the changes are not clustered. That is, no attempt is made to store new or updated rows according to their index order. If one wishes, one can periodically recluster by issuing the command again. So, there is no guarantee that there won't be tuples with the same key in a different physical location between your CLUSTER commands. That means your index idea won't really work. Perhaps you can alter your table layout/design to better suit your needs. For example: * If you have a very small number of values in att0, you could make a different table for each one and make a view that's the union of those tables. * You could make att1 an array. Yes, it is a horrible design from a relational standpoint, but if you need performance, there may not be many other options. You can then use set-returning functions and other functions to make it behave more like a relation. Ideally, postgresql would have relation-valued attributes, but currently arrays seem like the best option available for simulating a relation-valued attribute. If there are many identical values in att0, are you sure a sequential scan isn't more efficient? Also, are you sure the index isn't working well? It seems to me since you have the table clustered, it might be fairly efficient as-is (it would get a huge benefit from the spatial locality of the tuples in the table). Index size alone shouldn't destroy your performance, since the idea of an index lookup is that it only has to read O(log n) pages from the disk per lookup. Regards, Jeff Davis On Sun, 2005-03-06 at 23:33 +0200, Ioannis Theoharis wrote: Please let me know, if there is any option in postgresql to achieve the following usage of a b-tree index: For a relation R(att0, att1) and a btree index on attribute att0 In each insertion of a tuple on table: - look on index if the value of att0 of new entry does already exist in index, and - if no, allow the aprorpiate entry on b-tree - if yes, do not allow an entry. In my aplication i have always my relation clustered according to att0. And the only information needed for a query with a range condition over att0 in WHERE clause, is the place on disc where the first tuple with a given value on att0 is placed. The hint, is that beacause of too many raws of table, the index size is too big. But the number of discrete values of att0 is orders of magnitudes smaller than the number of tuples. I try to investigate, if there is a way to use an alternative of b-tree index, to decrease the blocks of indexed that are fetched into memory. Thanks. ---(end of broadcast)--- TIP 8: explain analyze is your friend ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
[HACKERS] ERROR: unrecognized node type in PostgresMain( )
Hi I'm implementing a new join that needs 2 hash tables, one for each relation. I declared a new type of node, DPHashJoin (with id T_DPHashJoin 121 at nodes.h) , which is by now the same as HashJoin plan node.I created a new plan with two Hash nodes, something that looks like a symmetric hash join plan tree, (I havent create the corresponding plan state tree at the executor yet). The problem is that while, Im running the postgres with gdb and I set a breakpoint at the function create_plan() of createplan.c, after some step commands,the gdb says that it cant recongnize the node 121 in the PostgresMain() function. I looked into that function but I cant see any message like this :unrecognized node type:.So I cant see where the proplem could be.(I saw as well that the error didnt come from the execProcnode.c at the executor) Do anyone has any idea what the problem could be? (That node is recognizable at the file I added code (createplan.c) at the optimizer.) Please help. I dont know what to do. I send you the messages of the gdb. Please take a look. [EMAIL PROTECTED] martha]$ gdb /usr/local/pgsql/bin/postgres GNU gdb Red Hat Linux (5.3post-0.20021129.18rh) Copyright 2003 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type show copying to see the conditions. There is absolutely no warranty for GDB. Type show warranty for details. This GDB was configured as i386-redhat-linux-gnu... (gdb) break create_plan Breakpoint 1 at 0x811c7f1 (gdb) run -D /usr/local/pgsql/test test Starting program: /usr/local/pgsql/bin/postgres -D /usr/local/pgsql/test test POSTGRES backend interactive interface $Revision: 1.375.2.1 $ $Date: 2003/11/24 14:50:02 $ backend select * from work,people where work.peoplesurname=people.surname CHEAPEST PATH: #506# #500# #121# #0.30..0.50..# Breakpoint 1, 0x0811c7f1 in create_plan () (gdb) step Single stepping until exit from function create_plan, which has no line number information. martha: it is T_HashJoin or T_MergeJoin or T_NestLoop. Invoked function create_join_plan( ) Breakpoint 1, 0x0811c7f1 in create_plan () (gdb) step Single stepping until exit from function create_plan, which has no line number information. martha: it is T_SeqScan. Invoked function create_seqscan_plan( ) //this is a printf that I added at the source code 0x0811caa7 in create_join_plan () (gdb) step Single stepping until exit from function create_join_plan, which has no line number information. Breakpoint 1, 0x0811c7f1 in create_plan () (gdb) step Single stepping until exit from function create_plan, which has no line number information. martha: it is T_SeqScan. Invoked function create_seqscan_plan( ) //this is a printf that I added at the source code 0x0811cab6 in create_join_plan () (gdb) step Single stepping until exit from function create_join_plan, which has no line number information. 0x0811c892 in create_plan () (gdb) step Single stepping until exit from function create_plan, which has no line number information. 0x081209cb in grouping_planner () (gdb) step Single stepping until exit from function grouping_planner, which has no line number information. 0x08120253 in subquery_planner () (gdb) step Single stepping until exit from function subquery_planner, which has no line number information. 0x0811ff7c in planner () (gdb) step Single stepping until exit from function planner, which has no line number information. ERROR: unrecognized node type: 121 0x0814e09c in PostgresMain () (gdb) Thanks in advance ! -Martha mc ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Solving hash table overrun problems
This also brings up a line of thought I had a while ago on a related topic. Something like a HashDistinct might be useful, if it had no startup cost. We already have that: the planner will use a HashAgg node in this fashion in some contexts (I think just as one of the ways to do IN, at the moment). Hmm.. I see HashAggregate being used that way in the IN queries, but I have not observed it used in a way that incurrs no startup cost. It looked to me that in doing hash aggregation in ExecAgg (nodeAgg.c), agg_fill_hash_table() would have to be called, which iterate through every output of the child plan building the hash table before it returns, thus incurring at least the startup cost of executing the entire subplan of the child node at the aggregation stage. I'm not too familiar with the code, so there is probably something I'm missing somewhere :( It's not yet bright enough to consider doing it for SELECT DISTINCT. The DISTINCT planning code is old and crufty and pretty tightly interwired with ORDER BY ... it needs work. Yes, SELECT DISTINCT was my motivating example, though in my specific application latency (i.e. not having to wait for the entire query below the DISTINCT operation to finish) was also an important factor, hence my thoughts on a zero startup cost hash aggregation and wondering if it would really be any kind of win in the end. -Aaron ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] ERROR: unrecognized node type in PostgresMain( )
[EMAIL PROTECTED] writes: The problem is that while, Im running the postgres with gdb and I set a breakpoint at the function create_plan() of createplan.c, after some step commands,the gdb says that it cant recongnize the node 121 in the PostgresMain() function. That's not what gdb said at all; all you know is that the error happened someplace where you didn't have control. (My bet is that control got as far as executor startup, which you already said you hadn't fixed to understand this new node type, so the error is exactly what I'd expect.) Perhaps you should rebuild the backend with -g (see --enable-debug) so that gdb can actually be somewhat helpful. It's usually a good idea to have --enable-cassert turned on when hacking C code, too. FWIW, I don't personally ever do debugging in a standalone backend, unless the problem is one that keeps a normal backend from starting. It's much more pleasant to use a regular psql session to issue SQL commands, and attach to the connected backend with gdb in another terminal window. regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] buildfarm issues
Andrew, or 2.6.x ... in fact it's almost impossible to tell what might be installed on a Gentoo system, or how it was compiled. So I'm really not sure how we should treat such systems. Distribution, General OS, Kernel Version, Other Info e.g. SuSELinux 2.6.8-7 64-Bit MS Windows XP SP2 Also, I think you should be recording the compile-time switches used on each machine and indexing them indivdually. I'd hate to find out that, for example, we'd broken --with-odbc and didn't know it because nobody in the buildfarm was using it. Second is the fact that systems change over time. People upgrade their machines. I'm considering a facility to allow people to change the os-version,compiler-version aspects of their registered personality - these will become essentially timestamped pieces of information, so we'll still be able to tie a set of values to a history item. Yes ... actually, a quick (and automated) reminder e-mail once a month wouldn't hurt. And this machine registration info could be kept as a config file in the Buildfarm directory, for ease of update, and sent in with the buildfarm results. Fourth, there are some significant gaps in the buildfarm coverage. I am I've a few responses from the PWN, such as a YDL machine on PPC4. I'll be passing these people along this week. Finally, a volunteer or two to help me with both running and administering this would be great. Knowledhge of perl and Template Toolkit a decided advantage. I can help with the people end of things, but my web skills suck. -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])