[HACKERS] ExecuteTruncate quirk: expects a unique list of relations
Hi, Consider this simple case: postgres=# TRUNCATE foo, foo; ERROR: cannot TRUNCATE foo because it is being used by active queries in this session The above occurs because the ExecuteTruncate() function invokes truncate_check_rel() in a loop. Since the same table name appears twice, the rd_refcnt for table foo is bumped up to 2, causing the above failure. We might want to add a step to ExecuteTruncate(), or whatever calls it, to make the list unique. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] plpgsql: penalty due to double evaluation of parameters
Hi, I don't buy the performance argument unless I see some test results demonstrating it; exec_prepare_plan is only called on the first invocation of a statement. What kind of side-effects could exec_eval_datum call have? Note that I have avoided using the performance word for this very reason. But consider for example when the datum type is PLPGSQL_DTYPE_REC. I dont think its justified to have the overhead of heap_copytuple_with_tuple, when all we need is just the typeid! Similar arguments apply for other datums like PLPGSQL_DTYPE_ROW, PLPGSQL_DTYPE_TRIGARG e.g. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
[HACKERS] plpgsql: penalty due to double evaluation of parameters
Hi, Within exec_prepare_plan there are calls to exec_eval_datum to identify the argtypes of the involved parameters. However exec_eval_datum actually fills up value, isnull entries in these cases causing unnecessary additional calls when all we need is the datum type. Such unnecessary evaluation of values might prove to be very costly later since this quirk of exec_eval_datum usage is not so visible. Worse still it could cause bugs if some evaluations have side-effects across multiple evals. It might make sense to introduce a new function exec_eval_datum_type to address this or exec_eval_datum could itself be modified for cases where we just need the datum type. Should I cook up a patch for this? I am inclined towards introducing a new function (but that means that any new datum related changes need to be carried out in 2 functions instead of one currently). Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] [PATCHES] [EMAIL PROTECTED]: Re: [BUGS] Problem identifying constraints which should not be inherited]
Hi, On Sat, May 10, 2008 at 6:11 AM, Alex Hunsaker [EMAIL PROTECTED] wrote: On Fri, May 9, 2008 at 5:37 PM, Tom Lane [EMAIL PROTECTED] wrote: Alex Hunsaker [EMAIL PROTECTED] writes: [ patch to change inherited-check-constraint behavior ] Applied after rather heavy editorializations. You didn't do very well on getting it to work in multiple-inheritance scenarios, such as create table p (f1 int check (f10)); create table c1 (f2 int) inherits (p); create table c2 (f3 int) inherits (p); create table cc () inherits (c1,c2); Here the same constraint is multiply inherited. The base case as above worked okay, but adding the constraint to an existing inheritance tree via ALTER TABLE, not so much. Ouch. Ok Ill (obviously) review what you committed so I can do a lot better next time. Thanks for muddling through it! Ouchie indeed! I'm not sure if we ought to try to back-patch that --- it'd be a behavioral change with non-obvious implications. In the back branches, ADD CHECK followed by DROP CONSTRAINT will end up not deleting the child-table constraints, which is probably a bug but I wouldn't be surprised if applications were depending on the behavior. Given the lack complaints it does not seem worth a back patch IMHO. Yeah, same IMHO. I do hope we have covered things properly for inherited check constraints by now. One minor thing that myself and Alex discussed was the usage of child tables in tablecmds.c, especially in error messages. Again English is not my native language, but shouldn't that be worded as children tables? Admittedly even this does not sound any better than child tables though :). It is nit-picking really, but I can submit a cleanup patch to reword this if the list thinks so.. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Lessons from commit fest
Hi, The idea that we fix stylistic issues on the fly is not sustainable. We should offer help and mentorship to new patch submitters in all areas (including stylistic) but they should do the work. It is the only way we will mold them to submit patches in the proper way. I agree. As a submitter I would much rather get an email saying e.g. Hey, your patch is nice but the code style sticks out like a sore thumb. Please adopt surrounding naming convention and fix your indentation per the rules at [link]. than have it fixed silently on its way to being committed. With the former I learn something and get to improve my own work. With the latter, my next patch is probably going to have the exact same problem, which is in the long term just making extra work for the reviewers. I think, us patch-submitters should be asked to do a run of pg_indent on the files that we have modified. That should take care of atleast the indentation related issues. I looked at the README of src/tools/pgindent, and it should be easy to run enough (or is it not?). Only one thing that caught my eye was: 1) Build the source tree with _debug_ symbols and all possible configure options Can the above point be elaborated further? What all typical and possible configure options should be used to get a clean and complete pg_indent run? And I think adopting surrounding naming, commeting, coding conventions should come naturally as it can aide in copy-pasting too :) Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] COPY Transform support
Hi, On Thu, Apr 3, 2008 at 6:47 PM, Dimitri Fontaine [EMAIL PROTECTED] wrote: Here's a proposal for COPY to support the T part of an ETL, that is adding the capability for COPY FROM to Transform the data it gets. The idea is quite simple: adding to COPY FROM the option to run a function on the data before to call datatype_in functions. This needs some syntax addition to be worked out at the COPY side, then the COPY code will have to run the given function on the read data and consider giving the output of it to current COPY code (datatype input function). The function could either get the data as text or bytea, and would have to return either text or bytea. bytea seems the more sensible choice, as long as we don't lose encoding information there, which I'm not sure about. The syntax could be something like: COPY mytable FROM '/my/file.txt' WITH COLUMN x CONVERT USING myfunc; I tried to only add keywords already present in [1], while getting something meaningfull... and x is intended to be the column number, counting from 1. [1] http://www.postgresql.org/docs/8.3/static/sql-keywords-appendix.html Comments? -- dim +1 Data transformation while doing a data load is a requirement now and then. Considering that users will have to do mass updates *after* the load completes to mend the data to their liking should be reason enough to do this while the loading is happening. I think to go about it the right way we should support the following: * The ability to provide per-column transformation expressions COPY mytable (col1 transform to col1 + 10, col2 transform to 'Post' || 'greSQL', col3...) FROM .. * The ability to use any kind of expressions while doing the transformation The transformation expression should be any expression (basically ExecEvalExpr) that can be evaluated to give a resulting value and obviously a corresponding is_null value too. It should and could be system in-built functions (e.g. UPPER, TRIM, TO_CHAR, TO_NUMBER etc.) or user defined functions too * The transformation expression can refer to other columns involved in the load. So that when the current row is extracted from the input file, the current values should be used to generate the new resultant values before doing a heap_form_tuple. E.g. (col1 transform col1 + 10, col2 transform col1 * col2, col3 transform UPPER(col1 || col3),...) I have spent some thoughts on how to do this and will be happy to share the same if the list is interested. Personally, I think data transformation using such expressions is a pretty powerful and important activity while doing the data load itself. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] [BUGS] Problem identifying constraints which should not be inherited
Hi Alvaro On Fri, Mar 28, 2008 at 6:05 PM, Alvaro Herrera [EMAIL PROTECTED] wrote: NikhilS escribió: I will take a look at the pg_dump related changes if you want. We will need changes in flagInhAttrs() and in getTableAttrs() to query the backend for these 2 attributes for post 80300 versions. Oh, BTW, I have not added this patch to any Commitfest page on the wiki, since it has obvious things that need more work. If you do work on them, please post the improved patch later and add it to the corresponding Commitfest, as appropriate. I submitted the combined latest patch to the patches list yesterday. How can I add it to the commitfest (presumably to the May one?)? Please let me know. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] [PATCHES] [EMAIL PROTECTED]: Re: [BUGS] Problem identifying constraints which should not be inherited]
Hi Alex, On Sun, Mar 30, 2008 at 7:10 AM, Alex Hunsaker [EMAIL PROTECTED] wrote: (trimmed cc's) Find attached inherited_constraint_v2.patch Changes since v1: -rebased against latest HEAD -changed enum { Anum_pg_constraint_... } back into #define Anum_pg_constraint_... -remove whitespace damage I added -fixed regression tests I added to be more robust -fixed create table ac (a int constraint check_a check (a 0)); create table bc (a int constraint check_a check (a 0)) inherits (ac); so it properly works (removed crud I put into AddRelationRawConstraints and created a proper fix in DefineRelation) I was taking a look at this patch to add the pg_dump related changes. Just wanted to give you a heads up as this patch crashes if we run make installcheck. Seems there is an issue introduced in the CREATE TABLE REFERENCES code path due to your patch (this is without my pg_dump changes just to be sure). Looks like some memory overwrite issue. The trace is as follows: Core was generated by `postgres: nikhils regression [local] CREATE TABLE '. Program terminated with signal 11, Segmentation fault. #0 0x08378024 in AllocSetCheck (context=0xa060368) at aset.c:1112 1112if (dsize 0 dsize chsize *chdata_end != 0x7E) (gdb) bt #0 0x08378024 in AllocSetCheck (context=0xa060368) at aset.c:1112 #1 0x0837704f in AllocSetDelete (context=0xa060368) at aset.c:487 #2 0x083783c2 in MemoryContextDelete (context=0xa060368) at mcxt.c:196 #3 0x083797fb in PortalDrop (portal=0xa0845bc, isTopCommit=0 '\0') at portalmem.c:448 #4 0x08281939 in exec_simple_query ( query_string=0xa07e564 CREATE TABLE enumtest_child (parent rainbow REFERENCES enumtest_parent);) at postgres.c:992 #5 0x082857d4 in PostgresMain (argc=4, argv=0x9ffbe28, username=0x9ffbcc4 nikhils) at postgres.c:3550 #6 0x0824917b in BackendRun (port=0xa003180) at postmaster.c:3204 #7 0x082486a2 in BackendStartup (port=0xa003180) at postmaster.c:2827 #8 0x08245e9c in ServerLoop () at postmaster.c:1271 #9 0x082457fd in PostmasterMain (argc=3, argv=0x9ff9c60) at postmaster.c :1019 #10 0x081e1c03 in main (argc=3, argv=0x9ff9c60) at main.c:188 Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] [BUGS] Problem identifying constraints which should not be inherited
Hi Alex, On Fri, Mar 28, 2008 at 4:58 AM, Alex Hunsaker [EMAIL PROTECTED] wrote: On Thu, Mar 27, 2008 at 5:14 AM, NikhilS [EMAIL PROTECTED] wrote: * Add logic to disallow ADD CONSTRAINT ONLY to parent of an inheritance hierarchy * Add logic to mark inherited constraints in the children: This can be achieved by introducing a new bool coninherited attribute in pg_constraint. This will be set to true on only those check constraints that are added to children via the inheritance mechanism * Add logic to disallow dropping inherited constraints directly on children Obviously they will get dropped if a DROP CONSTRAINT is fired on the parent. with recurse set to true (this is the default behaviour) * Modify the pg_dump logic to use the new pg_constraint based attribute logic for version 80300 (or should it be PG_VERSION_NUM 80400?) onwards. Infact the current logic to determine if a check constraint is inherited is to compare its name with the parent constraint name and the comment already mentions that we should make changes in pg_constraint to avoid this rudimentary way of determing the inheritance :). Attached is a WIP patch I have been playing with in my spare time. It should take care of the first 2. It does nothing for pg_dump or set (not) null/set default. Note it has some gross points (see comment in src/backend/catalog/heap.c AddRelationRawConstraints) and the regression tests I added are not quite up to par (and probably a bit redundant). But in the interest of saving work I thought i would post it. I took a quick look and it seems to be on the lines of attislocal and attinhcount which is a good thing. I am not sure about your syscache related changes though and also about using enums for pg_constraint attributes which deviates from other catalog specifications. Its good that you posted your WIP here immediately or it would have caused duplication of efforts from my side :) I will take a look at the pg_dump related changes if you want. We will need changes in flagInhAttrs() and in getTableAttrs() to query the backend for these 2 attributes for post 80300 versions. P.S Alvaro, I think this patch did not reach the mailing list and was stalled due to size restrictions or something. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] [BUGS] Problem identifying constraints which should not be inherited
Hi, On Thu, Mar 20, 2008 at 7:36 PM, Tom Lane [EMAIL PROTECTED] wrote: More to the point, it takes a capability away from the user without actually solving the problem we need to solve, namely to guarantee consistency between parent and child constraints. You can be sure that there is someone out there who will complain that we've broken his application when we disallow this, and we need to be able to point to some positive benefit we got from it. Agreed. So I think we need to implement the whole enchilada or nothing at all. We need to do the following for this: * Add logic to disallow ADD CONSTRAINT ONLY to parent of an inheritance hierarchy * Add logic to mark inherited constraints in the children: This can be achieved by introducing a new bool coninherited attribute in pg_constraint. This will be set to true on only those check constraints that are added to children via the inheritance mechanism * Add logic to disallow dropping inherited constraints directly on children Obviously they will get dropped if a DROP CONSTRAINT is fired on the parent. with recurse set to true (this is the default behaviour) * Modify the pg_dump logic to use the new pg_constraint based attribute logic for version 80300 (or should it be PG_VERSION_NUM 80400?) onwards. Infact the current logic to determine if a check constraint is inherited is to compare its name with the parent constraint name and the comment already mentions that we should make changes in pg_constraint to avoid this rudimentary way of determing the inheritance :). Am important decision here is about adding a new attribute to pg_constraint as it is the only sane way of determining inherited constraints, but that will require an initdb. Comments? Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] MemoryContextSwitchTo() confusion
src/backend/utils/mmgr/README contains more information about the same too. Regards, Nikhils On Thu, Mar 20, 2008 at 2:41 PM, Pavan Deolasee [EMAIL PROTECTED] wrote: On Thu, Mar 20, 2008 at 12:27 AM, Dan Searle [EMAIL PROTECTED] wrote: I had to fiddle about with switching memory contexts rather a lot to make it work this far, but I'm only guessing as to when it's appropriate to call MemoryContextSwitchTo(), and to which context to switch to Here is what I know. Not sure whether it would answer your question though. A memory context is a memory enclosure with a life associated to it. Whenever the context is freed, all objects allocated in that context are also automatically freed. If there are any references to these objects, they will turn into dangling pointers, causing segfaults and/or memory corruption. So you need to be careful while choosing a memory context to allocate memory from. You don't want to allocate something in a long-lived context if you don't need it for that much time because failure to explicitely free the allocation will result in memory consumption when its not required or even a memory leak. OTOH you don't want to allocate something in a very short-live context, if you may require that object outside the scope of that context. Certain memory contexts are well known. For example, a TopMemoryContext has life of the session. Any object allocated in this context would remain valid for the entire session. Of course, failure to free them would result in memory leaks. TopTransactionContext, as the name suggests, is valid in the current top transaction. As soon as the transaction commits/aborts, the context is freed. CurrentTransactionContext, which may be same as TopTransactionContext remains valid for the current transaction or subtransaction. Apart from that, there are contexts attached to different objects during execution and their lifespan is usually attached to the lifespan of the object itself. You may need to choose one of them if you know that what you are allocating can not or should not outlive that object. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS]COPY issue(gsoc project)
Hi Longlong, i think this is a better idea. from *NikhilS * http://archives.postgresql.org/pgsql-hackers/2007-12/msg00584.php But instead of using a per insert or a batch insert substraction, I am thinking that we can start off a subtraction and continue it till we encounter a failure. The moment an error is encountered, since we have the offending (already in heap) tuple around, we can call a simple_heap_delete on the same and commit (instead of aborting) this subtransaction after doing some minor cleanup. This current input data row can also be logged into a bad file. Recall that we need to only handle those errors in which the simple_heap_insert is successful, but the index insertion or the after row insert trigger causes an error. The rest of the load then can go ahead with the start of a new subtransaction. the simplest thing are often the best. i think it's hard to implement or some other deficiency since you want subtransaction or every n rows. Yeah simpler things are often the best, but as folks are mentioning, we need a carefully thought out approach here. The reply from Tom to my posting there raises issues which need to be taken care of. Although I still think that if we carry out *sanity* checks before starting the load about presence of triggers, constrainsts, fkey constraints etc, if others do not have any issues with the approach, the simple_heap_delete idea should work in some cases. Although the term I used after some minor cleanup might need some thought too now that I think more of it.. Also if Fkey checks or complex triggers are around, maybe we can fall back to a subtransaction per row insert too as a worse case measure.. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Declarative partitioning grammar
Hi, I've proposed an alternative approach, which we've called declarative partitioning which is grammar based. This grammar was developed by Jeff Cohen at Greenplum with some assistance from myself. It is to be completely open source. .. FWIW, I had done some very initial work on declarative partitioning (no where as exhaustive as this proposal) and submitted a wip patch here: http://momjian.us/mhonarc/patches_hold/msg6.html Kindly take a look at the patch, to see if would be useful to you folks in any way. .. Range - Range has the most expressive grammar. I'll introduce it in steps: ...PARTITION BY RANGE (b) ( PARTITION aa start (date '2007-01-01') end (date '2008-01-01'), PARTITION bb start (date '2008-01-01') end (date '2009-01-01') ); It is common that these partitions follow a pattern, such as following every week, month or year. So, we support the following specification: ... PARTITION BY RANGE(order_date) ( START (date '2005-12-01') end (date '2007-12-01') EVERY(interval '2 months') ); .. It will be interesting to see how this start,end, interval usage accomodates data types other than dates. I hope, this specification is not influenced overlty just by dates-like partitions. .. ADD --- For range and list partitioning, it's important to be able to add partitions for data not covered by the existing specification. So, we propose: ... ADD PARTITION q1_2008 end (date '2008-04-01') .. What about data that does not match any existing partition specification? It might make sense to have a dummy partition which handles all these cases. .. DROP For list and range partitions, drop a specified partition from the set of partitions. ... DROP PARTITION minny; This drops a named partition. Often, it will be difficult for users to know partition names, and they might be unnamed. So, we allow this syntax: ... DROP PARTITION FOR(date '2007-01-01'); for range partitions; and: ... DROP PARTITION FOR(VALUES('CA')); for list partitions. We've also discussed something like: ... DROP PARTITION FOR(POSITION(1)); so that users can easily drop a specific partition in an array of range partitions. It seems to me, though, that the use case is generally to drop the oldest partition so perhaps we should have a more explicit syntax. Thoughts? .. Surely, the partitions will get (default, parent inferred) names when they get created? Do we expect the users to remember FOR() specifications like the ones mentioned above? It might make sense to have a \d in psql e.g to present a parent with all its named partitions alongwith the partition clauses to facilitate drop partition using partition names. .. EXCHANGE This sub-clause allows us to make a table a partition in a set of partitions or take a partition out of a set but keep it as a table. IBM uses ATTACH and DETACH, which is explicit but Oracle uses EXCHANGE. I'll explain the latter: ... EXCHANGE partition identifier WITH TABLE table name partition identifier is one of PARTITION name or PARTITION FOR(...). The partition in the partition set 'becomes' the table table name and vice-versa. Essentially, we'd swap the relfilenodes. This means that we have to first ADD PARTITION then swap the table and the partition. Thoughts? .. Surely this wont be instantaneous? .. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Declarative partitioning grammar
Hi, We did look at allowing general functions for partitioning and this was one concern. The other is that we want to enforce that a row only gets inserted into a single partition, so we wanted a declarative syntax where it was relatively easy to check that range and list specifications don't overlap. Detection of mutually exclusive ranges might not turn out to be so easy afterall. I think there is some code in the constraint_exclusion area which might help out in this. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Declarative partitioning grammar
Hi, On Jan 12, 2008 6:29 AM, Gavin Sherry [EMAIL PROTECTED] wrote: The syntax is half the problem, performance is the other. I will bring the performance issues up in another thread. Yes, we are confident that we can address the performance issues that rule out the existing partitioning for many applications. We need it for our own stuff! :P Agreed, syntax is just the sugar. Also other than performance, how are updates involving partition keys causing the resultant tuple to end up in a new partition handled here? Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] duplicate columns with COPY
Hi, pei=# copy test1 (a, a) to stdout; ERROR: 42701: column a specified more than once Or is this just an overly extensive check that is actually intended for COPY FROM STDIN? This seems to be a common check in both COPY TO and COPY FROM cases source/destination being STDIN or otherwise. While it definitely makes sense for the FROM case maybe we could relax this for the COPY TO case. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] VLDB Features
Hi, On Dec 15, 2007 1:14 PM, Tom Lane [EMAIL PROTECTED] wrote: NikhilS [EMAIL PROTECTED] writes: Any errors which occur before doing the heap_insert should not require any recovery according to me. A sufficient (though far from all-encompassing) rejoinder to that is triggers and CHECK constraints can do anything. The overhead of having a subtransaction per row is a very valid concern. But instead of using a per insert or a batch insert substraction, I am thinking that we can start off a subtraction and continue it till we encounter a failure.The moment an error is encountered, since we have the offending (already in heap) tuple around, we can call a simple_heap_delete on the same and commit (instead of aborting) this subtransaction What of failures that occur only at (sub)transaction commit, such as foreign key checks? What if we identify and define a subset where we could do subtransactions based COPY? The following could be supported: * A subset of triggers and CHECK constraints which do not move the tuple around. (Identifying this subset might be an issue though?) * Primary/unique key indexes As Hannu mentioned elsewhere in this thread, there should not be very many instances of complex triggers/CHECKs around? And may be in those instances (and also the foreign key checks case), the behaviour could default to use a per-subtransaction-per-row or even the existing single transaction model? Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] VLDB Features
Hi, Another approach would be to distinguish between errors that require a subtransaction to recover to a consistent state, and less serious errors that don't have this requirement (e.g. invalid input to a data type input function). If all the errors that we want to tolerate during a bulk load fall into the latter category, we can do without subtransactions. I think errors which occur after we have done a fast_heap_insert of the tuple generated from the current input row are the ones which would require the subtransaction to recover. Examples could be unique/primary key violation errors or FKey/triggers related errors. Any errors which occur before doing the heap_insert should not require any recovery according to me. The overhead of having a subtransaction per row is a very valid concern. But instead of using a per insert or a batch insert substraction, I am thinking that we can start off a subtraction and continue it till we encounter a failure. The moment an error is encountered, since we have the offending (already in heap) tuple around, we can call a simple_heap_delete on the same and commit (instead of aborting) this subtransaction after doing some minor cleanup. This current input data row can also be logged into a bad file. Recall that we need to only handle those errors in which the simple_heap_insert is successful, but the index insertion or the after row insert trigger causes an error. The rest of the load then can go ahead with the start of a new subtransaction. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] [BUGS] BUG #3811: Getting multiple values from a sequence generator
Hi, Right, I want to use it with a bulk operation, say importing a million records with COPY. Calling nextval one million times looks to me like an enormous waste of resources. Suppose, you are on an ADSL line: it will cost one million times the ping time of the ADSL line (say 10 milliseconds per call). Well OK, one could write a server function that does this, but then the one million result values must be transported back to the client, because they are not guaranteed to be contiguous. Unneeded complexity compared to a simple nextval increment parameter. The usual way to use nextval() is to use it on the server as an expression in an INSERT or DEFAULT. If you're using COPY and don't have a column default set up then, hm, I guess you're kind of stuck. That would make a good use case for a one-time nextval(increment) or something like that. Coincidently, I very briefly discussed (offline) about supporting expressions while doing loads using COPY FROM with Heikki a while back. From the above mail exchanges, it does appear that adding this kind of functionality will be useful while doing bulk imports into tables using COPY. Heikki's initial suggestion was as follows: COPY table FROM file USING query Where query could be any SELECT query, executed once for row using the values from the input data file. For example: COPY footable (strcol, strcollen, moredata) FROM file USING SELECT $1, length($1), $2; The sql expressions could refer to the columns being read or could be user defined procedures, built-in functions etc too. These expressions would need to be executed per row read from the input data file to form a new set of values[], nulls[] entries before forming the corresponding tuple entry. I think the above will be a very useful enhancement to COPY. The syntax and other details mentioned above are ofcourse subject to discussion and approval on the list. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] [BUGS] BUG #3774: create table like including index doesn't update pg_constraints with primary key
Hi, The fundamental question though is should we allow primary, unique CONSTRAINTS which use the index mechanism just as an implementation to be created using the INCLUDING INDEXES mechanism. Yeah, this bizarreness was foreseen and agreed to back when we set up LIKE INCLUDING CONSTRAINTS the way it was defined (ie, copying only CHECK constraints and not other things called constraints). I was never very thrilled with that definition myself, but it's a bit too late to revisit it. Yeah this is all confusing. I believe we should remove the following TODO now that the above has been checked in: CREATE - Have WITH CONSTRAINTS also create constraint indexes http://archives.postgresql.org/pgsql-patches/2007-04/msg00149.php Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [pgsql-www] [HACKERS] Time to update list of contributors
Apologies, if some of you receive duplicates of this email. I am not subscribed to -www, so sending this to - hackers again. Hi, On Wed, 2007-11-28 at 11:44 +0530, Pavan Deolasee wrote: Nikhil S Nikhil is from India, EnterpriseDB. What is his surname? I think we need that for adding to web page. Yes, agreed. We had someone else who wanted to be listed by alias some time back (year+, don't remember whom it was) and that was turned down. Agreed :), for the record, my name is Nikhil Sontakke. Regard, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] [BUGS] BUG #3774: create table like including index doesn't update pg_constraints with primary key
Hi, The following bug has been logged online: Bug reference: 3774 Logged by: guillaume (ioguix) de Rorthais Email address: [EMAIL PROTECTED] PostgreSQL version: 8.3 beta3 Operating system: mac os x 10.4.10 Description:create table like including index doesn't update pg_constraints with primary key Details: When creating a table using the create table ... (like ... inluding indexes...) syntaxe, pg_catalog.pg_constraint is not updated with the PK constraints which actually is setted in pg_index. I'm not sure if this issue is actually a bug or if there a logic behind this, but as the primary key is a constraint, I would expect it to be setted in pg_constraint, shouldn't it ? This can be handled by setting index-isconstraint appropriately inside generateClonedIndexStmt(). The fundamental question though is should we allow primary, unique CONSTRAINTS which use the index mechanism just as an implementation to be created using the INCLUDING INDEXES mechanism. As per the discussion here: http://www.nabble.com/Re%3A-CREATE-TABLE-LIKE-INCLUDING-INDEXES-support-p10683716.html maybe we should not? In other words INCLUDING INDEXES should only create those indexes which do not have isconstraint set to TRUE. Comments? Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
[HACKERS] Assertion failure due to ColumnRefStar
Hi, One of our qmg folks reported an assertion failure: create table x(y char(1)); insert into x values (*); The above causes the following assertion to be hit: /* * Target item is a bare '*', expand all tables * * (e.g., SELECT * FROM emp, dept) * * Since the grammar only accepts bare '*' at top level of SELECT, we * need not handle the targetlist==false case here. */ Assert(targetlist); in ExpandColumnRefStar() function. I was wondering if we should fix this by checking for list_length(cref-fields) being greater than 1 before calling this in transformExpressionList? Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] StringInfo misc. issues
Apologies! As Alvaro guessed it correctly I was working with 8.2 sources. Sorry for the noise. Regards, Nikhils On 8/29/07, Tom Lane [EMAIL PROTECTED] wrote: NikhilS [EMAIL PROTECTED] writes: The attached patch should fix this. And break other things, no doubt. needed = 0 is a perfectly valid edge case and mustn't be rejected here. (In fact, I doubt you'd even get through the regression tests with this patch ... how much did you test it?) The real problem with what you describe is that you should have used makeStringInfo(). I also found the absence of a function like resetStringInfo() a bit puzzling. CVS HEAD is way ahead of you. regards, tom lane -- EnterpriseDB http://www.enterprisedb.com
[HACKERS] StringInfo misc. issues
Hi, I palloc0'ed a variable of type StringInfo and without doing an initStringInfo() (forgot to do it i.e.) tried to append some stuff to it using appendStringInfo(). It went into a tight loop within the function enlargeStringInfo() at: while (needed newlen) Must be a common enough case for a palloc0'ed field right? The attached patch should fix this. *** 226,232 ! if (needed 0 || ((Size) needed) = (MaxAllocSize - (Size) str-len)) elog(ERROR, invalid string enlargement request size %d, needed); --- 226,232 ! if (needed = 0 || ((Size) needed) = (MaxAllocSize - (Size) str-len)) elog(ERROR, invalid string enlargement request size %d, needed); I also found the absence of a function like resetStringInfo() a bit puzzling. A found a lot of places where the code was resetting the len field to 0 and assigning '\0' to the data field to reset the variable. This seems to be the only missing API which will be needed while working with the StringInfo type. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com Index: src/backend/lib/stringinfo.c === RCS file: /repositories/edbhome/cvs/EDBAS82/edb/edb-postgres/src/backend/lib/stringinfo.c,v retrieving revision 1.3 diff -c -r1.3 stringinfo.c *** src/backend/lib/stringinfo.c 9 Nov 2006 11:09:09 - 1.3 --- src/backend/lib/stringinfo.c 29 Aug 2007 14:37:58 - *** *** 226,232 * bogus data. Without this, we can get an overflow or infinite loop in * the following. */ ! if (needed 0 || ((Size) needed) = (MaxAllocSize - (Size) str-len)) elog(ERROR, invalid string enlargement request size %d, needed); --- 226,232 * bogus data. Without this, we can get an overflow or infinite loop in * the following. */ ! if (needed = 0 || ((Size) needed) = (MaxAllocSize - (Size) str-len)) elog(ERROR, invalid string enlargement request size %d, needed); *** *** 259,261 --- 259,272 str-maxlen = newlen; } + + /* + * resetStringInfo + * Reset the len field and the data field contents for a fresh start + */ + void + resetStringInfo(StringInfo str) + { + str-len = 0; + str-data[0] = '\0'; + } Index: src/include/lib/stringinfo.h === RCS file: /repositories/edbhome/cvs/EDBAS82/edb/edb-postgres/src/include/lib/stringinfo.h,v retrieving revision 1.3 diff -c -r1.3 stringinfo.h *** src/include/lib/stringinfo.h 9 Nov 2006 11:09:17 - 1.3 --- src/include/lib/stringinfo.h 29 Aug 2007 14:37:58 - *** *** 138,141 --- 138,146 */ extern void enlargeStringInfo(StringInfo str, int needed); + /* + * resetStringInfo + * Reset the len field and the data field contents for a fresh start + */ + extern void resetStringInfo(StringInfo str); #endif /* STRINGINFO_H */ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Memory leak in vac_update_relstats ?
Hi, It's palloc'd in the current memory context, so it's not serious. It'll be freed at the end of the transaction, if not before that. That's the beauty of memory contexts; no need to worry about small allocations like that. That's the beauty of memory contexts for small allocations. But because of the 'convenience' of memory contexts we sometimes tend to not pay attention to doing explicit pfrees. As a general rule I think allocations in TopMemoryContext should be critically examined. I was bitten by this undue bloat recently while developing some code and valgrind is not of much help in such cases because of this very beauty of memory contexts :). Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Memory leak in vac_update_relstats ?
Hi, That's the beauty of memory contexts for small allocations. But because of the 'convenience' of memory contexts we sometimes tend to not pay attention to doing explicit pfrees. As a general rule I think allocations in TopMemoryContext should be critically examined. I was bitten by this undue bloat recently while developing some code and valgrind is not of much help in such cases because of this very beauty of memory contexts :). One specific case I want to mention here is hash_create(). For local hash tables if HASH_CONTEXT is not specified, they get created in a context which becomes a direct child of TopMemoryContext. Wouldn't it be a better idea to create the table in CurrentMemoryContext? If hash_destroy() is not explicitly invoked, this can cause a lot of bloat especially if the intention was to use the hash table only for a while. Regards, Nikhils Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] schema creation during initdb
Hi, On 4/18/07, sharath kumar [EMAIL PROTECTED] wrote: Sorry i have put the question wrongly. I wanted to ask if we can create a schema during createdb time so that i have to hide the following from the user. psql -c 'create schema foo' mytemplate psql -c 'create table foo.bar ...' mytemplate Whenever a user runs createdb command, the above schema and table should be created automatically. Thanks, Sharat. One way that I can think of doing this would be by adding the details about the new schema and the tables that lie therein in the src/backend/catalog/information_schema.sql file (I dont know if this is the recommended way though). These will end up becoming a part of template1 and any subsequent databases that are created will contain them. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] where to write small reusable functions ?
Hi, char * pg_strcat (char *dest,char *src) { /* pg_realloc is a safer function than realloc */ dest=pg_realloc(dest,strlen(dest)+strlen(src)+1); strcat(dest,src); return dest; } Postgres already has something for the above functionality. See makeStringInfo, appendStringInfo. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
[HACKERS] Bug about column references within subqueries used in selects
Hi, Shouldn't the final command below cause a 'column b does not exist error'? create table update_test (a int, b int); create table supdate_test(x int, y int); insert into update_test values (20, 30); insert into supdate_test values (40, 50); select a, (select b from supdate_test) from update_test; a ?column? -- - 2030 Is the problem with the code in colNameToVar or maybe we should add checks in transformSubLink? Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Bug about column references within subqueries used in selects
Hi, On 4/12/07, Merlin Moncure [EMAIL PROTECTED] wrote: On 4/12/07, NikhilS [EMAIL PROTECTED] wrote: Hi, Shouldn't the final command below cause a 'column b does not exist error'? create table update_test (a int, b int); create table supdate_test(x int, y int); insert into update_test values (20, 30); insert into supdate_test values (40, 50); select a, (select b from supdate_test) from update_test; a ?column? -- - 2030 Is the problem with the code in colNameToVar or maybe we should add checks in transformSubLink? I don't think so...the columns of update_test are visible to the scalar subquery...that way you can use fields from 'a' to filter the subquery... select a, (select y from supdate_test where x = a) from update_test; Yes this is fine, but in select columnname from tablename using column references of the other involved table is what I am objecting to. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] UPDATE using sub selects
Hi, Hmm. That sounds like it would be a horrid mess. You need to decouple the execution of the subplan from the use of its outputs, apparently. There is some precedent for this in the way that InitPlans are handled: the result of the subplan is stored into a ParamList array entry that's later referenced by a Param node in the parent's expression tree. That would generalize easily enough to setting more than one Param, but I'm not clear on where you'd want to stick the subplan itself in the plan tree, nor on what controls how often it needs to get evaluated. Ended up using something similar to the above suggestion. I have posted the patch to -patches based on this. An important concern was where to stick the evaluation of the subqueries so that they end up becoming subplans which are used in the execution. For this I have added a new field in the Query structure. This entry gets preprocessed similar to other fields of the Query from within subquery_planner. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Idle idea for a feature
Hi, On 4/11/07, Guillaume Smet [EMAIL PROTECTED] wrote: On 4/10/07, Tom Lane [EMAIL PROTECTED] wrote: ISTM it'd be a good idea if it did, as are there any incoming foreign keys seems to be a question we constantly ask when solving update-performance problems, and there isn't any easy way to check for such. On similar lines, maybe \d can also show the list of inheritors when invoked on a parent. e.g: postgres=# \d parent Table public.parent Column | Type | Modifiers +-+--- a | integer | Indexes: parent_a_key UNIQUE, btree (a) Inherited by: child Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto Partitioning
Hi, The only problem I have with this is that the shops I know with big partitioned tables favor triggers over rules for both performance reason and a cleaner implementation. Even with automated rule creation this isnt going to change afaics... not to mention we already create our rules triggers automatically, so really this just isn't exciting to me (though it may make it easier for people getting in on the ground floor) I second this. The trigger route is much more maintainable than the rule route. IMO what really needs to happen is something more low level where there are no DBA visible changes. Triggers also have overhead, it would be nice to get a little more bare metal with this. I had raised this issue about rules/triggers back then and the responses seemed to be evenly split as to which ones to use. I think the broad question really is how well we want to support the current inheritance based partitioning mechanism. If we want to stick to it for a while (and to which we will stick to unless something concrete/better/bare metal comes up), IMHO we should try to make things simpler (by automating things if possible) to make it easier for people getting in. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto Partitioning
Hi, I had raised this issue about rules/triggers back then and the responses seemed to be evenly split as to which ones to use. Presumably your implementation already uses Triggers for INSERTs though, so why not use triggers for everything? No I am using rules for all the 3 cases. I am done with the UPDATE stuff too on which I was stuck with some help, so here is what the patch will do: postgres=# create table test1 (a int unique , b int check (b 0)) partition by range(a) (partition child_1 check (a 10)); NOTICE: CREATE TABLE / UNIQUE will create implicit index test1_a_key for table test1 NOTICE: CREATE TABLE / UNIQUE will create implicit index child_1_a_key for table child_1 CREATE TABLE A describe of the parent shows the rules added to it: postgres=# \d test1 Table public.test1 Column | Type | Modifiers +-+--- a | integer | b | integer | Indexes: test1_a_key UNIQUE, btree (a) Check constraints: test1_b_check CHECK (b 0) Rules: test1_child_1_delete AS ON DELETE TO test1 WHERE old.a 10 DO INSTEAD DELETE FROM child_1 WHERE child_1.a = old.a test1_child_1_insert AS ON INSERT TO test1 WHERE new.a 10 DO INSTEAD INSERT INTO child_1 (a, b) VALUES (new.a, new.b) test1_child_1_update AS ON UPDATE TO test1 WHERE old.a 10 DO INSTEAD UPDATE child_1 SET a = new.a, b = new.b WHERE child_1.a = old.a Whereas a describe on the child shows the following: postgres=# \d child_1 Table public.child_1 Column | Type | Modifiers +-+--- a | integer | b | integer | Indexes: child_1_a_key UNIQUE, btree (a) Check constraints: child_1_a_check CHECK (a 10) test1_b_check CHECK (b 0) Inherits: test1 Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto Partitioning
So we are unable to load any of the tables using COPY. Aww, guess should have stuck to triggers as a first choice. Mea culpa, since I should have investigated some more before deciding on rules, or should have prodded you more earlier:) Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto Partitioning
Hi, I appreciate you efforts, but I'm not sure if this has been discussed Thanks Markus. enough. There seem to be two ideas floating around: - you are heading for automating the current kludge, which involves creating partitions and constraints by hand. AFAICT, you want to support list and range partitioning. - Simon Riggs has proposed partitioning functions, which could easily handle any type of partitioning (hash, list, range and any mix of those). When I submitted the proposal, AFAIR there was no objection to going with the first proposal. Yes there was a lot of forward looking discussion, but since what I had proposed (atleast syntax wise) was similar/closer to Mysql, Oracle I did not see any one objecting to it. I think SQL server provides partitioning functions similar to Simon's proposal. And all along, I had maintained that I wanted to automate as far as possible, the existing mechanism for partitioning. To this too, I do not remember anyone objecting to. Our current partitioning solution is based on inheritance. With that in mind, for 8.3 I thought an implementation based on auto rules creation would be the way to go. Having said that, obviously I would want to go with the consensus on this list as to what we think is the *best* way to go forward with partitioning. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto Partitioning Patch - WIP version 1
Hi, The following things are TODOs: iv) Auto generate rules using the checks mentioned for the partitions, to handle INSERTs/DELETEs/UPDATEs to navigate them to the appropriate child. Note that checks specified directly on the master table will get inherited automatically. Am planning to do the above by using the check constraint specified for each partition. This constraint's raw_expr field ends up becoming the whereClause for the rule specific to that partition. One question is whether we should we allow auto creation of UPDATE rules given that updates can end up spanning multiple partitions if the column on which partitioning is specified gets updated? Also if we decide to auto - add rules for UPDATE, the raw_expr will need to be modified to refer to OLD.col, which can be quite a headache. We do not have parsetree walker/mutator functions as far as I could see in the code. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] UPDATE using sub selects
Hi, No. Offhand I think you'd either need to relax EXPR_SUBLINK to allow multiple output columns, or invent a ROW_SUBLINK SubLinkType that is just like EXPR_SUBLINK except for allowing multiple output columns. The latter would probably be less likely to break other things... Yeah, was looking at EXPR_SUBLINK and its single column use case and drove to the same conclusion that inventing a new sublink type would be better too. It is indeed becoming a not so simple and narrow fix as you had mentioned earlier in your first response :) I have invented a ROWEXPR_SUBLINK type that handles multiple output columns. The trouble is that since eventually columns of the parents have to be part of the query's targetList, I am sending the entire subquery as one of the entries in that list and the targetList gets populated with entries dependent on the subquery much later via make_subplan. This breaks code in rewriteTargetList (which expects every list entry to be of type TargetEntry), and expand_targetlist (which expects the targets to be present in attrno order, the entries added because of the subquery will not be in order as compared to normal SET colname = expr targets). Is there a simpler way of doing things? Should I try generating a resjunk TargetEntry in transformUpdateStmt and have its expr point to the subquery and see if that works? Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] UPDATE using sub selects
Hi, On 3/31/07, Tom Lane [EMAIL PROTECTED] wrote: NikhilS [EMAIL PROTECTED] writes: I have invented a ROWEXPR_SUBLINK type that handles multiple output columns. The trouble is that since eventually columns of the parents have to be part of the query's targetList, I am sending the entire subquery as one of the entries in that list and the targetList gets populated with entries dependent on the subquery much later via make_subplan. This breaks code in rewriteTargetList (which expects every list entry to be of type TargetEntry), and expand_targetlist (which expects the targets to be present in attrno order, the entries added because of the subquery will not be in order as compared to normal SET colname = expr targets). Hmm. That sounds like it would be a horrid mess. You need to decouple the execution of the subplan from the use of its outputs, apparently. There is some precedent for this in the way that InitPlans are handled: the result of the subplan is stored into a ParamList array entry that's later referenced by a Param node in the parent's expression tree. That would generalize easily enough to setting more than one Param, but I'm not clear on where you'd want to stick the subplan itself in the plan tree, nor on what controls how often it needs to get evaluated. Yes, I have tried this already. As you suspect, it seems that the subplan does not get evaluated if its not part of the targetList at all. An alternative approach is to put the subplan into the rangetable and use Vars to reference its outputs. Again it's not quite clear what drives re-execution of the subplan. It strikes me though that an approach like this might also serve for SQL2003's LATERAL construct, which'd be a nice thing to support. Ok, I will try this out. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] UPDATE using sub selects
Hi, What's the expected result if the tuple from subselect is more than 1? Error, per SQL99 section 7.14: 1) If the cardinality of a row subquery is greater than 1 (one), then an exception condition is raised: cardinality violation. I expect no update at all in case of void result set, is this the case ? No, you get nulls; it's a subquery not a join. Per SQL99 7.1: c) If the row value constructor is a row subquery, then: i) Let R be the result of the row subquery and let D be the degree of R. ii) If the cardinality of R is 0 (zero), then the result of the row value constructor is D null values. iii) If the cardinality of R is 1 (one), then the result of the row value constructor is R. regards, tom lane To allow both of the above to hold, I think the subselect will have to be treated like a EXPR_SUBLINK subquery. I was wondering if we have a similar mechanism for plain selects/subselects to check and restrict their output to a single row. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] UPDATE using sub selects
Hi, On 3/16/07, Tom Lane [EMAIL PROTECTED] wrote: NikhilS [EMAIL PROTECTED] writes: To allow both of the above to hold, I think the subselect will have to be treated like a EXPR_SUBLINK subquery. I was wondering if we have a similar mechanism for plain selects/subselects to check and restrict their output to a single row. No. Offhand I think you'd either need to relax EXPR_SUBLINK to allow multiple output columns, or invent a ROW_SUBLINK SubLinkType that is just like EXPR_SUBLINK except for allowing multiple output columns. The latter would probably be less likely to break other things... Yeah, was looking at EXPR_SUBLINK and its single column use case and drove to the same conclusion that inventing a new sublink type would be better too. It is indeed becoming a not so simple and narrow fix as you had mentioned earlier in your first response :) Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] where to add/change commands
Hi, psql? There's some code to detect commands that can't be run in a transaction block in src/bin/psql/common.c, maybe that's what you're looking for. Or did you mean something else? How doesn't it pick it up? I think he probably meant that he was getting a syntax error, even after making all the changes. Grzegorz, I would have suggested to make an entry for VERBOSE in parser/keywords.c, but it already seems to contain an entry for VERBOSE. I hope you are using the opt_verbose rule in your gram.y in the CLUSTER [VERBOSE] case. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
[HACKERS] UPDATE using sub selects
Hi, I have coded up a patch which solves the following TODO. I will submit a patch for this soon: - UPDATE - Allow UPDATE tab SET ROW (col, ...) = (SELECT...) http://archives.postgresql.org/pgsql-hackers/2006-07/msg01306.php The question is that since this enhances the UPDATE syntax, what changes and where all they need to be made with respect to the documentation? Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] UPDATE using sub selects
Hi, The question is that since this enhances the UPDATE syntax, what changes and where all they need to be made with respect to the documentation? Documentation is the very least of your worries. What exactly is your implementation plan? If this were a simple or narrow fix it would have been done already. The implementation that I have planned is pretty similar to the way INSERT INTO ... SELECT has been implemented. Along with the grammar changes in gram.y, the changes are localized in the transformUpdateStmt code path. The SELECT clause ends up becoming a subquery to the update query with the target column expressions transformed properly to include the subquery expressions. Does this sound ok? I have tried some update-subselect variations and they seem to work. For example the case in the src/test/regress/sql/update.sql, which used to fail till now, seems to work: UPDATE update_test SET (a,b) = (select a,b FROM update_test where c = 'foo') WHERE a = 10; Will try testing out some other variations too. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto creation of Partitions
Hi, Why would we support HASH partitions? If you did, the full syntax for hash clusters should be supported. In MySQL, Oracle, the syntax for HASH partitions seems to be similar to the one mentioned. I do not know much about hash clusters though. If we do the CHECK clauses like that then we still have don't have a guaranteed non-overlap between partitions. It would be easier to use Oracle syntax and then construct the CHECK clauses from that. Again Oracle, MySQL use VALUES LESS THAN (expr) format for RANGE partitions. So you mean that they end up creating ranges like MININT - Range1, Range1+1 - Range2 etc for each of the partitions? I think Postgres users are used to the CHECK clauses and I still feel that the onus of distinct partitions lies on the partition creator. Also, the syntax needs to be fairly complex to allow for a mixture of modes, e.g. range and list partitioning. That is currently possible today and the syntax for doing that is IMHO much simpler than the Oracle simple way of specifying it. Subpartitioning is not being targeted right now, but could be put on the TODO list for further enhancements. An alternative is to provide a partitioning function which decides which partition each values goes into. PARTITION FUNCTION which_partition(date_col) The partition function must return an unsigned integer 0, which would correspond to particular partitions. Partitions would be numbered 1..N, and named tablename_partM where 1 = M = N. The input and contents of the partition function would be up to the user. e.g. CREATE FUNCTION range_partition(date date_col) { if (date_col D1) return 1; else if (date_col D2) return 2; else if (date_col D3) return 3; return 4; } Doing it this way would allow us to easily join two tables based upon a common partition function. In time, I would suggest we support both ways: declarative and functional. Till now, we are going the declarative way. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto creation of Partitions
Hi, On 3/10/07, Hannu Krosing [EMAIL PROTECTED] wrote: Ühel kenal päeval, R, 2007-03-09 kell 15:41, kirjutas Jim Nasby: On Mar 9, 2007, at 3:31 AM, Zeugswetter Andreas ADI SD wrote: Since partition is inheritance-based, a simple DROP or NO INHERIT will do the job to deal with the partition. Do we want to reinvent additional syntax when these are around and are documented? Well, if the syntax for adding a new partition eventually ends up as ALTER TABLE ADD PARTITION, then it would make more sense that you remove a partition via ALTER TABLE DROP PARTITION. But DROP PARTITION usually moves the data from this partition to other partitions, so it is something different. It does? IIRC every partitioning system I've seen DROP PARTITION drops the data as well. It's up to you to move it somewhere else if you want to keep it. Will this proposed DROP PARTITION just disassociate the table from the master, or will it actually drop the partitions table from the whole database ? Thats why I would prefer the existing mechanism, there a DROP on the child removes it and a NO INHERIT disassociates it. There might be situations where we would want to just disassociate and not drop. Regards, Nikhils -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto creation of Partitions
Hi, Given that Simon wants to do away with having the master table APPENDed in the planning phase, this would be better. ISTM you're trading appending the master table for appending the DUMP partition, which afaict would give you no gain. If there are entries in the master table, I think it would get appended for all queries regardless of whether we need to examine its contents or not. Segregating dump data into a partition will avoid that. I have seen examples in some other databases wherein a partition specifies a range of someval - MAXINT for instance, to catch such cases. That again means that the onus is on the partition creator most of the times.. Regards, Nikhils -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto creation of Partitions
Hi, Maybe, the dummy entry could be extended to contain the bounds (max/min) for each of the other involved partitions and they could be updated each time a DML happens across the partitions. That ways, an update to a particular partition needs to lock out the others, examine the dummy entries in its own index and follow it up with dummy entries update into other partitions if the need be. Ofcourse as you have mentioned all of this so needs to be done after a careful think on the locking/deadlocking etc issues. Regards, Nikhils On 3/7/07, Alvaro Herrera [EMAIL PROTECTED] wrote: I am wondering if we can implement unique indexes across several tables (inheritance hierarchy) not by using a single, big index covering all the tables, but rather by inserting a dummy entry into each partition's unique index. This dummy entry would have an expanded CTID which would include the tableoid, so it's possible to check it (albeit there is a problem in that we may require the opening of another heap to do the actual checking). These dummy entries could be removed by bulkcleanup as soon as the inserting transaction is no longer running, to avoid bloating the index too much. All said dummy index entries would be located at either the rightmost or the leftmost leaf, or close to it, so another idea is to have future inserters reuse the entry for a different key. The obvious problem with this is, naturally, the excess I/O that extra index traversing causes. The not so obvious ones are locking, deadlocking and the opening of other heaps and indexes while you do the insertion, which may be too expensive. On the other hand, maybe this idea is easier to implement than full-fledged cross-table indexes, so we could have richer partitioning earlier than when somebody finally bites the bullet and implements cross-table indexes. Or maybe this is just a dumb idea, but I had to let it out anyway :-) -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto creation of Partitions
Hi, There are 2 other reasons to favor triggers though: 1) People (Josh Drake comes to mind) have found that if you get over a tiny number of partitions, the performance of rules is abysmal. 2) I believe it should be possible to construct an update trigger that allows you to perform updates that will place the row in question into a new partition. While I can see cases for simply disallowing updates to the partitioning key, I think there are also times when being able to do that would be very useful. The consensus seems to be veering towards triggers. I think it'd be great to make adding and removing partitions as simple as ALTER TABLE. I don't think that DELETE should be the mechanism to drop a partition, though. Again, DML statements shouldn't be performing DDL. Since partition is inheritance-based, a simple DROP or NO INHERIT will do the job to deal with the partition. Do we want to reinvent additional syntax when these are around and are documented? Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto creation of Partitions
Hi, On 3/9/07, Shane Ambler [EMAIL PROTECTED] wrote: Note to Nikhil: Make sure the new syntax doesn't prevent partitions from being placed upon multiple tablespaces in some manner, at CREATE TABLE time. What if the syntax was something like - CREATE TABLE tabname ( ... ... ) PARTITION BY HASH(expr) | RANGE(expr) | LIST(expr) [PARTITIONS num_partitions] /* will apply to HASH only for now*/ [PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname], PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname] ... ]; And (if we use the ALTER TABLE to add partitions) ALTER TABLE tabname ADD PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname]; We could as well drop the USING part. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto creation of Partitions
Hi, This follows on from the suggestion I made - taken along the lines of the subject auto creation of partitions where I suggested the syntax of partition check(month of mydatecol) and have a new partition created as data was entered. With this scenario dropping the partition when it was empty would complement the creation of a new partition as needed. Given that there seems to be no real support of going with auto maintenance were new partitions are added as needed, then the auto dropping of empty partitions would also not apply. Leaving us with only specific add partition / drop partition commands. And have the parent table pick up rows not matching any partition check criteria. I was thinking along the lines of what Jim had suggested earlier regarding overflow partition. Instead of dumping unmatched rows to the master table, we could put them into a default DUMP/DUMB partition. Given that Simon wants to do away with having the master table APPENDed in the planning phase, this would be better. Regards, Nikhils -- Shane Ambler [EMAIL PROTECTED] Get Sheeky @ http://Sheeky.Biz -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto creation of Partitions
Hi, If you know that the constraints on each of the tables is distinct, then building a UNIQUE index on each of the partitions is sufficient to prove that all rows in the combined partitioned table are distinct also. The hard part there is checking that the partition constraints are distinct. If the partition constraints are added one at a time, you can use the predicate testing logic to compare the to-be-added partition's constraint against each of the already added constraints. That becomes an O(N) problem. What is really needed is a data structure that allows range partitions to be accessed more efficiently. This could make adding partitions and deciding in which partition a specific value goes an O(logN) operation. If the unique constraint is supposed to be on a column which is NOT being used for the partitioning, then all the above becomes much more difficult. While partitioning, the additional onus on the user is to specify non-conflicting CHECKs for the range/list partitions. Regards, Nikhils -- Simon Riggs EnterpriseDB http://www.enterprisedb.com -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto creation of Partitions
Hi, On 3/7/07, Zeugswetter Andreas ADI SD [EMAIL PROTECTED] wrote: iv) Based on the PRIMARY, UNIQUE, REFERENCES information specified, pass it on to the children tables. How will you maintain a primary key in such a table, considering that indexes can't span multiple tables? Many partitioning schemes have (or can be made to have) only one possible target partition for the primary key. Thus if you create separate unique indexes on each partition the problem is solved. For a first version I opt, that it is sufficient to disallow creation of a unique index on the master, when the constraints (on columns of this index) do not nail down a specific partition for each row (e.g. a hash or a range on one of the index columns that does not overlap). Later, global index, or indexes with separate partitioning rules can be implemented, that cover the other cases. Andreas Yes, I agree. For version 1, UNIQUE/PRIMARY indexes will cascade down to the child table, only if the indexed column is present as part of the partitioning rule. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
[HACKERS] Auto creation of Partitions
Hi, This is to get feedback to meet the following TODO: - Simplify ability to create partitioned tables This would allow creation of partitioned tables without requiring creation of rules for INSERT/UPDATE/DELETE, and constraints for rapid partition selection. Options could include range and hash partition selection. There was some discussion on the pgsql mailing lists, which lead to the above TODO: http://archives.postgresql.org/pgsql-hackers/2006-09/msg00189.php http://archives.postgresql.org/pgsql-hackers/2006-08/msg01874.php We can have the following syntax to support auto creation of partitions in Postgresql: CREATE TABLE tabname ( ... ... ) PARTITION BY HASH(expr) | RANGE(expr) | LIST(expr) [PARTITIONS num_partitions] /* will apply to HASH only for now*/ [PARTITION partition_name CHECK(...), PARTITION partition_name CHECK(...) ... ]; Here expr will be one of the column names as specified for the master table. Once we finalize the syntax, the above statement would end up carrying out the following activities (disclaimer: we might add or remove some activities based on the discussion here). i ) Create master table. ii) Create children tables based on the number of partitions specified and make them inherit from the master table. iii) Auto generate rules (or triggers?) using the checks mentioned for the partitions, to handle INSERTs/DELETEs/UPDATEs to navigate them to the appropriate child. Note that checks specified directly on the master table will get inherited automatically. iv) Based on the PRIMARY, UNIQUE, REFERENCES information specified, pass it on to the children tables. v) If possible add CHECK (false) to the master table to avoid any activity on it. Some questions remain as to: 1) Whether we should use triggers/rules for step number (iii) above. Maybe rules is the way to go. 2) What other attributes (access permissions e.g.) of the master along with the ones specified in (iv) should be passed on to the children. 3) Some implementation specific issue e.g. whether SPI_execute would be a good way of creating these rules. Comments appreciated, Regards, Nikhils EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto creation of Partitions
Hi, On 3/6/07, Peter Eisentraut [EMAIL PROTECTED] wrote: NikhilS wrote: iv) Based on the PRIMARY, UNIQUE, REFERENCES information specified, pass it on to the children tables. How will you maintain a primary key in such a table, considering that indexes can't span multiple tables? We will not (I know its a hard thing to do :) ), the intention is to use this information from the parent and make it a property of the child table. This will avoid the step for the user having to manually specify CREATE INDEX and the likes on all the children tables one-by-one. 1) Whether we should use triggers/rules for step number (iii) above. Maybe rules is the way to go. Since this would basically be a case of the updatable rules problem, you should review those discussions in the past to check whether the issues mentioned there don't interfere with that plan. The rules mentioned here will be to specify that all the inserts/updates/deletes should go into proper children tables instead of the parent. I do not see the updateable rules problem with regards to this, but will check out the archives for discussion on this related to partitioning. 2) What other attributes (access permissions e.g.) of the master along with the ones specified in (iv) should be passed on to the children. Moreover, how are later changes of those attributes propagated? Once created, this will be a normal inheritance relationship between the tables and all the existing commands will apply to both the parent and the child. The basic idea here is to automate as many things as possible at partition creation time. The user is free to make additional changes to the involved tables later too. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto creation of Partitions
On 3/6/07, NikhilS [EMAIL PROTECTED] wrote: Hi, On 3/6/07, Gregory Stark [EMAIL PROTECTED] wrote: NikhilS [EMAIL PROTECTED] writes: the intention is to use this information from the parent and make it a property of the child table. This will avoid the step for the user having to manually specify CREATE INDEX and the likes on all the children tables one-by-one. Missed the start of this thread. A while back I had intended to add WITH INDEXES to CREATE TABLE LIKE. That would let you create a tale LIKE parent WITH CONSTRAINTS WITH INDEXES and get pretty much a perfect table ready for adding to the inheritance structure. Yeah, this one aims to do pretty much the above as part of the auto creation of the inheritance-based partitions. And to add, maybe if there is consensus/demand for the WITH INDEXES idea mentioned above too, I could work on it as well. Regards, Nikhils Regards, Nikhils -- Gregory Stark EnterpriseDB http://www.enterprisedb.com -- EnterpriseDB http://www.enterprisedb.com -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto creation of Partitions
Hi, On 3/6/07, Gregory Stark [EMAIL PROTECTED] wrote: NikhilS [EMAIL PROTECTED] writes: the intention is to use this information from the parent and make it a property of the child table. This will avoid the step for the user having to manually specify CREATE INDEX and the likes on all the children tables one-by-one. Missed the start of this thread. A while back I had intended to add WITH INDEXES to CREATE TABLE LIKE. That would let you create a tale LIKE parent WITH CONSTRAINTS WITH INDEXES and get pretty much a perfect table ready for adding to the inheritance structure. Yeah, this one aims to do pretty much the above as part of the auto creation of the inheritance-based partitions. Regards, Nikhils -- Gregory Stark EnterpriseDB http://www.enterprisedb.com -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto creation of Partitions
Hi Shane, Maybe I'm looking at auto-maintenance which is beyond any current planning? Many of your suggestions are useful, but auto-maintenance will be beyond the current plan. Regards, Nikhils EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto creation of Partitions
Hi, On 3/7/07, Tom Lane [EMAIL PROTECTED] wrote: Peter Eisentraut [EMAIL PROTECTED] writes: But when I say CREATE TABLE ( a int PRIMARY KEY, ... ) PARTITION blah ... then I expect that the primary key will be enforced across all partitions. We currently sidestep that issue by not offering seemingly transparent partitioning. But if you are planning to offer that, the unique index issue needs to be solved, and I see nothing in your plan about that. Agreed, it needs to Just Work. I think it'd still be useful though if we only support auto-partitioning on the primary key, and that restriction avoids the indexing problem. regards, tom lane Sure, but as Chris mentioned earlier, wouldn't it be useful to maintain uniqueness on a partition-by-partition basis too? Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] PrivateRefCount (for 8.3)
Hi, What is the opinion of the list as to the best way of measuring if the following implementation is ok? http://archives.postgresql.org/pgsql-hackers/2007-01/msg00752.php As mentioned in earlier mails, this will reduce the per-backend usage of memory by an amount which will be a fraction (single digit percentage) of (NBuffers * int) size. I have done pgbench/dbt2 runs and I do not see any negative impact because of this. Are there any other suggestions for measuring the backend memory footprint? Regards, Nikhils On 2/21/07, Bruce Momjian [EMAIL PROTECTED] wrote: Added to TODO: * Consider decreasing the amount of memory used by PrivateRefCount http://archives.postgresql.org/pgsql-hackers/2006-11/msg00797.php http://archives.postgresql.org/pgsql-hackers/2007-01/msg00752.php --- Simon Riggs wrote: On Mon, 2006-11-27 at 14:42 -0500, Bruce Momjian wrote: Simon Riggs wrote: int8 still seems like overkjll. When will the ref counts go above 2 on a regular basis? Surely refcount=2 is just chance at the best of times. Refcount - 2 bits per value, plus a simple overflow list? That would allow 0,1,2 ref counts plus 3 means look in hashtable to find real refcount. At two bits, would we run into contention for the byte by multiple backends? No contention, its a private per-backend data structure. That's why we want to reduce the size of it so badly. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] PrivateRefCount (for 8.3)
Hi, Yeah with 500K shared buffers and multiples of backends, we could achieve noticeable savings with this. And that is why it will be difficult to show the performance gains by running just pgbench/dbt2 on medium scale machines. One way of looking at this could be that memory saved here, could lead to more critical usage elsewhere... But agreed, it is hard to show with just some performance runs. Regards, Nikhils On 3/5/07, Stefan Kaltenbrunner [EMAIL PROTECTED] wrote: Tom Lane wrote: NikhilS [EMAIL PROTECTED] writes: What is the opinion of the list as to the best way of measuring if the following implementation is ok? http://archives.postgresql.org/pgsql-hackers/2007-01/msg00752.php As mentioned in earlier mails, this will reduce the per-backend usage of memory by an amount which will be a fraction (single digit percentage) of (NBuffers * int) size. I have done pgbench/dbt2 runs and I do not see any negative impact because of this. I find it extremely telling that you don't claim to have seen any positive impact either. I think that the original argument http://archives.postgresql.org/pgsql-hackers/2006-11/msg00797.php is basically bogus. At 50 buffers (4GB in shared memory) the per-backend space for PrivateRefCount is still only 2MB, which is simply not as significant as Simon claims; a backend needs at least that much for catalog caches etc. There is, furthermore, no evidence that running shared_buffers that high is a good idea in the first place, or that there aren't other performance bottlenecks that will manifest before this one becomes interesting. hmm - we are continuily running into people with dedicated servers that have 16GB RAM or even more available and most tuning docs recommend some 20-30% of system RAM to get dedicated to shared_buffers. So having some 500k buffers allocated does not sound so unrealistic in practise and combined with the fact that people often have a few hundred backends that could add up to some noticable overhead. If that is actually a problem given that those people tend to have heaps of memory is another story but if we can preserve some memory ... Stefan -- EnterpriseDB http://www.enterprisedb.com
[HACKERS] --enable-debug does not work with gcc
Hi, configure with --enable-debug does not seem to add -g to CFLAGS while compiling with gcc. Guess we will need to change configure.in as below: *** # supply -g if --enable-debug ! if test $enable_debug = yes test $ac_cv_prog_cc_g = yes; then CFLAGS=$CFLAGS -g fi --- 300,315 # supply -g if --enable-debug ! if test $enable_debug = yes (test $ac_cv_prog_cc_g = yes || ! test $ac_cv_prog_gcc_g = yes); then CFLAGS=$CFLAGS -g fi Should I submit a patch for this? Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] --enable-debug does not work with gcc
Hi, Indeed it does, apologies for not doing the entire groundwork. But what it also does is that it adds -O2 by default for gcc even when --enable-debug is specified. gdb is not able to navigate the stack traces properly with this optimization in place. Especially tracing of static functions becomes difficult. Has this issue been faced by anybody else? If so can try out a patch to avoid using O2 with enable-debug. Regards, Nikhils On 2/2/07, Gavin Sherry [EMAIL PROTECTED] wrote: On Fri, 2 Feb 2007, NikhilS wrote: Hi, configure with --enable-debug does not seem to add -g to CFLAGS while compiling with gcc. Guess we will need to change configure.in as below: Erm... works for me and everyone else... AFAIK. Thanks, Gavin -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] --enable-debug does not work with gcc
Hi, On 2/2/07, Gavin Sherry [EMAIL PROTECTED] wrote: On Fri, 2 Feb 2007, NikhilS wrote: Hi, Indeed it does, apologies for not doing the entire groundwork. But what it also does is that it adds -O2 by default for gcc even when --enable-debug is specified. gdb is not able to navigate the stack traces properly with this optimization in place. Especially tracing of static functions becomes difficult. Has this issue been faced by anybody else? If so can try out a patch to avoid using O2 with enable-debug. Yes, this is known. The thing with gcc is, it only emits some warnings at -O2. I'm not that this is why we do not set optimisation to 0 but have long assumed it to be the case. I imagine that it's fairly standard practice for people doing debugging to CFLAGS=-O0 as an argument to configure. True, this is how I myself circumvent this problem too. But IMHO, explicitly passing CFLAGS when we are invoking --enable-debug (which does add -g, but leaves some optimization flag around which deters debugging) does not seem correct? Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] PrivateRefCount (for 8.3)
Hi, Most likely a waste of development effort --- have you got any evidence of a real effect here? With 200 max_connections the size of the arrays is still less than 10% of the space occupied by the buffers themselves, ergo there isn't going to be all that much cache-thrashing compared to what happens in the buffers themselves. You're going to be hard pressed to buy back the overhead of the hashing. It might be interesting to see whether we could shrink the refcount entries to int16 or int8. We'd need some scheme to deal with overflow, but given that the counts are now backed by ResourceOwner entries, maybe extra state could be kept in those entries to handle it. I did some instrumentation coupled with pgbench/dbt2/views/join query runs to find out the following: (a) Maximum number of buffers pinned simultaneously by a backend: 6-9 (b) Maximum value of simultaneous pins on a given buffer by a backend: 4-6 (a) indicates that for large shared_buffers value we will end up with space wastage due to a big PrivateRefCount array per backend (current allocation is (int32 * shared_buffers)). (b) indicates that the refcount to be tracked per buffer is a small enough value. And Tom's suggestion of exploring int16 or int8 might be worthwhile. Following is the Hash Table based proposal based on the above readings: - Do away with allocating NBuffers sized PrivateRefCount array which is an allocation of (NBuffers * int). - Define Pvt_RefCnt_Size to be 64 (128?) or some such value so as to be multiples ahead of the above observed ranges. Define Overflow_Size to be 8 or some similar small value to handle collisions. - Define the following Hash Table entry to keep track of reference counts struct HashRefCntEnt { int32BufferId; int32RefCnt; int32NextEnt;/* To handle collisions */ }; - Define a similar Overflow Table entry as above to handle collisions. An array HashRefCntTable of such HashRefCntEnt'ries of size Pvt_RefCnt_Size will get initialized in the InitBufferPoolAccess function. An OverflowTable of size Overflow_Size will be allocated. This array will be sized dynamically (2* current Overflow_Size) to accomodate more entries if it cannot accomodate further collisions in the main table. We do not want the overhead of a costly hashing function. So we will use (%Pvt_RefCnt_Size i.e modulo Pvt_RefCnt_Size) to get the index where the buffer needs to go. In short our hash function is (bufid % Pvt_RefCnt_Size) which should be a cheap enough operation. Considering that 9-10 buffers will be needed, the probability of collisions will be less. Collisions will arise only if buffers with ids (x, x + Pvt_RefCnt_Size, x + 2*Pvt_RefCnt_Size etc.) get used in the same operation. This should be pretty rare. Functions PinBuffer, PinBuffer_Locked, IncrBufferRefCount, UnpinBuffer etc. will be modified to consider the above mechanism properly. The changes will be localized in the buf_init.c and bufmgr.c files only. Comments please. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Frequent Update Project: Design Overview of HOTUpdates
Hi, This allows the length of a typical tuple chain to be extremely short in practice. For a single connection issuing a stream of UPDATEs the chain length will no more than 1 at any time. Only if there are no other transactions being held open, which makes this claim a lot weaker.True, but Nikhil has run tests that clearly show HOT outperformingcurrent situation in the case of long running transactions. The need to optimise HeapTupleSatisfiesVacuum() and avoid long chains does stillremain a difficulty for both HOT and the current situation.Yes, I carried out some pgbench runs comparing our current HOT update patch with PG82BETA2 sources for the long running transaction case. For an apples to apples comparison we got roughly 170% improvement with the HOT update patch over BETA2. In case of BETA2, since all versions are in the main heap, we end up doing multiple index scans for them. In case of HOT updates, we have a single index entry with the chains getting traversed from the overflow relation. So as Simon has mentioned the need to avoid long chains remains a difficulty for both the situations. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Frequent Update Project: Design Overview of HOT Updates
Hi,I think the vision is that the overflow table would never be very large because it can be vacuumed very aggressively. It has only tuples that are busyand will need vacuuming as soon as a transaction ends. Unlike the main tablewhich is mostly tuples that don't need vacuuming. Thats right. vacuum if it gets a chance to work on the overflow relation seems to be doing a decent job in our runs. If autovacuum/vacuum gets to run optimally, the FSM information generated for the overflow relations will be able to serve a lot of new tuple requests avoiding undue/large bloat in the overflow relations. Regards,Nikhils-- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Frequent Update Project: Design Overview of HOT Updates
Hi, On 11/10/06, Pavan Deolasee [EMAIL PROTECTED] wrote: On 11/10/06, Tom Lane [EMAIL PROTECTED] wrote: Pavan Deolasee [EMAIL PROTECTED] writes: On 11/10/06, Tom Lane [EMAIL PROTECTED] wrote: (2) Isn't this full of race conditions? I agree, therecould be raceconditions. But IMO we can handle those.Doubtless you can prevent races by introducing a bunch of additionallocking.The question was really directed to how much concurrent performance is left, once you get done locking it down.I understand your point and I can clearly see a chance to improve upon the currentlocking implementation in the prototype even though we are seeing a good performance boost for 50 clients and 50 scaling factor with pgbench runs as mentioned by Nikhil.Regards,Pavan Yes, we have done a number of runs with and without autovacuum with parameters like 50 clients, 50 scaling factor and 25000 transactions per client. 50 clients should introduce a decent amount of concurrency. The tps values observed with the HOT update patch (850 tps) were approximately 200+%better than PG82 sources (270). Runs with 25 clients, 25 scaling factor and 25000 transactions produce similar percentage increases with the HOT update patch. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Frequent Update Project: Design Overview of HOTUpdates
Hi, On 11/10/06, Zeugswetter Andreas ADI SD [EMAIL PROTECTED] wrote: True, but Nikhil has run tests that clearly show HOT outperforming current situation in the case of long running transactions. The need to optimise HeapTupleSatisfiesVacuum() and avoid long chains does still remain a difficulty for both HOT and the current situation. Yes, I carried out some pgbench runs comparing our current HOT update patch with PG82BETA2 sources for the long running transaction case. For an apples to apples comparison we gotVaccuums every 5 minutes, or no vaccuums ? We tried with both. Vacuumseems to dolittle to help in a long running transaction case. Generally in most of the pgbench runs that we carried out, autovacuum did not seem to be of much help even to PG82. Regards, Nikhils-- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] [SPAM?] Re: Asynchronous I/O Support
Hi, While we are at async i/o. I think direct i/o and concurrent i/o also deserve a look at. The archives suggest that Bruce had some misgivings about dio because of no kernel caching, but almost all databases seem to (carefully) use dio (Solaris, Linux, ?) and cio (AIX) extensively nowadays. Since these can be turned on a per file basis, perf testing them out should be simpler too. Regards, Nikhils On 10/25/06, Martijn van Oosterhout kleptog@svana.org wrote: On Tue, Oct 24, 2006 at 12:53:23PM -0700, Ron Mayer wrote: Anyway, for those who want to see what they do in Linux, http://www.gelato.unsw.edu.au/lxr/source/mm/fadvise.c Pretty scary that Bruce said it could make older linuxes dump core - there isn't a lot of code there. The bug was probably in the glibc interface to the kernel. Google foundthis:http://sourceware.org/ml/libc-hacker/2004-03/msg0.html i.e. posix_fadvise appears to have been broken on all 64-bitarchitechtures prior to March 2004 due to a silly linking error.And then things like this: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=313219Which suggest that prior to glibc 2.3.5, posix_fadvise crashed on 2.4kernels. That's a fairly recent version, so the bug would still befairly widespead. Have a nice day,--Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ From each according to his ability. To each according to his ability to litigate. -BEGIN PGP SIGNATURE-Version: GnuPG v1.4.1 (GNU/Linux)iD8DBQFFPnYrIB7bNG8LQkwRAuAqAJ4uqx8y9LxUa9RcEDm7CPwZ2lkS2wCfYxjB2KzJ7iDYU21lumcZT6cHeLI==MzUY-END PGP SIGNATURE- -- All the world's a stage, and most of us are desperately unrehearsed.
[HACKERS] Microseconds granularity SIGALRM interrupt support
Hi, Currently we have enable_sig_alarm() which provides millisecond level granularity in specifying delay. I tried using it by just specifying nonzero value for the timeval.it_value.tv_usec field before calling setitimer, but didn't seem to work well. Do we have any function in the postgres codebase which provides microseconds level of delay coupled with SIGALRM support? Regards, Nikhils EnterpriseDB http://www.enterprisedb.com-- All the world's a stage, and most of us are desperately unrehearsed.
Re: [HACKERS] Additional stats for Relations
Hi Jim, On 10/18/06, Jim C. Nasby [EMAIL PROTECTED] wrote: Also how many times a relation has been vacuumed (which puts all theother numbers in more perspective... good catch Simon). And I thinknumber of pages that could not be added to the FSM would also beextremely valuable. By the above, do you mean the number of pages that could not be added to the FSM because they had freespace which was less than the threshold for this particular relation? Regards, Nikhils EnterpriseDB http://www.enterprisedb.com On Wed, Oct 18, 2006 at 11:27:39AM +0530, NikhilS wrote: Hi, So: heap_blks_reused (with Jim's semantics), heap_blks_extend, heap_blks_truncate are the interesting stats? Will try to work up a patch for this. Regards, Nikhils EnterpriseDB http://www.enterprisedb.com On 10/15/06, Simon Riggs [EMAIL PROTECTED] wrote: On Sat, 2006-10-14 at 11:32 +0530, NikhilS wrote: On 10/13/06, Jim C. Nasby [EMAIL PROTECTED] wrote:I'm also not sure if this metric is what you actually want, since a single page can be returned many times from the FSM even between vacuums. Tracking how many pages for a relation have been put into the FSM might be more useful... Nikhils Pages might be put into the FSM, but by this metric don't we get the actual usage of the pages from the FSM? Agreed a single page can be returned multiple times, but since it serves a new tuple, shouldn't we track it? Nikhils This makes sense for indexes, but only makes sense for heaps when we know that the backend will keep re-accessing the block until it is full - so only of interest in steady-state workloads. IMHO Jim's proposal makes more sense for general use.heap_blks_extend: The number of times file extend was invoked on the relation Sounds goodheap_blks_truncate: The total number of blocks that have been truncated due to vacuum activity e.g. Sounds goodAs an addendum to the truncate stats above, we can also have the additional following stats: heap_blks_maxtruncate: The max block of buffers truncated in one go heap_blks_ntruncate: The number of times truncate was called on this relation Those last 2 sound too complex for normal use and ntruncate is most likely the same as number of vacuums anyway. Hmmm...Perhaps nvacuums is a more interesting metric? We've got last vacuum date, but no indication of how frequently a vacuum has run. Do you have a use-case for this info? I can see where it might be neat to know, but I'm not sure how you'd actually use it in the real world. Nikhils The use-case according to me is that these stats help prove the effectiveness of autovacuum/vacuum operations. By varying some autovac guc variables, and doing subsequent (pgbench e.g.) runs, one can find out the optimum values for these variables using these stats. Nikhils This should be useful for tuning space allocation/deallocation. If we get this patch in early it should help get feedback on this area. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com-- All the world's a stage, and most of us are desperately unrehearsed. --Jim Nasby[EMAIL PROTECTED]EnterpriseDBhttp://enterprisedb.com512.569.9461 (cell) -- All the world's a stage, and most of us are desperately unrehearsed.
Re: [HACKERS] Asynchronous I/O Support
Hi, On 10/18/06, Martijn van Oosterhout kleptog@svana.org wrote: On Wed, Oct 18, 2006 at 08:04:29PM +1300, Mark Kirkwood wrote: bgwriter doing aysncronous I/O for the dirty buffers that it is supposed to sync Another decent use-case? Good idea, but async i/o is generally poorly supported. Async i/o is stably supported on most *nix (apart from Linux 2.6.*) plus Windows. Guess it would be still worth it, since one fine day 2.6.* will start supporting it properly too. Regards, Nikhils Is it worth considering using readv(2) instead?Err, readv allows you to split a single consecutive read into multiple buffers. Doesn't help at all for reads on widely areas of a file.Have a ncie day,--Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ From each according to his ability. To each according to his ability to litigate.-BEGIN PGP SIGNATURE-Version: GnuPG v1.4.1 (GNU/Linux)iD8DBQFFNhtyIB7bNG8LQkwRApNAAJ9mOhEaFqU59HRCCoJS9k9HCZZl5gCdHDWt FurlswevGH4CWErsjcWmwVk==sQoa-END PGP SIGNATURE--- All the world's a stage, and most of us are desperately unrehearsed.
Re: [HACKERS] Asynchronous I/O Support
Hi, bgwriter doing aysncronous I/O for the dirty buffers that it is supposed to sync Another decent use-case? Regards, Nikhils EnterpriseDB http://www.enterprisedb.com On 10/15/06, Luke Lonergan [EMAIL PROTECTED] wrote: Martijn,On 10/15/06 10:56 AM, Martijn van Oosterhout kleptog@svana.org wrote: Have enough systems actually got to the point of actually supporting async I/O that it's worth implementing?I think there are enough high end applications / systems that need it atthis point.The killer use-case we've identified is for the scattered I/O associated with index + heap scans in Postgres.If we can issue ~5-15 I/Os in advancewhen the TIDs are widely separated it has the potential to increase the I/Ospeed by the number of disks in the tablespace being scanned.At this point, that pattern will only use one disk.- Luke---(end of broadcast)---TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly -- All the world's a stage, and most of us are desperately unrehearsed.
Re: [HACKERS] Additional stats for Relations
Hi, So: heap_blks_reused (with Jim's semantics), heap_blks_extend, heap_blks_truncate are the interesting stats? Will try to work up a patch for this. Regards, Nikhils EnterpriseDB http://www.enterprisedb.com On 10/15/06, Simon Riggs [EMAIL PROTECTED] wrote: On Sat, 2006-10-14 at 11:32 +0530, NikhilS wrote: On 10/13/06, Jim C. Nasby [EMAIL PROTECTED] wrote: I'm also not sure if this metric is what you actually want, since a single page can be returned many times from the FSM even between vacuums. Tracking how many pages for a relation have been put into the FSM might be more useful... Nikhils Pages might be put into the FSM, but by this metric don't we get the actual usage of the pages from the FSM? Agreed a single page can be returned multiple times, but since it serves a new tuple, shouldn't we track it? NikhilsThis makes sense for indexes, but only makes sense for heaps when weknow that the backend will keep re-accessing the block until it is full - so only of interest in steady-state workloads.IMHO Jim's proposal makes more sense for general use. heap_blks_extend: The number of times file extend was invoked on the relationSounds good heap_blks_truncate: The total number of blocks that have been truncated due to vacuum activity e.g.Sounds good As an addendum to the truncate stats above, we can also have the additional following stats: heap_blks_maxtruncate: The max block of buffers truncated in one go heap_blks_ntruncate: The number of times truncate was called on this relationThose last 2 sound too complex for normal use and ntruncate is most likely the same as number of vacuums anyway. Hmmm...Perhaps nvacuums isa more interesting metric? We've got last vacuum date, but no indicationof how frequently a vacuum has run. Do you have a use-case for this info? I can see where it might be neat to know, but I'm not sure how you'd actually use it in the real world. Nikhils The use-case according to me is that these stats help prove the effectiveness of autovacuum/vacuum operations. By varying some autovac guc variables, and doing subsequent (pgbench e.g.) runs, one can find out the optimum values for these variables using these stats. NikhilsThis should be useful for tuning space allocation/deallocation. If weget this patch in early it should help get feedback on this area.--Simon RiggsEnterpriseDB http://www.enterprisedb.com-- All the world's a stage, and most of us are desperately unrehearsed.
Re: [HACKERS] Additional stats for Relations
Hi Jim, On 10/13/06, Jim C. Nasby [EMAIL PROTECTED] wrote: On Fri, Oct 13, 2006 at 06:17:47PM +0530, NikhilS wrote: Currently a select * from pg_statio_user_tables; displays only heap_blks_read, heap_blks_hit stats amongst others for the main relation. It would be good to have the following stats collected too. I think these stats can be used to better statistically analyze/understand the block I/O activity on the relation: heap_blks_reused: The number of buffers returned by the FSM for use to store a new tuple inThe description on this is misleading... FSM doesn't return buffers, it returns pages that have free space on them. Nikhils FSM returns the block number from which we fetch the buffer. This is similar to the way we track buffer_read stats in ReadBuffer. Nikhils I'm also not sure if this metric is what you actually want, since asingle page can be returned many times from the FSM even between vacuums. Tracking how many pages for a relation have been put into theFSM might be more useful... Nikhils Pages might be put into the FSM, but by this metric don't we get the actual usage of the pages from the FSM? Agreed a single page can be returned multiple times, but since it serves a new tuple, shouldn't we track it? Nikhils heap_blks_extend: The number of times file extend was invoked on the relation heap_blks_truncate: The total number of blocks that have been truncated due to vacuum activity e.g. As an addendum to the truncate stats above, we can also have the additional following stats: heap_blks_maxtruncate: The max block of buffers truncated in one go heap_blks_ntruncate: The number of times truncate was called on this relation I can come up with a patch (already have one) for the above. Any thought/comments?Do you have a use-case for this info? I can see where it might be neat to know, but I'm not sure how you'd actually use it in the real world. Nikhils The use-case according to me is that these stats help prove the effectiveness of autovacuum/vacuum operations. By varying some autovac guc variables, and doing subsequent (pgbench e.g.) runs, one can find out the optimum values for these variables using these stats. Nikhils Regards, Nikhils EnterpriseDB http://www.enterprisedb.com-- All the world's a stage, and most of us are desperately unrehearsed.
[HACKERS] Additional stats for Relations
Hi, Currently a select * from pg_statio_user_tables; displays only heap_blks_read, heap_blks_hit stats amongst others for the main relation. It would be good to have the following stats collected too. I think these stats can be used to better statistically analyze/understand the block I/O activity on the relation: heap_blks_reused: The number of buffers returned by the FSM for use to store a new tuple in heap_blks_extend: The number of times file extend was invoked on the relation heap_blks_truncate: The total number of blocks that have been truncated due to vacuum activity e.g. As an addendum to the truncate stats above, we can also have the additional following stats: heap_blks_maxtruncate: The max block of buffers truncated in one go heap_blks_ntruncate: The number of times truncate was called on this relation I can come up with a patch (already have one) for the above. Any thought/comments? Regards, Nikhils (www.enterprisedb.com)-- All the world's a stage, and most of us are desperately unrehearsed.