Re: [HACKERS] proposal: array utility functions phase 1
Tom Lane wrote: It seems like somehow we need a level of FROM/WHERE producing some base rows, and then a set of table function calls to apply to each of the base rows, and then another level of WHERE to filter the results of the function calls (in particular to provide join conditions to identify which rows to match up in the function outputs). I don't see any way to do this without inventing new SELECT clauses out of whole cloth ... unless SQL99's WITH clause helps, but I don't think it does ... Well, maybe this is a start. It allows a table function's input parameter to be declared with setof. The changes involved primarily: 1) a big loop in ExecMakeTableFunctionResult so that functions with set returning arguments get called for each row of the argument, and 2) aways initializing the tuplestore in ExecMakeTableFunctionResult and passing that to the function, even when SFRM_Materialize mode is used. The result looks like: create table foot(f1 text, f2 text); insert into foot values('a','b'); insert into foot values('c','d'); insert into foot values('e','f'); create or replace function test2() returns setof foot as 'select * from foot order by 1 asc' language 'sql'; create or replace function test(setof foot) returns foot as 'select $1.f1, $1.f2' language 'sql'; regression=# select * from test(test2()); f1 | f2 + a | b c | d e | f (3 rows) I know it doesn't solve all the issues discussed, but is it a step forward? Suggestions? Joe Index: contrib/tablefunc/tablefunc.c === RCS file: /opt/src/cvs/pgsql-server/contrib/tablefunc/tablefunc.c,v retrieving revision 1.11 diff -c -r1.11 tablefunc.c *** contrib/tablefunc/tablefunc.c 23 Nov 2002 01:54:09 - 1.11 --- contrib/tablefunc/tablefunc.c 11 Dec 2002 22:07:01 - *** *** 53,59 int max_depth, bool show_branch, MemoryContext per_query_ctx, ! AttInMetadata *attinmeta); static Tuplestorestate *build_tuplestore_recursively(char *key_fld, char *parent_key_fld, char *relname, --- 53,60 int max_depth, bool show_branch, MemoryContext per_query_ctx, ! AttInMetadata *attinmeta, ! Tuplestorestate *tupstore); static Tuplestorestate *build_tuplestore_recursively(char *key_fld, char *parent_key_fld, char *relname, *** *** 641,646 --- 642,648 char *branch_delim = NULL; boolshow_branch = false; ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + Tuplestorestate *tupstore; TupleDesc tupdesc; AttInMetadata *attinmeta; MemoryContext per_query_ctx; *** *** 673,678 --- 675,681 "allowed in this context"); /* OK, go to work */ + tupstore = rsinfo->setResult; rsinfo->returnMode = SFRM_Materialize; rsinfo->setResult = connectby(relname, key_fld, *** *** 682,688 max_depth, show_branch, per_query_ctx, ! attinmeta); rsinfo->setDesc = tupdesc; MemoryContextSwitchTo(oldcontext); --- 685,692 max_depth, show_branch, per_query_ctx, ! attinmeta, ! tupstore); rsinfo->setDesc = tupdesc; MemoryContextSwitchTo(oldcontext); *** *** 709,732 int max_depth, bool show_branch, MemoryContext per_query_ctx, ! AttInMetadata *attinmeta) { - Tuplestorestate *tupstore = NULL; int ret; - MemoryContext oldcontext; /* Connect to SPI manager */ if ((ret = SPI_connect()) < 0) elog(ERROR, "connectby: SPI_connect returned %d", ret); - /* switch to longer term context to create the tuple store */ - oldcontext = MemoryContextSwitchTo(per_query_ctx); - - /* initialize our tuplestore */ - tupstore = tuplestore_begin_heap(tru
Re: [HACKERS] proposal: array utility functions phase 1
Tom Lane wrote: This seems fairly unworkable to me as-is. By definition, WHERE selects from a cross-product of the FROM tables; to make the above do what you want, you'd have to break that fundamental semantics. The semantics of explicit JOIN cases would be broken too. What we need is some kind of explicit multi-level SELECT operation. Perhaps it would help to think about the analogy of aggregates of aggregate functions, which are impossible to express properly in a single SELECT but work nicely given subselect-in-FROM. Subselect-in-FROM doesn't seem to get this job done though. Right offhand I don't see any reasonable syntax other than function-in-the-SELECT-list, which shoots us right back into the messinesses of the Berkeley implementation. However, we do now have the precedent of the table-function AS clause. Does it help any to do something like SELECT grosysid, array_values(grolist) AS (array_index,member_id) FROM pg_group where groname = 'g2'; After further thought, and ignoring the difficulty of implementation, what seems ideal is to be able to specify 'setof ' or 'setof ' as an input to the function, and fire the function once for each row of the input. Basically, allow anything that now qualifies as a FROM item -- a table reference, a subselect with AS clause, another table function, or maybe even a join clause. Some (totally contrived) examples of how it would look: create table foo1(f1 int, f2 text); insert into foo1 values(1,'a'); insert into foo1 values(2,'b'); insert into foo1 values(3,'c'); create table foo2(f1 int, f2 text); insert into foo2 values(1,'w'); insert into foo2 values(1,'x'); insert into foo2 values(2,'y'); insert into foo2 values(2,'z'); create function funcfoo1(setof foo1) returns setof foo2 as 'select * from foo2 where foo2.f1 = $1.f1' language 'sql'; select * from funcfoo1(foo1); f1 f2 +- 1 | w 1 | x 2 | y 2 | z select * from funcfoo1((select * from foo1 where f1=1) as t); f1 f2 +- 1 | w 1 | x What do you think? (Again you could wrap this in an outer SELECT to transform the member_ids to member_names.) The real problem with the Berkeley approach shows up when you consider what happens with multiple table functions called in a single SELECT. The code we currently have produces the cross-product of the implied rows (or at least it tries to, I seem to recall that it doesn't necessarily get it right). That's pretty unpleasant, and though you can filter the rows in an outer SELECT, there's no way to optimize the implementation into a smarter-than-nested-loop join. What if there was a way to declare that a table function returns sorted results, and on which column(s)? It seems like somehow we need a level of FROM/WHERE producing some base rows, and then a set of table function calls to apply to each of the base rows, and then another level of WHERE to filter the results of the function calls (in particular to provide join conditions to identify which rows to match up in the function outputs). I don't see any way to do this without inventing new SELECT clauses out of whole cloth ... unless SQL99's WITH clause helps, but I don't think it does ... Is this still needed given my approach above? How ugly/difficult would it be to allow the planner to interrogate the function and let the function report back a tupledesc based on the actual runtime input parameters? Parse-time, not run-time. It could be done --- IIRC, the auxiliary "function info" call we introduced in the V1 fmgr protocol was deliberately designed to allow expansion in this sort of direction. But it would have to take a tupledesc (or some similar static description) and return another one. Nice! I'll dig in to that a bit. Thanks, Joe ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] proposal: array utility functions phase 1
Joe Conway <[EMAIL PROTECTED]> writes: > Yes, this is exactly what I was yearning to do. Was there a spec or technical > reason (or both) for not allowing the following? >select * from array_values(g.grolist), pg_group g where g.groname = 'g2'; This seems fairly unworkable to me as-is. By definition, WHERE selects from a cross-product of the FROM tables; to make the above do what you want, you'd have to break that fundamental semantics. The semantics of explicit JOIN cases would be broken too. What we need is some kind of explicit multi-level SELECT operation. Perhaps it would help to think about the analogy of aggregates of aggregate functions, which are impossible to express properly in a single SELECT but work nicely given subselect-in-FROM. Subselect-in-FROM doesn't seem to get this job done though. Right offhand I don't see any reasonable syntax other than function-in-the-SELECT-list, which shoots us right back into the messinesses of the Berkeley implementation. However, we do now have the precedent of the table-function AS clause. Does it help any to do something like SELECT grosysid, array_values(grolist) AS (array_index,member_id) FROM pg_group where groname = 'g2'; (Again you could wrap this in an outer SELECT to transform the member_ids to member_names.) The real problem with the Berkeley approach shows up when you consider what happens with multiple table functions called in a single SELECT. The code we currently have produces the cross-product of the implied rows (or at least it tries to, I seem to recall that it doesn't necessarily get it right). That's pretty unpleasant, and though you can filter the rows in an outer SELECT, there's no way to optimize the implementation into a smarter-than-nested-loop join. It seems like somehow we need a level of FROM/WHERE producing some base rows, and then a set of table function calls to apply to each of the base rows, and then another level of WHERE to filter the results of the function calls (in particular to provide join conditions to identify which rows to match up in the function outputs). I don't see any way to do this without inventing new SELECT clauses out of whole cloth ... unless SQL99's WITH clause helps, but I don't think it does ... > How ugly/difficult would it be to allow the planner to interrogate the > function and let the function report back a tupledesc based on the actual > runtime input parameters? Parse-time, not run-time. It could be done --- IIRC, the auxiliary "function info" call we introduced in the V1 fmgr protocol was deliberately designed to allow expansion in this sort of direction. But it would have to take a tupledesc (or some similar static description) and return another one. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] proposal: array utility functions phase 1
Tom Lane wrote: This crystallizes something that has been bothering me for awhile: the table function syntax is severely hobbled (not to say crippled :-() by the fact that the function arguments have to be constants. You really don't want to have to invent intermediate functions every time you want a slightly different query --- yet this technique seems to require *two* bespoke functions for every query, one on each end of the array_values() function. It did for me too. I was thinking along these lines while working on the connectby function, but this work really makes it clear. The original Berkeley syntax, messy as it was, at least avoided this problem. For example, I believe this same problem could be solved (approximately) with select array_values(grolist) from pg_group where groname = 'g2' Yes, this is exactly what I was yearning to do. Was there a spec or technical reason (or both) for not allowing the following? select * from array_values(g.grolist), pg_group g where g.groname = 'g2'; It seems like you could treat it like a one-to-many join between pg_group and the function. I'm sure this is a bad idea and breaks down for more complex examples, but I often have found myself wishing I could do exactly that. I think we ought to try to find a way that table functions could be used with inputs that are taken from tables. In a narrow sense you can do this already, with a sub-SELECT: select * from my_table_func((select x from ...)); but (a) the sub-select can only return a single value, and (b) you can't get at any of the other columns in the row the sub-select is selecting. For instance it won't help me much to do select * from array_values((select grolist from pg_group where groname = 'g2')) if I want to show the group's grosysid as well. You could do something like: select * from array_values('pg_group','grolist') ... and repeat the rest of pg_group's columns for each row produced from grolist in the output (this is closer to what Nigel did, IIRC). This even works in the current table function implementation. It does not get around the issue of specifying querytime column refs though. I know I'm not explaining this very well (I'm only firing on one cylinder today :-(), but basically I think we need to step back and take another look at the mechanism before we start inventing tons of helper functions to make up for the lack of adequate mechanism. Nope, you're explaining it just fine -- it's what I've been thinking for a while, but couldn't articulate myself. As for array_values() itself, it seems fairly inelegant to rely on the user to get the input and output types to match up. Now that we have an "anyarray" pseudotype, I think it wouldn't be unreasonable to hack up some kluge in the parser to allow reference to the element type of such an argument --- that is, you'd say something like create function array_values(anyarray) returns setof anyarray_element and the parser would automatically understand what return type to assign to any particular use of array_values. (Since type resolution is done bottom-up, I see no logical difficulty here, though the implementation might be a bit of a wart...) That doesn't quite work as written (you'd have to account for the array index column or lose it -- which loses any ability to get position in the array), and has even more problems with the array_values('pg_group','grolist') approach. How ugly/difficult would it be to allow the planner to interrogate the function and let the function report back a tupledesc based on the actual runtime input parameters? Kind of a special mode of function call that the function could detect and respond to differently than during execution (to avoid excessive runtime an/or side effects -- just form a tupledesc and return it). Then the planner could move forward without requiring a specific declared return composite type or a return type of record with a runtime query column definition. Joe ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] proposal: array utility functions phase 1
Joe Conway <[EMAIL PROTECTED]> writes: > [ much snipped ] > The first function borrows from an idea Nigel Andrews had -- i.e. expand an > array into rows (and possibly columns). It currently works like this: > -- 1D array > test=# select * from array_values('{101,102,103,104}'::int[]) as (a int, b1 int); > Now on to the TODO item. Given the array_values() function, here's what I was > thinking of to implement listing members of a group: > CREATE OR REPLACE FUNCTION pg_get_grolist(text) RETURNS INT[] AS 'SELECT > grolist FROM pg_group WHERE groname = $1' LANGUAGE 'sql'; > CREATE TYPE pg_grolist_rec AS (array_index int, member_name text); > CREATE OR REPLACE FUNCTION group_list(text) RETURNS SETOF pg_grolist_rec AS > 'SELECT g.id, pg_get_userbyid(g.usesysid)::text AS member_name FROM > array_values(pg_get_grolist($1)) AS g(id int, usesysid int)' LANGUAGE 'sql'; This crystallizes something that has been bothering me for awhile: the table function syntax is severely hobbled (not to say crippled :-() by the fact that the function arguments have to be constants. You really don't want to have to invent intermediate functions every time you want a slightly different query --- yet this technique seems to require *two* bespoke functions for every query, one on each end of the array_values() function. The original Berkeley syntax, messy as it was, at least avoided this problem. For example, I believe this same problem could be solved (approximately) with select array_values(grolist) from pg_group where groname = 'g2' --- getting the users shown as names instead of numbers would take an extra level of SELECT, but I leave the details to the reader. I think we ought to try to find a way that table functions could be used with inputs that are taken from tables. In a narrow sense you can do this already, with a sub-SELECT: select * from my_table_func((select x from ...)); but (a) the sub-select can only return a single value, and (b) you can't get at any of the other columns in the row the sub-select is selecting. For instance it won't help me much to do select * from array_values((select grolist from pg_group where groname = 'g2')) if I want to show the group's grosysid as well. I know I'm not explaining this very well (I'm only firing on one cylinder today :-(), but basically I think we need to step back and take another look at the mechanism before we start inventing tons of helper functions to make up for the lack of adequate mechanism. As for array_values() itself, it seems fairly inelegant to rely on the user to get the input and output types to match up. Now that we have an "anyarray" pseudotype, I think it wouldn't be unreasonable to hack up some kluge in the parser to allow reference to the element type of such an argument --- that is, you'd say something like create function array_values(anyarray) returns setof anyarray_element and the parser would automatically understand what return type to assign to any particular use of array_values. (Since type resolution is done bottom-up, I see no logical difficulty here, though the implementation might be a bit of a wart...) regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[HACKERS] proposal: array utility functions phase 1
I'm working on the TODO item "Allow easy display of usernames in a group" in the context of a slightly larger effort to improve usability of arrays. I'm far enough down the road to have a better idea of where I want to go with this, but I'd like to vet those ideas with the list so I don't waste too much effort if everyone hates them ;-) The first function borrows from an idea Nigel Andrews had -- i.e. expand an array into rows (and possibly columns). It currently works like this: -- 1D array test=# select * from array_values('{101,102,103,104}'::int[]) as (a int, b1 int); a | b1 ---+- 1 | 101 2 | 102 3 | 103 4 | 104 (4 rows) CREATE TABLE arr_text(f1 int, f2 text[]); INSERT INTO arr_text VALUES (1, '{"a","b","c"}'); UPDATE arr_text SET f2[-2:0] = '{"x","y","z"}' WHERE f1 = 1; CREATE OR REPLACE FUNCTION get_arr_text(int) RETURNS text[] AS 'SELECT f2 FROM arr_text WHERE f1 = $1' LANGUAGE 'sql'; test=# select * from array_values(get_arr_text(1)) as (a int, b1 text); a | b1 + -2 | x -1 | y 0 | z 1 | a 2 | b 3 | c (6 rows) -- 2D array test=# select * from array_values('{{1,2,3,4},{5,6,7,8}}'::int[]) as (a int, b1 int, b2 int, b3 int, b4 int); a | b1 | b2 | b3 | b4 ---++++ 1 | 1 | 2 | 3 | 4 2 | 5 | 6 | 7 | 8 (2 rows) It accepts type anyarray, and returns record. The first column preserves the array subscript for the 1st dimension. One question I have is this: what, if anything, should be done with 3 (and higher) dimension arrays? I was considering returning 2 columns -- the 1st dimension array subscript, and a 2nd column containing the sub-array left over. E.g.: array_values('{{{111,112},{121,122}},{{211,212},{221,222}}}'::int[]) would become: a | b1 +--- 1 | {{111,112},{121,122}} 2 | {{211,212},{221,222}} Does this make sense, or is something else better, or would it be better not to support 3 dim arrays and up? Now on to the TODO item. Given the array_values() function, here's what I was thinking of to implement listing members of a group: CREATE OR REPLACE FUNCTION pg_get_grolist(text) RETURNS INT[] AS 'SELECT grolist FROM pg_group WHERE groname = $1' LANGUAGE 'sql'; CREATE TYPE pg_grolist_rec AS (array_index int, member_name text); CREATE OR REPLACE FUNCTION group_list(text) RETURNS SETOF pg_grolist_rec AS 'SELECT g.id, pg_get_userbyid(g.usesysid)::text AS member_name FROM array_values(pg_get_grolist($1)) AS g(id int, usesysid int)' LANGUAGE 'sql'; test=# select * from pg_group; groname | grosysid |grolist -+--+--- g1 | 100 | {100,101} g2 | 101 | {100,101,102} (2 rows) test=# select * from group_list('g2'); array_index | member_name -+- 1 | user1 2 | user2 3 | user3 pg_get_grolist(text) is intended for internal use, as is the pg_grolist_rec composite type. group_list() is intended as the user facing table function. I would implement this by running the three statements above during initdb. Any comments or objections WRT object names or the method of implementation? I don't think this is a very speed critical application, but even using the sql functions it is very fast: test=# explain analyze select * from group_list('g2'); QUERY PLAN Function Scan on group_list (cost=0.00..12.50 rows=1000 width=36) (actual time=1.49..1.50 rows=3 loops=1) Total runtime: 1.55 msec (2 rows) I have more planned beyond the above as outlined in an earlier post (see http://archives.postgresql.org/pgsql-hackers/2002-11/msg01213.php). Next on my list will be a split() function (as discussed in early September) that creates an array from an input string by splitting on a given delimiter. This is similar to functions in perl, php, and undoubtedly other languages. It should work nicely in conjunction with array_values(). Sorry for the long mail and thanks for any feedback! Joe ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html