Greetings, The array_accum example aggregate in the user documentation works reasonably on small data sets but doesn't work too hot on large ones. http://www.postgresql.org/docs/8.1/static/xaggr.html
Normally I wouldn't care particularly much but it turns out that PL/R uses arrays for quite a bit (eg: histograms and other statistics functions). I've also heard other complaints about the performance of arrays, though I'm not sure if those were due to array_accum or something else. Long story short, I set out to build a faster array_accum. Much to my suprise and delight, we already *had* one. accumArrayResult() and makeArrayResult()/construct_md_array() appear to do a fantastic job. I've created a couple of 'glue' functions to expose these functions so they can be used in an aggregate. I'm sure they could be improved upon and possibly made even smaller than they already are (90 lines total for both) but I'd like to throw out the idea of including them in core. The aggregate created with them could also be considered for inclusion though I'm less concerned with that. I don't expect general PostgreSQL users would have trouble creating the aggregate- I don't know that the average user would be able or willing to write the C functions. For comparison, the new functions run with: time psql -c "select aaccum(generate_series) from generate_series(1,1000000);" > /dev/null 4.24s real 0.34s user 0.06s system Compared to: time psql -c "select array_accum(generate_series) from generate_series(1,1000000);" > /dev/null ... Well, it's still running and it's been over an hour. The main differences, as I see it, are: accumArrayResult() works in chunks of 64 elements, and uses repalloc(). array_accum uses array_set() which works on individual elements and uses palloc()/memcpy(). I appriciate that this is done because for most cases of array_set() it's not acceptable to modify the input and am not suggesting that be changed. An alternative might be to modify array_set() to check if it is in an aggregate and change its behavior but adding the seperate functions seemed cleaner and much less intrusive to me. Please find the functions attached. Thanks, Stephen
#include "postgres.h" #include "fmgr.h" #include "utils/array.h" #include "nodes/execnodes.h" PG_MODULE_MAGIC; PG_FUNCTION_INFO_V1(aaccum_sfunc); Datum aaccum_sfunc(PG_FUNCTION_ARGS) { int32 totlen; bytea *storage; Datum element; ArrayBuildState *astate = NULL; AggState *aggstate; /* Make sure we are in an aggregate. */ if (!fcinfo->context || !IsA(fcinfo->context, AggState)) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("Can not call aaccum_sfunc as a non-aggregate"))); aggstate = (AggState*) fcinfo->context; /* Initial call just passes in NULLs, so just allocate memory * and get set up. */ if (PG_ARGISNULL(0)) { storage = (bytea*) palloc(VARHDRSZ+sizeof(ArrayBuildState*)); storage->vl_len = VARHDRSZ+sizeof(ArrayBuildState*); astate = NULL; memcpy(storage->vl_dat,&astate,sizeof(astate)); } else { storage = PG_GETARG_BYTEA_P(0); } memcpy(&astate,storage->vl_dat,sizeof(astate)); element = PG_GETARG_DATUM(1); astate = accumArrayResult(astate, element, PG_ARGISNULL(1), get_fn_expr_argtype(fcinfo->flinfo, 1), aggstate->aggcontext); memcpy(storage->vl_dat,&astate,sizeof(astate)); PG_RETURN_BYTEA_P(storage); } PG_FUNCTION_INFO_V1(aaccum_ffunc); Datum aaccum_ffunc(PG_FUNCTION_ARGS) { int dims[1]; int lbs[1]; bytea *storage; ArrayBuildState *astate; AggState *aggstate; ArrayType *result; /* Make sure we are in an aggregate. */ if (!fcinfo->context || !IsA(fcinfo->context, AggState)) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("Can not call aaccum_sfunc as a non-aggregate"))); aggstate = (AggState*) fcinfo->context; if (PG_ARGISNULL(0)) PG_RETURN_ARRAYTYPE_P(NULL); storage = (bytea*) PG_GETARG_BYTEA_P(0); memcpy(&astate,storage->vl_dat,sizeof(astate)); dims[0] = astate->nelems; lbs[0] = 1; result = construct_md_array(astate->dvalues, astate->dnulls, 1, dims, lbs, astate->element_type, astate->typlen, astate->typbyval, astate->typalign); PG_RETURN_ARRAYTYPE_P(PointerGetDatum(result)); }
---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings