Hi

Following issue exists with PL/Python: when your function takes array as
input parameters, processing arrays of fixed-size elements containing null
values is many times slower than processing same array without nulls. Here
is an example:

-- Function

create or replace function test(a int8[]) returns int8 as $BODY$
return sum([x for x in a if x is not None])
$BODY$ language plpythonu volatile;

pl_regression=# select test(array_agg(a)::int8[])
pl_regression-#     from (
pl_regression(#         select generate_series(1,100000) as a
pl_regression(#         ) as q;
    test
------------
 5000050000
(1 row)

Time: 22.248 ms
pl_regression=# select test(array_agg(a)::int8[])
pl_regression-#     from (
pl_regression(#         select generate_series(1,100000) as a
pl_regression(#         union all
pl_regression(#         select null::int8 as a
pl_regression(#         ) as q;
    test
------------
 5000050000
(1 row)

Time: 7179.921 ms


As you can see, single null in array introduces 320x slowdown. The reason
for this is following:
Original implementation uses array_ref for each element of the array. Each
call to array_ref causes subsequent call to array_seek. Function array_seek
in turn has a shortcut for fixed-size arrays with no nulls. But if your
array is not of fixed-size elements, or if it contains nulls, each call to
array_seek would cause calculation of the Kth element offset starting from
the first element. This is O(N^2) algorithm, resulting in high processing
time for arrays of non-fixed-size elements and arrays with nulls.

The fix I propose applies same logic used at array_out function for
efficient array traversal, keeping the pointer to the last fetched
element's offset, which results in dramatical performance improvement for
affected cases. With this implementation, both arrays of fixed-size
elements without nulls, fixed-size elements with nulls and variable-size
elements are processed with the same speed. Here is the test after this fix
is applied:

pl_regression=# select test(array_agg(a)::int8[])
pl_regression-#     from (
pl_regression(#         select generate_series(1,100000) as a
pl_regression(#         ) as q;
    test
------------
 5000050000
(1 row)

Time: 21.056 ms
pl_regression=# select test(array_agg(a)::int8[])
pl_regression-#     from (
pl_regression(#         select generate_series(1,100000) as a
pl_regression(#         union all
pl_regression(#         select null::int8 as a
pl_regression(#         ) as q;
    test
------------
 5000050000
(1 row)

Time: 22.839 ms


-- 
Best regards,
Alexey Grishchenko

Attachment: 0001-Fix-for-PL-Python-slow-input-arrays-traversal-issue.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to