Hello,
I’ve written an sql function that takes a single JSON parameter, in this case
an array of objects each with eight properties (example below.) This function
uses json_to_recordset() in a CTE to insert three rows on two tables. It takes
nearly 7 minutes to insert my dataset of 8935 records (in JSON), using a small
python script (the script parses the JSON in < 1s) and PostgreSQL 13 on my
Macbook Air.
As an experiment I wrote a second sql function with 8 parameters representing
each property of the JSON objects, requiring a discrete function call per
record. I wrote a python script to insert all 8935 records (making as many
calls to the sql function) which executed in around 2 minutes.
I’m very much a novice at interpreting EXPLAIN (ANALYZE) and hoping someone can
help me better optimize my original function. Both the function and results of
explain/analyze are provided below. Is it perhaps a limitation of CTEs or
json_to_recordset(), and an entirely different approach is necessary (like the
second function I wrote with one call per record?) I was really hoping to make
this work, I’ve written a small API to my database using sql and plpgsql
functions each taking a single JSON parameter, and my web backend acts almost
as an http proxy to the database. It’s a different way of doing things as far
as webdev goes, but (for me) an interesting experiment. I like PostgreSQL and
really want to take advantage of it.
I’ve tried a few “optimizations”. Removing the final SELECT and returning VOID
saves around 1.5 min. Removing some extra JOINs saves a little time but nothing
substantial (the joins against account.t_account in some places are to check
“ownership” of the record by a given client_id.)
Using a subquery seems like it could be much faster than a CTE, but I don’t
know how to insert three rows on two tables using a subquery.
Any advice is appreciated, thank you in advance.
Matt
INPUT:
(The CTE illustrates all the properties and types)
[
{
“bank_account_id”: 1324,
“transaction_id”: “abc123”,
“transaction_date”: “2020-10-20”,
…
},
…
]
OUTPUT:
(Not sure what I’ve done to create the nested arrays, but it’s unnecessary..
I can shave off ~1.5min by returning VOID.)
[
[
{
"id": 250185
},
{
"id": 250186
},
...
]
]
FUNCTION:
CREATE OR REPLACE FUNCTION journal.create_with_categories(in_json JSON)
RETURNS JSON
AS $$
WITH data AS (
SELECT
(in_json#>>'{context, client_id}')::BIGINT AS client_id,
nextval(pg_get_serial_sequence('journal', 'id')) AS journal_id,
bank_transaction_type,
x.*
FROM json_to_recordset(in_json->'data')
AS x (
bank_account_id BIGINT,
transaction_id TEXT,
transaction_date DATE,
posted_date DATE,
amount finance.monetary,
description TEXT,
parent_account_id BIGINT,
child_account_id BIGINT
),
LATERAL bank_account.get_transaction_type_by_id(x.bank_account_id,
x.amount) AS bank_transaction_type
),
insert_journal_entry AS (
INSERT INTO journal.journal (
client_id, id, bank_account_id,
transaction_id, transaction_date, posted_date,
description
)
SELECT
client_id, journal_id, bank_account_id,
transaction_id, transaction_date, posted_date,
description
FROM data
),
insert_account_entries AS (
INSERT INTO journal.account2journal(
account_id, journal_id, amount, type
)
-- T account
SELECT
t.id,
d.journal_id,
@ d.amount,
CASE WHEN d.bank_transaction_type = 'debit'::transaction_type
THEN 'credit'::transaction_type
ELSE 'debit'::transaction_type
END
FROM data d
LEFT JOIN account.t_account t
ON (t.id = COALESCE(d.child_account_id, d.parent_account_id))
WHERE t.client_id = d.client_id OR t.id IS NULL
UNION ALL
-- bank account
SELECT
t.id, d.journal_id, @ d.amount, d.bank_transaction_type
FROM data d
JOIN bank_account.bank_account b
ON (b.id = d.bank_account_id)
JOIN account.t_account t
ON (t.id = b.t_account_id)
WHERE
t.client_id = d.client_id
)
SELECT json_agg(d) FROM (SELECT d.journal_id AS id FROM data AS d) AS d;
$$ LANGUAGE sql;
EXPLAIN ANALYZE:
(From logs)
Aggregate (cost=24.24..24.25 rows=1 width=32) (actual
time=388926.249..388926.371 rows=1 loops=1)
Buffers: shared hit=53877 dirtied=2
CTE data
-> Nested Loop (cost=0.26..4.76 rows=100 width=148) (actual
time=183.906..388716.550 rows=8935 loops=1)
Buffers: shared hit=53877 dirtied=2
-> Function Scan on json_to_recordset x (cost=0.01..1.00
rows=100 width=128) (actual time=130.645..142.316 rows=8935 loops=1)
-> Function Scan on get_transaction_type_by_id
bank_transaction_type (cost=0.25..0.26 rows=1 width=4) (actual
time=0.154..0.156 rows=1 loops=8935)
Buffers: shared hit=18054
CTE insert_journal_entry
-> Insert on journal (cost=0.00..2.00 rows=100 width=96) (actual
time=453.563..453.563 rows=0 loops=1)
Buffers: shared hit=79242 dirtied=295
-> CTE Scan on data (cost=0.00..2.00 rows=100 width=96)
(actual time=0.006..10.001 rows=8935 loops=1)
CTE insert_account_entries
-> Insert on account2journal (cost=4.86..15.23 rows=2 width=52)
(actual time=816.381..816.381 rows=0 loops=1)
Buffers: shared hit=159273 dirtied=335 written=17
-> Result (cost=4.86..15.23 rows=2 width=52) (actual
time=0.206..109.222 rows=17870 loops=1)
Buffers: shared hit=5
-> Append (cost=4.86..15.20 rows=2 width=52) (actual
time=0.197..95.060 rows=17870 loops=1)
Buffers: shared hit=5
-> Hash Left Join (cost=4.86..7.14 rows=1
width=52) (actual time=0.195..35.512 rows=8935 loops=1)
Hash Cond: (COALESCE(d_1.child_account_id,
d_1.parent_account_id) = t.id)
Filter: ((t.client_id = d_1.client_id) OR
(t.id IS NULL))
Buffers: shared hit=2
-> CTE Scan on data d_1 (cost=0.00..2.00
rows=100 width=68) (actual time=0.004..6.544 rows=8935 loops=1)
-> Hash (cost=3.27..3.27 rows=127
width=16) (actual time=0.137..0.137 rows=127 loops=1)
Buffers: shared hit=2
-> Seq Scan on t_account t
(cost=0.00..3.27 rows=127 width=16) (actual time=0.026..0.073 rows=127 loops=1)
Buffers: shared hit=2
-> Hash Join (cost=3.80..8.03 rows=1 width=52)
(actual time=40.182..53.796 rows=8935 loops=1)
Hash Cond: ((t_1.id = b.t_account_id) AND
(t_1.client_id = d_2.client_id))
Buffers: shared hit=3
-> Seq Scan on t_account t_1
(cost=0.00..3.27 rows=127 width=16) (actual time=0.022..0.079 rows=127 loops=1)
Buffers: shared hit=2
-> Hash (cost=3.59..3.59 rows=14
width=60) (actual time=40.118..40.118 rows=8935 loops=1)
Buffers: shared hit=1
-> Hash Join (cost=1.32..3.59
rows=14 width=60) (actual time=0.071..17.863 rows=8935 loops=1)
Hash Cond: (d_2.bank_account_id
= b.id)
Buffers: shared hit=1
-> CTE Scan on data d_2
(cost=0.00..2.00 rows=100 width=60) (actual time=0.005..3.740 rows=8935 loops=1)
-> Hash (cost=1.14..1.14
rows=14 width=16) (actual time=0.030..0.030 rows=14 loops=1)
Buffers: shared hit=1
-> Seq Scan on
bank_account b (cost=0.00..1.14 rows=14 width=16) (actual time=0.012..0.016
rows=14 loops=1)
Buffers: shared
hit=1
-> CTE Scan on data d (cost=0.00..2.00 rows=100 width=8) (actual
time=183.918..388812.950 rows=8935 loops=1)
Buffers: shared hit=53877 dirtied=2
Trigger for constraint journal_client_id_fkey on journal: time=194.194
calls=8935
Trigger for constraint journal_bank_account_id_fkey on journal:
time=204.014 calls=8935
Trigger trigger_journal_import_sequence on journal: time=373.344 calls=1
Trigger for constraint account2journal_account_id_fkey on
account2journal: time=580.482 calls=17870
Trigger trigger_debits_equal_credits_on_insert on account2journal:
time=116.653 calls=1