Re: JSONB index not in use, but is TOAST the real cause of slow query?

2022-05-30 Thread Jeff Janes
On Sat, May 28, 2022 at 1:54 PM Shaheed Haque 
wrote:


>
And how can I understand the dreadful amount of
> time (of course, this is just on my dev machine, but still...)? Is
> there a way to see/tweak what TOAST costs or indeed to confirm if it
> is even in use?
>

Turn on track_io_timing, and then do explain (analyse, buffers).  This
won't distinguish TOAST from main table IO, but will at least confirm if a
large amount of IO is happening and how long it is taking.

You can look at pg_statio_user_tables before the after the query to
distinguish TOAST, but this won't give you the timing, just the block
counts.  And the counts might also get incremented by concurrent users, so
on a busy system it is hard to interpret.

Are these queries fast upon repeat execution?  A few meg time less than
1000 it just not that much with modern hardware, you should be able to just
cache it in RAM.

Cheers,

Jeff


Re: JSONB index not in use, but is TOAST the real cause of slow query?

2022-05-29 Thread Shaheed Haque
On Sun, 29 May 2022, 15:58 Tom Lane,  wrote:

> Shaheed Haque  writes:
> > Unfortunately, the real query which I think should behave very
> > similarly is still at the several-seconds level despite using the
> > index. ...
>
> > -> Bitmap Heap Scan on paiyroll_payrun (cost=26.88..30.91 rows=1
> > width=4) (actual time=32.488..2258.891 rows=62 loops=1)
> >   Recheck Cond: ((company_id = 173) AND ((snapshot ->
> > 'employee'::text) ? '16376'::text))
> >   Filter: (((snapshot #>
> > '{employee,16376,last_run_of_employment}'::text[]) <> 'true'::jsonb)
> > OR ((snapshot #> '{employee,16376,pay_graph}'::text[]) <> '0'::jsonb)
> > OR ((snapshot #> '{employee,16376,state,employment,-1,2}'::text[]) <=
> > '0'::jsonb))
> >   Heap Blocks: exact=5
> > -> BitmapAnd (cost=26.88..26.88 rows=1 width=0) (actual
> > time=0.038..0.039 rows=0 loops=1)
> >   -> Bitmap Index Scan on paiyroll_payrun_company_id_ce341888
> > (cost=0.00..6.56 rows=304 width=0) (actual time=0.016..0.016 rows=304
> > loops=1)
> > Index Cond: (company_id = 173)
> >   -> Bitmap Index Scan on idx1 (cost=0.00..20.07 rows=9
> > width=0) (actual time=0.021..0.021 rows=62 loops=1)
> > Index Cond: ((snapshot -> 'employee'::text) ?
> '16376'::text)
>
> > IIUC, at the bottom, the indices are doing their thing, but a couple
> > of layers up, the "Bitmap Heap Scan" jumps from ~30ms to 2200ms. But I
> > cannot quite see why.
>
> I suppose it's the execution of that "Filter" condition, which will
> require perhaps as many as three fetches of the "snapshot" column.
>

Thanks, that's clearly in the frame.

You really need to rethink that data structure.  Sure, you can store tons
> of unorganized data in a jsonb column, but you pay for that convenience
> with slow access.  Normalizing the bits you need frequently into a more
> traditional relational schema is the route to better-performing queries.
>

Ack. Indeed, the current design works very well for all of the access
patterns other than this one, which only recently came into view as a
problem.

Ahead of contemplating a design change I have been looking at how to
optimise this bit. I'm currently mired in a crash course on SQL syntax as
pertains to JSONB, jsonpath et. al. And the equally mysterious side effects
of "?" and "@>" and so on in terms of the amount of data being fetched etc.
(and all wrapped in a dose of ORM for good measure).

I'll write separately with more specific questions if needed on those
details.

Thanks again for the kind help.

Shaheed


> regards, tom lane
>


Re: JSONB index not in use, but is TOAST the real cause of slow query?

2022-05-29 Thread Tom Lane
Shaheed Haque  writes:
> Unfortunately, the real query which I think should behave very
> similarly is still at the several-seconds level despite using the
> index. ...

> -> Bitmap Heap Scan on paiyroll_payrun (cost=26.88..30.91 rows=1
> width=4) (actual time=32.488..2258.891 rows=62 loops=1)
>   Recheck Cond: ((company_id = 173) AND ((snapshot ->
> 'employee'::text) ? '16376'::text))
>   Filter: (((snapshot #>
> '{employee,16376,last_run_of_employment}'::text[]) <> 'true'::jsonb)
> OR ((snapshot #> '{employee,16376,pay_graph}'::text[]) <> '0'::jsonb)
> OR ((snapshot #> '{employee,16376,state,employment,-1,2}'::text[]) <=
> '0'::jsonb))
>   Heap Blocks: exact=5
> -> BitmapAnd (cost=26.88..26.88 rows=1 width=0) (actual
> time=0.038..0.039 rows=0 loops=1)
>   -> Bitmap Index Scan on paiyroll_payrun_company_id_ce341888
> (cost=0.00..6.56 rows=304 width=0) (actual time=0.016..0.016 rows=304
> loops=1)
> Index Cond: (company_id = 173)
>   -> Bitmap Index Scan on idx1 (cost=0.00..20.07 rows=9
> width=0) (actual time=0.021..0.021 rows=62 loops=1)
> Index Cond: ((snapshot -> 'employee'::text) ? '16376'::text)

> IIUC, at the bottom, the indices are doing their thing, but a couple
> of layers up, the "Bitmap Heap Scan" jumps from ~30ms to 2200ms. But I
> cannot quite see why.

I suppose it's the execution of that "Filter" condition, which will
require perhaps as many as three fetches of the "snapshot" column.

You really need to rethink that data structure.  Sure, you can store tons
of unorganized data in a jsonb column, but you pay for that convenience
with slow access.  Normalizing the bits you need frequently into a more
traditional relational schema is the route to better-performing queries.

regards, tom lane




Re: JSONB index not in use, but is TOAST the real cause of slow query?

2022-05-28 Thread Shaheed Haque
Adrian, Tom, thanks for the input. Based on that, it occurred to me to
create some "dummy" rows and - almost magically - the index kicks in!
Before is 2500ms:


Seq Scan on paiyroll_payrun  (cost=0.00..52.43 rows=17 width=32)
(actual time=53.127..2567.024 rows=104 loops=1)
  Filter: ((snapshot -> 'employee'::text) ? '2209'::text)
  Rows Removed by Filter: 1835
Planning Time: 0.060 ms
Execution Time: 2567.044 ms
(5 rows)


After is 300ms:


Bitmap Heap Scan on paiyroll_payrun  (cost=36.11..64.67 rows=14
width=32) (actual time=4.189..311.932 rows=104 loops=1)
  Recheck Cond: ((snapshot -> 'employee'::text) ? '2209'::text)
  Rows Removed by Index Recheck: 1
  Heap Blocks: exact=8
  ->  Bitmap Index Scan on idx1  (cost=0.00..36.10 rows=14 width=0)
(actual time=0.087..0.087 rows=105 loops=1)
Index Cond: ((snapshot -> 'employee'::text) ? '2209'::text)
Planning Time: 0.167 ms
Execution Time: 311.962 ms
(8 rows)


Woot!

Unfortunately, the real query which I think should behave very
similarly is still at the several-seconds level despite using the
index. Before 3600ms:


SELECT "paiyroll_payrun"."actual_t"
  FROM "paiyroll_payrun"
 WHERE ("paiyroll_payrun"."company_id" = 173 AND
("paiyroll_payrun"."snapshot" -> 'employee') ? '16376'
   AND NOT (("paiyroll_payrun"."snapshot" #>
ARRAY['employee','16376','last_run_of_employment']) = 'true'
   AND ("paiyroll_payrun"."snapshot" #>
ARRAY['employee','16376','pay_graph']) = '0'
   AND ("paiyroll_payrun"."snapshot" #>
ARRAY['employee','16376','state','employment','-1','2']) > '0'))
 ORDER BY "paiyroll_payrun"."actual_t" DESC
 LIMIT 1



Limit (cost=31.33..31.33 rows=1 width=4) (actual
time=3595.174..3595.176 rows=1 loops=1)
  -> Sort (cost=31.33..31.33 rows=3 width=4) (actual
time=3595.174..3595.174 rows=1 loops=1)
Sort Key: actual_t DESC
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on paiyroll_payrun (cost=6.43..31.31 rows=3
width=4) (actual time=44.575..3595.082 rows=62 loops=1)
   Recheck Cond: (company_id = 173)
   Filter: (((snapshot -> 'employee'::text) ? '16376'::text) AND
(((snapshot #> '{employee,16376,last_run_of_employment}'::text[]) <>
'true'::jsonb) OR ((snapshot #> '{employee,16376,pay_graph}'::text[])
<> '0'::jsonb) OR ((snapshot #>
'{employee,16376,state,employment,-1,2}'::text[]) <= '0'::jsonb)))
Rows Removed by Filter: 242
   Heap Blocks: exact=9
 -> Bitmap Index Scan on paiyroll_payrun_company_id_ce341888
(cost=0.00..6.43 rows=304 width=0) (actual time=0.013..0.013 rows=304
loops=1)
 Index Cond: (company_id = 173)
Planning Time: 0.258 ms
Execution Time: 3595.195 ms


After, with the index in use, 2200ms:


Limit (cost=30.92..30.93 rows=1 width=4) (actual
time=2258.989..2258.990 rows=1 loops=1)
  -> Sort (cost=30.92..30.93 rows=1 width=4) (actual
time=2258.988..2258.989 rows=1 loops=1)
Sort Key: actual_t DESC
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on paiyroll_payrun (cost=26.88..30.91 rows=1
width=4) (actual time=32.488..2258.891 rows=62 loops=1)
  Recheck Cond: ((company_id = 173) AND ((snapshot ->
'employee'::text) ? '16376'::text))
  Filter: (((snapshot #>
'{employee,16376,last_run_of_employment}'::text[]) <> 'true'::jsonb)
OR ((snapshot #> '{employee,16376,pay_graph}'::text[]) <> '0'::jsonb)
OR ((snapshot #> '{employee,16376,state,employment,-1,2}'::text[]) <=
'0'::jsonb))
  Heap Blocks: exact=5
-> BitmapAnd (cost=26.88..26.88 rows=1 width=0) (actual
time=0.038..0.039 rows=0 loops=1)
  -> Bitmap Index Scan on paiyroll_payrun_company_id_ce341888
(cost=0.00..6.56 rows=304 width=0) (actual time=0.016..0.016 rows=304
loops=1)
Index Cond: (company_id = 173)
  -> Bitmap Index Scan on idx1 (cost=0.00..20.07 rows=9
width=0) (actual time=0.021..0.021 rows=62 loops=1)
Index Cond: ((snapshot -> 'employee'::text) ? '16376'::text)
Planning Time: 0.245 ms
Execution Time: 2259.019 ms
===

IIUC, at the bottom, the indices are doing their thing, but a couple
of layers up, the "Bitmap Heap Scan" jumps from ~30ms to 2200ms. But I
cannot quite see why. Have I missed a needed index or what? I'm pretty
much a novice at SQL, so it is quite possible I've missed something
very basic.

Thanks, Shaheed

On Sat, 28 May 2022 at 20:59, Tom Lane  wrote:
>
> Shaheed Haque  writes:
> > One last thought about TOAST. If the cost of the -> retrieving the
> > data cannot be obviated, is there any way to tweak how that works?
>
> The only thing that's readily tweakable is to disable data compression
> for the out-of-line values (see ALTER TABLE ... SET STORAGE, and note
> that you have to rebuild the table for already-stored values to be
> changed).  It seems unlikely that that will help you much though,
> since doing that would save CPU at the cost of more disk I/O, and
> it seems that the I/O side 

Re: JSONB index not in use, but is TOAST the real cause of slow query?

2022-05-28 Thread Tom Lane
Shaheed Haque  writes:
> One last thought about TOAST. If the cost of the -> retrieving the
> data cannot be obviated, is there any way to tweak how that works?

The only thing that's readily tweakable is to disable data compression
for the out-of-line values (see ALTER TABLE ... SET STORAGE, and note
that you have to rebuild the table for already-stored values to be
changed).  It seems unlikely that that will help you much though,
since doing that would save CPU at the cost of more disk I/O, and
it seems that the I/O side is your bottleneck already.  It *would*
help if jsonb had logic for partial fetches, because that'd require
uncompressed storage to work.  But AFAICS that's not there yet.
I distinctly recall that that was planned for when we were designing
jsonb's on-disk layout, but I see no code actually doing anything
of that sort ... the -> operator certainly doesn't.

regards, tom lane




Re: JSONB index not in use, but is TOAST the real cause of slow query?

2022-05-28 Thread Adrian Klaver

On 5/28/22 12:38, Shaheed Haque wrote:

Tom,

Thanks for the considered advice and insights. My takeaway is that
based on what I've said,you are mostly unsurprised by the results I
see. In the longer term, the number of rows will increase but I will
have to ponder options for the immediate future. I'll have a play with
the knobs you suggested and will report back with anything of note.

One last thought about TOAST. If the cost of the -> retrieving the
data cannot be obviated, is there any way to tweak how that works?


See:

https://www.postgresql.org/docs/current/storage-toast.html

and from there:

https://www.postgresql.org/docs/current/sql-altertable.html

Search for SET STORAGE.

It might help with(from Tom Lane):

"Unfortunately the planner doesn't account
for detoasting costs when making such estimates, ..."

but probably not much with:

" ...because anything at all that you do with
that big JSONB column is going to be expensive.  (Another thing
that's been on the to-do list for awhile is enabling partial
retrieval of large JSONB values, but AFAIK that hasn't happened
yet either.)



Thanks, Shaheed




--
Adrian Klaver
adrian.kla...@aklaver.com




Re: JSONB index not in use, but is TOAST the real cause of slow query?

2022-05-28 Thread Shaheed Haque
Tom,

Thanks for the considered advice and insights. My takeaway is that
based on what I've said,you are mostly unsurprised by the results I
see. In the longer term, the number of rows will increase but I will
have to ponder options for the immediate future. I'll have a play with
the knobs you suggested and will report back with anything of note.

One last thought about TOAST. If the cost of the -> retrieving the
data cannot be obviated, is there any way to tweak how that works?

Thanks, Shaheed

On Sat, 28 May 2022 at 19:41, Tom Lane  wrote:
>
> Shaheed Haque  writes:
> > ==
> > foo =# explain analyse SELECT snapshot ->'company'->'legal_name' FROM
> > paiyroll_payrun WHERE snapshot ->'employee' ? '2209';
> >   QUERY PLAN
> > -
> > Seq Scan on paiyroll_payrun  (cost=0.00..29.13 rows=9 width=32)
> > (actual time=50.185..2520.983 rows=104 loops=1)
> >   Filter: ((snapshot -> 'employee'::text) ? '2209'::text)
> >   Rows Removed by Filter: 835
> > Planning Time: 0.075 ms
> > Execution Time: 2521.004 ms
> > (5 rows)
> > ==
>
> > So, over 2.5 seconds to sequentially scan ~850 rows. Am I right to
> > presume the INDEX is not used because of the number of rows? Is there
> > a way to verify that?
>
> You could do "set enable_seqscan = off" and see if the EXPLAIN
> results change.  My guess is that you'll find that the indexscan
> alternative is costed at a bit more than 29.13 units and thus
> the planner thinks seqscan is cheaper.
>
> > And how can I understand the dreadful amount of
> > time (of course, this is just on my dev machine, but still...)?
>
> In the seqscan case, the -> operator is going to retrieve the whole
> JSONB value from each row, which of course is pretty darn expensive
> if it's a few megabytes.  Unfortunately the planner doesn't account
> for detoasting costs when making such estimates, so it doesn't
> realize that the seqscan case is going to be expensive.  (Fixing
> that has been on the to-do list for a long time, but we seldom
> see cases where it matters this much, so it hasn't gotten done.)
>
> The problem would likely go away by itself if your table had more
> than a few hundred rows, but if you don't anticipate that happening
> then you need some sort of band-aid.  I don't recommend turning
> enable_seqscan off as a production fix; it'd likely have negative
> effects on other queries.  Personally I'd experiment with reducing
> random_page_cost a bit to see if I could encourage use of the index
> that way.  The default value of 4.0 is tuned for spinning-rust
> storage and is not too appropriate for a lot of modern hardware,
> so there's probably room to fix it that way without detuning your
> setup for other queries.
>
> You should probably also rethink whether you really want to store
> your data in this format, because anything at all that you do with
> that big JSONB column is going to be expensive.  (Another thing
> that's been on the to-do list for awhile is enabling partial
> retrieval of large JSONB values, but AFAIK that hasn't happened
> yet either.)
>
> regards, tom lane




Re: JSONB index not in use, but is TOAST the real cause of slow query?

2022-05-28 Thread Tom Lane
Shaheed Haque  writes:
> ==
> foo =# explain analyse SELECT snapshot ->'company'->'legal_name' FROM
> paiyroll_payrun WHERE snapshot ->'employee' ? '2209';
>   QUERY PLAN
> -
> Seq Scan on paiyroll_payrun  (cost=0.00..29.13 rows=9 width=32)
> (actual time=50.185..2520.983 rows=104 loops=1)
>   Filter: ((snapshot -> 'employee'::text) ? '2209'::text)
>   Rows Removed by Filter: 835
> Planning Time: 0.075 ms
> Execution Time: 2521.004 ms
> (5 rows)
> ==

> So, over 2.5 seconds to sequentially scan ~850 rows. Am I right to
> presume the INDEX is not used because of the number of rows? Is there
> a way to verify that?

You could do "set enable_seqscan = off" and see if the EXPLAIN
results change.  My guess is that you'll find that the indexscan
alternative is costed at a bit more than 29.13 units and thus
the planner thinks seqscan is cheaper.

> And how can I understand the dreadful amount of
> time (of course, this is just on my dev machine, but still...)?

In the seqscan case, the -> operator is going to retrieve the whole
JSONB value from each row, which of course is pretty darn expensive
if it's a few megabytes.  Unfortunately the planner doesn't account
for detoasting costs when making such estimates, so it doesn't
realize that the seqscan case is going to be expensive.  (Fixing
that has been on the to-do list for a long time, but we seldom
see cases where it matters this much, so it hasn't gotten done.)

The problem would likely go away by itself if your table had more
than a few hundred rows, but if you don't anticipate that happening
then you need some sort of band-aid.  I don't recommend turning
enable_seqscan off as a production fix; it'd likely have negative
effects on other queries.  Personally I'd experiment with reducing
random_page_cost a bit to see if I could encourage use of the index
that way.  The default value of 4.0 is tuned for spinning-rust
storage and is not too appropriate for a lot of modern hardware,
so there's probably room to fix it that way without detuning your
setup for other queries.

You should probably also rethink whether you really want to store
your data in this format, because anything at all that you do with
that big JSONB column is going to be expensive.  (Another thing
that's been on the to-do list for awhile is enabling partial
retrieval of large JSONB values, but AFAIK that hasn't happened
yet either.)

regards, tom lane




JSONB index not in use, but is TOAST the real cause of slow query?

2022-05-28 Thread Shaheed Haque
Hi,

I have a database table with a modest number of rows (<1000) but where
one column in the table is a JSONB "snapshot" which can be a few MB in
size. Generally, this is a great fit for the read-write access
patterns involved, but there is a read-query which is VERY slow.

I've searched via Google and in this mailing list archive and found
lots of material, but nothing that explains what is going on...or how
to fix it. The table columns look like this:

...a primary key...
...various other keys including a couple of FKs...
snapshot JSONB

and before I did anything, the indices looked like this:

   "paiyroll_payrun_pkey" PRIMARY KEY, btree (process_ptr_id)
   "paiyroll_payrun_company_id_ce341888" btree (company_id)
   "paiyroll_payrun_schedule_id_1593f55f" btree (schedule_id)

The data in one row's "snapshot" looks a bit like this:

{
"stuff": {},
"more other stuff": {},
"employee": {
 "1234": {},
 "56789": {},
}
}

The query that is slow can be approximated like this:

   SELECT snapshot ->'stuff'->'item' FROM paiyroll_payrun WHERE
snapshot ->'employee' ? '2209';

When I add this index:

CREATE INDEX idx1 ON paiyroll_payrun USING gin ((snapshot ->'employee'));

the analyser says this:

==
foo =# explain analyse SELECT snapshot ->'company'->'legal_name' FROM
paiyroll_payrun WHERE snapshot ->'employee' ? '2209';
  QUERY PLAN
-
Seq Scan on paiyroll_payrun  (cost=0.00..29.13 rows=9 width=32)
(actual time=50.185..2520.983 rows=104 loops=1)
  Filter: ((snapshot -> 'employee'::text) ? '2209'::text)
  Rows Removed by Filter: 835
Planning Time: 0.075 ms
Execution Time: 2521.004 ms
(5 rows)
==

So, over 2.5 seconds to sequentially scan ~850 rows. Am I right to
presume the INDEX is not used because of the number of rows? Is there
a way to verify that? And how can I understand the dreadful amount of
time (of course, this is just on my dev machine, but still...)? Is
there a way to see/tweak what TOAST costs or indeed to confirm if it
is even in use?

Any help appreciated.

Thanks, Shaheed