Re: [GENERAL] Drastic select count performance hit when jsonb GIN indices are present

2014-12-26 Thread Maxim Boguk
>
>
>
> Getting back to my original point - you pointed out that for queries that
> need a decent % of the table it will be cheaper to do a scan, which is
> exactly what the query planner does for the relational version. If it only
> needs a small % of the values it looks at the index and for a large % it
> goes for a scan (it also puts everything in shared buffers and is
> lightening quick!). Is this just a lack of maturity in the jsonb planner or
> am I missing something?
>

​Hi Anton,

Good selectivity estimators exists only for the scalar data types.
For the complex data types such as json/jsonb introducing a reasonable
selectivity estimator is very complicated task, so database could only
guess in this cases.
In your case the database guessed amount of returned rows with 3 order of
magnitude error (estimated 3716 rows, actually 1417152 rows).
Personally, I don't expect serious progress in json/jsonb selectivity
estimators in short future, so better to avoid using a low-selectivity
queries against indexed json/jsonb fields.

​

-- 
Maxim Boguk
Senior Postgresql DBA
http://www.postgresql-consulting.ru/ 

Phone RU: +7 910 405 4718
Phone AU: +61 45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
Jabber: maxim.bo...@gmail.com
МойКруг: http://mboguk.moikrug.ru/

"People problems are solved with people.
If people cannot solve the problem, try technology.
People will then wish they'd listened at the first stage."


Re: [GENERAL] Drastic select count performance hit when jsonb GIN indices are present

2014-12-26 Thread Anton Melser
Hi Maxim,

Thanks for the suggestions. Unfortunately, it does appear to be an IO issue
but not one I can get around (on my laptop at least!). Postgres seems to
refuse to put any of the index or table in shared buffers at all. For some
reason, there seems to be a very great increase in space required for
having each event (item) on its own row as opposed to having it with the
original items array within far fewer database rows:

  relation   |size
-+
 public.myevents | 3502 MB
...
 pg_toast.pg_toast_57487 | 800 MB
...
 public.events   | 2232 kB
...

myevents is where each event has a row to itself and events has 100 events
in an array within the JSON rows (and the toast table above is where the
data seem to be stored) - actually the events table has more data in it as
there are the extra paging objects which have been removed from myevents.
Performing vacuum analyze seems to have no effect either.

Getting back to my original point - you pointed out that for queries that
need a decent % of the table it will be cheaper to do a scan, which is
exactly what the query planner does for the relational version. If it only
needs a small % of the values it looks at the index and for a large % it
goes for a scan (it also puts everything in shared buffers and is
lightening quick!). Is this just a lack of maturity in the jsonb planner or
am I missing something?

Thanks again,
Anton

On 26 December 2014 at 14:19, Maxim Boguk  wrote:

>->  Bitmap Heap Scan on myevents  (cost=35.80..3615.09 rows=3716
> width=0) (actual time=351.510..77669.907 rows=1417152 loops=1)
>
>>  Recheck Cond: (event @> '{"event": "delivered"}'::jsonb)
>>  Heap Blocks: exact=298362
>> ​​
>>  Buffers: shared hit=1 read=298589
>>
> ​...​
>
>>  Execution time: 80986.340 ms
>>
>
>
>>->  Bitmap Heap Scan on
>> ​​
>> myevents  (cost=42.80..3622.09 rows=3716 width=0) (actual
>> time=534.816..78526.944 rows=1417152 loops=1)
>>  Recheck Cond: (event @> '{"event": "delivered"}'::jsonb)
>>  Heap Blocks: exact=298362
>>  Buffers: shared hit=383 read=299133
>>
> ​...
>>  Execution time: 81898.578 ms
>>
>
> ​Hi Anton,
>
> What you see there (i think) - it's a performance hit of random disk read
> for non-cached database.
> Try increase a shared buffers to value when table and index could fit
> into, and redo queries few time until you see something like
> Buffers: shared hit=bigvalue read=0 and compare performance, it might
> change timing quite a lot.
>
> Also, I recommend set track_io_timing=on in postgresql.conf and  after it
> use explain (analyze, buffers, timing) to see check how much time database
> spent doing IO operations.
> Also try perform vacuum analyze ​myevents; before testing because it seems
> that you have no up to date visibility map on the table.
>
>
> However, even in fully cached case selecting 40% on the table rows almost
> always will be faster via sequential scan, so I don't expect miracles.
>
> --
> Maxim Boguk
> Senior Postgresql DBA
> http://www.postgresql-consulting.ru/
> 
>
> Phone RU: +7 910 405 4718
> Phone AU: +61 45 218 5678
>
> LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
> Skype: maxim.boguk
> Jabber: maxim.bo...@gmail.com
> МойКруг: http://mboguk.moikrug.ru/
>
> "People problems are solved with people.
> If people cannot solve the problem, try technology.
> People will then wish they'd listened at the first stage."
>



-- 
echo '16i[q]sa[ln0=aln100%Pln100/snlbx]sbA0D4D465452snlbxq' | dc
This will help you for 99.9% of your problems ...


Re: [GENERAL] Drastic select count performance hit when jsonb GIN indices are present

2014-12-26 Thread Maxim Boguk
   ->  Bitmap Heap Scan on myevents  (cost=35.80..3615.09 rows=3716
width=0) (actual time=351.510..77669.907 rows=1417152 loops=1)

>  Recheck Cond: (event @> '{"event": "delivered"}'::jsonb)
>  Heap Blocks: exact=298362
> ​​
>  Buffers: shared hit=1 read=298589
>
​...​

>  Execution time: 80986.340 ms
>


>->  Bitmap Heap Scan on
> ​​
> myevents  (cost=42.80..3622.09 rows=3716 width=0) (actual
> time=534.816..78526.944 rows=1417152 loops=1)
>  Recheck Cond: (event @> '{"event": "delivered"}'::jsonb)
>  Heap Blocks: exact=298362
>  Buffers: shared hit=383 read=299133
>
​...
>  Execution time: 81898.578 ms
>

​Hi Anton,

What you see there (i think) - it's a performance hit of random disk read
for non-cached database.
Try increase a shared buffers to value when table and index could fit into,
and redo queries few time until you see something like
Buffers: shared hit=bigvalue read=0 and compare performance, it might
change timing quite a lot.

Also, I recommend set track_io_timing=on in postgresql.conf and  after it
use explain (analyze, buffers, timing) to see check how much time database
spent doing IO operations.
Also try perform vacuum analyze ​myevents; before testing because it seems
that you have no up to date visibility map on the table.


However, even in fully cached case selecting 40% on the table rows almost
always will be faster via sequential scan, so I don't expect miracles.

-- 
Maxim Boguk
Senior Postgresql DBA
http://www.postgresql-consulting.ru/ 

Phone RU: +7 910 405 4718
Phone AU: +61 45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
Jabber: maxim.bo...@gmail.com
МойКруг: http://mboguk.moikrug.ru/

"People problems are solved with people.
If people cannot solve the problem, try technology.
People will then wish they'd listened at the first stage."


[GENERAL] Drastic select count performance hit when jsonb GIN indices are present

2014-12-26 Thread Anton Melser
Hi,

I've been playing with jsonb for storing and querying data from the Mailgun
Events API (http://documentation.mailgun.com/api-events.html#examples). I
already have a system that parses the JSON to csv and loads into standard
tables but what better way to spend the holidays than nerding out on new
postgres tech :-)? I am using the official postgres-supplied 9.4 Ubuntu
repo on 14.04.

I have been testing both just inserting the raw API call JSON
({"items":[{"item1"],"paging":...}) and extracting the actual events
from the items array ({"item1":...}, {"item2":...}) and inserting per-event
rather than per page of 100 events in an "items" array. I did this to try
and benefit from GIN indices but much to my surprise, adding indices when
there is a line per-event actually drastically *reduces* performance!

mgevents=# create table myevents (event jsonb);
mgevents=# create index idx_myevents on myevents using gin (event
jsonb_path_ops);
mgevents=# create index idx_myevents_no_path on myevents using gin (event);

mgevents=# select count(*) from myevents;
  count
-
 3715600
(1 row)

mgevents=# explain (analyze, buffers) select count(*) from myevents where
event ->> 'event' = 'delivered';
QUERY PLAN

---
 Aggregate  (cost=279884.45..279884.46 rows=1 width=0) (actual
time=14626.213..14626.216 rows=1 loops=1)
   Buffers: shared read=448208
   ->  Seq Scan on myevents  (cost=0.00..279838.00 rows=18578 width=0)
(actual time=0.719..11432.283 rows=1417152 loops=1)
 Filter: ((event ->> 'event'::text) = 'delivered'::text)
 Rows Removed by Filter: 2298448
 Buffers: shared read=448208
 Planning time: 0.074 ms
 Execution time: 14626.955 ms
(8 rows)

mgevents=# explain (analyze, buffers) select count(*) from myevents where
event @> '{"event": "delivered"}';
   QUERY PLAN

-
 Aggregate  (cost=3624.38..3624.39 rows=1 width=0) (actual
time=80984.095..80984.098 rows=1 loops=1)
   Buffers: shared hit=1 read=298589
   ->  Bitmap Heap Scan on myevents  (cost=35.80..3615.09 rows=3716
width=0) (actual time=351.510..77669.907 rows=1417152 loops=1)
 Recheck Cond: (event @> '{"event": "delivered"}'::jsonb)
 Heap Blocks: exact=298362
 Buffers: shared hit=1 read=298589
 ->  Bitmap Index Scan on idx_myevents  (cost=0.00..34.87 rows=3716
width=0) (actual time=257.219..257.219 rows=1417152 loops=1)
   Index Cond: (event @> '{"event": "delivered"}'::jsonb)
   Buffers: shared hit=1 read=227
 Planning time: 3.197 ms
 Execution time: 80986.340 ms
(11 rows)

mgevents=# drop index idx_myevents;

mgevents=# explain (analyze, buffers) select count(*) from myevents where
event @> '{"event": "delivered"}';
   QUERY
PLAN
-
 Aggregate  (cost=3631.38..3631.39 rows=1 width=0) (actual
time=81898.250..81898.252 rows=1 loops=1)
   Buffers: shared hit=383 read=299133
   ->  Bitmap Heap Scan on myevents  (cost=42.80..3622.09 rows=3716
width=0) (actual time=534.816..78526.944 rows=1417152 loops=1)
 Recheck Cond: (event @> '{"event": "delivered"}'::jsonb)
 Heap Blocks: exact=298362
 Buffers: shared hit=383 read=299133
 ->  Bitmap Index Scan on idx_myevents_no_path  (cost=0.00..41.87
rows=3716 width=0) (actual time=453.412..453.412 rows=1417152 loops=1)
   Index Cond: (event @> '{"event": "delivered"}'::jsonb)
   Buffers: shared hit=383 read=771
 Planning time: 2.322 ms
 Execution time: 81898.578 ms
(11 rows)

mgevents=# drop index idx_myevents_no_path;

mgevents=# explain (analyze, buffers) select count(*) from myevents where
event @> '{"event": "delivered"}';
QUERY PLAN

--
 Aggregate  (cost=270558.29..270558.30 rows=1 width=0) (actual
time=19834.530..19834.532 rows=1 loops=1)
   Buffers: shared hit=130935 read=317273
   ->  Seq Scan on myevents  (cost=0.00..270549.00 rows=3716 width=0)
(actual time=4.650..16764.726 rows=1417152 loops=1)
 Filter: (event @> '{"event": "delivered"}'::jsonb)
 Rows Removed by Filter: 2298448
 Buffers: shared hit=130935 read=317273
 Planning time: 0.238 ms
 Execution time: 19834.605 ms
(8 rows)

So it doesn't even appear to be a problem with the operator (->> vs @>) but
rather that the planner is using the index rather tha