Re: [PERFORM] query of partitioned object doesnt use index in qa

2017-09-15 Thread Tomas Vondra
On 09/16/2017 12:05 AM, Tom Lane wrote:
> Andres Freund  writes:
>> I'm wondering if we should extend the new CREATE STATISTICS
>> framework to be able to do that without requiring an index.
> 
> I think that's already on the roadmap --- it's one of the reasons we 
> ended up with a SELECT-like syntax for CREATE STATISTICS. But it 
> didn't get done for v10.
> 

Right. It's one of the things I'd like to be working on after getting in
the more complex statistics types (MCV & histograms).

> If we do look at that as a substitute for "make an expression index
> just so you get some stats", it would be good to have a way to
> specify that you only want the standard ANALYZE stats on that value
> and not the extended ones.
> 

Not sure I understand what you mean by "extended" - the statistics we
collect for expression indexes, or the CREATE STATISTICS stuff? I assume
the former, because if you don't want the latter then just don't create
the statistics. Or am I missing something?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] query of partitioned object doesnt use index in qa

2017-09-15 Thread Tom Lane
Andres Freund  writes:
> I'm wondering if we should extend the new CREATE STATISTICS framework to
> be able to do that without requiring an index.

I think that's already on the roadmap --- it's one of the reasons we
ended up with a SELECT-like syntax for CREATE STATISTICS.  But it
didn't get done for v10.

If we do look at that as a substitute for "make an expression index just
so you get some stats", it would be good to have a way to specify that you
only want the standard ANALYZE stats on that value and not the extended
ones.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] query of partitioned object doesnt use index in qa

2017-09-15 Thread Andres Freund


On September 15, 2017 1:42:23 PM PDT, Tom Lane  wrote:
>One thing you could consider doing about this is creating an index
>on (body ->> 'SID'::text), which would prompt ANALYZE to gather
>statistics
>about that expression.  Even if the index weren't actually used in the
>plan, this might improve the estimates and the resulting planning
>choices
>enough to make it worth maintaining such an index.

I'm wondering if we should extend the new CREATE STATISTICS framework to be 
able to do that without requiring an index. I.e. allow expressions and add a 
new type of stats that just correspond to what normal columns have.  Could even 
create that implicitly for expression indexes, but allow to drop it, if the 
overtrading isn't worth it.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] query of partitioned object doesnt use index in qa

2017-09-15 Thread Mike Broers
That makes a lot of sense, thanks for taking a look.  An index like you
suggest would probably further improve the query.   Is that suggestion
sidestepping the original problem that production is evaluating the
landing_id bit with the partition index and qa is sequence scanning instead?

AND exists (select 1 from t_sap where e.landing_id = t_sap.landing_id)) as
rankings;

Based on the difference in row estimate I am attempting an analyze with a
higher default_statistic_target (currently 100) to see if that helps.




On Fri, Sep 15, 2017 at 3:42 PM, Tom Lane  wrote:

> Mike Broers  writes:
> > If Im reading this correctly postgres thinks the partition will return
> 6.5
> > million matching rows but actually comes back with 162k.  Is this a case
> > where something is wrong with the analyze job?
>
> You've got a lot of scans there that're using conditions like
>
> > │   ->  Seq Scan on event__ e_1
> (cost=0.00..2527828.05 rows=11383021 width=778) (actual
> time=25522.389..747238.885 rows=42 loops=1)
> > │ Filter: (((body ->> 'SID'::text) IS
> NOT NULL) AND (validation_status_code = 'P'::bpchar))
> > │ Rows Removed by Filter: 12172186
>
> While I'd expect the planner to be pretty solid on estimating the
> validation_status_code condition, it's not going to have any idea about
> that JSON field test.  That's apparently very selective, but you're just
> getting a default estimate, which is not going to think that a NOT NULL
> test will exclude lots of rows.
>
> One thing you could consider doing about this is creating an index
> on (body ->> 'SID'::text), which would prompt ANALYZE to gather statistics
> about that expression.  Even if the index weren't actually used in the
> plan, this might improve the estimates and the resulting planning choices
> enough to make it worth maintaining such an index.
>
> Or you could think about pulling that field out and storing it on its own.
> JSON columns are great for storing random unstructured data, but they are
> less great when you want to do relational-ish things on subfields.
>
> regards, tom lane
>


[PERFORM] Indexing an array of two separate columns

2017-09-15 Thread Jeremy Finzel
I have a user who is trying to match overlapping duplicate phone info but
for different customer_ids.

The intended conditional could be expressed:
IF the intersection of the sets
{c.main_phone, c.secondary_phone}
and
{c1.main_phone, c1.secondary_phone}
is not empty
THEN join
EXCEPT where the intersection of the sets =
{'00'}

He wants a join like this:

FROM customers c
INNER JOIN customers c1 on (array[c.main_phone, c.secondary_phone] &&
 array[nullif(c1.main_phone, '00') , nullif(c1.secondary_phone,
'00')])
(array[c.main_phone, c.secondary_phone] && array[nullif(c1.main_phone,
'00') , nullif(c1.secondary_phone, '00')])
WHERE c.customer_id = 1;

I want to index this part:
array[nullif(c1.main_phone, '00') , nullif(c1.secondary_phone,
'00')]

First of all I see I can't create a btree index on an array.  And with
btree_gin, this index is not being used:

CREATE INDEX ON customers USING gin ((NULLIF(main_phone,
'00'::text)), (NULLIF(secondary_phone, '00'::text)));

What am I missing here?  Is there a way to support a condition like this?

Thank you!


Re: [PERFORM] query of partitioned object doesnt use index in qa

2017-09-15 Thread Tom Lane
Mike Broers  writes:
> If Im reading this correctly postgres thinks the partition will return 6.5
> million matching rows but actually comes back with 162k.  Is this a case
> where something is wrong with the analyze job?

You've got a lot of scans there that're using conditions like

> │   ->  Seq Scan on event__ e_1 
> (cost=0.00..2527828.05 rows=11383021 width=778) (actual 
> time=25522.389..747238.885 rows=42 loops=1)
> │ Filter: (((body ->> 'SID'::text) IS NOT 
> NULL) AND (validation_status_code = 'P'::bpchar))
> │ Rows Removed by Filter: 12172186

While I'd expect the planner to be pretty solid on estimating the
validation_status_code condition, it's not going to have any idea about
that JSON field test.  That's apparently very selective, but you're just
getting a default estimate, which is not going to think that a NOT NULL
test will exclude lots of rows.

One thing you could consider doing about this is creating an index
on (body ->> 'SID'::text), which would prompt ANALYZE to gather statistics
about that expression.  Even if the index weren't actually used in the
plan, this might improve the estimates and the resulting planning choices
enough to make it worth maintaining such an index.

Or you could think about pulling that field out and storing it on its own.
JSON columns are great for storing random unstructured data, but they are
less great when you want to do relational-ish things on subfields.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] query of partitioned object doesnt use index in qa

2017-09-15 Thread Mike Broers
Query finally came back with an explain analyze :)

If Im reading this correctly postgres thinks the partition will return 6.5
million matching rows but actually comes back with 162k.  Is this a case
where something is wrong with the analyze job?

Seq Scan on event__00071000 e_4  (cost=0.00..2204374.94 rows=6523419
width=785) (actual time=7020.509..448368.247 rows=162912 loops=1)




┌───┐

│
QUERY PLAN
  │

├───┤

│ Subquery Scan on rankings  (cost=45357272.27..47351629.37 rows=39887142
width=24) (actual time=6117566.189..6117619.805 rows=25190 loops=1)
  │

│   ->  WindowAgg  (cost=45357272.27..46952757.95 rows=39887142 width=772)
(actual time=6117566.101..6117611.266 rows=25190 loops=1)
  │

│ ->  Sort  (cost=45357272.27..45456990.12 rows=39887142 width=772)
(actual time=6117566.054..6117572.121 rows=25190 loops=1)
│

│   Sort Key: ((e.body ->> 'SID'::text)), ((e.body ->>
'Timestamp'::text)) DESC
  │

│   Sort Method: quicksort  Memory: 13757kB

│

│   ->  Hash Join  (cost=46.38..24740720.18 rows=39887142
width=772) (actual time=1511499.761..6117335.382 rows=25190 loops=1)
  │

│ Hash Cond: (e.landing_id = t_sap.landing_id)

│

│ ->  Append  (cost=0.00..24387085.38 rows=79774283
width=776) (actual time=25522.442..6116672.504 rows=2481659 loops=1)
│

│   ->  Seq Scan on event e  (cost=0.00..1.36
rows=1 width=97) (actual time=0.049..0.049 rows=0 loops=1)
  │

│ Filter: (((body ->> 'SID'::text) IS NOT
NULL) AND (validation_status_code = 'P'::bpchar))
  │

│ Rows Removed by Filter: 24

│

│   ->  Seq Scan on event__ e_1
(cost=0.00..2527828.05
rows=11383021 width=778) (actual time=25522.389..747238.885 rows=42 loops=1)
  │

│ Filter: (((body ->> 'SID'::text) IS NOT
NULL) AND (validation_status_code = 'P'::bpchar))
  │

│ Rows Removed by Filter: 12172186

│

│   ->  Seq Scan on event__00069000 e_2
(cost=0.00..1462613.93
rows=5957018 width=771) (actual time=4486.295..370098.760 rows=183696
loops=1)  │

│ Filter: (((body ->> 'SID'::text) IS NOT
NULL) AND (validation_status_code = 'P'::bpchar))
  │

│ Rows Removed by Filter: 6956029

│

│   ->  Seq Scan on event__0007 e_3
(cost=0.00..1534702.41
rows=5991507 width=787) (actual time=3415.907..361606.800 rows=199081
loops=1)  │

│ Filter: (((body ->> 'SID'::text) IS NOT
NULL) AND (validation_status_code = 'P'::bpchar))
  │

│ Rows Removed by Filter: 7177444

│

│   ->  Seq Scan on event__00071000 e_4
(cost=0.00..2204374.94
rows=6523419 width=785) (actual time=7020.509..448368.247 rows=162912
loops=1)  │

│ Filter: (((body ->> 'SID'::text) IS NOT
NULL) AND (validation_status_code = 'P'::bpchar))
  │

│ Rows Removed by Filter: 8091470

│

│   ->  Seq Scan on event__00072000 e_5
(cost=0.00..1531430.89
rows=5814704 width=792) (actual time=25.304..343612.826 rows=214891 loops=1)
  │

│ Filter: (((body ->> 'SID'::text) IS NOT
NULL) AND (validation_status_code = 'P'::bpchar))
  │

│ Rows Removed by Filter: 7301151

│

│   ->  Seq Scan on event__00073000 e_6
(cost=0.00..1384865.48
rows=5876959 width=767) (actual time=1631.133..424827.603 rows=163959
loops=1)  │

│ Filter: (((body ->> 'SID'::text) IS NOT
NULL) AND (validation_status_code = 'P'::bpchar))
  │

│ Rows Removed by Filter: 6523673

│

│   ->  Seq Scan on event__00074000 e_7
(cost=0.00..1289048.37
rows=4747343 width=801) (actual time=3287.286..280317.057 rows=204394
loops=1)  │

│ Filter: (((body ->> 'SID'::text) IS NOT
NULL) AND (validation_status_code = 'P'::bpchar))
  │

│ Rows Removed by