On Fri, Jan 04, 2019 at 08:58:57AM +0000, Abadie Lana wrote:
> SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) frac_MCV, tablename, 
> attname, null_frac, n_distinct, array_length(most_common_vals,1) n_mcv, 
> array_length(histogram_bounds,1) n_hist FROM pg_stats WHERE attname='...' AND 
> tablename='...' ORDER BY 1 DESC; 
> 
> Hmm. Is it normal that the couple (tablename,attname ) is not unique? I'm 
> surprised to see sample_{ctrl,util,buil} quoted twice

One of the rows is for "inherited stats" (including child tables) stats and one
is "noninherited stats".

The unique index on the table behind that view is:
    "pg_statistic_relid_att_inh_index" UNIQUE, btree (starelid, staattnum, 
stainherit)

On the wiki, I added inherited and correlation columns.  Would you rerun that 
query ?
https://wiki.postgresql.org/wiki/Slow_Query_Questions#Statistics:_n_distinct.2C_MCV.2C_histogram

I'm also interested to see \d and channel_id statistics for the channel table.

> explain (analyze, buffers) select 
> 'BUIL-B36-VA-RT-RT1:CL0001-2-ABW',c.smpl_time,c.nanosecs,c.float_val,c.num_val,c.str_val,c.datatype,c.array_val
>  from sample c WHERE c.channel_id = (SELECT channel_id FROM channel WHERE 
> name='BUIL-B36-VA-RT-RT1:CL0001-2-ABW') order by c.smpl_time desc limit 5;

You originally wrote this as a implicit comma join.  Does the original query
still have an issue ?  The =(subselect query) doesn't allow the planner to
optimize for the given channel, which seems to be a fundamental problem.

On Fri, Jan 04, 2019 at 08:58:57AM +0000, Abadie Lana wrote:
> Based on your feedback...i rerun analyse directly on the two table 
> sample_ctrl_year and sample_buil_year
> [...] Now when running the query again, only for sample_buil_year table the 
> wrong index is picked up...

It looks like statistics on your tables were completely wrong; not just
sample_ctrl_year and sample_buil_year.  Right ?

Autoanalyze would normally handle this on nonempty tables (children or
otherwise) and you should manually run ANALZYE on the parents (both levels of
them) whenever statistics change, like after running a big DELETE or DROP or
after a significant interval of time has passed relative to the range of time
in the table's timestamp columns.

Do you know why autoanalze didn't handle the nonempty tables on its own ?

> Now, the channel name I gave has no entries in sample_buil_year...(and when I 
> run the query directly against sample_buil_year the right index is picked 
> up).... So maybe something related with the partitioning?

>                ->  Index Scan Backward using smpl_time_bx2_idx on 
> sample_buil_year c_5  (cost=0.56..2023054.76 rows=665761 width=75) (actual 
> time=13216.589..13216.589 rows=0 loops=1)
>                      Filter: (channel_id = $0)
>                      Rows Removed by Filter: 50597834
>                      Buffers: shared hit=26626368

So it scanned the entire index expecting to find 5 matching channel IDs "pretty
soon", based on the generic distribution of channel IDs, without the benefit of
knowing that this channel ID doesn't exist at all (due to =(subquery)).

26e6 buffers is 200GB, apparently accessing some pages many
times (even if cached).

    table_name    |     index_name      | table_size | index_size               
                                                                                
                                                                   
 sample_buil_year | smpl_time_bx2_idx   | 4492 MB    | 1084 MB                  
                                                                                
                                                                   

General comments:

On Wed, Jan 02, 2019 at 04:28:41PM +0000, Abadie Lana wrote:
>    "sample_time_bm_idx" btree (channel_id, smpl_time)
>    "sample_time_mb1_idx" btree (smpl_time, channel_id)
>    "smpl_time_bx1_idx" btree (smpl_time)

The smpl_time index is loosely redundant with index on (smpl_time,channel_id).
You might consider dropping it, or otherwise dropping the smpl_time,channel_id
index and making two separate indices on smpl_time and channel.  That would
allow bitmap ANDing them together.

Or possibly (depending on detail of your data loading) leaving the composite
index and changing smpl_time to a BRIN index - it's nice to be able to CLUSTER
on the btree index to maximize the efficiency of the brin index.

>Check constraints:
>    "sample_buil_month_smpl_time_check" CHECK (smpl_time >= (now() - '32 
> days'::interval)::timestamp without time zone AND smpl_time <= now())

I'm surprised that works, and not really sure what it's doing..but in any case
it's maybe not doing what you wanted(??).  I'm guessing you never get
constraint exclusion (which is irrelevant for this query but still).

Justin

Reply via email to