Re: [GENERAL] Performance with high correlation in group by on PK

2017-09-15 Thread Alban Hertroys
On 8 September 2017 at 00:23, Jeff Janes  wrote:
> On Tue, Aug 29, 2017 at 1:20 AM, Alban Hertroys  wrote:
>>
>> On 28 August 2017 at 21:32, Jeff Janes  wrote:
>> > On Mon, Aug 28, 2017 at 5:22 AM, Alban Hertroys 
>> > wrote:
>> >>
>> >> Hi all,
>> >>
>> >> It's been a while since I actually got to use PG for anything serious,
>> >> but we're finally doing some experimentation @work now to see if it is
>> >> suitable for our datawarehouse. So far it's been doing well, but there
>> >> is a particular type of query I run into that I expect we will
>> >> frequently use and that's choosing a sequential scan - and I can't
>> >> fathom why.
>> >>
>> >> This is on:
>> >>
>> >>
>> >> The query in question is:
>> >> select "VBAK_MANDT", max("VBAK_VBELN")
>> >>   from staging.etl1_vbak
>> >>  group by "VBAK_MANDT";
>> >>
>> >> This is the header-table for another detail table, and in this case
>> >> we're already seeing a seqscan. The thing is, there are 15M rows in
>> >> the table (disk usage is 15GB), while the PK is on ("VBAK_MANDT",
>> >> "VBAK_VBELN") with very few distinct values for "VBAK_MANDT" (in fact,
>> >> we only have 1 at the moment!).
>> >
>> >
>> > You need an "index skip-scan" or "loose index scan".  PostgreSQL doesn't
>> > currently detect and implement them automatically, but you can use a
>> > recursive CTE to get it to work.  There are some examples at
>> > https://wiki.postgresql.org/wiki/Loose_indexscan
>>
>> Thanks Jeff, that's an interesting approach. It looks very similar to
>> correlated subqueries.
>>
>> Unfortunately, it doesn't seem to help with my issue. The CTE is
>> indeed fast, but when querying the results from the 2nd level ov the
>> PK with the CTE results, I'm back at a seqscan on pdw2_vbak again.
>
>
> Something like this works:
>
> create table foo as select trunc(random()*5) as col1, random() as col2 from
> generate_series(1,1);
> create index on foo (col1, col2);
> vacuum analyze foo;
>
>
> with recursive t as (
>select * from (select col1, col2 from foo order by col1 desc, col2 desc
> limit 1) asdfsaf
> union all
>   select
>  (select col1 from foo where foo.col1 < t.col1 order by col1 desc, col2
> desc limit 1) as col1,
>  (select col2 from foo where foo.col1 < t.col1 order by col1 desc, col2
> desc limit 1) as col2
>from t where t.col1 is not null
> )
> select * from t where t is not null;
>
> It is pretty ugly that you need one subquery in the select list for each
> column to be returned.  Maybe someone can find a way to avoid that part.  I
> tried using lateral joins to get around it, but couldn't make that work.
>
> Cheers,
>
> Jeff

Thanks Jeff. That does indeed look ugly.

Since we're dealing with a 4GL language (FOCUS) that translates to
SQL, I don't think we'll attempt your workaround, even though we can
use SQL directly in our reports if we want to.

But, I just remembered giving someone else in a similar situation some
advice on this very list; Obviously, when my first primary key field
is not very selective, I should change the order of the fields in the
PK!

But let's first enjoy the weekend.

Alban.
-- 
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Performance with high correlation in group by on PK

2017-09-07 Thread Jeff Janes
On Tue, Aug 29, 2017 at 1:20 AM, Alban Hertroys  wrote:

> On 28 August 2017 at 21:32, Jeff Janes  wrote:
> > On Mon, Aug 28, 2017 at 5:22 AM, Alban Hertroys 
> wrote:
> >>
> >> Hi all,
> >>
> >> It's been a while since I actually got to use PG for anything serious,
> >> but we're finally doing some experimentation @work now to see if it is
> >> suitable for our datawarehouse. So far it's been doing well, but there
> >> is a particular type of query I run into that I expect we will
> >> frequently use and that's choosing a sequential scan - and I can't
> >> fathom why.
> >>
> >> This is on:
> >>
> >>
> >> The query in question is:
> >> select "VBAK_MANDT", max("VBAK_VBELN")
> >>   from staging.etl1_vbak
> >>  group by "VBAK_MANDT";
> >>
> >> This is the header-table for another detail table, and in this case
> >> we're already seeing a seqscan. The thing is, there are 15M rows in
> >> the table (disk usage is 15GB), while the PK is on ("VBAK_MANDT",
> >> "VBAK_VBELN") with very few distinct values for "VBAK_MANDT" (in fact,
> >> we only have 1 at the moment!).
> >
> >
> > You need an "index skip-scan" or "loose index scan".  PostgreSQL doesn't
> > currently detect and implement them automatically, but you can use a
> > recursive CTE to get it to work.  There are some examples at
> > https://wiki.postgresql.org/wiki/Loose_indexscan
>
> Thanks Jeff, that's an interesting approach. It looks very similar to
> correlated subqueries.
>
> Unfortunately, it doesn't seem to help with my issue. The CTE is
> indeed fast, but when querying the results from the 2nd level ov the
> PK with the CTE results, I'm back at a seqscan on pdw2_vbak again.
>

Something like this works:

create table foo as select trunc(random()*5) as col1, random() as col2 from
generate_series(1,1);
create index on foo (col1, col2);
vacuum analyze foo;


with recursive t as (
   select * from (select col1, col2 from foo order by col1 desc, col2 desc
limit 1) asdfsaf
union all
  select
 (select col1 from foo where foo.col1 < t.col1 order by col1 desc, col2
desc limit 1) as col1,
 (select col2 from foo where foo.col1 < t.col1 order by col1 desc, col2
desc limit 1) as col2
   from t where t.col1 is not null
)
select * from t where t is not null;

It is pretty ugly that you need one subquery in the select list for each
column to be returned.  Maybe someone can find a way to avoid that part.  I
tried using lateral joins to get around it, but couldn't make that work.

Cheers,

Jeff


Re: [GENERAL] Performance with high correlation in group by on PK

2017-08-29 Thread Alban Hertroys
On 28 August 2017 at 21:32, Jeff Janes  wrote:
> On Mon, Aug 28, 2017 at 5:22 AM, Alban Hertroys  wrote:
>>
>> Hi all,
>>
>> It's been a while since I actually got to use PG for anything serious,
>> but we're finally doing some experimentation @work now to see if it is
>> suitable for our datawarehouse. So far it's been doing well, but there
>> is a particular type of query I run into that I expect we will
>> frequently use and that's choosing a sequential scan - and I can't
>> fathom why.
>>
>> This is on:
>>
>>
>> The query in question is:
>> select "VBAK_MANDT", max("VBAK_VBELN")
>>   from staging.etl1_vbak
>>  group by "VBAK_MANDT";
>>
>> This is the header-table for another detail table, and in this case
>> we're already seeing a seqscan. The thing is, there are 15M rows in
>> the table (disk usage is 15GB), while the PK is on ("VBAK_MANDT",
>> "VBAK_VBELN") with very few distinct values for "VBAK_MANDT" (in fact,
>> we only have 1 at the moment!).
>
>
> You need an "index skip-scan" or "loose index scan".  PostgreSQL doesn't
> currently detect and implement them automatically, but you can use a
> recursive CTE to get it to work.  There are some examples at
> https://wiki.postgresql.org/wiki/Loose_indexscan

Thanks Jeff, that's an interesting approach. It looks very similar to
correlated subqueries.

Unfortunately, it doesn't seem to help with my issue. The CTE is
indeed fast, but when querying the results from the 2nd level ov the
PK with the CTE results, I'm back at a seqscan on pdw2_vbak again.

Just the CTE plan is in skipScan-top.sql.
The complete plan is in skipScan-full.sql

Note: I cloned the original etl1_vbak table into a new
pdw2_vbak table that has varchars instead of chars, which reduced
the table size to just over half the original's size. Hence the
different table names, but the behaviour for this particular issue is
the same between them.

-- 
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

warehouse=# explain (analyze, buffers)
with recursive t as (
select min("VBAK_MANDT") as "VBAK_MANDT" from staging.pdw2_vbak
union all
select (select min("VBAK_MANDT") as "VBAK_MANDT" from staging.pdw2_vbak 
where "VBAK_MANDT" > t."VBAK_MANDT") from t where t."VBAK_MANDT" is not null
) select "VBAK_MANDT" from t;

   QUERY PLAN  
-
 CTE Scan on t  (cost=98.31..100.33 rows=101 width=32) (actual 
time=0.031..0.054 rows=2 loops=1)
   Buffers: shared hit=9
   CTE t
 ->  Recursive Union  (cost=0.73..98.31 rows=101 width=32) (actual 
time=0.029..0.052 rows=2 loops=1)
   Buffers: shared hit=9
   ->  Result  (cost=0.73..0.74 rows=1 width=32) (actual 
time=0.029..0.029 rows=1 loops=1)
 Buffers: shared hit=5
 InitPlan 3 (returns $1)
   ->  Limit  (cost=0.56..0.73 rows=1 width=32) (actual 
time=0.026..0.027 rows=1 loops=1)
 Buffers: shared hit=5
 ->  Index Only Scan using pdw2_vbak_pkey on 
pdw2_vbak pdw2_vbak_1  (cost=0.56..2375293.75 rows=14214332 width=32) 
(actual time=0.024..0.024 rows=1 loops=1)
   Index Cond: ("VBAK_MANDT" IS NOT NULL)
   Heap Fetches: 1
   Buffers: shared hit=5
   ->  WorkTable Scan on t t_1  (cost=0.00..9.56 rows=10 width=32) 
(actual time=0.009..0.010 rows=0 loops=2)
 Filter: ("VBAK_MANDT" IS NOT NULL)
 Rows Removed by Filter: 0
 Buffers: shared hit=4
 SubPlan 2
   ->  Result  (cost=0.93..0.94 rows=1 width=32) (actual 
time=0.015..0.015 rows=1 loops=1)
 Buffers: shared hit=4
 InitPlan 1 (returns $3)
   ->  Limit  (cost=0.56..0.93 rows=1 width=32) (actual 
time=0.013..0.013 rows=0 loops=1)
 Buffers: shared hit=4
 ->  Index Only Scan using pdw2_vbak_pkey 
on pdw2_vbak  (cost=0.56..1732075.91 rows=4738111 width=32) (actual 
time=0.012..0.012 rows=0 loops=1)
   Index Cond: (("VBAK_MANDT" IS NOT NULL) 
AND ("VBAK_MANDT" > t_1."VBAK_MANDT"))
   Heap Fetches: 0
   Buffers: shared hit=4
 Planning time: 0.346 ms
 Execution time: 0.100 ms
(30 rows)
warehouse=# explain (analyze, buffers)
with recursive t as (
select min("VBAK_MANDT") as "VBAK_MANDT" from staging.pdw2_vbak
union all
select (select min("VBAK_MANDT") as 

Re: [GENERAL] Performance with high correlation in group by on PK

2017-08-28 Thread Jeff Janes
On Mon, Aug 28, 2017 at 5:22 AM, Alban Hertroys  wrote:

> Hi all,
>
> It's been a while since I actually got to use PG for anything serious,
> but we're finally doing some experimentation @work now to see if it is
> suitable for our datawarehouse. So far it's been doing well, but there
> is a particular type of query I run into that I expect we will
> frequently use and that's choosing a sequential scan - and I can't
> fathom why.
>
> This is on:
>
>
> The query in question is:
> select "VBAK_MANDT", max("VBAK_VBELN")
>   from staging.etl1_vbak
>  group by "VBAK_MANDT";
>
> This is the header-table for another detail table, and in this case
> we're already seeing a seqscan. The thing is, there are 15M rows in
> the table (disk usage is 15GB), while the PK is on ("VBAK_MANDT",
> "VBAK_VBELN") with very few distinct values for "VBAK_MANDT" (in fact,
> we only have 1 at the moment!).
>

You need an "index skip-scan" or "loose index scan".  PostgreSQL doesn't
currently detect and implement them automatically, but you can use a
recursive CTE to get it to work.  There are some examples at
https://wiki.postgresql.org/wiki/Loose_indexscan

Cheers,

Jeff


Re: [GENERAL] Performance with high correlation in group by on PK

2017-08-28 Thread Alban Hertroys
On 28 August 2017 at 14:22, Alban Hertroys  wrote:

> This is on:
Just noticed I forgot to paste this in:
warehouse=# select version();
 version
--
 PostgreSQL 9.6.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
20150623 (Red Hat 4.8.5-11), 64-bit
(1 row)


--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Performance with high correlation in group by on PK

2017-08-28 Thread Alban Hertroys
Hi all,

It's been a while since I actually got to use PG for anything serious,
but we're finally doing some experimentation @work now to see if it is
suitable for our datawarehouse. So far it's been doing well, but there
is a particular type of query I run into that I expect we will
frequently use and that's choosing a sequential scan - and I can't
fathom why.

This is on:


The query in question is:
select "VBAK_MANDT", max("VBAK_VBELN")
  from staging.etl1_vbak
 group by "VBAK_MANDT";

This is the header-table for another detail table, and in this case
we're already seeing a seqscan. The thing is, there are 15M rows in
the table (disk usage is 15GB), while the PK is on ("VBAK_MANDT",
"VBAK_VBELN") with very few distinct values for "VBAK_MANDT" (in fact,
we only have 1 at the moment!).

Explain analyze says the following about this query:
warehouse=# explain (analyze, buffers) select "VBAK_MANDT",
max("VBAK_VBELN") from staging.etl1_vbak group by "VBAK_MANDT";
  QUERY PLAN
--
 HashAggregate  (cost=1990054.08..1990054.09 rows=1 width=36) (actual
time=38723.602..38723.602 rows=1 loops=1)
   Group Key: "VBAK_MANDT"
   Buffers: shared hit=367490 read=1409344
   ->  Seq Scan on etl1_vbak  (cost=0.00..1918980.72 rows=14214672
width=15) (actual time=8.886..31317.283 rows=14214672 loops=1)
 Buffers: shared hit=367490 read=1409344
 Planning time: 0.126 ms
 Execution time: 38723.674 ms
(7 rows)

As you can see, a sequential scan. The statistics seem quite correct.

If I change the query to select a single value of "VBAK_MANDT" we get:
warehouse=# explain (analyze, buffers) select max("VBAK_VBELN") from
staging.etl1_vbak where "VBAK_MANDT" = '300';

   QUERY PLAN
---
 Result  (cost=1.37..1.38 rows=1 width=32) (actual time=14.911..14.911
rows=1 loops=1)
   Buffers: shared hit=2 read=3
   InitPlan 1 (returns $0)
 ->  Limit  (cost=0.56..1.37 rows=1 width=11) (actual
time=14.907..14.908 rows=1 loops=1)
   Buffers: shared hit=2 read=3
   ->  Index Only Scan Backward using etl1_vbak_pkey on
etl1_vbak  (cost=0.56..11498362.31 rows=14214672 width=11) (actual
time=14.906..14.906 rows=1 loops=1)
 Index Cond: (("VBAK_MANDT" = '300'::bpchar) AND
("VBAK_VBELN" IS NOT NULL))
 Heap Fetches: 1
 Buffers: shared hit=2 read=3
 Planning time: 0.248 ms
 Execution time: 14.945 ms
(11 rows)

That is more in line with my expectations.

Oddly enough, adding "MANDT_VBAK" and the group by back into that last
query, the result is a seqscan again.

For "VBAK_MANDT" we see these statistics:
Null fraction: 0
Average width: 4
Distinct values: 1
Most common values: {300}
Most common frequencies: {1}
Histogram bounds :
Correlation :1

The table definition is:
  Table "staging.etl1_vbak"
   Column| Type  | Modifiers
-+---+---
 VBAK_MANDT  | character(3)  | not null
 VBAK_VBELN  | character(10) | not null
 VBAK_ERDAT  | date  | not null
 VBAK_ERZET  | character(6)  | not null
 VBAK_ERNAM  | character(12) | not null
 VBAK_ANGDT  | date  | not null
 VBAK_BNDDT  | date  | not null
 VBAK_AUDAT  | date  | not null
...
VBAK_MULTI  | character(1)  | not null
 VBAK_SPPAYM | character(2)  | not null
Indexes:
"etl1_vbak_pkey" PRIMARY KEY, btree ("VBAK_MANDT", "VBAK_VBELN")
"idx_etl1_vbak_erdat" btree ("VBAK_ERDAT")

A final remark: The table definition was generated by our
reporting/ETL software, hence the case-sensitive column names and the
use of the character type instead of varchar (or text).

What can be done to help the planner choose a smarter plan?
-- 
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general