On Tue, Aug 29, 2017 at 1:20 AM, Alban Hertroys <haram...@gmail.com> wrote:
> On 28 August 2017 at 21:32, Jeff Janes <jeff.ja...@gmail.com> wrote: > > On Mon, Aug 28, 2017 at 5:22 AM, Alban Hertroys <haram...@gmail.com> > wrote: > >> > >> Hi all, > >> > >> It's been a while since I actually got to use PG for anything serious, > >> but we're finally doing some experimentation @work now to see if it is > >> suitable for our datawarehouse. So far it's been doing well, but there > >> is a particular type of query I run into that I expect we will > >> frequently use and that's choosing a sequential scan - and I can't > >> fathom why. > >> > >> This is on: > >> > >> > >> The query in question is: > >> select "VBAK_MANDT", max("VBAK_VBELN") > >> from staging.etl00001_vbak > >> group by "VBAK_MANDT"; > >> > >> This is the header-table for another detail table, and in this case > >> we're already seeing a seqscan. The thing is, there are 15M rows in > >> the table (disk usage is 15GB), while the PK is on ("VBAK_MANDT", > >> "VBAK_VBELN") with very few distinct values for "VBAK_MANDT" (in fact, > >> we only have 1 at the moment!). > > > > > > You need an "index skip-scan" or "loose index scan". PostgreSQL doesn't > > currently detect and implement them automatically, but you can use a > > recursive CTE to get it to work. There are some examples at > > https://wiki.postgresql.org/wiki/Loose_indexscan > > Thanks Jeff, that's an interesting approach. It looks very similar to > correlated subqueries. > > Unfortunately, it doesn't seem to help with my issue. The CTE is > indeed fast, but when querying the results from the 2nd level ov the > PK with the CTE results, I'm back at a seqscan on pdw00002_vbak again. > Something like this works: create table foo as select trunc(random()*5) as col1, random() as col2 from generate_series(1,100000000); create index on foo (col1, col2); vacuum analyze foo; with recursive t as ( select * from (select col1, col2 from foo order by col1 desc, col2 desc limit 1) asdfsaf union all select (select col1 from foo where foo.col1 < t.col1 order by col1 desc, col2 desc limit 1) as col1, (select col2 from foo where foo.col1 < t.col1 order by col1 desc, col2 desc limit 1) as col2 from t where t.col1 is not null ) select * from t where t is not null; It is pretty ugly that you need one subquery in the select list for each column to be returned. Maybe someone can find a way to avoid that part. I tried using lateral joins to get around it, but couldn't make that work. Cheers, Jeff