> > > Reading around, seems many people are still using this 2005 snippet to > obtain the > > row count estimate from Explain: > > I recommend using FORMAT JSON and extracting the top row count from that. > It is > simpler and less error-prone. > Good tip, thanks Laurenze!
> > > Is this still the current best practice? Any tips to increase precision? > > Currently it can estimate the actual number of rows for over or under a > million, > > as seen on the sample queries (1,955,297 instead of 1,001,200; 162,080 > instead > > of 1,292,010). > > Looking at the samples you provided, I get the impression that the > statistics for > the table are quite outdated. That will affect the estimates. Try > running ANALYZE > and see if that improves the estimates. > > No major changes after doing Analyze, and also Vacuum Analyze. Seems something is seriously off. I pimped my config thanks to Alvaro's prompting, set default statistics = 500 (suggested for warehouse dbs) but raised pertinent columns from 2,000 to 5,000 (will play with disabling JIT or raising cost later): shared_buffers = 2GB # ~0.25 * RAM, dedicated cache, hard allocation (requires restart) effective_cache_size = 6GB # 0.5-0.75 RAM (free -h: free + cache + shared_buffers) work_mem = 128MB # RAM * 0.25 / max_connections. maintenance_work_mem = 512MB default_statistics_target = 500 # def 100, higher to make planner use indexes in big warehouse tables. random_page_cost = 1.1 # Random reads in SSD cost almost as little as sequential ones Analized again (1.5M samples instead of 600k): "tenders": scanned 216632 of 216632 pages, containing 3815567 live rows and 0 dead rows; 1500000 rows in sample, 3815567 estimated total rows "items": scanned 995023 of 995023 pages, containing 7865043 live rows and 0 dead rows; 1500000 rows in sample, 7865043 estimated total rows but same deal: -- After config pimp 1,959,657 instead of 1,001,200 45,341.654 ms Gather (cost=247031.70..1479393.82 rows=1959657 width=824) (actual time =8464.691..45257.435 rows=1001200 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Hash Join (cost=246031.70..1282428.12 rows=816524 width=824) ( actual time=8413.057..44614.153 rows=333733 loops=3) Hash Cond: (pricescope_items.tender_transaction_id = pricescope_tenders. transaction_id) -> Parallel Seq Scan on pricescope_items (cost=0.00..1027794.01 rows=3277101 width=522) (actual time=0.753..41654.507 rows=2621681 loops=3) -> Parallel Hash (cost=241080.20..241080.20 rows=396120 width=302) (actual time=995.247..995.250 rows=333733 loops=3) Buckets: 1048576 Batches: 1 Memory Usage: 219904kB -> Parallel Bitmap Heap Scan on pricescope_tenders (cost=17516.10..241080.20 rows=396120 width=302) (actual time=162.898..321.472 rows=333733 loops=3) Recheck Cond: ((country = 'Colombia'::text) AND (date >= '2023-01-01'::date) AND (date < '2024-01-01'::date)) Heap Blocks: exact=34722 -> Bitmap Index Scan on pricescope_tenders_country_and_date_index (cost =0.00..17278.43 rows=950688 width=0) (actual time=186.536..186.537 rows= 1001200 loops=1) Index Cond: ((country = 'Colombia'::text) AND (date >= '2023-01-01'::date) AND (date < '2024-01-01'::date)) Planning Time: 11.310 ms JIT: Functions: 33 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 8.608 ms, Inlining 213.375 ms, Optimization 557.351 ms, Emission 417.568 ms, Total 1196.902 ms Execution Time: 45341.654 ms BUT if I force the planner to ignore 'country' statistics: -- Subselect country to hide constant from planner, so it doesn't use statistics Explain Analyze Select * from pricescope_tenders inner join pricescope_items on transaction_id = tender_transaction_id where country = (select 'Colombia') and "date" >= '2023-01-01' and "date" < '2024-01-01' ; Then I get the same plan than if I filter for Mexico, with a similar run time: -- Colombia in subselect 428,623 instead of 1,001,200 6674.860 ms Gather (cost=1001.00..570980.73 rows=428623 width=824) (actual time =166.785..6600.673 rows=1001200 loops=1) Workers Planned: 2 Params Evaluated: $0 Workers Launched: 2 InitPlan 1 (returns $0) -> Result (cost=0.00..0.01 rows=1 width=32) (actual time=166.031..166.033 rows=1 loops=1) -> Nested Loop (cost=0.99..527118.42 rows=178593 width=824) (actual time =200.511..5921.585 rows=333733 loops=3) -> Parallel Index Scan using pricescope_tenders_country_and_date_index on pricescope_tenders (cost=0.43..104391.64 rows=86641 width=302) (actual time =200.388..400.882 rows=333733 loops=3) Index Cond: ((country = $0) AND (date >= '2023-01-01'::date) AND (date < '2024-01-01'::date)) -> Index Scan using pricescope_items_tender_transaction_id_index on pricescope_items (cost=0.56..4.83 rows=5 width=522) (actual time=0.016.. 0.016 rows=1 loops=1001200) Index Cond: (tender_transaction_id = pricescope_tenders.transaction_id) Planning Time: 7.372 ms JIT: Functions: 31 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 6.981 ms, Inlining 209.470 ms, Optimization 308.123 ms, Emission 248.176 ms, Total 772.750 ms Execution Time: 6674.860 ms So runtime is now decent; stats are still way off by -670k, tho I guess that's better than +1M. 1. Any tips to fix stats? 2. Or a better way of making the planner go for index scans for country? Thanks again!