[SQL] Curious problem of using BETWEEN with start and end being the same versus EQUALS '='
Output: a.date_id, a.page_group, a.page, a.int_alloc_type, a.componentset, a.adc_visit, (upper((a.adc_visit)::text))" "Sort Key: (upper((a.adc_visit)::text))" "Sort Method: external merge Disk: 49480kB" "-> Index Scan using alps_agg_date_id on bi2003.alps_agg a (cost=0.00..17870.00 rows=26292 width=1350) (actual time=0.047..142.383 rows=36132 loops=1)" " Output: a.date_id, a.page_group, a.page, a.int_alloc_type, a.componentset, a.adc_visit, upper((a.adc_visit)::text)" " Index Cond: (a.date_id = 20120228)" " Filter: ((a.page)::text = 'ddi_671'::text)" " -> Materialize (cost=40395.08..40560.22 rows=33028 width=1694) (actual time=4606.461..4809.163 rows=45428 loops=1)" "Output: b.vcset, b.eventvalueset, b.eventcountset, b.date_id, (upper(bicommon.get_value('adc_visit'::text, b.vcset)))" "-> Sort (cost=40395.08..40477.65 rows=33028 width=1694) (actual time=4606.455..4778.491 rows=45428 loops=1)" " Output: b.vcset, b.eventvalueset, b.eventcountset, b.date_id, (upper(bicommon.get_value('adc_visit'::text, b.vcset)))" " Sort Key: (upper(bicommon.get_value('adc_visit'::text, b.vcset)))" " Sort Method: external merge Disk: 75440kB" " -> Index Scan using event_agg_date_id on bi2003.event_agg b (cost=0.00..13643.60 rows=33028 width=1694) (actual time=0.201..3865.567 rows=45428 loops=1)" "Output: b.vcset, b.eventvalueset, b.eventcountset, b.date_id, upper(bicommon.get_value('adc_visit'::text, b.vcset))" "Index Cond: (b.date_id = 20120228)" "Total runtime: 37797.536 ms" Now, The problem run: If I change the above query to use one day in a BETWEEN clause where start and end are the same, the query never completes even after 6+ hours. select a.date_id Date, a.page_group page_group, a.page page, a.int_alloc_type int_alloc_type, ... --{snipped for brevity} from bi2003.alps_agg a left outer join bi2003.event_agg b on (a.date_id = b.date_id and upper(a.adc_visit) = upper(bicommon.get_value('adc_visit', b.vcset))) WHERE a.date_id BETWEEN 20120228 AND 20120228 - CHANGE TO STATEMENT IS HERE -- and a.page = 'ddi_671' group by 1,2,3,4,5,6,7,8,9; "HashAggregate (cost=23.23..24.49 rows=1 width=3040)" " Output: a.date_id, a.page_group, a.page, a.int_alloc_type, (bicommon.get_value('browserfamily'::text, b.vcset)), (bicommon.get_value('trafficsource'::text, b.vcset)), (bicommon.get_value('e671_hl_p1'::text, a.componentset)), (bicommon.get_value('e671_img_p1'::text, a.componentset)), (bicommon.get_value('e671_formLabels'::text, a.componentset)), count(a.adc_visit), sum((bicommon.get_value('revenue'::text, b.eventvalueset))::numeric(9,0)), sum((bicommon.get_value('Impression'::text, b.eventcountset))::numeric(9,0)), sum((bicommon.get_value('page1submit'::text, b.eventcountset))::numeric(9,0)), sum((bicommon.get_value('conversion'::text, b.eventcountset))::numeric(9,0)), sum((bicommon.get_value('lead'::text, b.eventcountset))::numeric(9,0))" " -> Nested Loop (cost=0.00..21.91 rows=1 width=3040)" "Output: a.date_id, a.page_group, a.page, a.int_alloc_type, b.vcset, a.componentset, a.adc_visit, b.eventvalueset, b.eventcountset, bicommon.get_value('browserfamily'::text, b.vcset), bicommon.get_value('trafficsource'::text, b.vcset), bicommon.get_value('e671_hl_p1'::text, a.componentset), bicommon.get_value('e671_img_p1'::text, a.componentset), bicommon.get_value('e671_formLabels'::text, a.componentset)" "Join Filter: ((a.date_id = b.date_id) AND (upper((a.adc_visit)::text) = upper(bicommon.get_value('adc_visit'::text, b.vcset" "-> Index Scan using alps_agg_date_id on bi2003.alps_agg a (cost=0.00..10.12 rows=1 width=1350)" " Output: a.date_id, a.adc_visit, a.page_group, a.page, a.int_alloc_type, a.componentset, a.variation_tagset, a.page_instance" " Index Cond: ((a.date_id >= 20120228) AND (a.date_id <= 20120228))" " Filter: ((a.page)::text = 'ddi_671'::text)" "-> Index Scan using event_agg_date_id on bi2003.event_agg b (cost=0.00..10.27 rows=1 width=1694)" " Output: b.date_id, b.vcset, b.eventcountset, b.eventvalueset" " Index Cond: ((b.date_id >= 20120228) AND (b.date_id <= 20120228))" Has anyone experienced a similar issue? We have checked the tables and indexes for consistency, VACUUM'ed, ANALYZEd and REINDEX'ed them many times. But the between clause with start and end the same runs for hours and never completes. But if we use a range where start and end is not the same, it comes back in a few minutes. If we use a single day in an '=' clause, it also comes back in 30+ secs. What could be the problem here? Does anyone have any insights? Venky Kandaswamy Principal Engineer, Adchemy Inc. 925-200-7124
Re: [SQL] Curious problem of using BETWEEN with start and end being the same versus EQUALS '='
Thanks for the quick and detailed response, Tom. Yes, I did add a redundant where clause with a restriction on b.date_id on the range queries. This appears to speed things up since it does an index scan on the b table before the merge join. We will get more intelligent on query generation (our system generates queries on the fly) to work around this problem. Venky Kandaswamy Principal Engineer, Adchemy Inc. 925-200-7124 From: Tom Lane [t...@sss.pgh.pa.us] Sent: Tuesday, January 15, 2013 2:30 PM To: Venky Kandaswamy Cc: pgsql-gene...@postgresql.org; pgsql-sql@postgresql.org Subject: Re: [SQL] Curious problem of using BETWEEN with start and end being the same versus EQUALS '=' Venky Kandaswamy writes: >On 9.1, I am running into a curious issue. It's not very curious at all, or at least people on pgsql-performance (the right list for this sort of question) would have figured it out quickly. You're getting a crummy plan because of a crummy row estimate. When you do this: > WHERE a.date_id = 20120228 you get this: > "-> Index Scan using alps_agg_date_id on bi2003.alps_agg > a (cost=0.00..17870.00 rows=26292 width=1350) (actual time=0.047..142.383 > rows=36132 loops=1)" > " Output: a.date_id, a.page_group, a.page, > a.int_alloc_type, a.componentset, a.adc_visit, upper((a.adc_visit)::text)" > " Index Cond: (a.date_id = 20120228)" > " Filter: ((a.page)::text = 'ddi_671'::text)" 26K estimated rows versus 36K actual isn't the greatest estimate in the world, but it's plenty good enough. But when you do this: > WHERE a.date_id BETWEEN 20120228 AND 20120228 you get this: > "-> Index Scan using alps_agg_date_id on bi2003.alps_agg a > (cost=0.00..10.12 rows=1 width=1350)" > " Output: a.date_id, a.adc_visit, a.page_group, a.page, > a.int_alloc_type, a.componentset, a.variation_tagset, a.page_instance" > " Index Cond: ((a.date_id >= 20120228) AND (a.date_id <= > 20120228))" > " Filter: ((a.page)::text = 'ddi_671'::text)" so the bogus estimate of only one row causes the planner to pick an entirely different plan, which would probably be a great choice if there were indeed only one such row, but with 36000 of them it's horrid. The reason the row estimate is so crummy is that a zero-width interval is an edge case for range estimates. We've seen this before, although usually it's not quite this bad. There's been some talk of making the estimate for "x >= a AND x <= b" always be at least as much as the estimate for "x = a", but this would increase the cost of making the estimate by quite a bit, and make things actually worse in some cases (in particular, if a > b then a nil estimate is indeed the right thing). You might look into whether queries formed like "date_id >= 20120228 AND date_id < 20120229" give you more robust estimates at the edge cases. BTW, I notice in your EXPLAIN results that the same range restriction has been propagated to b.date_id: > "-> Index Scan using event_agg_date_id on bi2003.event_agg b > (cost=0.00..10.27 rows=1 width=1694)" > " Output: b.date_id, b.vcset, b.eventcountset, b.eventvalueset" > " Index Cond: ((b.date_id >= 20120228) AND (b.date_id <= > 20120228))" I'd expect that to happen automatically for a simple equality constraint, but not for a range constraint. Did you do that manually and not tell us about it? regards, tom lane -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Aggregate over a linked list
Did you try: select substring(contract from 1 for 1), min(entry_date), max(entry_date), sum(profit) from contract_table group by 1; Venky Kandaswamy Principal Engineer, Adchemy Inc. 925-200-7124 From: pgsql-sql-ow...@postgresql.org [pgsql-sql-ow...@postgresql.org] on behalf of M Lubratt [mplubr...@gmail.com] Sent: Thursday, January 17, 2013 3:19 PM To: pgsql-sql@postgresql.org Subject: [SQL] Aggregate over a linked list Hello! I trade futures contracts and I have a PostgreSQL 9.2 database that keeps track of all of my trading activity. The table includes columns for the futures contract, the entry and exit dates and the profit for that particular trade. Now, futures contracts expire, so within a trade being indicated by my rules, I may need to "roll" contracts to the next contract to avoid contract expiration. Therefore I can end up with multiple actual trades in my table that are linked by the dates. e.g. If the contract is SH12 (S = soybeans and H12 = March 2012 expiration) contract entry_date exit_date profit --- SH12 2012-1-102012-2-27 500.00 SM12 2012-2-272012-3-30 200.00 While these are the actual exchange executed trades, I'd like to reduce this into a single row like (linked by the "S" and then exit_date = entry_date): contract entry_date exit_date profit --- S 2012-1-102012-3-30 700.00 I've gone round and round in my head, google, SQL Cookbook, etc. trying to figure out how to do this. Can anyone provide any pointers on how to do this? Thanks and best regards! Mark
Re: [SQL] How to split an array-column?
You can try select id, unnest(array_col) from table Venky Kandaswamy Principal Engineer, Adchemy Inc. 925-200-7124 From: pgsql-sql-ow...@postgresql.org [pgsql-sql-ow...@postgresql.org] on behalf of Andreas [maps...@gmx.net] Sent: Monday, March 18, 2013 12:13 PM To: pgsql-sql@postgresql.org Subject: [SQL] How to split an array-column? Hi, I've got a table to import from csv that has an array-column like: import ( id, array_col, ... ) Those arrays look like ( 42, ";4941;4931;4932", ... ) They can have 0 or any number of elements separated by ; So I'd need a result like this: 42, 4941 42, 4931 42, 4932 How would I get this? -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] value from max row in group by
You can use Postgres WINDOW functions for this in several different ways. For example, one way of doing it: select stts_id, last_value(stts_offset) over (partition by stts_id order by stts_offset desc) + last_value(stts_duration) over (partition by stts_id order by stts_offset desc) from table group by stts_id; Venky Kandaswamy Principal Engineer, Adchemy Inc. 925-200-7124 From: pgsql-sql-ow...@postgresql.org on behalf of Gary Stainburn Sent: Thursday, July 25, 2013 10:57 AM To: pgsql-sql@postgresql.org Subject: Re: [SQL] value from max row in group by As usual, once I've asked the question, I find the answer myself. However, it *feels* like there should be a more efficient way. Can anyone comment or suggest a better method? timetable=> select stts_id, stts_offset+stts_duration as total_duration timetable-> from standard_trip_sections timetable-> where (stts_id, stts_offset) in timetable-> (select stts_id, max(stts_offset) from standard_trip_sections group by stts_id); stts_id | total_duration -+ 1 | 01:35:00 2 | 01:35:00 3 | 01:08:00 4 | 01:38:00 5 | 01:03:00 6 | 01:06:00 (6 rows) timetable=> -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql