[SQL] Curious problem of using BETWEEN with start and end being the same versus EQUALS '='

2013-01-15 Thread Venky Kandaswamy
Output: a.date_id, a.page_group, a.page, a.int_alloc_type, 
a.componentset, a.adc_visit, (upper((a.adc_visit)::text))"
"Sort Key: (upper((a.adc_visit)::text))"
"Sort Method: external merge  Disk: 49480kB"
"->  Index Scan using alps_agg_date_id on bi2003.alps_agg a 
 (cost=0.00..17870.00 rows=26292 width=1350) (actual time=0.047..142.383 
rows=36132 loops=1)"
"  Output: a.date_id, a.page_group, a.page, 
a.int_alloc_type, a.componentset, a.adc_visit, upper((a.adc_visit)::text)"
"  Index Cond: (a.date_id = 20120228)"
"  Filter: ((a.page)::text = 'ddi_671'::text)"
"  ->  Materialize  (cost=40395.08..40560.22 rows=33028 width=1694) 
(actual time=4606.461..4809.163 rows=45428 loops=1)"
"Output: b.vcset, b.eventvalueset, b.eventcountset, 
b.date_id, (upper(bicommon.get_value('adc_visit'::text, b.vcset)))"
"->  Sort  (cost=40395.08..40477.65 rows=33028 width=1694) 
(actual time=4606.455..4778.491 rows=45428 loops=1)"
"  Output: b.vcset, b.eventvalueset, b.eventcountset, 
b.date_id, (upper(bicommon.get_value('adc_visit'::text, b.vcset)))"
"  Sort Key: 
(upper(bicommon.get_value('adc_visit'::text, b.vcset)))"
"  Sort Method: external merge  Disk: 75440kB"
"  ->  Index Scan using event_agg_date_id on 
bi2003.event_agg b  (cost=0.00..13643.60 rows=33028 width=1694) (actual 
time=0.201..3865.567 rows=45428 loops=1)"
"Output: b.vcset, b.eventvalueset, 
b.eventcountset, b.date_id, upper(bicommon.get_value('adc_visit'::text, 
b.vcset))"
"Index Cond: (b.date_id = 20120228)"
"Total runtime: 37797.536 ms"

Now, The problem run:
If I change the above query to use one day in a BETWEEN clause where start and 
end are the same, the query never completes even after 6+ hours.

  select a.date_id Date,
a.page_group page_group,
a.page page,
a.int_alloc_type int_alloc_type,
... --{snipped for brevity}
 from bi2003.alps_agg a left outer join bi2003.event_agg b
on (a.date_id = b.date_id and upper(a.adc_visit) = 
upper(bicommon.get_value('adc_visit', b.vcset)))
 WHERE a.date_id BETWEEN 20120228 AND 20120228 - CHANGE TO 
STATEMENT IS HERE --
   and a.page = 'ddi_671'
 group by 1,2,3,4,5,6,7,8,9;

"HashAggregate  (cost=23.23..24.49 rows=1 width=3040)"
"  Output: a.date_id, a.page_group, a.page, a.int_alloc_type, 
(bicommon.get_value('browserfamily'::text, b.vcset)), 
(bicommon.get_value('trafficsource'::text, b.vcset)), 
(bicommon.get_value('e671_hl_p1'::text, a.componentset)), 
(bicommon.get_value('e671_img_p1'::text, a.componentset)), 
(bicommon.get_value('e671_formLabels'::text, a.componentset)), 
count(a.adc_visit), sum((bicommon.get_value('revenue'::text, 
b.eventvalueset))::numeric(9,0)), sum((bicommon.get_value('Impression'::text, 
b.eventcountset))::numeric(9,0)), sum((bicommon.get_value('page1submit'::text, 
b.eventcountset))::numeric(9,0)), sum((bicommon.get_value('conversion'::text, 
b.eventcountset))::numeric(9,0)), sum((bicommon.get_value('lead'::text, 
b.eventcountset))::numeric(9,0))"
"  ->  Nested Loop  (cost=0.00..21.91 rows=1 width=3040)"
"Output: a.date_id, a.page_group, a.page, a.int_alloc_type, b.vcset, 
a.componentset, a.adc_visit, b.eventvalueset, b.eventcountset, 
bicommon.get_value('browserfamily'::text, b.vcset), 
bicommon.get_value('trafficsource'::text, b.vcset), 
bicommon.get_value('e671_hl_p1'::text, a.componentset), 
bicommon.get_value('e671_img_p1'::text, a.componentset), 
bicommon.get_value('e671_formLabels'::text, a.componentset)"
"Join Filter: ((a.date_id = b.date_id) AND (upper((a.adc_visit)::text) 
= upper(bicommon.get_value('adc_visit'::text, b.vcset"
"->  Index Scan using alps_agg_date_id on bi2003.alps_agg a  
(cost=0.00..10.12 rows=1 width=1350)"
"  Output: a.date_id, a.adc_visit, a.page_group, a.page, 
a.int_alloc_type, a.componentset, a.variation_tagset, a.page_instance"
"  Index Cond: ((a.date_id >= 20120228) AND (a.date_id <= 
20120228))"
"  Filter: ((a.page)::text = 'ddi_671'::text)"
"->  Index Scan using event_agg_date_id on bi2003.event_agg b  
(cost=0.00..10.27 rows=1 width=1694)"
"  Output: b.date_id, b.vcset, b.eventcountset, b.eventvalueset"
"  Index Cond: ((b.date_id >= 20120228) AND (b.date_id <= 
20120228))"

Has anyone experienced a similar issue? We have checked the tables and indexes 
for consistency, VACUUM'ed, ANALYZEd and REINDEX'ed them many times. But the 
between clause with start and end the same runs for hours and never completes. 
But if we use a range where start and end is not the same, it comes back in a 
few minutes. If we use a single day in an '=' clause, it also comes back in 30+ 
secs.

What could be the problem here? Does anyone have any insights?




Venky Kandaswamy

Principal Engineer, Adchemy Inc.

925-200-7124


Re: [SQL] Curious problem of using BETWEEN with start and end being the same versus EQUALS '='

2013-01-15 Thread Venky Kandaswamy
Thanks for the quick and detailed response, Tom. 

Yes, I did add a redundant where clause with a restriction on b.date_id on the 
range queries. This appears to speed things up since it does an index scan on 
the b table before the merge join. 

We will get more intelligent on query generation (our system generates queries 
on the fly) to work around this problem. 


Venky Kandaswamy

Principal Engineer, Adchemy Inc.

925-200-7124


From: Tom Lane [t...@sss.pgh.pa.us]
Sent: Tuesday, January 15, 2013 2:30 PM
To: Venky Kandaswamy
Cc: pgsql-gene...@postgresql.org; pgsql-sql@postgresql.org
Subject: Re: [SQL] Curious problem of using BETWEEN with start and end being 
the same versus EQUALS '='

Venky Kandaswamy  writes:
>On 9.1, I am running into a curious issue.

It's not very curious at all, or at least people on pgsql-performance
(the right list for this sort of question) would have figured it out
quickly.  You're getting a crummy plan because of a crummy row estimate.
When you do this:

>  WHERE a.date_id = 20120228

you get this:

> "->  Index Scan using alps_agg_date_id on bi2003.alps_agg 
> a  (cost=0.00..17870.00 rows=26292 width=1350) (actual time=0.047..142.383 
> rows=36132 loops=1)"
> "  Output: a.date_id, a.page_group, a.page, 
> a.int_alloc_type, a.componentset, a.adc_visit, upper((a.adc_visit)::text)"
> "  Index Cond: (a.date_id = 20120228)"
> "  Filter: ((a.page)::text = 'ddi_671'::text)"

26K estimated rows versus 36K actual isn't the greatest estimate in the
world, but it's plenty good enough.  But when you do this:

>  WHERE a.date_id BETWEEN 20120228 AND 20120228

you get this:

> "->  Index Scan using alps_agg_date_id on bi2003.alps_agg a  
> (cost=0.00..10.12 rows=1 width=1350)"
> "  Output: a.date_id, a.adc_visit, a.page_group, a.page, 
> a.int_alloc_type, a.componentset, a.variation_tagset, a.page_instance"
> "  Index Cond: ((a.date_id >= 20120228) AND (a.date_id <= 
> 20120228))"
> "  Filter: ((a.page)::text = 'ddi_671'::text)"

so the bogus estimate of only one row causes the planner to pick an
entirely different plan, which would probably be a great choice if there
were indeed only one such row, but with 36000 of them it's horrid.

The reason the row estimate is so crummy is that a zero-width interval
is an edge case for range estimates.  We've seen this before, although
usually it's not quite this bad.

There's been some talk of making the estimate for "x >= a AND x <= b"
always be at least as much as the estimate for "x = a", but this would
increase the cost of making the estimate by quite a bit, and make things
actually worse in some cases (in particular, if a > b then a nil
estimate is indeed the right thing).

You might look into whether queries formed like "date_id >= 20120228 AND
date_id < 20120229" give you more robust estimates at the edge cases.

BTW, I notice in your EXPLAIN results that the same range restriction
has been propagated to b.date_id:

> "->  Index Scan using event_agg_date_id on bi2003.event_agg b  
> (cost=0.00..10.27 rows=1 width=1694)"
> "  Output: b.date_id, b.vcset, b.eventcountset, b.eventvalueset"
> "  Index Cond: ((b.date_id >= 20120228) AND (b.date_id <= 
> 20120228))"

I'd expect that to happen automatically for a simple equality
constraint, but not for a range constraint.  Did you do that manually
and not tell us about it?

regards, tom lane




-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] Aggregate over a linked list

2013-01-17 Thread Venky Kandaswamy
Did you try:

select substring(contract from 1 for 1), min(entry_date), max(entry_date), 
sum(profit)
from contract_table
group by 1;




Venky Kandaswamy

Principal Engineer, Adchemy Inc.

925-200-7124


From: pgsql-sql-ow...@postgresql.org [pgsql-sql-ow...@postgresql.org] on behalf 
of M Lubratt [mplubr...@gmail.com]
Sent: Thursday, January 17, 2013 3:19 PM
To: pgsql-sql@postgresql.org
Subject: [SQL] Aggregate over a linked list

Hello!

I trade futures contracts and I have a PostgreSQL 9.2 database that keeps track 
of all of my trading activity.  The table includes columns for the futures 
contract, the entry and exit dates and the profit for that particular trade.  
Now, futures contracts expire, so within a trade being indicated by my rules, I 
may need to "roll" contracts to the next contract to avoid contract expiration. 
 Therefore I can end up with multiple actual trades in my table that are linked 
by the dates.

e.g.

If the contract is SH12 (S = soybeans and H12 = March 2012 expiration)

contract  entry_date   exit_date  profit
---
SH12  2012-1-102012-2-27 500.00
SM12 2012-2-272012-3-30 200.00

While these are the actual exchange executed trades, I'd like to reduce this 
into a single row like (linked by the "S" and then exit_date = entry_date):

contract  entry_date   exit_date  profit
---
S   2012-1-102012-3-30 700.00

I've gone round and round in my head, google, SQL Cookbook, etc. trying to 
figure out how to do this.  Can anyone provide any pointers on how to do this?

Thanks and best regards!
Mark



Re: [SQL] How to split an array-column?

2013-03-18 Thread Venky Kandaswamy
You can try 

select id, unnest(array_col) from table




Venky Kandaswamy

Principal Engineer, Adchemy Inc.

925-200-7124


From: pgsql-sql-ow...@postgresql.org [pgsql-sql-ow...@postgresql.org] on behalf 
of Andreas [maps...@gmx.net]
Sent: Monday, March 18, 2013 12:13 PM
To: pgsql-sql@postgresql.org
Subject: [SQL] How to split an array-column?

Hi,

I've got a table to import from csv that has an array-column like:

import ( id, array_col, ... )

Those arrays look like ( 42,  ";4941;4931;4932", ... )
They can have 0 or any number of elements separated by ;

So I'd need a result like this:
42, 4941
42, 4931
42, 4932

How would I get this?


--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql




-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] value from max row in group by

2013-07-25 Thread Venky Kandaswamy
You can use Postgres WINDOW functions for this in several different ways. For 
example, one way of doing it:

select stts_id,   last_value(stts_offset) over (partition by stts_id order by 
stts_offset desc) 
   + last_value(stts_duration) over (partition by stts_id order 
by stts_offset desc)
from table
group by stts_id;



Venky Kandaswamy

Principal Engineer, Adchemy Inc.

925-200-7124


From: pgsql-sql-ow...@postgresql.org  on behalf 
of Gary Stainburn 
Sent: Thursday, July 25, 2013 10:57 AM
To: pgsql-sql@postgresql.org
Subject: Re: [SQL] value from max row in group by

As usual, once I've asked the question, I find the answer myself.

However, it *feels* like there should be a more efficient way. Can anyone
comment or suggest a better method?

timetable=> select stts_id, stts_offset+stts_duration as total_duration
timetable-> from standard_trip_sections
timetable-> where (stts_id, stts_offset) in
timetable-> (select stts_id, max(stts_offset) from standard_trip_sections
group by stts_id);
 stts_id | total_duration
-+
   1 | 01:35:00
   2 | 01:35:00
   3 | 01:08:00
   4 | 01:38:00
   5 | 01:03:00
   6 | 01:06:00
(6 rows)

timetable=>


--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql