Re: [GENERAL] Query to find contiguous ranges on a column
On Tue, Oct 13, 2009 at 5:12 PM, Tim Landscheidt t...@tim-landscheidt.de wrote: Peter Hunsberger peter.hunsber...@gmail.com wrote: You can either use a PL/pgSQL function (SETOF TEXT just for the convenience of the example): That works well, takes about 20 seconds to do the 6M+ rows or a recursive query (which I always find very hard to com- prehend): | WITH RECURSIVE RecCols (LeftBoundary, Value) AS | (SELECT col, col FROM t WHERE (col - 1) NOT IN (SELECT col FROM t) | UNION ALL SELECT p.LeftBoundary, c.col FROM RecCols AS p, t AS c WHERE c.col = p.Value + 1) | SELECT LeftBoundary, MAX(Value) AS RightBoundary FROM RecCols | GROUP BY LeftBoundary | ORDER BY LeftBoundary; Could you run both against your data set and find out which one is faster for your six million rows? Turns out the server is v 8.3, looks like I need to get them to upgrade it so I get recursive and windowing :-(. If this happens any time soon I'll let you know the results. Many thanks. -- Peter Hunsberger -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Query to find contiguous ranges on a column
Peter Hunsberger peter.hunsber...@gmail.com wrote: [...] or a recursive query (which I always find very hard to com- prehend): | WITH RECURSIVE RecCols (LeftBoundary, Value) AS | (SELECT col, col FROM t WHERE (col - 1) NOT IN (SELECT col FROM t) | UNION ALL SELECT p.LeftBoundary, c.col FROM RecCols AS p, t AS c WHERE c.col = p.Value + 1) | SELECT LeftBoundary, MAX(Value) AS RightBoundary FROM RecCols | GROUP BY LeftBoundary | ORDER BY LeftBoundary; Could you run both against your data set and find out which one is faster for your six million rows? Turns out the server is v 8.3, looks like I need to get them to upgrade it so I get recursive and windowing :-(. If this happens any time soon I'll let you know the results. Many thanks. After some tests with a data set of 7983 rows (and 1638 ran- ges): Don't! :-) The recursive solution seems to be more than double as slow as the iterative. I'll take it to -per- formance. Tim -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Query to find contiguous ranges on a column
On Wed, Oct 14, 2009 at 4:50 PM, Tim Landscheidt t...@tim-landscheidt.de wrote: Peter Hunsberger peter.hunsber...@gmail.com wrote: After some tests with a data set of 7983 rows (and 1638 ran- ges): Don't! :-) The recursive solution seems to be more than double as slow as the iterative. I'll take it to -per- formance. Interesting, I've never liked recursive on Oracle but performance is usually reasonable... Thanks for the heads up... -- Peter Hunsberger -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] Query to find contiguous ranges on a column
Given a column of data resembling the following: col 2 3 4 5 11 12 13 14 15 16 17 18 19 23 32 33 34 35 36 37 I need a query to find the contiguous ranges within this column, in this case returning the result set: start, end 2, 5 11, 19 23, 23 32, 37 I have one solution that joins the table against itself and does (among other things) a subselect looking not exists col +1 and not exists col -1 on the two instances of the table to find the start and end. This is, as you might guess, is not very efficient (my actual data is some 6 million+ rows) and I'm guessing there has to be something more efficient with windowing or possibly grouping on min and max (though I can't see how to make sure they are part of a contiguous set). Anyone have any ideas? -- Peter Hunsberger -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Query to find contiguous ranges on a column
Peter Hunsberger peter.hunsber...@gmail.com wrote: [...] I have one solution that joins the table against itself and does (among other things) a subselect looking not exists col +1 and not exists col -1 on the two instances of the table to find the start and end. This is, as you might guess, is not very efficient (my actual data is some 6 million+ rows) and I'm guessing there has to be something more efficient with windowing or possibly grouping on min and max (though I can't see how to make sure they are part of a contiguous set). Anyone have any ideas? You can either use a PL/pgSQL function (SETOF TEXT just for the convenience of the example): | CREATE FUNCTION SummarizeRanges () RETURNS SETOF TEXT AS $$ | DECLARE | CurrentFirst INT; | CurrentLast INT; | CurrentRecord RECORD; | BEGIN | FOR CurrentRecord IN SELECT col FROM t ORDER BY col LOOP | IF CurrentFirst IS NULL THEN | CurrentFirst := CurrentRecord.col; | CurrentLast := CurrentRecord.col; | ELSIF CurrentRecord.col = CurrentLast + 1 THEN | CurrentLast := CurrentRecord.col; | ELSE | RETURN NEXT CurrentFirst || ', ' || CurrentLast; | CurrentFirst := CurrentRecord.col; | CurrentLast := CurrentRecord.col; | END IF; | END LOOP; | IF CurrentFirst IS NOT NULL THEN | RETURN NEXT CurrentFirst || ', ' || CurrentLast; | END IF; | RETURN; | END; | $$ LANGUAGE plpgsql; or a recursive query (which I always find very hard to com- prehend): | WITH RECURSIVE RecCols (LeftBoundary, Value) AS | (SELECT col, col FROM t WHERE (col - 1) NOT IN (SELECT col FROM t) |UNION ALL SELECT p.LeftBoundary, c.col FROM RecCols AS p, t AS c WHERE c.col = p.Value + 1) | SELECT LeftBoundary, MAX(Value) AS RightBoundary FROM RecCols | GROUP BY LeftBoundary | ORDER BY LeftBoundary; Could you run both against your data set and find out which one is faster for your six million rows? Tim -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Query to find contiguous ranges on a column
Peter Hunsberger wrote on 13.10.2009 23:23: I need a query to find the contiguous ranges within this column, in this case returning the result set: start, end 2, 5 11, 19 23, 23 32, 37 I have one solution that joins the table against itself and does (among other things) a subselect looking not exists col +1 and not exists col -1 on the two instances of the table to find the start and end. This is, as you might guess, is not very efficient (my actual data is some 6 million+ rows) and I'm guessing there has to be something more efficient with windowing or possibly grouping on min and max (though I can't see how to make sure they are part of a contiguous set). Anyone have any ideas? This is the best I can put together right now. Not very nice, but currently I can't think of a better solution: select * from ( select soi as start_of_interval, case when soi is not null and eoi is null then lead(eoi) over() when soi is not null and eoi is not null then eoi else null end as end_of_interval from ( select case when col - (lag(col,1,col + 1) over (order by col)) - 1 0 then col else null end as soi, case when col - (lead(col,1,col + 1) over (order by col)) + 1 0 then col else null end as eoi from numbers ) t1 where t1.soi is not null or t1.eoi is not null ) t2 where t2.start_of_interval is not null and t2.end_of_interval is not null; The outer-most select is needed to get rid of the empty rows. I couldn't find a way to push that into one of the sub-queries. The execution plan doesn't look too bad (probably better than your plan with a self join and a subselect), but it does sort the whole table which might be a problem. Regards Thomas -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general