Re: Given a set of daterange, finding the continuous range that includes a particular date (aggregates)

2018-02-28 Thread mariusz

hi,

On Tue, 2018-02-27 at 16:15 -0800, Ken Tanzer wrote:
> Hi.  Thanks so much for your assistance.  This is definitely getting
> the results I was looking for.  It is still syntacticallly more
> cumbersome than I might have hoped, but I can work with it.  So I've
> got two follow questions/issues:
> 
> 
> 1)  I can see there are many, more complex, options for aggregates,
> which I am trying to wrap my mind around.  I'm wondering if any of
> these (esp. partial aggregates/combine functions, final functions or
> moving aggregates) could be used to streamline this into a single
> function call, or if that is barking up a dead tree.
> 
i'm not an expert and i don't use fancy aggregates too often, just got a
simple solution, so i may be wrong, but i don't see simple solution as
one function. so unless someone comes with a better idea, you have to
stay with this, which isn't complex as the idea and solution is really
very simple (simplicity is somewhat visually hidden in that case by
converting dates to ranges and to boundary dates again (and again to
ranges for visualizing result)).
but it requires additional sort for each window as you can see in
explain which is the cost to pay
> 
> 2)  I'm sure at this point I must be being dense, but after re-reading
> docs multiple times, I am still confused about one piece of this:
> 
> 
> > first window (within lower() func) extends its initial
> (earliest) range
> > to right for every next row continuous with current range
> (and jumps to
> > new start on discontinuity), thus over(order by d ASC)
> > second window (within upper() func) extends its initial
> (latest) range
> > to left for every next row continuous with current range
> (and jumps to
> > new end on discontinuity), thus over(order by d DESC)
> > partition by client_id within each over() is to not mix
> client_ids'
> > dateranges
> >
> 
> 
> which is what is it exactly that is triggering Postgresql to know
> there is a discontinuity and to start with a new range?  And is it
> based on the input or the output values?  Based on PARTITION BY
> client_id ORDER BY d, I might have thought it was d.  But that doesn't
> seem to be right.  So is it something about what agg_daterange is
> returning, and if so what?  Again, sorry for being dense about this
> one.
> 
posgresql itself doesn't know nor care about those discontinuities, all
it cares about is partition by client_id to not mix client_ids, and
applying our agg function for rows in order of our choice.

here again i remind you, you don't really want to sort by d::daterange
for second descending window, but end_date because desc order of range
will not be the same as desc order of end_date in general case,
and start_date for first window (but as i already said, for that first
ascending window it's cosmetics as order of d::daterange and
start_date::date will be the same.
that doesn't really matter in your case of exclusive ranges but matters
in more general case.

and back to your question, we find discontinuity in our function. see
the query

SELECT CASE WHEN d1 && d2 OR d1 -|- d2 THEN d1 + d2 ELSE d2 END

where d1 is internal state value of aggregate and d2 is current row's
daterange value.
we check for continuity by d1 && d2 OR d1 -|- d2, and if it is
continuous than we just extend result (return value for current row and
new state value of agg) by d1+d2,
otherwise, when we find discontinuity, we forget d1 (agg state value
until now) and simply return d2 being current row's range, thus starting
with new range.

hope that explains enough. as i already said, i'm not an expert, i'm
just coincidentally working currently on my semi-toy project which
utilizes dateranges quite heavily.
anyway, feel free to ask if you have any further questions. for now i'm
glad i could help somehow.

regards,
mariusz jadczak

> 
> Thanks!
> 
> 
> Ken
> 
> 
> 
> -- 
> 
> AGENCY Software  
> A Free Software data system
> By and for non-profits
> http://agency-software.org/
> https://demo.agency-software.org/client
> 
> ken.tan...@agency-software.org
> (253) 245-3801
> 
> 
> Subscribe to the mailing list to
> learn more about AGENCY or
> follow the discussion.





Re: Given a set of daterange, finding the continuous range that includes a particular date (aggregates)

2018-02-27 Thread Ken Tanzer
Hi.  Thanks so much for your assistance.  This is definitely getting the
results I was looking for.  It is still syntacticallly more cumbersome than
I might have hoped, but I can work with it.  So I've got two follow
questions/issues:

1)  I can see there are many, more complex, options for aggregates, which I
am trying to wrap my mind around.  I'm wondering if any of these (esp.
partial aggregates/combine functions, final functions or moving aggregates)
could be used to streamline this into a single function call, or if that is
barking up a dead tree.

2)  I'm sure at this point I must be being dense, but after re-reading docs
multiple times, I am still confused about one piece of this:

> first window (within lower() func) extends its initial (earliest) range
> > to right for every next row continuous with current range (and jumps to
> > new start on discontinuity), thus over(order by d ASC)
> > second window (within upper() func) extends its initial (latest) range
> > to left for every next row continuous with current range (and jumps to
> > new end on discontinuity), thus over(order by d DESC)
> > partition by client_id within each over() is to not mix client_ids'
> > dateranges
> >
>
> which is what is it exactly that is triggering Postgresql to know there is
a discontinuity and to start with a new range?  And is it based on the
input or the output values?  Based on PARTITION BY client_id ORDER BY d, I
might have thought it was d.  But that doesn't seem to be right.  So is it
something about what agg_daterange is returning, and if so what?  Again,
sorry for being dense about this one.

Thanks!

Ken


-- 
AGENCY Software
A Free Software data system
By and for non-profits
*http://agency-software.org/ *
*https://demo.agency-software.org/client
*
ken.tan...@agency-software.org
(253) 245-3801

Subscribe to the mailing list
 to
learn more about AGENCY or
follow the discussion.


Re: Given a set of daterange, finding the continuous range that includes a particular date (aggregates)

2018-02-27 Thread mariusz

hello,

one more fix, to not let someone get incorrect/incomplete ideas, see
below


On Tue, 2018-02-27 at 10:03 +0100, mariusz wrote:
> On Fri, 2018-02-23 at 18:11 -0800, Ken Tanzer wrote:
> > On Fri, Feb 23, 2018 at 6:10 AM,
> > mariusz  wrote: 
> > 
> > 
> > i guess, you can easily get max continuous range for each row
> > with
> > something like this:
> > 
> > CREATE OR REPLACE FUNCTION
> > append_daterange(d1 daterange, d2 daterange)
> > RETURNS daterange
> > LANGUAGE sql
> > AS
> > $$
> > SELECT CASE WHEN d1 && d2 OR d1 -|- d2 THEN d1 + d2
> > ELSE d2 END;
> > $$;
> > 
> > CREATE AGGREGATE agg_daterange (daterange) (
> > sfunc = append_daterange,
> > stype = daterange
> > );
> > 
> > SELECT dr,
> >lower(agg_daterange(dr) OVER (ORDER BY dr ASC)),
> >upper(agg_daterange(dr) OVER (ORDER BY dr DESC))
> >   FROM ...
> > 
> > above example is simplified to selecting only daterange column
> > "dr" for
> > readability, which in your case should be something like
> > 
> > daterange(staff_assign_date,staff_assign_date_end,'[)')
> > 
> > please note that daterange would be normalized to [) format so
> > upper()
> > above returns exactly your max "staff_assign_date_end" for
> > each
> > continuous range when dateranges are created with '[)' format.
> > 
> > the key point is ... ELSE d2 in append_daterange() which
> > starts with new
> > value each time that new value is discontinuous with agg's
> > state value
> > and order in which rows are processed (ASC for lower of
> > daterange, DESC
> > for upper of daterange).
> > 
> > unfortunately this involves reading all rows for "client_id"
> > and
> > additional sorting for each window.
> > i recall reading that you already pay the price of reading all
> > rows for
> > client_id anyway, so the only question is the cost of two
> > additional
> > sorts (maybe extracting dateranges to subset on which to do
> > windows and
> > rejoining result of continuous ranges to original set would
> > help to
> > lower the cost).
> > 
> > 
> > Thank you, and I wanted to follow up on this.  I couldn't quite get
> > your example working as described, but I also ended up trying
> > something very similar that got me very close but not quite there.
> > Basically, I can see that it is correctly calculating the ranges (in
> > the notices), but it is only returning the last range for each client.
> > (Because I said PARTITION BY client_id).
> > 
> > 
> sorry for late replay, i was offline from sat to mon inclusive.
> i may have previously added some confusion, so i'll try to explain what
> i had in my mind. see below what my suggested query is.
> 
> 
> > So I'm not sure if I should be calling this differently, or if the
> > function needs to work differently, or if this just isn't possible.
> > Do I need to partition by something else, and if so what?  I don't see
> > what I could specify that would indicate a new episode.
> > 
> definitely you want to partition by client_id if you are calculating
> this for multiple client_ids, but what matters here for each client_id
> is order of dateranges in over()
> 
> > 
> > Also, it's not clear to me how an aggregate might define/return
> > different values within a partition.  Although this must be possible,
> > since functions like rank() and row_number() seem to do it.
> > 
> that is because function is defined like agg, but what happens is that
> its current state value (return value of agg function) is returned for
> every row within given window without grouping the resultset
> > 
> > Hoping there is something easy that can be tweaked here.  See below
> > for copy/pastable test stuff.  It includes output from both functions.
> > Both look to be returning the same results, which makes me wonder if
> > my passing in a start date was a waste of time, though it seems to me
> > it would be necessary.
> > 
> > Cheers,
> > Ken
> > 
> > 
> > BEGIN;
> > CREATE TEMP TABLE sample_data (
> > client_id INTEGER,
> > start_date DATE,
> > end_date DATE,
> > episode INTEGER -- Just a label, for clarity
> > );
> > 
> > INSERT INTO sample_data VALUES
> > (1,'1990-01-01','1990-12-31',1),
> > (1,'1991-01-01','1991-12-31',1),
> > 
> > (1,'2000-01-01','2000-12-31',2),
> > (1,'2001-01-01','2001-12-31',2),
> > (1,'2002-01-01','2002-12-31',2),
> > (1,'2003-01-01','2003-12-31',2),
> > (1,'2004-01-01','2004-12-31',2),
> > (1,'2005-01-01','2005-12-31',2),
> > (1,'2006-01-01','2006-12-31',2),
> > 
> > 

Re: Given a set of daterange, finding the continuous range that includes a particular date (aggregates)

2018-02-27 Thread mariusz


On Tue, 2018-02-27 at 10:03 +0100, mariusz wrote:
> On Fri, 2018-02-23 at 18:11 -0800, Ken Tanzer wrote:
> > On Fri, Feb 23, 2018 at 6:10 AM,
> > mariusz  wrote: 
> > 
> > 
> > i guess, you can easily get max continuous range for each row
> > with
> > something like this:
> > 
> > CREATE OR REPLACE FUNCTION
> > append_daterange(d1 daterange, d2 daterange)
> > RETURNS daterange
> > LANGUAGE sql
> > AS
> > $$
> > SELECT CASE WHEN d1 && d2 OR d1 -|- d2 THEN d1 + d2
> > ELSE d2 END;
> > $$;
> > 
> > CREATE AGGREGATE agg_daterange (daterange) (
> > sfunc = append_daterange,
> > stype = daterange
> > );
> > 
> > SELECT dr,
> >lower(agg_daterange(dr) OVER (ORDER BY dr ASC)),
> >upper(agg_daterange(dr) OVER (ORDER BY dr DESC))
> >   FROM ...
> > 
> > above example is simplified to selecting only daterange column
> > "dr" for
> > readability, which in your case should be something like
> > 
> > daterange(staff_assign_date,staff_assign_date_end,'[)')
> > 
> > please note that daterange would be normalized to [) format so
> > upper()
> > above returns exactly your max "staff_assign_date_end" for
> > each
> > continuous range when dateranges are created with '[)' format.
> > 
> > the key point is ... ELSE d2 in append_daterange() which
> > starts with new
> > value each time that new value is discontinuous with agg's
> > state value
> > and order in which rows are processed (ASC for lower of
> > daterange, DESC
> > for upper of daterange).
> > 
> > unfortunately this involves reading all rows for "client_id"
> > and
> > additional sorting for each window.
> > i recall reading that you already pay the price of reading all
> > rows for
> > client_id anyway, so the only question is the cost of two
> > additional
> > sorts (maybe extracting dateranges to subset on which to do
> > windows and
> > rejoining result of continuous ranges to original set would
> > help to
> > lower the cost).
> > 
> > 
> > Thank you, and I wanted to follow up on this.  I couldn't quite get
> > your example working as described, but I also ended up trying
> > something very similar that got me very close but not quite there.
> > Basically, I can see that it is correctly calculating the ranges (in
> > the notices), but it is only returning the last range for each client.
> > (Because I said PARTITION BY client_id).
> > 
> > 
> sorry for late replay, i was offline from sat to mon inclusive.
> i may have previously added some confusion, so i'll try to explain what
> i had in my mind. see below what my suggested query is.
> 
> 
> > So I'm not sure if I should be calling this differently, or if the
> > function needs to work differently, or if this just isn't possible.
> > Do I need to partition by something else, and if so what?  I don't see
> > what I could specify that would indicate a new episode.
> > 
> definitely you want to partition by client_id if you are calculating
> this for multiple client_ids, but what matters here for each client_id
> is order of dateranges in over()
> 
> > 
> > Also, it's not clear to me how an aggregate might define/return
> > different values within a partition.  Although this must be possible,
> > since functions like rank() and row_number() seem to do it.
> > 
> that is because function is defined like agg, but what happens is that
> its current state value (return value of agg function) is returned for
> every row within given window without grouping the resultset
> > 
> > Hoping there is something easy that can be tweaked here.  See below
> > for copy/pastable test stuff.  It includes output from both functions.
> > Both look to be returning the same results, which makes me wonder if
> > my passing in a start date was a waste of time, though it seems to me
> > it would be necessary.
> > 
> > Cheers,
> > Ken
> > 
> > 
> > BEGIN;
> > CREATE TEMP TABLE sample_data (
> > client_id INTEGER,
> > start_date DATE,
> > end_date DATE,
> > episode INTEGER -- Just a label, for clarity
> > );
> > 
> > INSERT INTO sample_data VALUES
> > (1,'1990-01-01','1990-12-31',1),
> > (1,'1991-01-01','1991-12-31',1),
> > 
> > (1,'2000-01-01','2000-12-31',2),
> > (1,'2001-01-01','2001-12-31',2),
> > (1,'2002-01-01','2002-12-31',2),
> > (1,'2003-01-01','2003-12-31',2),
> > (1,'2004-01-01','2004-12-31',2),
> > (1,'2005-01-01','2005-12-31',2),
> > (1,'2006-01-01','2006-12-31',2),
> > 
> > (1,'2014-01-01','2014-12-31',3),
> > (1,'2015-01-01','2015-12-31',3),
> > 

Re: Given a set of daterange, finding the continuous range that includes a particular date (aggregates)

2018-02-27 Thread mariusz
On Fri, 2018-02-23 at 18:11 -0800, Ken Tanzer wrote:
> On Fri, Feb 23, 2018 at 6:10 AM,
> mariusz  wrote: 
> 
> 
> i guess, you can easily get max continuous range for each row
> with
> something like this:
> 
> CREATE OR REPLACE FUNCTION
> append_daterange(d1 daterange, d2 daterange)
> RETURNS daterange
> LANGUAGE sql
> AS
> $$
> SELECT CASE WHEN d1 && d2 OR d1 -|- d2 THEN d1 + d2
> ELSE d2 END;
> $$;
> 
> CREATE AGGREGATE agg_daterange (daterange) (
> sfunc = append_daterange,
> stype = daterange
> );
> 
> SELECT dr,
>lower(agg_daterange(dr) OVER (ORDER BY dr ASC)),
>upper(agg_daterange(dr) OVER (ORDER BY dr DESC))
>   FROM ...
> 
> above example is simplified to selecting only daterange column
> "dr" for
> readability, which in your case should be something like
> 
> daterange(staff_assign_date,staff_assign_date_end,'[)')
> 
> please note that daterange would be normalized to [) format so
> upper()
> above returns exactly your max "staff_assign_date_end" for
> each
> continuous range when dateranges are created with '[)' format.
> 
> the key point is ... ELSE d2 in append_daterange() which
> starts with new
> value each time that new value is discontinuous with agg's
> state value
> and order in which rows are processed (ASC for lower of
> daterange, DESC
> for upper of daterange).
> 
> unfortunately this involves reading all rows for "client_id"
> and
> additional sorting for each window.
> i recall reading that you already pay the price of reading all
> rows for
> client_id anyway, so the only question is the cost of two
> additional
> sorts (maybe extracting dateranges to subset on which to do
> windows and
> rejoining result of continuous ranges to original set would
> help to
> lower the cost).
> 
> 
> Thank you, and I wanted to follow up on this.  I couldn't quite get
> your example working as described, but I also ended up trying
> something very similar that got me very close but not quite there.
> Basically, I can see that it is correctly calculating the ranges (in
> the notices), but it is only returning the last range for each client.
> (Because I said PARTITION BY client_id).
> 
> 
sorry for late replay, i was offline from sat to mon inclusive.
i may have previously added some confusion, so i'll try to explain what
i had in my mind. see below what my suggested query is.


> So I'm not sure if I should be calling this differently, or if the
> function needs to work differently, or if this just isn't possible.
> Do I need to partition by something else, and if so what?  I don't see
> what I could specify that would indicate a new episode.
> 
definitely you want to partition by client_id if you are calculating
this for multiple client_ids, but what matters here for each client_id
is order of dateranges in over()

> 
> Also, it's not clear to me how an aggregate might define/return
> different values within a partition.  Although this must be possible,
> since functions like rank() and row_number() seem to do it.
> 
that is because function is defined like agg, but what happens is that
its current state value (return value of agg function) is returned for
every row within given window without grouping the resultset
> 
> Hoping there is something easy that can be tweaked here.  See below
> for copy/pastable test stuff.  It includes output from both functions.
> Both look to be returning the same results, which makes me wonder if
> my passing in a start date was a waste of time, though it seems to me
> it would be necessary.
> 
> Cheers,
> Ken
> 
> 
> BEGIN;
> CREATE TEMP TABLE sample_data (
> client_id INTEGER,
> start_date DATE,
> end_date DATE,
> episode INTEGER -- Just a label, for clarity
> );
> 
> INSERT INTO sample_data VALUES
> (1,'1990-01-01','1990-12-31',1),
> (1,'1991-01-01','1991-12-31',1),
> 
> (1,'2000-01-01','2000-12-31',2),
> (1,'2001-01-01','2001-12-31',2),
> (1,'2002-01-01','2002-12-31',2),
> (1,'2003-01-01','2003-12-31',2),
> (1,'2004-01-01','2004-12-31',2),
> (1,'2005-01-01','2005-12-31',2),
> (1,'2006-01-01','2006-12-31',2),
> 
> (1,'2014-01-01','2014-12-31',3),
> (1,'2015-01-01','2015-12-31',3),
> (1,'2017-06-30','2017-12-31',4),
> (1,'2018-01-01',NULL,4),
> 
> (2,'2014-02-01','2015-01-31',1),
> (2,'2015-02-01','2015-12-31',1),
> (2,'2017-09-30','2018-01-31',2),
> (2,'2018-02-01','2018-02-14',2)
> ;
> 
> CREATE OR REPLACE FUNCTION
> append_daterange(d1 daterange, d2 daterange)
> RETURNS 

Re: Given a set of daterange, finding the continuous range that includes a particular date (aggregates)

2018-02-23 Thread Ken Tanzer
>
> On Fri, Feb 23, 2018 at 6:10 AM, mariusz  wrote:


> i guess, you can easily get max continuous range for each row with
> something like this:
>
> CREATE OR REPLACE FUNCTION
> append_daterange(d1 daterange, d2 daterange)
> RETURNS daterange
> LANGUAGE sql
> AS
> $$
> SELECT CASE WHEN d1 && d2 OR d1 -|- d2 THEN d1 + d2 ELSE d2 END;
> $$;
>
> CREATE AGGREGATE agg_daterange (daterange) (
> sfunc = append_daterange,
> stype = daterange
> );
>
> SELECT dr,
>lower(agg_daterange(dr) OVER (ORDER BY dr ASC)),
>upper(agg_daterange(dr) OVER (ORDER BY dr DESC))
>   FROM ...
>
> above example is simplified to selecting only daterange column "dr" for
> readability, which in your case should be something like
>
> daterange(staff_assign_date,staff_assign_date_end,'[)')
>
> please note that daterange would be normalized to [) format so upper()
> above returns exactly your max "staff_assign_date_end" for each
> continuous range when dateranges are created with '[)' format.
>
> the key point is ... ELSE d2 in append_daterange() which starts with new
> value each time that new value is discontinuous with agg's state value
> and order in which rows are processed (ASC for lower of daterange, DESC
> for upper of daterange).
>
> unfortunately this involves reading all rows for "client_id" and
> additional sorting for each window.
> i recall reading that you already pay the price of reading all rows for
> client_id anyway, so the only question is the cost of two additional
> sorts (maybe extracting dateranges to subset on which to do windows and
> rejoining result of continuous ranges to original set would help to
> lower the cost).
>
>
Thank you, and I wanted to follow up on this.  I couldn't quite get your
example working as described, but I also ended up trying something very
similar that got me very close but not quite there.  Basically, I can see
that it is correctly calculating the ranges (in the notices), but it is
only returning the last range for each client.  (Because I said PARTITION
BY client_id).

So I'm not sure if I should be calling this differently, or if the function
needs to work differently, or if this just isn't possible.  Do I need to
partition by something else, and if so what?  I don't see what I could
specify that would indicate a new episode.

Also, it's not clear to me how an aggregate might define/return different
values within a partition.  Although this must be possible, since functions
like rank() and row_number() seem to do it.

Hoping there is something easy that can be tweaked here.  See below for
copy/pastable test stuff.  It includes output from both functions.  Both
look to be returning the same results, which makes me wonder if my passing
in a start date was a waste of time, though it seems to me it would be
necessary.

Cheers,
Ken

BEGIN;
CREATE TEMP TABLE sample_data (
client_id INTEGER,
start_date DATE,
end_date DATE,
episode INTEGER -- Just a label, for clarity
);

INSERT INTO sample_data VALUES
(1,'1990-01-01','1990-12-31',1),
(1,'1991-01-01','1991-12-31',1),

(1,'2000-01-01','2000-12-31',2),
(1,'2001-01-01','2001-12-31',2),
(1,'2002-01-01','2002-12-31',2),
(1,'2003-01-01','2003-12-31',2),
(1,'2004-01-01','2004-12-31',2),
(1,'2005-01-01','2005-12-31',2),
(1,'2006-01-01','2006-12-31',2),

(1,'2014-01-01','2014-12-31',3),
(1,'2015-01-01','2015-12-31',3),
(1,'2017-06-30','2017-12-31',4),
(1,'2018-01-01',NULL,4),

(2,'2014-02-01','2015-01-31',1),
(2,'2015-02-01','2015-12-31',1),
(2,'2017-09-30','2018-01-31',2),
(2,'2018-02-01','2018-02-14',2)
;

CREATE OR REPLACE FUNCTION
append_daterange(d1 daterange, d2 daterange)
RETURNS daterange
LANGUAGE sql
AS
$$
SELECT CASE WHEN d1 && d2 OR d1 -|- d2 THEN d1 + d2 ELSE d2 END;
$$;

CREATE AGGREGATE agg_daterange (daterange) (
sfunc = append_daterange,
stype = daterange
);

CREATE OR REPLACE FUNCTION range_continuous_merge( daterange, daterange,
date ) RETURNS daterange AS $$

DECLARE
res daterange;

BEGIN
res:=  CASE
WHEN $1 IS NULL AND NOT $2 @> $3 THEN NULL
WHEN $1 IS NULL AND $2 @> $3 THEN $2
WHEN ($1 && $2) OR ($1 -|- $2) THEN $1 + $2
WHEN NOT $1 @> $3 THEN $2
ELSE $1
END;
RAISE NOTICE 'Inputs: %,%,%.  Returning %',$1::text,$2::text,$3::text,res;
RETURN res;
END;
$$ LANGUAGE plpgsql STABLE;

CREATE AGGREGATE range_continuous( daterange, date ) (

sfunc =  range_continuous_merge,
stype = daterange
--  initcond = '{0,0,0}'

);

SELECT
client_id,episode,start_date,end_date,range_continuous(daterange(start_date,end_date,'[]'),start_date)
OVER (PARTITION by client_id) FROM sample_data ;

SELECT
client_id,episode,start_date,end_date,agg_daterange(daterange(start_date,end_date,'[]'))
OVER (PARTITION by client_id) FROM sample_data ;



-- RANGE_CONTINUOUS_MERGE

NOTICE:  Inputs: ,[1990-01-01,1991-01-01),1990-01-01.  Returning
[1990-01-01,1991-01-01)
NOTICE:  Inputs: