nickva commented on code in PR #5603: URL: https://github.com/apache/couchdb/pull/5603#discussion_r2323536825
########## src/couch/src/couch_time_seq.erl: ########## @@ -0,0 +1,301 @@ +% Licensed under the Apache License, Version 2.0 (the "License"); you may not +% use this file except in compliance with the License. You may obtain a copy of +% the License at +% +% http://www.apache.org/licenses/LICENSE-2.0 +% +% Unless required by applicable law or agreed to in writing, software +% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT +% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the +% License for the specific language governing permissions and limitations under +% the License. + +% This module implements exponentially decaying time intervals which map to +% database update sequences. The idea is be able to quickly determine the set +% of changes which occurred in a rough time interval. The closer to the last +% update time -- the higher the accuracy. For instance, going back a few days, +% it's only possible to target individual days. Then weeks, then months, then +% going back years can target only years. +% +% An example of the shape of the data structure might be: +% +% +---------+ +---------+ +--------+ +% | seq=986 | | seq=891 | | seq=19 | +% +---------+ +---------+ +--------+ +% ^ ^ ^ +% | | | +% t=42 t=40 t=37 +% +% The head, on the left, points to the newest (most recently updated) time bin. +% In this example it started at t=42. The last t=37 bin is the oldest time bin. +% +% If we're looking for sequences starting before t=41, we'd pick seq=891 and +% run the changes since=891. If we're looking for a sequence starting before +% t=37, we'd start with since=0. The main idea here is that we'd rather error +% on the side of returning too many rows than not enough. This is how our +% clustered changes feeds behave during rewinds already, so it's really new +% behavior. +% +% The number of bins is always less than or equal to ?MAX_BIN_COUNT. During +% updates, when we're forced to move to a new bin, the bins are rolled-up to +% make some room before adding the latest update. +% +% During the roll-up, multiple old bins might end up merging into a single with +% a larger width. For example the above bins might end up in a single bin. In +% the example above the old two bins might merge and the bins now might look +% like: +% +% +---------+ +---------+ +% | seq=986 | | seq=891 | +% +---------+ +---------+ +% ^ ^ +% | | +% t=42 t=37 +% +% If we're now looking for sequences staring before t=40, we'd pick seq=891. +% +% The roll-up algorithm can be summarized as: +% +% - As a preparatory step: tag all the bins with their size, and reverse them +% to put them in chronological order (oldest first). +% +% - For each interval in ?INTERVALS: +% +% * For each bin in the list of bins before the (now - next interval) time: +% +% + If durations of two adjacent bins are =< interval, merge the newer +% one into the older one. +% +% * If we merged any bins for the given interval, return. +% +% * If we didn't merge any bins, continue with a longer interval +% +% * Once we get to just one interval remaining, and didn't succeed merging +% anything using the interval schedule in ?INTERVALS, simply merge half +% the bins the oldest part of the interval. This is guaranteed to make +% more room. +% + +-module(couch_time_seq). + +-export([ + new/0, + new/1, + since/2, + update/3, + histogram/2, + timestamp/0 +]). + +-export_type([ + time_seq/0 +]). + +% How bin count = 60 was picked: +% +% - With the ?INTERVALS schedule defined below ran 1 update per hour for 1M +% updates starting at year 3000 and ending at year 3114 and obtained: +% * Less than 10 years of spacing between years at the start: 3000, 3009, 3018 ... +% * Ten individual latest days: 3114-01-20 -> 3114-01-30 +% * Seven individual latest months: 3113-07 -> 3114-01 +% - Uncompressed term_to_binary(TSeq) = 920B +% - RAM flat size erts_debug:flat_size(TSeq) * erlang:system_info(wordsize) = 2KB +% +-define(MAX_BIN_COUNT, 60). + +-define(H, 3600). +-define(D, ?H * 24). +-define(M, ?D * 30). +-define(Y, ?D * 365). +%% erlfmt-ignore +-define(INTERVALS, [ Review Comment: > sure, I get that, but for example we can't do "3 months" and the ability to do "16 years" is a bit outlandish to me. Yes, agree that per-db state is tricky (and the security object is a bad precedent since clustering). 3 months already works; I ran a simulation with 1 update per hour for 1 million hours: * Can do 11 days down to individual days: 30,29,28,27,26,25,24,23,22,21,20 then can do another 10 but skipping one or two days in between. * Can do 7 months down to individual months: 01 ,12,11,10,09,08,07, then it can do 3 or 4 more months skipping one or two in between. * Left less resolution for the years: about 3 individual years, then 4 more skipping 2 or 3 in between. Then switches to decades and such and probably doesn't matter as much. ``` 3000-01-01T00:00:00Z -> 82176 3009-05-18T00:00:00Z -> 83712 3018-12-05T00:00:00Z -> 85584 3028-09-09T00:00:00Z -> 82416 3038-02-03T00:00:00Z -> 85488 3047-11-05T00:00:00Z -> 82704 3057-04-12T00:00:00Z -> 82944 3066-09-28T00:00:00Z -> 85872 3076-07-15T00:00:00Z -> 83520 3086-01-24T00:00:00Z -> 41472 3090-10-18T00:00:00Z -> 41472 3095-07-12T00:00:00Z -> 41760 3100-04-17T00:00:00Z -> 41472 3105-01-09T00:00:00Z -> 20736 3107-05-23T00:00:00Z -> 20736 3109-10-03T00:00:00Z -> 10368 3110-12-09T00:00:00Z -> 10368 3112-02-14T00:00:00Z -> 5184 3112-09-17T00:00:00Z -> 3456 3113-02-08T00:00:00Z -> 864 3113-03-16T00:00:00Z -> 864 3113-04-21T00:00:00Z -> 864 3113-05-27T00:00:00Z -> 864 3113-07-02T00:00:00Z -> 864 3113-08-07T00:00:00Z -> 288 3113-08-19T00:00:00Z -> 288 3113-08-31T00:00:00Z -> 288 3113-09-12T00:00:00Z -> 288 3113-09-24T00:00:00Z -> 288 3113-10-06T00:00:00Z -> 288 3113-10-18T00:00:00Z -> 288 3113-10-30T00:00:00Z -> 288 3113-11-11T00:00:00Z -> 288 3113-11-23T00:00:00Z -> 288 3113-12-05T00:00:00Z -> 288 3113-12-17T00:00:00Z -> 288 3113-12-29T00:00:00Z -> 96 3114-01-02T00:00:00Z -> 48 3114-01-04T00:00:00Z -> 48 3114-01-06T00:00:00Z -> 48 3114-01-08T00:00:00Z -> 48 3114-01-10T00:00:00Z -> 48 3114-01-12T00:00:00Z -> 48 3114-01-14T00:00:00Z -> 48 3114-01-16T00:00:00Z -> 48 3114-01-18T00:00:00Z -> 48 3114-01-20T00:00:00Z -> 24 3114-01-21T00:00:00Z -> 24 3114-01-22T00:00:00Z -> 24 3114-01-23T00:00:00Z -> 24 3114-01-24T00:00:00Z -> 24 3114-01-25T00:00:00Z -> 24 3114-01-26T00:00:00Z -> 24 3114-01-27T00:00:00Z -> 24 3114-01-28T00:00:00Z -> 24 3114-01-29T00:00:00Z -> 24 3114-01-30T00:00:00Z -> 6 3114-01-30T06:00:00Z -> 6 3114-01-30T12:00:00Z -> 3 3114-01-30T15:00:00Z -> 1 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
