nickva opened a new pull request, #5603:
URL: https://github.com/apache/couchdb/pull/5603
This started as experiment wondering if we could have a simple data
structure to track rough time intervals to db sequence mappings. This is to let
users get an idea of what changes happened in a rough timeframe. It should be
something on the order days, month, years. Nothing too exact. For example, it
would be nice to be able to say '_changes?since=$lastweek`.
The data structure itself, called `time-seq` further below is a fixed size
list of up to 50 key-value integer values, mapping time bins to db sequences.
The structure is small enough (~500B) when serialized that it can fit well
under the 4KB header size, yet it can represent exponentially decaying time
intervals over two decades. This is a trade-off of having a small, fixed size:
the further back in time we go, the lower the accuracy. However, this decaying
behavior is often how most people look at time: when we talk about yesterday,
we may refer to individual hours; when we talk about last month, we may only
talk about individual days; when talking about years, we may care about months
or quarters only.
The implementation itself and the tests, including property tests written by
@iilyak (thank you!) are in the first commit. The commit comment has more
implementation details.
Another unexpected benefit using a small data structure fitting inside the
header and having a bit of luck is that we can implement this feature so it's
downgrade safe. This can be accomplished by reusing a long unused db header
field. This way if the user upgrades, then downgrades. The older code doesn't
look or use that field so any new data structures there will be ignore. With
this trick we can avoid having to issue new intermediate downgrade target
release. The addition of time-seq data structure to the header is in the second
commit. That second commit also implement how the structure is upgraded: that
happens in couch_db_updater only on commit.
Since we're dealing with OS-defined time values, this is not a perfect
solution. On some systems time could jump backward after a boot, or it may
misbehave in other ways. There are few way to mitigate that:
* Do not accept changes that appear to happen back in time. Those can be
safely ignored and we'll start updating the time bins when the time finally
catches up.
* Do not accept time values lower than some minimum configurable value.
Users knowing what their embedded system may do after boot (if they jump back
to 1970 for example until NTP kicks in) may set this minimum threshold to say
1971. We simply set it as a default to a recent time when this feature was
enabled.
* Always allow a user to inspect and reset any time-seq structure without
having a need to recreate dbs or lose data. This is accomplished by adding two
helper $DB apis: `GET $db/_time_seq` and `DELETE $db/_time_seq`. This allows
users to inspect and reset any time-seq structure if they can detect something
unexpected happened with the time sync (say the year jump to 2100 for a while).
* Time bins are rounded to whole hours. We do not need any precise second
or even minute level accuracy there. Even if the accuracy is off by days and
the user knows that (by say inspecting their couch logs which also emit
timestamp based on the same OS timer) they may choose to only use since values
that are longer than whole days.
The 3rd commit implements thew new `$db/_time_seq` calls and the general
fabric level integration of the new feature.
Finally, after all that, the `_changes?since=YYYY-MM-DDTHH:MM:SSZ` streaming
is implemented in the last commit. Due to all the preparatory steps the last
commit is pretty simple. We essentially handle is like the value `now` value
for descending changes feeds.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]