Yes. I have seen things like this before.

Typically, if you have short time-to-visibility requirements, some kind of
database is required. If you have large data and long retention
requirements, it can be advantageous to roll out to a columnar compressed
form like parquet.

The design that I have suggested in the past is to use parquet for deep
history and a MapR DB table for short-term history. This data is accessed
via a view that selects data from the deep history if date is before a
cutoff and from the table if after. The cutoff can be injected into your
query or you can have a reference table of cutoffs.

The view looks something like this (don't expect zero errors here, btw):

  with
    cutoff as (select min(t) from `cutoff.db` where tag == "production")
    old_data as (select * from `archive.parquet` archive, cutoff where
archive.t < cutoff.t)
    new_data as (select * from `current.db` current, cutoff where current.t
>= cutoff.t)
  select * from old_data union new_data


When enough data gets collected to make a new parquet file, you would
insert a new row into cutoff with a tag like "pending" and use something
like this to create your new parquet file:

    create table as
        select *
        from `current.db` current,
             `cutoff.db` c1,
             `cutoff.db` c2
        where c1.tag == "production"
              and c2.tag == "pending"
              and current.t >= c1.t and current.t < c2.t

when this is done, inspect the data using a view that lets you use the new
cutoff tag. When you are happy, update the cutoff table to set the tag on
the new cutoff to "production" and change the old production row to have a
tag that reflects an archived state (so that you can look back at old
cutoffs if you need to). The view will atomically reflect the change when
the old row gets a new tag (because of the min operation in the view). When
you are certain you don't need to roll back, you can delete the old cutoff
and all the old rows from the current table that are before the old cutoff.

This is great if you have billions and billions of rows.

But if you only have thousands or millions, I would recommend simply using
a single database table.




On Thu, Feb 27, 2020 at 11:37 AM Dobes Vandermeer <[email protected]> wrote:

> Hi,
>
> I am trying to figure out a system that can offer both low latency in
> generating reports, low latency between data being collected and being
> available in the reporting system, and avoiding glitches and errors.
>
> In our system users are collecting responses from their students.  We want
> to generate reports showing student scores over time, and the scores should
> typically be available within a minute of being collected.
>
> I am looking to store the data tables in parquet on S3 and query them
> using drill.
>
> However, updating parquet files can be a bit troublesome.  The files
> cannot easily be appended to.  So some process has to periodically re-write
> the parquet files.  Also, we don't want to have hundreds or thousands of
> separate files, as this can slow down query executing.  So we don't want to
> end up with a new file every 10 seconds.
>
> What I have been thinking is to have a process that runs which writes
> changes fairly frequently to small new files and another process that rolls
> up those small files into progressively larger ones as they get older.
>
> When querying the data I will have to de-duplicate and keep only the most
> recent version of each record, which I think is possible using window
> functions.  Thus the file aggregation process might not have to worry about
> having the exact same row in two files temporarily.
>
> I'm wondering if anyone has gone down this road before and has insights to
> share about it.
>
>

Reply via email to