Re: Question on overall design

Ron Johnson Sat, 09 Dec 2023 11:59:37 -0800

On Sat, Dec 9, 2023 at 2:13 PM veem v <veema0...@gmail.com> wrote:

>
> Ron Johnson <ronljohnso...@gmail.com>
>  wrote:
>
>> "OK" is relative, but it's what we did in a similar situation: two years
>> of data on-line and 5 years of data in compressed files in S3.  (We're
>> required to keep data for 7 years, but they *never* ask for records more
>> than 2 years old.  If they ever do, we'll manually load the relevant data
>> back into PG.)
>> (I can't imagine that querying billions of unindexed flat-file records
>> via S3 would be fast.)
>> How often do end users look for data more than 90 days old?  Two years
>> old?
>> How quickly does the old data need to be available?
>> Maybe a four-tiered system of PG-Snowflake-S3-S3_Glacier would balance
>> speed and cost (or maybe Snowflake would just drive up costs).
>
>
> Thank You so much Ron.
>
> When you said "*two years of data on-line and 5 years of data in
> compressed files in S3*." So do you mean two years of data in aurora
> postgre which will be OLTP database and rest just dump as is in S3 bucket
> and when we need any such query for those data , just simply read and dump
> those S3 files back in the aurora postgre?
>


(RDS Postgresql, not Aurora, but that's beside the point.)

Yes.

But keep reading...

In the currently running oracle exadata system , it has SIX months of data
> (which is ~200TB) and the transaction tables are all range partitioned on a
> daily basis. And out of that ~2months of data gets frequently queried and
> other ~4months of data gets queried less frequently.  However, in the
> target architecture which we want to build on cloud here, there are some
> requirements for the analytics/data science team to query ~3years history.
> Beyond ~3years we may need that data rarely.
>
> We were initially thinking of just having one database to serve both OLTP
> and OLAP use cases(somewhat like oracle exadata currently doing for us) but
> it seems we don't have such an option here on AWS. Postgre will serve OLTP
> use case whereas Snowflake will serve OLAP use case.
>
> So do you suggest having both the databases in use, like recent
> transaction data for last 3 months should be streamed to aurora postgre,
> then  from 3months till 3years of data should be parked in snowflake which
> will serve OLAP/analytics use case. and from 3years till 10years will be
> kept in S3 (as parquet or Iceberg format) so that even Snowflake can query
> those directly when needed.
>
> OR
>
> Do you suggest just keeping last ~3months of data on Aurora postgre and
> rest everything on snowflake considering it will store those as compressed
> format and also storage is cheap(23$ per TB per month)?
>
> Few colleagues are pointing to databricks for the analytics use case. Is
> that a good option here?
>
>
I can't answer that without knowing what the end users actually need
(details, or just summaries of historical data, in different tiers).

You all will have to do the cost:benefit analysis of different
architectures.


>
>  On Sat, 9 Dec 2023 at 16:43, veem v <veema0...@gmail.com> wrote:
>
>> Hello All,
>> Although it's not exactly related to opensource postgre but want to ask
>> this question here to understand colleagues' view, considering  having
>> decades of experience in the database world, We want some guidance, if the
>> below design looks okay for our customer use case.
>>
>> We currently have financial systems transaction data streams to Oracle
>> exadata(X9) on-premise. This database supports processing of 400million
>> transactions per day. A single transaction for us is a combination of 7-8
>> inserts into different transaction tables with Indexes , unique constraints
>> etc defined on those. The transactions processed/committed in batches(~1000
>> batch size) in the database. And this system persists data for ~6 months.
>> We do have all sorts of OLAP(daily/monthly batch reports running)
>> applications run on the same database along with some user facing UI
>> applications showing customer transactions. So it's basically currently
>> serving a hybrid workload and is one stop solution for all use cases.
>>
>> Many of the applications are moving from on premise to AWS cloud as part
>> of modernization journey and AWS being chosen cloud partner also the
>> product is expected to expand across more regions and this system is
>> expected to serve increase in the transaction volume. And also we have a
>> requirement to persist transaction data for ~10years to have those
>> available for analytics/data science use cases.
>>
>> So the team is thinking of splitting it into two parts
>> 1)OLTP type use case in which we will persist/write the transaction data
>> faster and show it to the UI related apps , in near real time/quickest
>> possible time. and this database will store Max 60-90 days of transaction
>> data. Not sure if we have an option of Oracle exadata equivalent on AWS, so
>> team planning of using/experimenting with Aurora postgres. Please correct
>> me, if there are any other options we should use otherwise?
>>
>> 2)Then move the data beyond ~90 days into another database or object
>> storage S3 which will keep it there for ~10 years and will be queryable
>> using the necessary API's. That is supposed to cater to Olap/analytics/data
>> science use cases etc.
>>
>> Is the above design is okay? and also in regards to the second point
>> above i.e. persisting the historical data (that to be in queryable state),
>> should we go for some database like snowflake or should just keep it on S3
>> as is and make those queryable through APIs. Please advice?
>>
>

Re: Question on overall design

Reply via email to