On Sat, Dec 9, 2023 at 2:13 PM veem v <veema0...@gmail.com> wrote: > > Ron Johnson <ronljohnso...@gmail.com> > wrote: > >> "OK" is relative, but it's what we did in a similar situation: two years >> of data on-line and 5 years of data in compressed files in S3. (We're >> required to keep data for 7 years, but they *never* ask for records more >> than 2 years old. If they ever do, we'll manually load the relevant data >> back into PG.) >> (I can't imagine that querying billions of unindexed flat-file records >> via S3 would be fast.) >> How often do end users look for data more than 90 days old? Two years >> old? >> How quickly does the old data need to be available? >> Maybe a four-tiered system of PG-Snowflake-S3-S3_Glacier would balance >> speed and cost (or maybe Snowflake would just drive up costs). > > > Thank You so much Ron. > > When you said "*two years of data on-line and 5 years of data in > compressed files in S3*." So do you mean two years of data in aurora > postgre which will be OLTP database and rest just dump as is in S3 bucket > and when we need any such query for those data , just simply read and dump > those S3 files back in the aurora postgre? >
(RDS Postgresql, not Aurora, but that's beside the point.) Yes. But keep reading... In the currently running oracle exadata system , it has SIX months of data > (which is ~200TB) and the transaction tables are all range partitioned on a > daily basis. And out of that ~2months of data gets frequently queried and > other ~4months of data gets queried less frequently. However, in the > target architecture which we want to build on cloud here, there are some > requirements for the analytics/data science team to query ~3years history. > Beyond ~3years we may need that data rarely. > > We were initially thinking of just having one database to serve both OLTP > and OLAP use cases(somewhat like oracle exadata currently doing for us) but > it seems we don't have such an option here on AWS. Postgre will serve OLTP > use case whereas Snowflake will serve OLAP use case. > > So do you suggest having both the databases in use, like recent > transaction data for last 3 months should be streamed to aurora postgre, > then from 3months till 3years of data should be parked in snowflake which > will serve OLAP/analytics use case. and from 3years till 10years will be > kept in S3 (as parquet or Iceberg format) so that even Snowflake can query > those directly when needed. > > OR > > Do you suggest just keeping last ~3months of data on Aurora postgre and > rest everything on snowflake considering it will store those as compressed > format and also storage is cheap(23$ per TB per month)? > > Few colleagues are pointing to databricks for the analytics use case. Is > that a good option here? > > I can't answer that without knowing what the end users actually need (details, or just summaries of historical data, in different tiers). You all will have to do the cost:benefit analysis of different architectures. > > On Sat, 9 Dec 2023 at 16:43, veem v <veema0...@gmail.com> wrote: > >> Hello All, >> Although it's not exactly related to opensource postgre but want to ask >> this question here to understand colleagues' view, considering having >> decades of experience in the database world, We want some guidance, if the >> below design looks okay for our customer use case. >> >> We currently have financial systems transaction data streams to Oracle >> exadata(X9) on-premise. This database supports processing of 400million >> transactions per day. A single transaction for us is a combination of 7-8 >> inserts into different transaction tables with Indexes , unique constraints >> etc defined on those. The transactions processed/committed in batches(~1000 >> batch size) in the database. And this system persists data for ~6 months. >> We do have all sorts of OLAP(daily/monthly batch reports running) >> applications run on the same database along with some user facing UI >> applications showing customer transactions. So it's basically currently >> serving a hybrid workload and is one stop solution for all use cases. >> >> Many of the applications are moving from on premise to AWS cloud as part >> of modernization journey and AWS being chosen cloud partner also the >> product is expected to expand across more regions and this system is >> expected to serve increase in the transaction volume. And also we have a >> requirement to persist transaction data for ~10years to have those >> available for analytics/data science use cases. >> >> So the team is thinking of splitting it into two parts >> 1)OLTP type use case in which we will persist/write the transaction data >> faster and show it to the UI related apps , in near real time/quickest >> possible time. and this database will store Max 60-90 days of transaction >> data. Not sure if we have an option of Oracle exadata equivalent on AWS, so >> team planning of using/experimenting with Aurora postgres. Please correct >> me, if there are any other options we should use otherwise? >> >> 2)Then move the data beyond ~90 days into another database or object >> storage S3 which will keep it there for ~10 years and will be queryable >> using the necessary API's. That is supposed to cater to Olap/analytics/data >> science use cases etc. >> >> Is the above design is okay? and also in regards to the second point >> above i.e. persisting the historical data (that to be in queryable state), >> should we go for some database like snowflake or should just keep it on S3 >> as is and make those queryable through APIs. Please advice? >> >