Going further, if "Flink provides no guarantees about the order of the elements within a window" then with minibatch, which I assume uses a window under the hood, any aggregates that expect rows to arrive in order will fail to keep their consistency. Is this correct?
On Tue, Jan 26, 2021 at 5:36 PM Rex Fenley <r...@remind101.com> wrote: > Hello, > > We have a job from CDC to a large unbounded Flink plan to Elasticsearch. > > Currently, we have been relentlessly trying to reduce our record > amplification which, when our Elasticsearch index is near fully populated, > completely bottlenecks our write performance. We decided recently to try a > new job using mini-batch. At first this seemed promising but at some point > we began getting huge record amplification in a join operator. It appears > that minibatch may only batch on aggregate operators? > > So we're now thinking that we should have a window before our ES sink > which only takes the last record for any unique document id in the window, > since that's all we really want to send anyway. However, when investigating > turning a table, to a keyed window stream for deduping, and then back into > a table I read the following: > > >Attention Flink provides no guarantees about the order of the elements > within a window. This implies that although an evictor may remove elements > from the beginning of the window, these are not necessarily the ones that > arrive first or last. [1] > > which has put a damper on our investigation. > > I then found the deduplication SQL doc [2], but I have a hard time parsing > what the SQL does and we've never used TemporaryViews or proctime before. > Is this essentially what we want? > Will just using this SQL be safe for a job that is unbounded and just > wants to deduplicate a document write to whatever the most current one is > (i.e. will restoring from a checkpoint maintain our unbounded consistency > and will deletes work)? > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html > [2] > https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sql/queries.html#deduplication > > Thanks! > > > -- > > Rex Fenley | Software Engineer - Mobile and Backend > > > Remind.com <https://www.remind.com/> | BLOG <http://blog.remind.com/> | > FOLLOW US <https://twitter.com/remindhq> | LIKE US > <https://www.facebook.com/remindhq> > -- Rex Fenley | Software Engineer - Mobile and Backend Remind.com <https://www.remind.com/> | BLOG <http://blog.remind.com/> | FOLLOW US <https://twitter.com/remindhq> | LIKE US <https://www.facebook.com/remindhq>