Going further, if "Flink provides no guarantees about the order of the
elements within a window" then with minibatch, which I assume uses a window
under the hood, any aggregates that expect rows to arrive in order will
fail to keep their consistency. Is this correct?

On Tue, Jan 26, 2021 at 5:36 PM Rex Fenley <r...@remind101.com> wrote:

> Hello,
>
> We have a job from CDC to a large unbounded Flink plan to Elasticsearch.
>
> Currently, we have been relentlessly trying to reduce our record
> amplification which, when our Elasticsearch index is near fully populated,
> completely bottlenecks our write performance. We decided recently to try a
> new job using mini-batch. At first this seemed promising but at some point
> we began getting huge record amplification in a join operator. It appears
> that minibatch may only batch on aggregate operators?
>
> So we're now thinking that we should have a window before our ES sink
> which only takes the last record for any unique document id in the window,
> since that's all we really want to send anyway. However, when investigating
> turning a table, to a keyed window stream for deduping, and then back into
> a table I read the following:
>
> >Attention Flink provides no guarantees about the order of the elements
> within a window. This implies that although an evictor may remove elements
> from the beginning of the window, these are not necessarily the ones that
> arrive first or last. [1]
>
> which has put a damper on our investigation.
>
> I then found the deduplication SQL doc [2], but I have a hard time parsing
> what the SQL does and we've never used TemporaryViews or proctime before.
> Is this essentially what we want?
> Will just using this SQL be safe for a job that is unbounded and just
> wants to deduplicate a document write to whatever the most current one is
> (i.e. will restoring from a checkpoint maintain our unbounded consistency
> and will deletes work)?
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html
> [2]
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sql/queries.html#deduplication
>
> Thanks!
>
>
> --
>
> Rex Fenley  |  Software Engineer - Mobile and Backend
>
>
> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>  |
>  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
> <https://www.facebook.com/remindhq>
>


-- 

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
 |  FOLLOW
US <https://twitter.com/remindhq>  |  LIKE US
<https://www.facebook.com/remindhq>

Reply via email to