Deduplicating record amplification

Rex Fenley Tue, 26 Jan 2021 17:37:05 -0800

Hello,

We have a job from CDC to a large unbounded Flink plan to Elasticsearch.


Currently, we have been relentlessly trying to reduce our record
amplification which, when our Elasticsearch index is near fully populated,
completely bottlenecks our write performance. We decided recently to try a
new job using mini-batch. At first this seemed promising but at some point
we began getting huge record amplification in a join operator. It appears
that minibatch may only batch on aggregate operators?

So we're now thinking that we should have a window before our ES sink which
only takes the last record for any unique document id in the window, since
that's all we really want to send anyway. However, when investigating
turning a table, to a keyed window stream for deduping, and then back into
a table I read the following:

>Attention Flink provides no guarantees about the order of the elements
within a window. This implies that although an evictor may remove elements
from the beginning of the window, these are not necessarily the ones that
arrive first or last. [1]

which has put a damper on our investigation.

I then found the deduplication SQL doc [2], but I have a hard time parsing
what the SQL does and we've never used TemporaryViews or proctime before.
Is this essentially what we want?
Will just using this SQL be safe for a job that is unbounded and just wants
to deduplicate a document write to whatever the most current one is (i.e.
will restoring from a checkpoint maintain our unbounded consistency and
will deletes work)?

[1]
https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sql/queries.html#deduplication

Thanks!


-- 

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
 |  FOLLOW
US <https://twitter.com/remindhq>  |  LIKE US
<https://www.facebook.com/remindhq>

Deduplicating record amplification

Reply via email to