Your use case seems fine and is likely more dependent on the quality of the
IO connectors you use and/or how much time you're willing to invest in
filling in any gaps they may have.

You'll want to make sure that the connector has good initial splitting
logic to make sure that large tenants split more finely then small tenants.
Using a runner that supports dynamic splitting would help fix imbalances
caused by initial splitting.

If each tenant  is represented by one or more transforms then you may run
into issues where the pipeline graph is too large for some runners and you
would want the connector to follow the readAll pattern that is seen on
various IO connectors. This uses a common transform and the input is a
PCollection of "source descriptors". This pattern scales to the data limits
of the runner that is being used (orders of magnitude larger then the graph
size). The readAll is best if powered by a splittable DoFn internally so we
get initial splitting and dynamic splitting still happening. With the
readAll pattern, the "read all the locations of various tenant indices"
would move to occur within a DoFn and would happen when the pipeline
executes.

On Mon, Jun 1, 2020 at 5:17 PM Rion Williams <[email protected]> wrote:

> Hi folks,
>
> I was talking with a colleague about a scenario he was facing and we were
> exploring the idea of Beam as a possible solution. I thought I’d reach out
> to the audience hear to get their opinions.
>
> Basically, we have a series of single tenant Elasticsearch indices that we
> are attempting to read from, transform, and ultimately send to a Kafka
> topic to be consumed by some downstream multi-tenant Beam pipeline.
>
> The current working thoughts are something to the effect of:
> - Read all of the locations of the various tenant indices from a
> centralized location.
> - Construct the appropriate transforms (perhaps just via a
> `configurations.forEach()` or some other pattern)
> - Apply the transforms against the incoming data from Elastic (should be
> uniformly constricted in terms of schema)
> - Send to a Kafka topic with the tenant identifier as part of the
> transform process
>
> Does this seem like something Beam would be suitable for? These indices
> could be quite large and updated frequently (depending on the tenant), so I
> don’t know if watermarking should be a concern. Specifically watermarking
> for each tenant to avoid ingesting the same data or in cases of the
> services going down, being able to resume without lost data. I’m not aware
> of Beam or the Elastic connector having this notion, but I haven’t explored
> it in-depth.
>
> Additionally, there are scaling concerns (e.g. if one particular tenant
> has a large amount of volume and others have very little are their
> mechanisms for handling that?). What if there were thousands of tenants?
> Could a single pipeline effectively handle that kind of volume?
>
> Any thoughts or advice would be appreciated. I’d love to have the
> confidence to use Beam for this, but if it’s not the right tool I don’t
> want to fight it more than necessary.
>
> Thanks,
>
> Rion

Reply via email to