We have been promoting the use of DoFn to write IO connectors for many reasons
including better composability. A common pattern that arrives in such IOs is
that a preceding transform prepares the specification element on split that a
subsequent DoFn uses to read the data. You can see an example of this on FileIO
[1] or in RedisIO [2]

The issue is that if we process that spec in the `@ProcessElement` method we
lose the DoFn lifecycle because we cannot establish a connection on `@Setup` and
close it in `@Teardown` because the spec is per element, so we end up re
creating connections which is a quite costly operation in some systems like
Cassandra/HBase/etc and that it could end up saturating the data store because
of the massive creation of connections (something that already happened in the
past with JdbcIO in the streaming case).

In the ongoing PR that transforms Cassandra to be DoFn based [3] this subject
appeared again, and we were discussing how to eventually reuse connections,
maybe by a pretty naive approach of saving a previous connection (or set of
identified connections) statically so it can be reused by multiple DoFns
instances. We already had some issues in the past because of creating many
connections on other IOs (JdbcIO) with streaming pipelines where databases were
swamped by massive amounts of connections, so reusing connections seems to be
something that matters, but at the moment we do not have a clear way to do this
better.

Anyone have better ideas or recommendations for this scenario?
Thanks in advance.

Ismaël

[1] 
https://github.com/apache/beam/blob/14085a5a3c0e146fcc13ca77515bd24abc255eda/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L260
[2] 
https://github.com/apache/beam/blob/14085a5a3c0e146fcc13ca77515bd24abc255eda/sdks/java/io/solr/src/main/java/org/apache/beam/sdk/io/solr/SolrIO.java#L471
[3] https://github.com/apache/beam/pull/10546

Reply via email to