Re: [DISCUSS] Beam IO &runners native IO

Amit Sela Thu, 28 Apr 2016 08:35:27 -0700

>From the Spark runner point of view, the implementation of the KafkaIO (for
example) is to define the "settings" required to read from Kafka and from a
quick look at the SDK's kafkaIO, it looks like it could be used instead of
the runner's implementation (and if not now, then probably once Spark
supports Kafka 0.9 connector API).


As for the bigger picture here, as far as I can see, IOs *translation *will
always be runner-specific because they either create whatever PCollections
represent from external source output from whatever PCollections represent.
So I think translation will always be runner-specific for IOs.

Back to the IOs themselves, the SDK should allow the runner to extend it's
implementation of the IO if and where needed, so if the KafkaIO is missing
Encoder/Decoder kafka serializer settings, it could just add those.

Does this make sense ?


On Thu, Apr 28, 2016 at 3:45 PM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi all,
>
> regarding the recent threads on the mailing list, I would like to start
> a format discussion around the IO.
> As we can expect the first contributions on this area (I already have
> some work in progress around this ;)), I think it's a fair discussion to
> have.
>
> Now, we have two kinds of IO: the one "generic" to Beam, the one "local"
> to the runners.
>
> For example, let's take Kafka: we have the KafkaIO (in IO), and for
> instance, we have the spark-streaming kafka connector (in Spark Runner).
>
> Right now, we have two approaches for the user:
> 1. In the pipeline, we use KafkaIO from Beam: it's the preferred
> approach for sure. However, the user may want to use the runner specific
> IO for two reasons:
>         * Beam doesn't provide the IO yet (for instance, spark cassandra
> connector is available whereas we don't have yet any CassandraIO (I'm
> working on it anyway ;))
>         * The runner native IO is optimized or contain more features that
> the
> Beam native IO
> 2. So, for the previous reasons, the user could want to use the native
> runner IO. The drawback of this approach is that the pipeline will be
> tight to a specific runner, which is completely against the Beam design.
>
> I wonder if it wouldn't make sense to add flag on the IO API (and
> related on Runner API) like .useNative().
>
> For instance, the user would be able to do:
>
>
> pipeline.apply(KafkaIO.read().withBootstrapServers("...").withTopics("...").useNative(true);
>
> then, if the runner has a "native" IO, it will use it, else, if
> useNative(false) (the default), it won't use any runner native IO.
>
> The point there is for the configuration: assuming the Beam IO and the
> runner IO can differ, it means that the "Beam IO" would have to populate
> all runner specific IO configuration.
>
> Of course, it's always possible to use a PTransform to wrap the runner
> native IO, but we are back on the same concern: the pipeline will be
> couple to a specific runner.
>
> The purpose of the useNative() flag is to "automatically" inform the
> runner to use a specific IO if it has one: the pipeline stays decoupled
> from the runners.
>
> Thoughts ?
>
> Thanks
> Regards
> JB
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [DISCUSS] Beam IO &runners native IO

Reply via email to