Flink transforms
>> >>>
>> >>> It wouldn't be hard to add a way to run arbitrary Flink operators
>> >>> through the Beam API. Like you said, once you go down that road, you
>> >>> loose the ability to run the pipeline on a different Runner. And
>> >>> that's precisely one of the selling points of Beam. I'm afraid once
>> >>> you even allow 1% non-portable pipelines, you have lost it all.
>> >> Absolutely true, but - the question here is "how much effort do I
>> >> have to invest in order to port pipeline to different runner?". If
>> >> this effort is low, I'd say the pipeline remains "nearly portable".
>> >> Typical example could be a machine learning task, where you might
>> >> have a lot of data cleansing and simple transformations, followed by
>> >> some ML algorithm (e.g. SVD). One might want to use Spark MLlib for
>> >> the ML task, but Beam for all the transformations around. Then,
>> >> porting to different runner would mean only provide different
>> >> implementation of the SVD, but everything else would remaining the
>> same.
>> >>>
>> >>> Now, it would be a different story if we had a runner-agnostic way
>> >>> of running Flink operators on top of Beam. For a subset of the Flink
>> >>> transformations that might actually be possible. I'm not sure if
>> >>> it's feasible for Beam to depend on the Flink API.
>> >>>
>> >>> * Pipeline Tuning
>> >>>
>> >>> There are less bells and whistlers in the Beam API then there are in
>> >>> Flink's. I'd consider that a feature. As Robert pointed out, the
>> >>> Runner can make any optimizations that it wants to do. If you have
>> >>> an idea for an optimizations we could built it into the FlinkRunner.
>> >>
>> >> Generally, there are optimizations that could be really dependent on
>> >> the pipeline. Only then you might have enough information that can
>> >> result in some very specific optimization.
>> >>
>> >> Jan
>> >>
>> >>
>> >>> On 02.05.19 13:44, Robert Bradshaw wrote:
>> >>>> Correct, there's no out of the box way to do this. As mentioned, this
>> >>>> would also result in non-portable pipelines. However, even the
>> >>>> portability framework is set up such that runners can recognize
>> >>>> particular transforms and provide their own implementations thereof
>> >>>> (which is how translations are done for ParDo, GroupByKey, etc.) and
>> >>>> it is encouraged that runners do this for composite operations they
>> >>>> have can do better on (e.g. I know Flink maps Reshard directly to
>> >>>> Redistribute rather than using the generic pair-with-random-key
>> >>>> implementation).
>> >>>>
>> >>>> If you really want to do this for MyFancyFlinkOperator, the current
>> >>>> solution is to adapt/extend FlinkRunner (possibly forking code) to
>> >>>> understand this operation and its substitution.
>> >>>>
>> >>>>
>> >>>> On Thu, May 2, 2019 at 11:09 AM Jan Lukavský
>> wrote:
>> >>>>>
>> >>>>> Just to clarify - the code I posted is just a proposal, it is not
>> >>>>> actually possible currently.
>> >>>>>
>> >>>>> On 5/2/19 11:05 AM, Jan Lukavský wrote:
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> I'd say that what Pankaj meant could be rephrased as "What if I
>> want
>> >>>>>> to manually tune or tweak my Pipeline for specific runner? Do I
>> have
>> >>>>>> any options for that?". As I understand it, currently the answer
>> is,
>> >>>>>> no, PTransforms are somewhat hardwired into runners and the way
>> they
>> >>>>>> expand cannot be controlled or tuned. But, that could be changed,
>> >>>>>> maybe something like this would make it possible:
>> >>>>>>
>> >>>>>> PCollection<...> in = ...;
>> >>>>>> in.apply(new MyFancyFlinkOperator());
>> >>>>>>
>> >>>>>> // ...
>> >>>