Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-04 Thread Pankaj Chand
 Flink transforms
>> >>>
>> >>> It wouldn't be hard to add a way to run arbitrary Flink operators
>> >>> through the Beam API. Like you said, once you go down that road, you
>> >>> loose the ability to run the pipeline on a different Runner. And
>> >>> that's precisely one of the selling points of Beam. I'm afraid once
>> >>> you even allow 1% non-portable pipelines, you have lost it all.
>> >> Absolutely true, but - the question here is "how much effort do I
>> >> have to invest in order to port pipeline to different runner?". If
>> >> this effort is low, I'd say the pipeline remains "nearly portable".
>> >> Typical example could be a machine learning task, where you might
>> >> have a lot of data cleansing and simple transformations, followed by
>> >> some ML algorithm (e.g. SVD). One might want to use Spark MLlib for
>> >> the ML task, but Beam for all the transformations around. Then,
>> >> porting to different runner would mean only provide different
>> >> implementation of the SVD, but everything else would remaining the
>> same.
>> >>>
>> >>> Now, it would be a different story if we had a runner-agnostic way
>> >>> of running Flink operators on top of Beam. For a subset of the Flink
>> >>> transformations that might actually be possible. I'm not sure if
>> >>> it's feasible for Beam to depend on the Flink API.
>> >>>
>> >>> * Pipeline Tuning
>> >>>
>> >>> There are less bells and whistlers in the Beam API then there are in
>> >>> Flink's. I'd consider that a feature. As Robert pointed out, the
>> >>> Runner can make any optimizations that it wants to do. If you have
>> >>> an idea for an optimizations we could built it into the FlinkRunner.
>> >>
>> >> Generally, there are optimizations that could be really dependent on
>> >> the pipeline. Only then you might have enough information that can
>> >> result in some very specific optimization.
>> >>
>> >> Jan
>> >>
>> >>
>> >>> On 02.05.19 13:44, Robert Bradshaw wrote:
>> >>>> Correct, there's no out of the box way to do this. As mentioned, this
>> >>>> would also result in non-portable pipelines. However, even the
>> >>>> portability framework is set up such that runners can recognize
>> >>>> particular transforms and provide their own implementations thereof
>> >>>> (which is how translations are done for ParDo, GroupByKey, etc.) and
>> >>>> it is encouraged that runners do this for composite operations they
>> >>>> have can do better on (e.g. I know Flink maps Reshard directly to
>> >>>> Redistribute rather than using the generic pair-with-random-key
>> >>>> implementation).
>> >>>>
>> >>>> If you really want to do this for MyFancyFlinkOperator, the current
>> >>>> solution is to adapt/extend FlinkRunner (possibly forking code) to
>> >>>> understand this operation and its substitution.
>> >>>>
>> >>>>
>> >>>> On Thu, May 2, 2019 at 11:09 AM Jan Lukavský 
>> wrote:
>> >>>>>
>> >>>>> Just to clarify - the code I posted is just a proposal, it is not
>> >>>>> actually possible currently.
>> >>>>>
>> >>>>> On 5/2/19 11:05 AM, Jan Lukavský wrote:
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> I'd say that what Pankaj meant could be rephrased as "What if I
>> want
>> >>>>>> to manually tune or tweak my Pipeline for specific runner? Do I
>> have
>> >>>>>> any options for that?". As I understand it, currently the answer
>> is,
>> >>>>>> no, PTransforms are somewhat hardwired into runners and the way
>> they
>> >>>>>> expand cannot be controlled or tuned. But, that could be changed,
>> >>>>>> maybe something like this would make it possible:
>> >>>>>>
>> >>>>>> PCollection<...> in = ...;
>> >>>>>> in.apply(new MyFancyFlinkOperator());
>> >>>>>>
>> >>>>>> // ...
>> >>>

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-01 Thread Pankaj Chand
 is spark-specific, but it borrows heavily (among other
>>> > things) on ideas that Beam itself pioneered long before Spark 2.0,
>>> > specifically the unification of batch and streaming processing into a
>>> > single API, and the event-time based windowing (triggering) model for
>>> > consistently and correctly handling distributed, out-of-order data
>>> > streams.
>>> >
>>> > Of course there are also operational differences. Spark, for example,
>>> > is very tied to the micro-batch style of execution whereas Flink is
>>> > fundamentally very continuous, and Beam delegates to the underlying
>>> > runner.
>>> >
>>> > It is certainly Beam's goal to keep overhead minimal, and one of the
>>> > primary selling points is the flexibility of portability (of both the
>>> > execution runtime and the SDK) as your needs change.
>>> >
>>> > - Robert
>>> >
>>> >
>>> > On Tue, Apr 30, 2019 at 5:29 AM  wrote:
>>> >>
>>> >> Ofcourse! I suspect beam will always be one or two step backwards to
>>> the new functionality that is available or yet to come.
>>> >>
>>> >> For example: Spark Structured Streaming is still not available, no
>>> CEP apis yet and much more.
>>> >>
>>> >> Sent from my iPhone
>>> >>
>>> >> On Apr 30, 2019, at 12:11 AM, Pankaj Chand 
>>> wrote:
>>> >>
>>> >> Will Beam add any overhead or lack certain API/functions available in
>>> Spark/Flink?
>>>
>>


Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-04-29 Thread Pankaj Chand
Will Beam add any overhead or lack certain API/functions available in
Spark/Flink?