Flink?

Jan Lukavský Thu, 02 May 2019 02:09:53 -0700

Just to clarify - the code I posted is just a proposal, it is notactually possible currently.


On 5/2/19 11:05 AM, Jan Lukavský wrote:

Hi,
I'd say that what Pankaj meant could be rephrased as "What if I wantto manually tune or tweak my Pipeline for specific runner? Do I haveany options for that?". As I understand it, currently the answer is,no, PTransforms are somewhat hardwired into runners and the way theyexpand cannot be controlled or tuned. But, that could be changed,maybe something like this would make it possible:
PCollection<...> in = ...;
in.apply(new MyFancyFlinkOperator());

// ...
in.getPipeline().getOptions().as(FlinkPipelineOptions.class).setTransformOverride(MyFancyFlinkOperator.class,new MyFancyFlinkOperatorExpander());
The `expander` could have access to full Flink API (through therunner) and that way any transform could be overridden or customizedfor specific runtime conditions. Of course, this has downside, thatyou end up with non portable pipeline (also, this is probably inconflict with general portability framework). But, as I see it,usually you would need to override only very small part of youpipeline. So, say 90% of pipeline would be portable and in order toport it to different runner, you would need to implement only smallspecific part.
Jan

On 5/2/19 9:45 AM, Robert Bradshaw wrote:
On Thu, May 2, 2019 at 12:13 AM Pankaj Chand<pankajchanda...@gmail.com> wrote:
I thought by choosing Beam, I would get the benefits of both (Sparkand Flink), one at a time. Now, I'm understanding that I might notget the full potential from either of the two.
You get the benefit of being able to choose, without rewriting your
pipeline, whether to run on Spark or Flink. Or the next new runner
that comes around. As well as the Beam model, API, etc.
Example: If I use Beam with Flink, and then a new feature is addedto Flink but I cannot access it via Beam, and that feature is notimportant to the Beam community, then what is the suggestedworkaround? If I really need that feature, I would not want tore-write my pipeline in Flink from scratch.
How is this different from "If I used Spark, and a new feature is
added to Flink, I cannot access it from Spark. If that feature is not
important to the Spark community, then what is the suggested
workaround? If I really need that feature, I would not want to
re-write my pipeline in Flink from scratch." Or vice-versa. Or when a
new feature added to Beam itself (that may not be not present in any
of the underlying systems). Beam's feature set is neither the
intersection nor union of the feature sets of those runners it has
available as execution engines. (Even the notion of what is meant by
"feature" is nebulous enough that it's hard to make this discussion
concrete.)
Is it possible that in the near future, most of Beam's capabilitieswould favor Google's Dataflow API? That way, Beam could be used tolure developers and organizations who would typically useSpark/Flink, with the promise of portability. After they getdependent on Beam and cannot afford to re-write their pipelines inSpark/Flink from scratch, they would realize that Beam does not giveaccess to some of the capabilities of the free engines that they mayrequire. Then, they would be told that if they want all possiblecapabilities and would want to use their code in Beam, they couldpay for the Dataflow engine instead.
Google is very upfront about the fact that they are selling a service
to run Beam pipelines in a completely managed way. But Google has
*also* invested very heavily in making sure that the portability story
is not just talk, for those who need or want to run their jobs on
premise or elsewhere (now or in the future). It is our goal that all
runners be able to run all pipelines, and this is a community effort.
Pankaj
On Tue, Apr 30, 2019 at 6:15 PM Kenneth Knowles <k...@apache.org>wrote:
It is worth noting that Beam isn't solely a portability layer thatexposes underlying API features, but a feature-rich layer in itsown right, with carefully coherent abstractions. For example, quiteearly on the SparkRunner supported streaming aspects of the Beammodel - watermarks, windowing, triggers - that were not reallyavailable any other way. Beam's various features sometimes requiresjust a pass-through API and sometimes requires clever newimplementation. And everything is moving constantly. I don't seeBeam as following the features of any engine, but rather coming upwith new needed data processing abstractions and figuring out howto efficiently implement them on top of various architectures.
Kenn
On Tue, Apr 30, 2019 at 8:37 AM kant kodali <kanth...@gmail.com>wrote:
Staying behind doesn't imply one is better than the other and Ididn't mean that in any way but I fail to see how an abstractionframework like Beam can stay ahead of the underlying executionengines?
For example, If a new feature is added into the underlyingexecution engine that doesn't fit the interface of Beam or breaksthen I would think the interface would need to be changed. Anotherexample would say the underlying execution engines take differentkind's of parameters for the same feature then it isn't sostraight forward to come up with an interface since there might bevery little in common in the first place so, in that sense, I failto see how Beam can stay ahead.
"Of course the API itself is Spark-specific, but it borrowsheavily (among other things) on ideas that Beam itself pioneeredlong before Spark 2.0" Good to know.
"one of the things Beam has focused on was a language portabilityframework" Sure but how important is this for a typical user? Dopeople stop using a particular tool because it is in an Xlanguage? I personally would put features first over languageportability and it's completely fine that may not be in line withBeam's priorities. All said I can agree Beam focus on languageportability is great.
On Tue, Apr 30, 2019 at 2:48 AM Maximilian Michels<m...@apache.org> wrote:
I wouldn't say one is, or will always be, in front of or behindanother.
That's a great way to phrase it. I think it is very common tojump tothe conclusion that one system is better than the other. Inreality it's
often much more complicated.

For example, one of the things Beam has focused on was a language
portability framework. Do I get this with Flink? No. Does thatmean Beamis better than Flink? No. Maybe a better question would be, do Iwant to
be able to run Python pipelines?

This is just an example, there are many more factors to consider.

Cheers,
Max

On 30.04.19 10:59, Robert Bradshaw wrote:
Though we all certainly have our biases, I think it's fair tosay thatall of these systems are constantly innovating, borrowing ideasfromone another, and have their strengths and weaknesses. I wouldn'tsay
one is, or will always be, in front of or behind another.
Take, as the given example Spark Structured Streaming. Of coursethe
API itself is spark-specific, but it borrows heavily (among other
things) on ideas that Beam itself pioneered long before Spark 2.0,
specifically the unification of batch and streaming processinginto asingle API, and the event-time based windowing (triggering)model for
consistently and correctly handling distributed, out-of-order data
streams.
Of course there are also operational differences. Spark, forexample,
is very tied to the micro-batch style of execution whereas Flink is
fundamentally very continuous, and Beam delegates to the underlying
runner.
It is certainly Beam's goal to keep overhead minimal, and one oftheprimary selling points is the flexibility of portability (ofboth the
execution runtime and the SDK) as your needs change.

- Robert


On Tue, Apr 30, 2019 at 5:29 AM <kanth...@gmail.com> wrote:
Ofcourse! I suspect beam will always be one or two stepbackwards to the new functionality that is available or yet tocome.
For example: Spark Structured Streaming is still not available,no CEP apis yet and much more.
Sent from my iPhone
On Apr 30, 2019, at 12:11 AM, Pankaj Chand<pankajchanda...@gmail.com> wrote:
Will Beam add any overhead or lack certain API/functionsavailable in Spark/Flink?

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

Reply via email to