Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-06 Thread Juan Carlos Garcia
Hi,

I don't want to hijack the thread regarding as of why, but to keep it short
we experienced a lots of problems with Spark (streaming pipeline) +
checkpoints, at the point it was like a gambling to restart a pipeline
without spark going nuts while restoring from a checkpoint resulting then
on data lost(spark bugs). There were pipelines under heavy development
which requires to redeploy them multiple time a day.

We found on Flink the stability /features we needed while we planned a
migration to a managed environment (luckily 'Dataflow' which at that time
was not yet approved) and in our case as you mentioned, we were lucky to be
able to switch across runners without major problems.

Thanks

kant kodali  schrieb am Mo., 6. Mai 2019, 21:34:

> 1) It would be good to know the reasons why you guys moved from one
> execution to another?
> 2) You are lucky to have your problem fit into all three execution engines
> and supported by the Beam at the same time. This is certainly not the case
> for me since some runners that Beam supports are still a Work in progress
> while the execution engine had the support since 2 years at very least.
>
>
>
> On Mon, May 6, 2019 at 12:24 PM kant kodali  wrote:
>
>>
>>
>> On Mon, May 6, 2019 at 12:09 PM Juan Carlos Garcia 
>> wrote:
>>
>>> As everyone has pointed out there will be a small overhead added by the
>>> abstraction but in my own experience its totally worth it.
>>>
>>> Almost two years ago we decided to jump into the beam wagon, by first
>>> deploying into an on-premises hadoop cluster with the Spark engine (just
>>> because spark was already available and we didn't want to introduce a new
>>> stack in our hadoop cluster), then we moved to a Flink cluster (due to
>>> others reason) and few months later we moved 90% of our streaming
>>> processing to Dataflow (in order to migrate the on-premises cluster to the
>>> cloud), all that wouldn't have been possible without the beam abstraction.
>>>
>>> In conclusion beam abstraction rocks, it's not perfect, but it's really
>>> good.
>>>
>>> Just my 2 cents.
>>>
>>> Matt Casters  schrieb am Mo., 6. Mai 2019, 15:33:
>>>
 I've dealt with responses like this for a number of decades.  With
 Kettle Beam I could say: "here, in 20 minutes of visual programming you
 have your pipeline up and running".  It's easy to set up, maintain, debug,
 unit test, version control... the whole thing. And then someone would say:
 Naaah, if I don't code it myself I don't trust it.  Usually it's worded
 differently but that's what it comes down to.
 Some people think in terms of impossibilities instead of possibilities
 and will always find some reason why they fall in that 0.1% of the cases.

 > Lets say Beam came up with the abstractions long before other runners
 but to map things to runners it is going to take time (that's where things
 are today). so its always a moving target.

 Any scaleable data processing problem you might have that can't be
 solved by Spark, Flink or DataFlow is pretty obscure don't you think?

 Great discussion :-)

 Cheers,
 Matt
 ---
 Matt Casters attcast...@gmail.com>
 Senior Solution Architect, Kettle Project Founder



 Op zo 5 mei 2019 om 00:18 schreef kant kodali :

> I believe this comes down to more of abstractions vs execution engines
> and I am sure people can take on both sides. I think both are important
> however It is worth noting that the execution framework themselves have a
> lot of abstractions but sure more generic ones can be built on top. Are
> abstractions always good?! I will just point to this book
> 
>
> I tend to lean more on the execution engines side because I can build
> something on top. I am also not sure if Beam is the first one to come up
> with these ideas since Frameworks like Cascading had existed long before.
>
> Lets say Beam came up with the abstractions long before other runners
> but to map things to runners it is going to take time (that's where things
> are today). so its always a moving target.
>
>
>
>
>
>
> On Tue, Apr 30, 2019 at 3:15 PM Kenneth Knowles 
> wrote:
>
>> It is worth noting that Beam isn't solely a portability layer that
>> exposes underlying API features, but a feature-rich layer in its own 
>> right,
>> with carefully coherent abstractions. For example, quite early on the
>> SparkRunner supported streaming aspects of the Beam model - watermarks,
>> windowing, triggers - that were not really available any other way. 
>> Beam's
>> various features sometimes requires just a pass-through API and sometimes
>> requires clever new implementation. And everything is moving constantl

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-06 Thread kant kodali
1) It would be good to know the reasons why you guys moved from one
execution to another?
2) You are lucky to have your problem fit into all three execution engines
and supported by the Beam at the same time. This is certainly not the case
for me since some runners that Beam supports are still a Work in progress
while the execution engine had the support since 2 years at very least.



On Mon, May 6, 2019 at 12:24 PM kant kodali  wrote:

>
>
> On Mon, May 6, 2019 at 12:09 PM Juan Carlos Garcia 
> wrote:
>
>> As everyone has pointed out there will be a small overhead added by the
>> abstraction but in my own experience its totally worth it.
>>
>> Almost two years ago we decided to jump into the beam wagon, by first
>> deploying into an on-premises hadoop cluster with the Spark engine (just
>> because spark was already available and we didn't want to introduce a new
>> stack in our hadoop cluster), then we moved to a Flink cluster (due to
>> others reason) and few months later we moved 90% of our streaming
>> processing to Dataflow (in order to migrate the on-premises cluster to the
>> cloud), all that wouldn't have been possible without the beam abstraction.
>>
>> In conclusion beam abstraction rocks, it's not perfect, but it's really
>> good.
>>
>> Just my 2 cents.
>>
>> Matt Casters  schrieb am Mo., 6. Mai 2019, 15:33:
>>
>>> I've dealt with responses like this for a number of decades.  With
>>> Kettle Beam I could say: "here, in 20 minutes of visual programming you
>>> have your pipeline up and running".  It's easy to set up, maintain, debug,
>>> unit test, version control... the whole thing. And then someone would say:
>>> Naaah, if I don't code it myself I don't trust it.  Usually it's worded
>>> differently but that's what it comes down to.
>>> Some people think in terms of impossibilities instead of possibilities
>>> and will always find some reason why they fall in that 0.1% of the cases.
>>>
>>> > Lets say Beam came up with the abstractions long before other runners
>>> but to map things to runners it is going to take time (that's where things
>>> are today). so its always a moving target.
>>>
>>> Any scaleable data processing problem you might have that can't be
>>> solved by Spark, Flink or DataFlow is pretty obscure don't you think?
>>>
>>> Great discussion :-)
>>>
>>> Cheers,
>>> Matt
>>> ---
>>> Matt Casters attcast...@gmail.com>
>>> Senior Solution Architect, Kettle Project Founder
>>>
>>>
>>>
>>> Op zo 5 mei 2019 om 00:18 schreef kant kodali :
>>>
 I believe this comes down to more of abstractions vs execution engines
 and I am sure people can take on both sides. I think both are important
 however It is worth noting that the execution framework themselves have a
 lot of abstractions but sure more generic ones can be built on top. Are
 abstractions always good?! I will just point to this book
 

 I tend to lean more on the execution engines side because I can build
 something on top. I am also not sure if Beam is the first one to come up
 with these ideas since Frameworks like Cascading had existed long before.

 Lets say Beam came up with the abstractions long before other runners
 but to map things to runners it is going to take time (that's where things
 are today). so its always a moving target.






 On Tue, Apr 30, 2019 at 3:15 PM Kenneth Knowles 
 wrote:

> It is worth noting that Beam isn't solely a portability layer that
> exposes underlying API features, but a feature-rich layer in its own 
> right,
> with carefully coherent abstractions. For example, quite early on the
> SparkRunner supported streaming aspects of the Beam model - watermarks,
> windowing, triggers - that were not really available any other way. Beam's
> various features sometimes requires just a pass-through API and sometimes
> requires clever new implementation. And everything is moving constantly. I
> don't see Beam as following the features of any engine, but rather coming
> up with new needed data processing abstractions and figuring out how to
> efficiently implement them on top of various architectures.
>
> Kenn
>
> On Tue, Apr 30, 2019 at 8:37 AM kant kodali 
> wrote:
>
>> Staying behind doesn't imply one is better than the other and I
>> didn't mean that in any way but I fail to see how an abstraction 
>> framework
>> like Beam can stay ahead of the underlying execution engines?
>>
>> For example, If a new feature is added into the underlying execution
>> engine that doesn't fit the interface of Beam or breaks then I would 
>> think
>> the interface would need to be changed. Another example would say the
>> underlying execution engines take different kind's o

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-06 Thread kant kodali
On Mon, May 6, 2019 at 12:09 PM Juan Carlos Garcia 
wrote:

> As everyone has pointed out there will be a small overhead added by the
> abstraction but in my own experience its totally worth it.
>
> Almost two years ago we decided to jump into the beam wagon, by first
> deploying into an on-premises hadoop cluster with the Spark engine (just
> because spark was already available and we didn't want to introduce a new
> stack in our hadoop cluster), then we moved to a Flink cluster (due to
> others reason) and few months later we moved 90% of our streaming
> processing to Dataflow (in order to migrate the on-premises cluster to the
> cloud), all that wouldn't have been possible without the beam abstraction.
>
> In conclusion beam abstraction rocks, it's not perfect, but it's really
> good.
>
> Just my 2 cents.
>
> Matt Casters  schrieb am Mo., 6. Mai 2019, 15:33:
>
>> I've dealt with responses like this for a number of decades.  With Kettle
>> Beam I could say: "here, in 20 minutes of visual programming you have your
>> pipeline up and running".  It's easy to set up, maintain, debug, unit test,
>> version control... the whole thing. And then someone would say: Naaah, if I
>> don't code it myself I don't trust it.  Usually it's worded differently but
>> that's what it comes down to.
>> Some people think in terms of impossibilities instead of possibilities
>> and will always find some reason why they fall in that 0.1% of the cases.
>>
>> > Lets say Beam came up with the abstractions long before other runners
>> but to map things to runners it is going to take time (that's where things
>> are today). so its always a moving target.
>>
>> Any scaleable data processing problem you might have that can't be solved
>> by Spark, Flink or DataFlow is pretty obscure don't you think?
>>
>> Great discussion :-)
>>
>> Cheers,
>> Matt
>> ---
>> Matt Casters attcast...@gmail.com>
>> Senior Solution Architect, Kettle Project Founder
>>
>>
>>
>> Op zo 5 mei 2019 om 00:18 schreef kant kodali :
>>
>>> I believe this comes down to more of abstractions vs execution engines
>>> and I am sure people can take on both sides. I think both are important
>>> however It is worth noting that the execution framework themselves have a
>>> lot of abstractions but sure more generic ones can be built on top. Are
>>> abstractions always good?! I will just point to this book
>>> 
>>>
>>> I tend to lean more on the execution engines side because I can build
>>> something on top. I am also not sure if Beam is the first one to come up
>>> with these ideas since Frameworks like Cascading had existed long before.
>>>
>>> Lets say Beam came up with the abstractions long before other runners
>>> but to map things to runners it is going to take time (that's where things
>>> are today). so its always a moving target.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Apr 30, 2019 at 3:15 PM Kenneth Knowles  wrote:
>>>
 It is worth noting that Beam isn't solely a portability layer that
 exposes underlying API features, but a feature-rich layer in its own right,
 with carefully coherent abstractions. For example, quite early on the
 SparkRunner supported streaming aspects of the Beam model - watermarks,
 windowing, triggers - that were not really available any other way. Beam's
 various features sometimes requires just a pass-through API and sometimes
 requires clever new implementation. And everything is moving constantly. I
 don't see Beam as following the features of any engine, but rather coming
 up with new needed data processing abstractions and figuring out how to
 efficiently implement them on top of various architectures.

 Kenn

 On Tue, Apr 30, 2019 at 8:37 AM kant kodali  wrote:

> Staying behind doesn't imply one is better than the other and I didn't
> mean that in any way but I fail to see how an abstraction framework like
> Beam can stay ahead of the underlying execution engines?
>
> For example, If a new feature is added into the underlying execution
> engine that doesn't fit the interface of Beam or breaks then I would think
> the interface would need to be changed. Another example would say the
> underlying execution engines take different kind's of parameters for the
> same feature then it isn't so straight forward to come up with an 
> interface
> since there might be very little in common in the first place so, in that
> sense, I fail to see how Beam can stay ahead.
>
> "Of course the API itself is Spark-specific, but it borrows heavily
> (among other things) on ideas that Beam itself pioneered long before Spark
> 2.0" Good to know.
>
> "one of the things Beam has focused on was a language portability
> framework"  Sure but how important is this for a typical user? D

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-06 Thread Juan Carlos Garcia
As everyone has pointed out there will be a small overhead added by the
abstraction but in my own experience its totally worth it.

Almost two years ago we decided to jump into the beam wagon, by first
deploying into an on-premises hadoop cluster with the Spark engine (just
because spark was already available and we didn't want to introduce a new
stack in our hadoop cluster), then we moved to a Flink cluster (due to
others reason) and few months later we moved 90% of our streaming
processing to Dataflow (in order to migrate the on-premises cluster to the
cloud), all that wouldn't have been possible without the beam abstraction.

In conclusion beam abstraction rocks, it's not perfect, but it's really
good.

Just my 2 cents.

Matt Casters  schrieb am Mo., 6. Mai 2019, 15:33:

> I've dealt with responses like this for a number of decades.  With Kettle
> Beam I could say: "here, in 20 minutes of visual programming you have your
> pipeline up and running".  It's easy to set up, maintain, debug, unit test,
> version control... the whole thing. And then someone would say: Naaah, if I
> don't code it myself I don't trust it.  Usually it's worded differently but
> that's what it comes down to.
> Some people think in terms of impossibilities instead of possibilities and
> will always find some reason why they fall in that 0.1% of the cases.
>
> > Lets say Beam came up with the abstractions long before other runners
> but to map things to runners it is going to take time (that's where things
> are today). so its always a moving target.
>
> Any scaleable data processing problem you might have that can't be solved
> by Spark, Flink or DataFlow is pretty obscure don't you think?
>
> Great discussion :-)
>
> Cheers,
> Matt
> ---
> Matt Casters attcast...@gmail.com>
> Senior Solution Architect, Kettle Project Founder
>
>
>
> Op zo 5 mei 2019 om 00:18 schreef kant kodali :
>
>> I believe this comes down to more of abstractions vs execution engines
>> and I am sure people can take on both sides. I think both are important
>> however It is worth noting that the execution framework themselves have a
>> lot of abstractions but sure more generic ones can be built on top. Are
>> abstractions always good?! I will just point to this book
>> 
>>
>> I tend to lean more on the execution engines side because I can build
>> something on top. I am also not sure if Beam is the first one to come up
>> with these ideas since Frameworks like Cascading had existed long before.
>>
>> Lets say Beam came up with the abstractions long before other runners but
>> to map things to runners it is going to take time (that's where things are
>> today). so its always a moving target.
>>
>>
>>
>>
>>
>>
>> On Tue, Apr 30, 2019 at 3:15 PM Kenneth Knowles  wrote:
>>
>>> It is worth noting that Beam isn't solely a portability layer that
>>> exposes underlying API features, but a feature-rich layer in its own right,
>>> with carefully coherent abstractions. For example, quite early on the
>>> SparkRunner supported streaming aspects of the Beam model - watermarks,
>>> windowing, triggers - that were not really available any other way. Beam's
>>> various features sometimes requires just a pass-through API and sometimes
>>> requires clever new implementation. And everything is moving constantly. I
>>> don't see Beam as following the features of any engine, but rather coming
>>> up with new needed data processing abstractions and figuring out how to
>>> efficiently implement them on top of various architectures.
>>>
>>> Kenn
>>>
>>> On Tue, Apr 30, 2019 at 8:37 AM kant kodali  wrote:
>>>
 Staying behind doesn't imply one is better than the other and I didn't
 mean that in any way but I fail to see how an abstraction framework like
 Beam can stay ahead of the underlying execution engines?

 For example, If a new feature is added into the underlying execution
 engine that doesn't fit the interface of Beam or breaks then I would think
 the interface would need to be changed. Another example would say the
 underlying execution engines take different kind's of parameters for the
 same feature then it isn't so straight forward to come up with an interface
 since there might be very little in common in the first place so, in that
 sense, I fail to see how Beam can stay ahead.

 "Of course the API itself is Spark-specific, but it borrows heavily
 (among other things) on ideas that Beam itself pioneered long before Spark
 2.0" Good to know.

 "one of the things Beam has focused on was a language portability
 framework"  Sure but how important is this for a typical user? Do people
 stop using a particular tool because it is in an X language? I personally
 would put features first over language portability and it's completely fine
 that ma

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-06 Thread Matt Casters
I've dealt with responses like this for a number of decades.  With Kettle
Beam I could say: "here, in 20 minutes of visual programming you have your
pipeline up and running".  It's easy to set up, maintain, debug, unit test,
version control... the whole thing. And then someone would say: Naaah, if I
don't code it myself I don't trust it.  Usually it's worded differently but
that's what it comes down to.
Some people think in terms of impossibilities instead of possibilities and
will always find some reason why they fall in that 0.1% of the cases.

> Lets say Beam came up with the abstractions long before other runners but
to map things to runners it is going to take time (that's where things are
today). so its always a moving target.

Any scaleable data processing problem you might have that can't be solved
by Spark, Flink or DataFlow is pretty obscure don't you think?

Great discussion :-)

Cheers,
Matt
---
Matt Casters attcast...@gmail.com>
Senior Solution Architect, Kettle Project Founder



Op zo 5 mei 2019 om 00:18 schreef kant kodali :

> I believe this comes down to more of abstractions vs execution engines and
> I am sure people can take on both sides. I think both are important however
> It is worth noting that the execution framework themselves have a lot of
> abstractions but sure more generic ones can be built on top. Are
> abstractions always good?! I will just point to this book
> 
>
> I tend to lean more on the execution engines side because I can build
> something on top. I am also not sure if Beam is the first one to come up
> with these ideas since Frameworks like Cascading had existed long before.
>
> Lets say Beam came up with the abstractions long before other runners but
> to map things to runners it is going to take time (that's where things are
> today). so its always a moving target.
>
>
>
>
>
>
> On Tue, Apr 30, 2019 at 3:15 PM Kenneth Knowles  wrote:
>
>> It is worth noting that Beam isn't solely a portability layer that
>> exposes underlying API features, but a feature-rich layer in its own right,
>> with carefully coherent abstractions. For example, quite early on the
>> SparkRunner supported streaming aspects of the Beam model - watermarks,
>> windowing, triggers - that were not really available any other way. Beam's
>> various features sometimes requires just a pass-through API and sometimes
>> requires clever new implementation. And everything is moving constantly. I
>> don't see Beam as following the features of any engine, but rather coming
>> up with new needed data processing abstractions and figuring out how to
>> efficiently implement them on top of various architectures.
>>
>> Kenn
>>
>> On Tue, Apr 30, 2019 at 8:37 AM kant kodali  wrote:
>>
>>> Staying behind doesn't imply one is better than the other and I didn't
>>> mean that in any way but I fail to see how an abstraction framework like
>>> Beam can stay ahead of the underlying execution engines?
>>>
>>> For example, If a new feature is added into the underlying execution
>>> engine that doesn't fit the interface of Beam or breaks then I would think
>>> the interface would need to be changed. Another example would say the
>>> underlying execution engines take different kind's of parameters for the
>>> same feature then it isn't so straight forward to come up with an interface
>>> since there might be very little in common in the first place so, in that
>>> sense, I fail to see how Beam can stay ahead.
>>>
>>> "Of course the API itself is Spark-specific, but it borrows heavily
>>> (among other things) on ideas that Beam itself pioneered long before Spark
>>> 2.0" Good to know.
>>>
>>> "one of the things Beam has focused on was a language portability
>>> framework"  Sure but how important is this for a typical user? Do people
>>> stop using a particular tool because it is in an X language? I personally
>>> would put features first over language portability and it's completely fine
>>> that may not be in line with Beam's priorities. All said I can agree Beam
>>> focus on language portability is great.
>>>
>>> On Tue, Apr 30, 2019 at 2:48 AM Maximilian Michels 
>>> wrote:
>>>
 > I wouldn't say one is, or will always be, in front of or behind
 another.

 That's a great way to phrase it. I think it is very common to jump to
 the conclusion that one system is better than the other. In reality
 it's
 often much more complicated.

 For example, one of the things Beam has focused on was a language
 portability framework. Do I get this with Flink? No. Does that mean
 Beam
 is better than Flink? No. Maybe a better question would be, do I want
 to
 be able to run Python pipelines?

 This is just an example, there are many more factors to consider.

 Cheers,
 Max

 On 30.04.19 10:59, Robert Brad

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-04 Thread kant kodali
I believe this comes down to more of abstractions vs execution engines and
I am sure people can take on both sides. I think both are important however
It is worth noting that the execution framework themselves have a lot of
abstractions but sure more generic ones can be built on top. Are
abstractions always good?! I will just point to this book


I tend to lean more on the execution engines side because I can build
something on top. I am also not sure if Beam is the first one to come up
with these ideas since Frameworks like Cascading had existed long before.

Lets say Beam came up with the abstractions long before other runners but
to map things to runners it is going to take time (that's where things are
today). so its always a moving target.






On Tue, Apr 30, 2019 at 3:15 PM Kenneth Knowles  wrote:

> It is worth noting that Beam isn't solely a portability layer that exposes
> underlying API features, but a feature-rich layer in its own right, with
> carefully coherent abstractions. For example, quite early on the
> SparkRunner supported streaming aspects of the Beam model - watermarks,
> windowing, triggers - that were not really available any other way. Beam's
> various features sometimes requires just a pass-through API and sometimes
> requires clever new implementation. And everything is moving constantly. I
> don't see Beam as following the features of any engine, but rather coming
> up with new needed data processing abstractions and figuring out how to
> efficiently implement them on top of various architectures.
>
> Kenn
>
> On Tue, Apr 30, 2019 at 8:37 AM kant kodali  wrote:
>
>> Staying behind doesn't imply one is better than the other and I didn't
>> mean that in any way but I fail to see how an abstraction framework like
>> Beam can stay ahead of the underlying execution engines?
>>
>> For example, If a new feature is added into the underlying execution
>> engine that doesn't fit the interface of Beam or breaks then I would think
>> the interface would need to be changed. Another example would say the
>> underlying execution engines take different kind's of parameters for the
>> same feature then it isn't so straight forward to come up with an interface
>> since there might be very little in common in the first place so, in that
>> sense, I fail to see how Beam can stay ahead.
>>
>> "Of course the API itself is Spark-specific, but it borrows heavily
>> (among other things) on ideas that Beam itself pioneered long before Spark
>> 2.0" Good to know.
>>
>> "one of the things Beam has focused on was a language portability
>> framework"  Sure but how important is this for a typical user? Do people
>> stop using a particular tool because it is in an X language? I personally
>> would put features first over language portability and it's completely fine
>> that may not be in line with Beam's priorities. All said I can agree Beam
>> focus on language portability is great.
>>
>> On Tue, Apr 30, 2019 at 2:48 AM Maximilian Michels 
>> wrote:
>>
>>> > I wouldn't say one is, or will always be, in front of or behind
>>> another.
>>>
>>> That's a great way to phrase it. I think it is very common to jump to
>>> the conclusion that one system is better than the other. In reality it's
>>> often much more complicated.
>>>
>>> For example, one of the things Beam has focused on was a language
>>> portability framework. Do I get this with Flink? No. Does that mean Beam
>>> is better than Flink? No. Maybe a better question would be, do I want to
>>> be able to run Python pipelines?
>>>
>>> This is just an example, there are many more factors to consider.
>>>
>>> Cheers,
>>> Max
>>>
>>> On 30.04.19 10:59, Robert Bradshaw wrote:
>>> > Though we all certainly have our biases, I think it's fair to say that
>>> > all of these systems are constantly innovating, borrowing ideas from
>>> > one another, and have their strengths and weaknesses. I wouldn't say
>>> > one is, or will always be, in front of or behind another.
>>> >
>>> > Take, as the given example Spark Structured Streaming. Of course the
>>> > API itself is spark-specific, but it borrows heavily (among other
>>> > things) on ideas that Beam itself pioneered long before Spark 2.0,
>>> > specifically the unification of batch and streaming processing into a
>>> > single API, and the event-time based windowing (triggering) model for
>>> > consistently and correctly handling distributed, out-of-order data
>>> > streams.
>>> >
>>> > Of course there are also operational differences. Spark, for example,
>>> > is very tied to the micro-batch style of execution whereas Flink is
>>> > fundamentally very continuous, and Beam delegates to the underlying
>>> > runner.
>>> >
>>> > It is certainly Beam's goal to keep overhead minimal, and one of the
>>> > primary selling points is the flexibility of portability (of both t

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-04 Thread Pankaj Chand
Hi Matt,

My project is for my PhD. So, I am interested in those 0.1% of use cases.

--Pankaj

On Sat, May 4, 2019, 10:48 AM Matt Casters  wrote:

> Anything can be coded in any form or language on any platform.
> However, doing so takes time and effort.  Maintaining the code takes time
> as well as protecting the investments you made from changes in the
> ecosystem.
> This is obviously where APIs like Beam come into play quite heavily.  New
> technology seems to come around like fads these days and that innovation is
> obviously not a bad thing.  We would still be using Map/Reduce if it was.
> But for people trying to build solutions changing platforms is a painful
> process incurring massive costs.
> So with that in mind I would bounce this question back: why on Earth would
> you *want* to write for a specific platform?  Are you *really* interested
> in those 0.1% use cases and is it really helping your business move
> forward?  It's possible but if not, I would strongly advice against it.
>
> Just my 2 cents.
>
> Cheers,
> Matt
> ---
> Matt Casters attcast...@gmail.com>
> Senior Solution Architect, Kettle Project Founder
>
>
>
>
> Op vr 3 mei 2019 om 22:42 schreef Jan Lukavský :
>
>> Hi,
>>
>> On 5/3/19 12:20 PM, Maximilian Michels wrote:
>> > Hi Jan,
>> >
>> >> Typical example could be a machine learning task, where you might
>> >> have a lot of data cleansing and simple transformations, followed by
>> >> some ML algorithm (e.g. SVD). One might want to use Spark MLlib for
>> >> the ML task, but Beam for all the transformations around. Then,
>> >> porting to different runner would mean only provide different
>> >> implementation of the SVD, but everything else would remaining the
>> same.
>> >
>> > This is a fair point. Of course you could always split up the pipeline
>> > into two jobs, e.g. have a native Spark job and a Beam job running on
>> > Spark.
>> >
>> > Something that came to my mind is "unsafe" in Rust which allows you to
>> > leave the safe abstractions of Rust and use raw C code. If Beam had
>> > something like that which really emphasized the non-portable aspect of
>> > a transform, that could change things:
>> >
>> >   Pipeline p = ..
>> >   p.getOptions().setAllowNonPortable(true);
>> >   p.apply(
>> >   NonPortable.of(new MyFlinkOperator(), FlinkRunner.class));
>> >
>> > Again, I'm not sure we want to go down that road, but if there are
>> > really specific use cases, we could think about it.
>> Yes, this is exactly what I meant. I think that this doesn't threat any
>> of Beam's selling points, because this way, you declare you *want* your
>> pipeline being non-portable, so if you don't do it on purpose, your
>> pipeline will still be portable. The key point here is that the
>> underlying systems are likely to evolve quicker than Beam (in some
>> directions or some ways - Beam might on the other hand bring features to
>> these systems, that's for sure). Examples might be Spark's MLlib or
>> Flink's iterative streams.
>> >
>> >> Generally, there are optimizations that could be really dependent on
>> >> the pipeline. Only then you might have enough information that can
>> >> result in some very specific optimization.
>> >
>> > If these pattern can be detected in DAGs, then we can built
>> > optimizations into the FlinkRunner. If that is not feasible, then
>> > you're out luck. Could you describe an optimization that you miss in
>> > Beam?
>>
>> I think that sometimes you cannot infer all possible optimizations from
>> the DAG itself. If you read from a source (e.g. Kafka), information
>> about how do you partition data when writing to Kafka might help you
>> avoid additional shuffling in some cases. That's probably something you
>> could be in theory able to do via some annotations of sources, but the
>> fundamental question here is - do you really want to do that? Or just
>> let the user perform some hard coding when he knows that it might help
>> in his particular case (possible even corner case)?
>>
>> Jan
>>
>> >
>> > Cheers,
>> > Max
>> >
>> > On 02.05.19 22:44, Jan Lukavský wrote:
>> >> Hi Max,
>> >>
>> >> comments inline.
>> >>
>> >> On 5/2/19 3:29 PM, Maximilian Michels wrote:
>> >>> Couple of comments:
>> >>>
>> >>> * Flink transforms
>> >>>
>> >>> It wouldn't be hard to add a way to run arbitrary Flink operators
>> >>> through the Beam API. Like you said, once you go down that road, you
>> >>> loose the ability to run the pipeline on a different Runner. And
>> >>> that's precisely one of the selling points of Beam. I'm afraid once
>> >>> you even allow 1% non-portable pipelines, you have lost it all.
>> >> Absolutely true, but - the question here is "how much effort do I
>> >> have to invest in order to port pipeline to different runner?". If
>> >> this effort is low, I'd say the pipeline remains "nearly portable".
>> >> Typical example could be a machine learning task, where you might
>> >> have a lot of data cleansing and simple transformations, followed by
>> >> some ML a

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-04 Thread Matt Casters
Anything can be coded in any form or language on any platform.
However, doing so takes time and effort.  Maintaining the code takes time
as well as protecting the investments you made from changes in the
ecosystem.
This is obviously where APIs like Beam come into play quite heavily.  New
technology seems to come around like fads these days and that innovation is
obviously not a bad thing.  We would still be using Map/Reduce if it was.
But for people trying to build solutions changing platforms is a painful
process incurring massive costs.
So with that in mind I would bounce this question back: why on Earth would
you *want* to write for a specific platform?  Are you *really* interested
in those 0.1% use cases and is it really helping your business move
forward?  It's possible but if not, I would strongly advice against it.

Just my 2 cents.

Cheers,
Matt
---
Matt Casters attcast...@gmail.com>
Senior Solution Architect, Kettle Project Founder




Op vr 3 mei 2019 om 22:42 schreef Jan Lukavský :

> Hi,
>
> On 5/3/19 12:20 PM, Maximilian Michels wrote:
> > Hi Jan,
> >
> >> Typical example could be a machine learning task, where you might
> >> have a lot of data cleansing and simple transformations, followed by
> >> some ML algorithm (e.g. SVD). One might want to use Spark MLlib for
> >> the ML task, but Beam for all the transformations around. Then,
> >> porting to different runner would mean only provide different
> >> implementation of the SVD, but everything else would remaining the
> same.
> >
> > This is a fair point. Of course you could always split up the pipeline
> > into two jobs, e.g. have a native Spark job and a Beam job running on
> > Spark.
> >
> > Something that came to my mind is "unsafe" in Rust which allows you to
> > leave the safe abstractions of Rust and use raw C code. If Beam had
> > something like that which really emphasized the non-portable aspect of
> > a transform, that could change things:
> >
> >   Pipeline p = ..
> >   p.getOptions().setAllowNonPortable(true);
> >   p.apply(
> >   NonPortable.of(new MyFlinkOperator(), FlinkRunner.class));
> >
> > Again, I'm not sure we want to go down that road, but if there are
> > really specific use cases, we could think about it.
> Yes, this is exactly what I meant. I think that this doesn't threat any
> of Beam's selling points, because this way, you declare you *want* your
> pipeline being non-portable, so if you don't do it on purpose, your
> pipeline will still be portable. The key point here is that the
> underlying systems are likely to evolve quicker than Beam (in some
> directions or some ways - Beam might on the other hand bring features to
> these systems, that's for sure). Examples might be Spark's MLlib or
> Flink's iterative streams.
> >
> >> Generally, there are optimizations that could be really dependent on
> >> the pipeline. Only then you might have enough information that can
> >> result in some very specific optimization.
> >
> > If these pattern can be detected in DAGs, then we can built
> > optimizations into the FlinkRunner. If that is not feasible, then
> > you're out luck. Could you describe an optimization that you miss in
> > Beam?
>
> I think that sometimes you cannot infer all possible optimizations from
> the DAG itself. If you read from a source (e.g. Kafka), information
> about how do you partition data when writing to Kafka might help you
> avoid additional shuffling in some cases. That's probably something you
> could be in theory able to do via some annotations of sources, but the
> fundamental question here is - do you really want to do that? Or just
> let the user perform some hard coding when he knows that it might help
> in his particular case (possible even corner case)?
>
> Jan
>
> >
> > Cheers,
> > Max
> >
> > On 02.05.19 22:44, Jan Lukavský wrote:
> >> Hi Max,
> >>
> >> comments inline.
> >>
> >> On 5/2/19 3:29 PM, Maximilian Michels wrote:
> >>> Couple of comments:
> >>>
> >>> * Flink transforms
> >>>
> >>> It wouldn't be hard to add a way to run arbitrary Flink operators
> >>> through the Beam API. Like you said, once you go down that road, you
> >>> loose the ability to run the pipeline on a different Runner. And
> >>> that's precisely one of the selling points of Beam. I'm afraid once
> >>> you even allow 1% non-portable pipelines, you have lost it all.
> >> Absolutely true, but - the question here is "how much effort do I
> >> have to invest in order to port pipeline to different runner?". If
> >> this effort is low, I'd say the pipeline remains "nearly portable".
> >> Typical example could be a machine learning task, where you might
> >> have a lot of data cleansing and simple transformations, followed by
> >> some ML algorithm (e.g. SVD). One might want to use Spark MLlib for
> >> the ML task, but Beam for all the transformations around. Then,
> >> porting to different runner would mean only provide different
> >> implementation of the SVD, but everything else would remaining the same.
> >>

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-03 Thread Jan Lukavský

Hi,

On 5/3/19 12:20 PM, Maximilian Michels wrote:

Hi Jan,

Typical example could be a machine learning task, where you might 
have a lot of data cleansing and simple transformations, followed by 
some ML algorithm (e.g. SVD). One might want to use Spark MLlib for 
the ML task, but Beam for all the transformations around. Then, 
porting to different runner would mean only provide different 
implementation of the SVD, but everything else would remaining the same. 


This is a fair point. Of course you could always split up the pipeline 
into two jobs, e.g. have a native Spark job and a Beam job running on 
Spark.


Something that came to my mind is "unsafe" in Rust which allows you to 
leave the safe abstractions of Rust and use raw C code. If Beam had 
something like that which really emphasized the non-portable aspect of 
a transform, that could change things:


  Pipeline p = ..
  p.getOptions().setAllowNonPortable(true);
  p.apply(
  NonPortable.of(new MyFlinkOperator(), FlinkRunner.class));

Again, I'm not sure we want to go down that road, but if there are 
really specific use cases, we could think about it.
Yes, this is exactly what I meant. I think that this doesn't threat any 
of Beam's selling points, because this way, you declare you *want* your 
pipeline being non-portable, so if you don't do it on purpose, your 
pipeline will still be portable. The key point here is that the 
underlying systems are likely to evolve quicker than Beam (in some 
directions or some ways - Beam might on the other hand bring features to 
these systems, that's for sure). Examples might be Spark's MLlib or 
Flink's iterative streams.


Generally, there are optimizations that could be really dependent on 
the pipeline. Only then you might have enough information that can 
result in some very specific optimization. 


If these pattern can be detected in DAGs, then we can built 
optimizations into the FlinkRunner. If that is not feasible, then 
you're out luck. Could you describe an optimization that you miss in 
Beam?


I think that sometimes you cannot infer all possible optimizations from 
the DAG itself. If you read from a source (e.g. Kafka), information 
about how do you partition data when writing to Kafka might help you 
avoid additional shuffling in some cases. That's probably something you 
could be in theory able to do via some annotations of sources, but the 
fundamental question here is - do you really want to do that? Or just 
let the user perform some hard coding when he knows that it might help 
in his particular case (possible even corner case)?


Jan



Cheers,
Max

On 02.05.19 22:44, Jan Lukavský wrote:

Hi Max,

comments inline.

On 5/2/19 3:29 PM, Maximilian Michels wrote:

Couple of comments:

* Flink transforms

It wouldn't be hard to add a way to run arbitrary Flink operators 
through the Beam API. Like you said, once you go down that road, you 
loose the ability to run the pipeline on a different Runner. And 
that's precisely one of the selling points of Beam. I'm afraid once 
you even allow 1% non-portable pipelines, you have lost it all.
Absolutely true, but - the question here is "how much effort do I 
have to invest in order to port pipeline to different runner?". If 
this effort is low, I'd say the pipeline remains "nearly portable". 
Typical example could be a machine learning task, where you might 
have a lot of data cleansing and simple transformations, followed by 
some ML algorithm (e.g. SVD). One might want to use Spark MLlib for 
the ML task, but Beam for all the transformations around. Then, 
porting to different runner would mean only provide different 
implementation of the SVD, but everything else would remaining the same.


Now, it would be a different story if we had a runner-agnostic way 
of running Flink operators on top of Beam. For a subset of the Flink 
transformations that might actually be possible. I'm not sure if 
it's feasible for Beam to depend on the Flink API.


* Pipeline Tuning

There are less bells and whistlers in the Beam API then there are in 
Flink's. I'd consider that a feature. As Robert pointed out, the 
Runner can make any optimizations that it wants to do. If you have 
an idea for an optimizations we could built it into the FlinkRunner.


Generally, there are optimizations that could be really dependent on 
the pipeline. Only then you might have enough information that can 
result in some very specific optimization.


Jan



On 02.05.19 13:44, Robert Bradshaw wrote:

Correct, there's no out of the box way to do this. As mentioned, this
would also result in non-portable pipelines. However, even the
portability framework is set up such that runners can recognize
particular transforms and provide their own implementations thereof
(which is how translations are done for ParDo, GroupByKey, etc.) and
it is encouraged that runners do this for composite operations they
have can do better on (e.g. I know Flink maps Reshard directly to
Redistribute rather tha

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-03 Thread Maximilian Michels

Hi Jan,

Typical example could be a machine learning task, where you might have a lot of data cleansing and simple transformations, followed by some ML algorithm (e.g. SVD). One might want to use Spark MLlib for the ML task, but Beam for all the transformations around. Then, porting to different runner would mean only provide different implementation of the SVD, but everything else would remaining the same. 


This is a fair point. Of course you could always split up the pipeline 
into two jobs, e.g. have a native Spark job and a Beam job running on Spark.


Something that came to my mind is "unsafe" in Rust which allows you to 
leave the safe abstractions of Rust and use raw C code. If Beam had 
something like that which really emphasized the non-portable aspect of a 
transform, that could change things:


  Pipeline p = ..
  p.getOptions().setAllowNonPortable(true);
  p.apply(
  NonPortable.of(new MyFlinkOperator(), FlinkRunner.class));

Again, I'm not sure we want to go down that road, but if there are 
really specific use cases, we could think about it.


Generally, there are optimizations that could be really dependent on the pipeline. Only then you might have enough information that can result in some very specific optimization. 


If these pattern can be detected in DAGs, then we can built 
optimizations into the FlinkRunner. If that is not feasible, then you're 
out luck. Could you describe an optimization that you miss in Beam?


Cheers,
Max

On 02.05.19 22:44, Jan Lukavský wrote:

Hi Max,

comments inline.

On 5/2/19 3:29 PM, Maximilian Michels wrote:

Couple of comments:

* Flink transforms

It wouldn't be hard to add a way to run arbitrary Flink operators 
through the Beam API. Like you said, once you go down that road, you 
loose the ability to run the pipeline on a different Runner. And 
that's precisely one of the selling points of Beam. I'm afraid once 
you even allow 1% non-portable pipelines, you have lost it all.
Absolutely true, but - the question here is "how much effort do I have 
to invest in order to port pipeline to different runner?". If this 
effort is low, I'd say the pipeline remains "nearly portable". Typical 
example could be a machine learning task, where you might have a lot of 
data cleansing and simple transformations, followed by some ML algorithm 
(e.g. SVD). One might want to use Spark MLlib for the ML task, but Beam 
for all the transformations around. Then, porting to different runner 
would mean only provide different implementation of the SVD, but 
everything else would remaining the same.


Now, it would be a different story if we had a runner-agnostic way of 
running Flink operators on top of Beam. For a subset of the Flink 
transformations that might actually be possible. I'm not sure if it's 
feasible for Beam to depend on the Flink API.


* Pipeline Tuning

There are less bells and whistlers in the Beam API then there are in 
Flink's. I'd consider that a feature. As Robert pointed out, the 
Runner can make any optimizations that it wants to do. If you have an 
idea for an optimizations we could built it into the FlinkRunner.


Generally, there are optimizations that could be really dependent on the 
pipeline. Only then you might have enough information that can result in 
some very specific optimization.


Jan



On 02.05.19 13:44, Robert Bradshaw wrote:

Correct, there's no out of the box way to do this. As mentioned, this
would also result in non-portable pipelines. However, even the
portability framework is set up such that runners can recognize
particular transforms and provide their own implementations thereof
(which is how translations are done for ParDo, GroupByKey, etc.) and
it is encouraged that runners do this for composite operations they
have can do better on (e.g. I know Flink maps Reshard directly to
Redistribute rather than using the generic pair-with-random-key
implementation).

If you really want to do this for MyFancyFlinkOperator, the current
solution is to adapt/extend FlinkRunner (possibly forking code) to
understand this operation and its substitution.


On Thu, May 2, 2019 at 11:09 AM Jan Lukavský  wrote:


Just to clarify - the code I posted is just a proposal, it is not
actually possible currently.

On 5/2/19 11:05 AM, Jan Lukavský wrote:

Hi,

I'd say that what Pankaj meant could be rephrased as "What if I want
to manually tune or tweak my Pipeline for specific runner? Do I have
any options for that?". As I understand it, currently the answer is,
no, PTransforms are somewhat hardwired into runners and the way they
expand cannot be controlled or tuned. But, that could be changed,
maybe something like this would make it possible:

PCollection<...> in = ...;
in.apply(new MyFancyFlinkOperator());

// ...

in.getPipeline().getOptions().as(FlinkPipelineOptions.class).setTransformOverride(MyFancyFlinkOperator.class, 


new MyFancyFlinkOperatorExpander());


The `expander` could have access to full Flink API (through the
runner) and that

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-02 Thread Jan Lukavský

Hi Max,

comments inline.

On 5/2/19 3:29 PM, Maximilian Michels wrote:

Couple of comments:

* Flink transforms

It wouldn't be hard to add a way to run arbitrary Flink operators 
through the Beam API. Like you said, once you go down that road, you 
loose the ability to run the pipeline on a different Runner. And 
that's precisely one of the selling points of Beam. I'm afraid once 
you even allow 1% non-portable pipelines, you have lost it all.
Absolutely true, but - the question here is "how much effort do I have 
to invest in order to port pipeline to different runner?". If this 
effort is low, I'd say the pipeline remains "nearly portable". Typical 
example could be a machine learning task, where you might have a lot of 
data cleansing and simple transformations, followed by some ML algorithm 
(e.g. SVD). One might want to use Spark MLlib for the ML task, but Beam 
for all the transformations around. Then, porting to different runner 
would mean only provide different implementation of the SVD, but 
everything else would remaining the same.


Now, it would be a different story if we had a runner-agnostic way of 
running Flink operators on top of Beam. For a subset of the Flink 
transformations that might actually be possible. I'm not sure if it's 
feasible for Beam to depend on the Flink API.


* Pipeline Tuning

There are less bells and whistlers in the Beam API then there are in 
Flink's. I'd consider that a feature. As Robert pointed out, the 
Runner can make any optimizations that it wants to do. If you have an 
idea for an optimizations we could built it into the FlinkRunner.


Generally, there are optimizations that could be really dependent on the 
pipeline. Only then you might have enough information that can result in 
some very specific optimization.


Jan



On 02.05.19 13:44, Robert Bradshaw wrote:

Correct, there's no out of the box way to do this. As mentioned, this
would also result in non-portable pipelines. However, even the
portability framework is set up such that runners can recognize
particular transforms and provide their own implementations thereof
(which is how translations are done for ParDo, GroupByKey, etc.) and
it is encouraged that runners do this for composite operations they
have can do better on (e.g. I know Flink maps Reshard directly to
Redistribute rather than using the generic pair-with-random-key
implementation).

If you really want to do this for MyFancyFlinkOperator, the current
solution is to adapt/extend FlinkRunner (possibly forking code) to
understand this operation and its substitution.


On Thu, May 2, 2019 at 11:09 AM Jan Lukavský  wrote:


Just to clarify - the code I posted is just a proposal, it is not
actually possible currently.

On 5/2/19 11:05 AM, Jan Lukavský wrote:

Hi,

I'd say that what Pankaj meant could be rephrased as "What if I want
to manually tune or tweak my Pipeline for specific runner? Do I have
any options for that?". As I understand it, currently the answer is,
no, PTransforms are somewhat hardwired into runners and the way they
expand cannot be controlled or tuned. But, that could be changed,
maybe something like this would make it possible:

PCollection<...> in = ...;
in.apply(new MyFancyFlinkOperator());

// ...

in.getPipeline().getOptions().as(FlinkPipelineOptions.class).setTransformOverride(MyFancyFlinkOperator.class, 


new MyFancyFlinkOperatorExpander());


The `expander` could have access to full Flink API (through the
runner) and that way any transform could be overridden or customized
for specific runtime conditions. Of course, this has downside, that
you end up with non portable pipeline (also, this is probably in
conflict with general portability framework). But, as I see it,
usually you would need to override only very small part of you
pipeline. So, say 90% of pipeline would be portable and in order to
port it to different runner, you would need to implement only small
specific part.

Jan

On 5/2/19 9:45 AM, Robert Bradshaw wrote:

On Thu, May 2, 2019 at 12:13 AM Pankaj Chand
 wrote:

I thought by choosing Beam, I would get the benefits of both (Spark
and Flink), one at a time. Now, I'm understanding that I might not
get the full potential from either of the two.

You get the benefit of being able to choose, without rewriting your
pipeline, whether to run on Spark or Flink. Or the next new runner
that comes around. As well as the Beam model, API, etc.


Example: If I use Beam with Flink, and then a new feature is added
to Flink but I cannot access it via Beam, and that feature is not
important to the Beam community, then what is the suggested
workaround? If I really need that feature, I would not want to
re-write my pipeline in Flink from scratch.

How is this different from "If I used Spark, and a new feature is
added to Flink, I cannot access it from Spark. If that feature is not
important to the Spark community, then what is the suggested
workaround? If I really need that feature, I would not want to
re-write my pipe

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-02 Thread Pablo Estrada
An example that I can think of as a feature that Beam could provide to
other runners is SQL. Beam SQL expands into Beam transforms, and it can run
on other runners. Flink and Spark do have SQL support because they've
invested in it, but think of smaller runners e.g. Nemo.

Of course, not all of Beam's features or abstractions work the same way,
but this is one case.
Best
-P.

On Thu, May 2, 2019 at 10:39 AM kant kodali  wrote:

> If people don't want to use it because crucial libraries are written in
> only some language but not available in others, that makes some sense
> otherwise I would think it is biased(which is what happens most of the
> time). A lot of the Language arguments are biased anyways since most of
> them just talk about syntactic sugar all day.
>
> "For many use cases, the cost of retraining data analysts, software
> engineers, data scientists, ... to use a language they are unfamiliar it is
> a much greater cost (not just salaries but delays in project completion)
> then the cost of the hardware that the jobs run on. Once the cost of the
> jobs are significant, paying to optimize it via a different implementation,
> performance tuning, ... becomes worthwhile."
>
> I agree with this and By different implementation, I am assuming you meant
> optimizing the language that "data analysts, software engineers, data
> scientists" are familiar with. On the contrary, I don't understand why
> Google pays a particular group to come up with new languages when there are
> so many languages already available!
>
>
> On Thu, May 2, 2019 at 10:00 AM Lukasz Cwik  wrote:
>
>>
>>
>> On Thu, May 2, 2019 at 6:29 AM Maximilian Michels  wrote:
>>
>>> Couple of comments:
>>>
>>> * Flink transforms
>>>
>>> It wouldn't be hard to add a way to run arbitrary Flink operators
>>> through the Beam API. Like you said, once you go down that road, you
>>> loose the ability to run the pipeline on a different Runner. And that's
>>> precisely one of the selling points of Beam. I'm afraid once you even
>>> allow 1% non-portable pipelines, you have lost it all.
>>>
>>> Now, it would be a different story if we had a runner-agnostic way of
>>> running Flink operators on top of Beam. For a subset of the Flink
>>> transformations that might actually be possible. I'm not sure if it's
>>> feasible for Beam to depend on the Flink API.
>>>
>>> * Pipeline Tuning
>>>
>>> There are less bells and whistlers in the Beam API then there are in
>>> Flink's. I'd consider that a feature. As Robert pointed out, the Runner
>>> can make any optimizations that it wants to do. If you have an idea for
>>> an optimizations we could built it into the FlinkRunner.
>>>
>>> I'd also consider if we could add an easier way for the user to apply a
>>> custom optimization code, apart from forking the FlinkRunner.
>>>
>>> * Lock-In
>>>
>>> > Is it possible that in the near future, most of Beam's capabilities
>>> would favor Google's Dataflow API?
>>>
>>> I think that was true for the predecessor of Beam which was built for
>>> Google Cloud Dataflow, although even then there were different runtimes
>>> within Google, i.e. FlumeJava (batch) and Millwheel (streaming).
>>>
>>> The idea of Beam is to build a framework that works in the open-source
>>> as well as in proprietary Runners. As with any Apache project, there are
>>> different interests within the project. A healthy community will ensure
>>> that the interests are well-balanced. The Apache development model also
>>> has the advantage that small parties cannot be simply overruled.
>>>
>>
>> +1
>>
>>
>>> * Language portability
>>>
>>> > "one of the things Beam has focused on was a language portability
>>> framework"  Sure but how important is this for a typical user? Do people
>>> stop using a particular tool because it is in an X language? I
>>>
>>> It is very important to some people. So important that they wouldn't use
>>> a system which does not offer it. Possible reasons: crucial libraries
>>> only available in Python, users that refuse to use Java.
>>>
>>
>> For many use cases, the cost of retraining data analysts, software
>> engineers, data scientists, ... to use a language they are unfamiliar in is
>> a much greater cost (not just salaries but delays in project completion)
>> then the cost of the hardware that the jobs run on. Once the cost of the
>> jobs are significant, paying to optimize it via a different implementation,
>> performance tuning, ... becomes worthwhile.
>>
>>
>>>
>>> Cheers,
>>> Max
>>>
>>> On 02.05.19 13:44, Robert Bradshaw wrote:
>>> > Correct, there's no out of the box way to do this. As mentioned, this
>>> > would also result in non-portable pipelines. However, even the
>>> > portability framework is set up such that runners can recognize
>>> > particular transforms and provide their own implementations thereof
>>> > (which is how translations are done for ParDo, GroupByKey, etc.) and
>>> > it is encouraged that runners do this for composite operations they
>>> > have can do be

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-02 Thread kant kodali
If people don't want to use it because crucial libraries are written in
only some language but not available in others, that makes some sense
otherwise I would think it is biased(which is what happens most of the
time). A lot of the Language arguments are biased anyways since most of
them just talk about syntactic sugar all day.

"For many use cases, the cost of retraining data analysts, software
engineers, data scientists, ... to use a language they are unfamiliar it is
a much greater cost (not just salaries but delays in project completion)
then the cost of the hardware that the jobs run on. Once the cost of the
jobs are significant, paying to optimize it via a different implementation,
performance tuning, ... becomes worthwhile."

I agree with this and By different implementation, I am assuming you meant
optimizing the language that "data analysts, software engineers, data
scientists" are familiar with. On the contrary, I don't understand why
Google pays a particular group to come up with new languages when there are
so many languages already available!


On Thu, May 2, 2019 at 10:00 AM Lukasz Cwik  wrote:

>
>
> On Thu, May 2, 2019 at 6:29 AM Maximilian Michels  wrote:
>
>> Couple of comments:
>>
>> * Flink transforms
>>
>> It wouldn't be hard to add a way to run arbitrary Flink operators
>> through the Beam API. Like you said, once you go down that road, you
>> loose the ability to run the pipeline on a different Runner. And that's
>> precisely one of the selling points of Beam. I'm afraid once you even
>> allow 1% non-portable pipelines, you have lost it all.
>>
>> Now, it would be a different story if we had a runner-agnostic way of
>> running Flink operators on top of Beam. For a subset of the Flink
>> transformations that might actually be possible. I'm not sure if it's
>> feasible for Beam to depend on the Flink API.
>>
>> * Pipeline Tuning
>>
>> There are less bells and whistlers in the Beam API then there are in
>> Flink's. I'd consider that a feature. As Robert pointed out, the Runner
>> can make any optimizations that it wants to do. If you have an idea for
>> an optimizations we could built it into the FlinkRunner.
>>
>> I'd also consider if we could add an easier way for the user to apply a
>> custom optimization code, apart from forking the FlinkRunner.
>>
>> * Lock-In
>>
>> > Is it possible that in the near future, most of Beam's capabilities
>> would favor Google's Dataflow API?
>>
>> I think that was true for the predecessor of Beam which was built for
>> Google Cloud Dataflow, although even then there were different runtimes
>> within Google, i.e. FlumeJava (batch) and Millwheel (streaming).
>>
>> The idea of Beam is to build a framework that works in the open-source
>> as well as in proprietary Runners. As with any Apache project, there are
>> different interests within the project. A healthy community will ensure
>> that the interests are well-balanced. The Apache development model also
>> has the advantage that small parties cannot be simply overruled.
>>
>
> +1
>
>
>> * Language portability
>>
>> > "one of the things Beam has focused on was a language portability
>> framework"  Sure but how important is this for a typical user? Do people
>> stop using a particular tool because it is in an X language? I
>>
>> It is very important to some people. So important that they wouldn't use
>> a system which does not offer it. Possible reasons: crucial libraries
>> only available in Python, users that refuse to use Java.
>>
>
> For many use cases, the cost of retraining data analysts, software
> engineers, data scientists, ... to use a language they are unfamiliar in is
> a much greater cost (not just salaries but delays in project completion)
> then the cost of the hardware that the jobs run on. Once the cost of the
> jobs are significant, paying to optimize it via a different implementation,
> performance tuning, ... becomes worthwhile.
>
>
>>
>> Cheers,
>> Max
>>
>> On 02.05.19 13:44, Robert Bradshaw wrote:
>> > Correct, there's no out of the box way to do this. As mentioned, this
>> > would also result in non-portable pipelines. However, even the
>> > portability framework is set up such that runners can recognize
>> > particular transforms and provide their own implementations thereof
>> > (which is how translations are done for ParDo, GroupByKey, etc.) and
>> > it is encouraged that runners do this for composite operations they
>> > have can do better on (e.g. I know Flink maps Reshard directly to
>> > Redistribute rather than using the generic pair-with-random-key
>> > implementation).
>> >
>> > If you really want to do this for MyFancyFlinkOperator, the current
>> > solution is to adapt/extend FlinkRunner (possibly forking code) to
>> > understand this operation and its substitution.
>> >
>> >
>> > On Thu, May 2, 2019 at 11:09 AM Jan Lukavský  wrote:
>> >>
>> >> Just to clarify - the code I posted is just a proposal, it is not
>> >> actually possible currently.
>> >>
>> >> On 5/2/19 1

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-02 Thread Lukasz Cwik
On Thu, May 2, 2019 at 6:29 AM Maximilian Michels  wrote:

> Couple of comments:
>
> * Flink transforms
>
> It wouldn't be hard to add a way to run arbitrary Flink operators
> through the Beam API. Like you said, once you go down that road, you
> loose the ability to run the pipeline on a different Runner. And that's
> precisely one of the selling points of Beam. I'm afraid once you even
> allow 1% non-portable pipelines, you have lost it all.
>
> Now, it would be a different story if we had a runner-agnostic way of
> running Flink operators on top of Beam. For a subset of the Flink
> transformations that might actually be possible. I'm not sure if it's
> feasible for Beam to depend on the Flink API.
>
> * Pipeline Tuning
>
> There are less bells and whistlers in the Beam API then there are in
> Flink's. I'd consider that a feature. As Robert pointed out, the Runner
> can make any optimizations that it wants to do. If you have an idea for
> an optimizations we could built it into the FlinkRunner.
>
> I'd also consider if we could add an easier way for the user to apply a
> custom optimization code, apart from forking the FlinkRunner.
>
> * Lock-In
>
> > Is it possible that in the near future, most of Beam's capabilities
> would favor Google's Dataflow API?
>
> I think that was true for the predecessor of Beam which was built for
> Google Cloud Dataflow, although even then there were different runtimes
> within Google, i.e. FlumeJava (batch) and Millwheel (streaming).
>
> The idea of Beam is to build a framework that works in the open-source
> as well as in proprietary Runners. As with any Apache project, there are
> different interests within the project. A healthy community will ensure
> that the interests are well-balanced. The Apache development model also
> has the advantage that small parties cannot be simply overruled.
>

+1


> * Language portability
>
> > "one of the things Beam has focused on was a language portability
> framework"  Sure but how important is this for a typical user? Do people
> stop using a particular tool because it is in an X language? I
>
> It is very important to some people. So important that they wouldn't use
> a system which does not offer it. Possible reasons: crucial libraries
> only available in Python, users that refuse to use Java.
>

For many use cases, the cost of retraining data analysts, software
engineers, data scientists, ... to use a language they are unfamiliar in is
a much greater cost (not just salaries but delays in project completion)
then the cost of the hardware that the jobs run on. Once the cost of the
jobs are significant, paying to optimize it via a different implementation,
performance tuning, ... becomes worthwhile.


>
> Cheers,
> Max
>
> On 02.05.19 13:44, Robert Bradshaw wrote:
> > Correct, there's no out of the box way to do this. As mentioned, this
> > would also result in non-portable pipelines. However, even the
> > portability framework is set up such that runners can recognize
> > particular transforms and provide their own implementations thereof
> > (which is how translations are done for ParDo, GroupByKey, etc.) and
> > it is encouraged that runners do this for composite operations they
> > have can do better on (e.g. I know Flink maps Reshard directly to
> > Redistribute rather than using the generic pair-with-random-key
> > implementation).
> >
> > If you really want to do this for MyFancyFlinkOperator, the current
> > solution is to adapt/extend FlinkRunner (possibly forking code) to
> > understand this operation and its substitution.
> >
> >
> > On Thu, May 2, 2019 at 11:09 AM Jan Lukavský  wrote:
> >>
> >> Just to clarify - the code I posted is just a proposal, it is not
> >> actually possible currently.
> >>
> >> On 5/2/19 11:05 AM, Jan Lukavský wrote:
> >>> Hi,
> >>>
> >>> I'd say that what Pankaj meant could be rephrased as "What if I want
> >>> to manually tune or tweak my Pipeline for specific runner? Do I have
> >>> any options for that?". As I understand it, currently the answer is,
> >>> no, PTransforms are somewhat hardwired into runners and the way they
> >>> expand cannot be controlled or tuned. But, that could be changed,
> >>> maybe something like this would make it possible:
> >>>
> >>> PCollection<...> in = ...;
> >>> in.apply(new MyFancyFlinkOperator());
> >>>
> >>> // ...
> >>>
> >>>
> in.getPipeline().getOptions().as(FlinkPipelineOptions.class).setTransformOverride(MyFancyFlinkOperator.class,
> >>> new MyFancyFlinkOperatorExpander());
> >>>
> >>>
> >>> The `expander` could have access to full Flink API (through the
> >>> runner) and that way any transform could be overridden or customized
> >>> for specific runtime conditions. Of course, this has downside, that
> >>> you end up with non portable pipeline (also, this is probably in
> >>> conflict with general portability framework). But, as I see it,
> >>> usually you would need to override only very small part of you
> >>> pipeline. So, say 90% of pipeline would be

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-02 Thread Maximilian Michels

Couple of comments:

* Flink transforms

It wouldn't be hard to add a way to run arbitrary Flink operators 
through the Beam API. Like you said, once you go down that road, you 
loose the ability to run the pipeline on a different Runner. And that's 
precisely one of the selling points of Beam. I'm afraid once you even 
allow 1% non-portable pipelines, you have lost it all.


Now, it would be a different story if we had a runner-agnostic way of 
running Flink operators on top of Beam. For a subset of the Flink 
transformations that might actually be possible. I'm not sure if it's 
feasible for Beam to depend on the Flink API.


* Pipeline Tuning

There are less bells and whistlers in the Beam API then there are in 
Flink's. I'd consider that a feature. As Robert pointed out, the Runner 
can make any optimizations that it wants to do. If you have an idea for 
an optimizations we could built it into the FlinkRunner.


I'd also consider if we could add an easier way for the user to apply a 
custom optimization code, apart from forking the FlinkRunner.


* Lock-In

Is it possible that in the near future, most of Beam's capabilities would favor Google's Dataflow API? 


I think that was true for the predecessor of Beam which was built for 
Google Cloud Dataflow, although even then there were different runtimes 
within Google, i.e. FlumeJava (batch) and Millwheel (streaming).


The idea of Beam is to build a framework that works in the open-source 
as well as in proprietary Runners. As with any Apache project, there are 
different interests within the project. A healthy community will ensure 
that the interests are well-balanced. The Apache development model also 
has the advantage that small parties cannot be simply overruled.


* Language portability


"one of the things Beam has focused on was a language portability framework"  
Sure but how important is this for a typical user? Do people stop using a particular tool 
because it is in an X language? I


It is very important to some people. So important that they wouldn't use 
a system which does not offer it. Possible reasons: crucial libraries 
only available in Python, users that refuse to use Java.



Cheers,
Max

On 02.05.19 13:44, Robert Bradshaw wrote:

Correct, there's no out of the box way to do this. As mentioned, this
would also result in non-portable pipelines. However, even the
portability framework is set up such that runners can recognize
particular transforms and provide their own implementations thereof
(which is how translations are done for ParDo, GroupByKey, etc.) and
it is encouraged that runners do this for composite operations they
have can do better on (e.g. I know Flink maps Reshard directly to
Redistribute rather than using the generic pair-with-random-key
implementation).

If you really want to do this for MyFancyFlinkOperator, the current
solution is to adapt/extend FlinkRunner (possibly forking code) to
understand this operation and its substitution.


On Thu, May 2, 2019 at 11:09 AM Jan Lukavský  wrote:


Just to clarify - the code I posted is just a proposal, it is not
actually possible currently.

On 5/2/19 11:05 AM, Jan Lukavský wrote:

Hi,

I'd say that what Pankaj meant could be rephrased as "What if I want
to manually tune or tweak my Pipeline for specific runner? Do I have
any options for that?". As I understand it, currently the answer is,
no, PTransforms are somewhat hardwired into runners and the way they
expand cannot be controlled or tuned. But, that could be changed,
maybe something like this would make it possible:

PCollection<...> in = ...;
in.apply(new MyFancyFlinkOperator());

// ...

in.getPipeline().getOptions().as(FlinkPipelineOptions.class).setTransformOverride(MyFancyFlinkOperator.class,
new MyFancyFlinkOperatorExpander());


The `expander` could have access to full Flink API (through the
runner) and that way any transform could be overridden or customized
for specific runtime conditions. Of course, this has downside, that
you end up with non portable pipeline (also, this is probably in
conflict with general portability framework). But, as I see it,
usually you would need to override only very small part of you
pipeline. So, say 90% of pipeline would be portable and in order to
port it to different runner, you would need to implement only small
specific part.

Jan

On 5/2/19 9:45 AM, Robert Bradshaw wrote:

On Thu, May 2, 2019 at 12:13 AM Pankaj Chand
 wrote:

I thought by choosing Beam, I would get the benefits of both (Spark
and Flink), one at a time. Now, I'm understanding that I might not
get the full potential from either of the two.

You get the benefit of being able to choose, without rewriting your
pipeline, whether to run on Spark or Flink. Or the next new runner
that comes around. As well as the Beam model, API, etc.


Example: If I use Beam with Flink, and then a new feature is added
to Flink but I cannot access it via Beam, and that feature is not
important to the Beam community, then what is th

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-02 Thread Jan Lukavský

Hi Robert,

yes, that is correct. What I'm suggesting is a discussion whether Beam 
should or should not support this type of pipeline customization without 
hard forking code. But maybe this discussion should proceed in dev@.


Jan

On 5/2/19 1:44 PM, Robert Bradshaw wrote:

Correct, there's no out of the box way to do this. As mentioned, this
would also result in non-portable pipelines. However, even the
portability framework is set up such that runners can recognize
particular transforms and provide their own implementations thereof
(which is how translations are done for ParDo, GroupByKey, etc.) and
it is encouraged that runners do this for composite operations they
have can do better on (e.g. I know Flink maps Reshard directly to
Redistribute rather than using the generic pair-with-random-key
implementation).

If you really want to do this for MyFancyFlinkOperator, the current
solution is to adapt/extend FlinkRunner (possibly forking code) to
understand this operation and its substitution.


On Thu, May 2, 2019 at 11:09 AM Jan Lukavský  wrote:

Just to clarify - the code I posted is just a proposal, it is not
actually possible currently.

On 5/2/19 11:05 AM, Jan Lukavský wrote:

Hi,

I'd say that what Pankaj meant could be rephrased as "What if I want
to manually tune or tweak my Pipeline for specific runner? Do I have
any options for that?". As I understand it, currently the answer is,
no, PTransforms are somewhat hardwired into runners and the way they
expand cannot be controlled or tuned. But, that could be changed,
maybe something like this would make it possible:

PCollection<...> in = ...;
in.apply(new MyFancyFlinkOperator());

// ...

in.getPipeline().getOptions().as(FlinkPipelineOptions.class).setTransformOverride(MyFancyFlinkOperator.class,
new MyFancyFlinkOperatorExpander());


The `expander` could have access to full Flink API (through the
runner) and that way any transform could be overridden or customized
for specific runtime conditions. Of course, this has downside, that
you end up with non portable pipeline (also, this is probably in
conflict with general portability framework). But, as I see it,
usually you would need to override only very small part of you
pipeline. So, say 90% of pipeline would be portable and in order to
port it to different runner, you would need to implement only small
specific part.

Jan

On 5/2/19 9:45 AM, Robert Bradshaw wrote:

On Thu, May 2, 2019 at 12:13 AM Pankaj Chand
 wrote:

I thought by choosing Beam, I would get the benefits of both (Spark
and Flink), one at a time. Now, I'm understanding that I might not
get the full potential from either of the two.

You get the benefit of being able to choose, without rewriting your
pipeline, whether to run on Spark or Flink. Or the next new runner
that comes around. As well as the Beam model, API, etc.


Example: If I use Beam with Flink, and then a new feature is added
to Flink but I cannot access it via Beam, and that feature is not
important to the Beam community, then what is the suggested
workaround? If I really need that feature, I would not want to
re-write my pipeline in Flink from scratch.

How is this different from "If I used Spark, and a new feature is
added to Flink, I cannot access it from Spark. If that feature is not
important to the Spark community, then what is the suggested
workaround? If I really need that feature, I would not want to
re-write my pipeline in Flink from scratch." Or vice-versa. Or when a
new feature added to Beam itself (that may not be not present in any
of the underlying systems). Beam's feature set is neither the
intersection nor union of the feature sets of those runners it has
available as execution engines. (Even the notion of what is meant by
"feature" is nebulous enough that it's hard to make this discussion
concrete.)


Is it possible that in the near future, most of Beam's capabilities
would favor Google's Dataflow API? That way, Beam could be used to
lure developers and organizations who would typically use
Spark/Flink, with the promise of portability. After they get
dependent on Beam and cannot afford to re-write their pipelines in
Spark/Flink from scratch, they would realize that Beam does not give
access to some of the capabilities of the free engines that they may
require. Then, they would be told that if they want all possible
capabilities and would want to use their code in Beam, they could
pay for the Dataflow engine instead.

Google is very upfront about the fact that they are selling a service
to run Beam pipelines in a completely managed way. But Google has
*also* invested very heavily in making sure that the portability story
is not just talk, for those who need or want to run their jobs on
premise or elsewhere (now or in the future). It is our goal that all
runners be able to run all pipelines, and this is a community effort.


Pankaj

On Tue, Apr 30, 2019 at 6:15 PM Kenneth Knowles 
wrote:

It is worth noting that Beam isn't solely a portability layer tha

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-02 Thread Robert Bradshaw
Correct, there's no out of the box way to do this. As mentioned, this
would also result in non-portable pipelines. However, even the
portability framework is set up such that runners can recognize
particular transforms and provide their own implementations thereof
(which is how translations are done for ParDo, GroupByKey, etc.) and
it is encouraged that runners do this for composite operations they
have can do better on (e.g. I know Flink maps Reshard directly to
Redistribute rather than using the generic pair-with-random-key
implementation).

If you really want to do this for MyFancyFlinkOperator, the current
solution is to adapt/extend FlinkRunner (possibly forking code) to
understand this operation and its substitution.


On Thu, May 2, 2019 at 11:09 AM Jan Lukavský  wrote:
>
> Just to clarify - the code I posted is just a proposal, it is not
> actually possible currently.
>
> On 5/2/19 11:05 AM, Jan Lukavský wrote:
> > Hi,
> >
> > I'd say that what Pankaj meant could be rephrased as "What if I want
> > to manually tune or tweak my Pipeline for specific runner? Do I have
> > any options for that?". As I understand it, currently the answer is,
> > no, PTransforms are somewhat hardwired into runners and the way they
> > expand cannot be controlled or tuned. But, that could be changed,
> > maybe something like this would make it possible:
> >
> > PCollection<...> in = ...;
> > in.apply(new MyFancyFlinkOperator());
> >
> > // ...
> >
> > in.getPipeline().getOptions().as(FlinkPipelineOptions.class).setTransformOverride(MyFancyFlinkOperator.class,
> > new MyFancyFlinkOperatorExpander());
> >
> >
> > The `expander` could have access to full Flink API (through the
> > runner) and that way any transform could be overridden or customized
> > for specific runtime conditions. Of course, this has downside, that
> > you end up with non portable pipeline (also, this is probably in
> > conflict with general portability framework). But, as I see it,
> > usually you would need to override only very small part of you
> > pipeline. So, say 90% of pipeline would be portable and in order to
> > port it to different runner, you would need to implement only small
> > specific part.
> >
> > Jan
> >
> > On 5/2/19 9:45 AM, Robert Bradshaw wrote:
> >> On Thu, May 2, 2019 at 12:13 AM Pankaj Chand
> >>  wrote:
> >>> I thought by choosing Beam, I would get the benefits of both (Spark
> >>> and Flink), one at a time. Now, I'm understanding that I might not
> >>> get the full potential from either of the two.
> >> You get the benefit of being able to choose, without rewriting your
> >> pipeline, whether to run on Spark or Flink. Or the next new runner
> >> that comes around. As well as the Beam model, API, etc.
> >>
> >>> Example: If I use Beam with Flink, and then a new feature is added
> >>> to Flink but I cannot access it via Beam, and that feature is not
> >>> important to the Beam community, then what is the suggested
> >>> workaround? If I really need that feature, I would not want to
> >>> re-write my pipeline in Flink from scratch.
> >> How is this different from "If I used Spark, and a new feature is
> >> added to Flink, I cannot access it from Spark. If that feature is not
> >> important to the Spark community, then what is the suggested
> >> workaround? If I really need that feature, I would not want to
> >> re-write my pipeline in Flink from scratch." Or vice-versa. Or when a
> >> new feature added to Beam itself (that may not be not present in any
> >> of the underlying systems). Beam's feature set is neither the
> >> intersection nor union of the feature sets of those runners it has
> >> available as execution engines. (Even the notion of what is meant by
> >> "feature" is nebulous enough that it's hard to make this discussion
> >> concrete.)
> >>
> >>> Is it possible that in the near future, most of Beam's capabilities
> >>> would favor Google's Dataflow API? That way, Beam could be used to
> >>> lure developers and organizations who would typically use
> >>> Spark/Flink, with the promise of portability. After they get
> >>> dependent on Beam and cannot afford to re-write their pipelines in
> >>> Spark/Flink from scratch, they would realize that Beam does not give
> >>> access to some of the capabilities of the free engines that they may
> >>> require. Then, they would be told that if they want all possible
> >>> capabilities and would want to use their code in Beam, they could
> >>> pay for the Dataflow engine instead.
> >> Google is very upfront about the fact that they are selling a service
> >> to run Beam pipelines in a completely managed way. But Google has
> >> *also* invested very heavily in making sure that the portability story
> >> is not just talk, for those who need or want to run their jobs on
> >> premise or elsewhere (now or in the future). It is our goal that all
> >> runners be able to run all pipelines, and this is a community effort.
> >>
> >>> Pankaj
> >>>
> >>> On Tue, Apr 30, 2019 at 6:15 PM Kennet

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-02 Thread Robert Bradshaw
Sorry if I wasn't clear, I mean in the sense that A is neither the
intersection nor union of B and C in the venn diagram below. As Max says,
what really matters is the set of features you care about.



[image: venn.png]
https://www.benfrederickson.com/images/venn.png


On Thu, May 2, 2019 at 11:04 AM kant kodali  wrote:
>
> I don't understand the following statement.
>
> "Beam's feature set is neither the intersection nor union of the feature
sets of those runners it has available as execution engines."
>
> If Beam is neither intersection nor union how can Beam abstract anything?
Generally speaking, if there is nothing in common what is there to
abstract? Sounds like the 101 principles I had learned are broken or I must
have completely misunderstood.
>
>
>
>
>
> On Thu, May 2, 2019 at 12:46 AM Robert Bradshaw 
wrote:
>>
>> On Thu, May 2, 2019 at 12:13 AM Pankaj Chand 
wrote:
>> >
>> > I thought by choosing Beam, I would get the benefits of both (Spark
and Flink), one at a time. Now, I'm understanding that I might not get the
full potential from either of the two.
>>
>> You get the benefit of being able to choose, without rewriting your
>> pipeline, whether to run on Spark or Flink. Or the next new runner
>> that comes around. As well as the Beam model, API, etc.
>>
>> > Example: If I use Beam with Flink, and then a new feature is added to
Flink but I cannot access it via Beam, and that feature is not important to
the Beam community, then what is the suggested workaround? If I really need
that feature, I would not want to re-write my pipeline in Flink from
scratch.
>>
>> How is this different from "If I used Spark, and a new feature is
>> added to Flink, I cannot access it from Spark. If that feature is not
>> important to the Spark community, then what is the suggested
>> workaround? If I really need that feature, I would not want to
>> re-write my pipeline in Flink from scratch." Or vice-versa. Or when a
>> new feature added to Beam itself (that may not be not present in any
>> of the underlying systems). Beam's feature set is neither the
>> intersection nor union of the feature sets of those runners it has
>> available as execution engines. (Even the notion of what is meant by
>> "feature" is nebulous enough that it's hard to make this discussion
>> concrete.)
>>
>> > Is it possible that in the near future, most of Beam's capabilities
would favor Google's Dataflow API? That way, Beam could be used to lure
developers and organizations who would typically use Spark/Flink, with the
promise of portability. After they get dependent on Beam and cannot afford
to re-write their pipelines in Spark/Flink from scratch, they would realize
that Beam does not give access to some of the capabilities of the free
engines that they may require. Then, they would be told that if they want
all possible capabilities and would want to use their code in Beam, they
could pay for the Dataflow engine instead.
>>
>> Google is very upfront about the fact that they are selling a service
>> to run Beam pipelines in a completely managed way. But Google has
>> *also* invested very heavily in making sure that the portability story
>> is not just talk, for those who need or want to run their jobs on
>> premise or elsewhere (now or in the future). It is our goal that all
>> runners be able to run all pipelines, and this is a community effort.
>>
>> > Pankaj
>> >
>> > On Tue, Apr 30, 2019 at 6:15 PM Kenneth Knowles 
wrote:
>> >>
>> >> It is worth noting that Beam isn't solely a portability layer that
exposes underlying API features, but a feature-rich layer in its own right,
with carefully coherent abstractions. For example, quite early on the
SparkRunner supported streaming aspects of the Beam model - watermarks,
windowing, triggers - that were not really available any other way. Beam's
various features sometimes requires just a pass-through API and sometimes
requires clever new implementation. And everything is moving constantly. I
don't see Beam as following the features of any engine, but rather coming
up with new needed data processing abstractions and figuring out how to
efficiently implement them on top of various architectures.
>> >>
>> >> Kenn
>> >>
>> >> On Tue, Apr 30, 2019 at 8:37 AM kant kodali 
wrote:
>> >>>
>> >>> Staying behind doesn't imply one is better than the other and I
didn't mean that in any way but I fail to see how an abstraction framework
like Beam can stay ahead of the underlying execution engines?
>> >>>
>> >>> For example, If a new feature is added into the underlying execution
engine that doesn't fit the interface of Beam or breaks then I would think
the interface would need to be changed. Another example would say the
underlying execution engines take different kind's of parameters for the
same feature then it isn't so straight forward to come up with an interface
since there might be very little in common in the first place so, in that
sense, I fail to see how Beam can stay ahead.
>> >>>
>> >>> "Of course t

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-02 Thread Jan Lukavský
Just to clarify - the code I posted is just a proposal, it is not 
actually possible currently.


On 5/2/19 11:05 AM, Jan Lukavský wrote:

Hi,

I'd say that what Pankaj meant could be rephrased as "What if I want 
to manually tune or tweak my Pipeline for specific runner? Do I have 
any options for that?". As I understand it, currently the answer is, 
no, PTransforms are somewhat hardwired into runners and the way they 
expand cannot be controlled or tuned. But, that could be changed, 
maybe something like this would make it possible:


PCollection<...> in = ...;
in.apply(new MyFancyFlinkOperator());

// ...

in.getPipeline().getOptions().as(FlinkPipelineOptions.class).setTransformOverride(MyFancyFlinkOperator.class, 
new MyFancyFlinkOperatorExpander());



The `expander` could have access to full Flink API (through the 
runner) and that way any transform could be overridden or customized 
for specific runtime conditions. Of course, this has downside, that 
you end up with non portable pipeline (also, this is probably in 
conflict with general portability framework). But, as I see it, 
usually you would need to override only very small part of you 
pipeline. So, say 90% of pipeline would be portable and in order to 
port it to different runner, you would need to implement only small 
specific part.


Jan

On 5/2/19 9:45 AM, Robert Bradshaw wrote:
On Thu, May 2, 2019 at 12:13 AM Pankaj Chand 
 wrote:
I thought by choosing Beam, I would get the benefits of both (Spark 
and Flink), one at a time. Now, I'm understanding that I might not 
get the full potential from either of the two.

You get the benefit of being able to choose, without rewriting your
pipeline, whether to run on Spark or Flink. Or the next new runner
that comes around. As well as the Beam model, API, etc.

Example: If I use Beam with Flink, and then a new feature is added 
to Flink but I cannot access it via Beam, and that feature is not 
important to the Beam community, then what is the suggested 
workaround? If I really need that feature, I would not want to 
re-write my pipeline in Flink from scratch.

How is this different from "If I used Spark, and a new feature is
added to Flink, I cannot access it from Spark. If that feature is not
important to the Spark community, then what is the suggested
workaround? If I really need that feature, I would not want to
re-write my pipeline in Flink from scratch." Or vice-versa. Or when a
new feature added to Beam itself (that may not be not present in any
of the underlying systems). Beam's feature set is neither the
intersection nor union of the feature sets of those runners it has
available as execution engines. (Even the notion of what is meant by
"feature" is nebulous enough that it's hard to make this discussion
concrete.)

Is it possible that in the near future, most of Beam's capabilities 
would favor Google's Dataflow API? That way, Beam could be used to 
lure developers and organizations who would typically use 
Spark/Flink, with the promise of portability. After they get 
dependent on Beam and cannot afford to re-write their pipelines in 
Spark/Flink from scratch, they would realize that Beam does not give 
access to some of the capabilities of the free engines that they may 
require. Then, they would be told that if they want all possible 
capabilities and would want to use their code in Beam, they could 
pay for the Dataflow engine instead.

Google is very upfront about the fact that they are selling a service
to run Beam pipelines in a completely managed way. But Google has
*also* invested very heavily in making sure that the portability story
is not just talk, for those who need or want to run their jobs on
premise or elsewhere (now or in the future). It is our goal that all
runners be able to run all pipelines, and this is a community effort.


Pankaj

On Tue, Apr 30, 2019 at 6:15 PM Kenneth Knowles  
wrote:
It is worth noting that Beam isn't solely a portability layer that 
exposes underlying API features, but a feature-rich layer in its 
own right, with carefully coherent abstractions. For example, quite 
early on the SparkRunner supported streaming aspects of the Beam 
model - watermarks, windowing, triggers - that were not really 
available any other way. Beam's various features sometimes requires 
just a pass-through API and sometimes requires clever new 
implementation. And everything is moving constantly. I don't see 
Beam as following the features of any engine, but rather coming up 
with new needed data processing abstractions and figuring out how 
to efficiently implement them on top of various architectures.


Kenn

On Tue, Apr 30, 2019 at 8:37 AM kant kodali  
wrote:
Staying behind doesn't imply one is better than the other and I 
didn't mean that in any way but I fail to see how an abstraction 
framework like Beam can stay ahead of the underlying execution 
engines?


For example, If a new feature is added into the underlying 
execution engine that doesn't fit the interface o

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-02 Thread kant kodali
I don't understand the following statement.

"Beam's feature set is neither the intersection nor union of the feature
sets of those runners it has available as execution engines."

If Beam is neither intersection nor union how can Beam abstract anything?
Generally speaking, if there is nothing in common what is there to
abstract? Sounds like the 101 principles I had learned are broken or I must
have completely misunderstood.





On Thu, May 2, 2019 at 12:46 AM Robert Bradshaw  wrote:

> On Thu, May 2, 2019 at 12:13 AM Pankaj Chand 
> wrote:
> >
> > I thought by choosing Beam, I would get the benefits of both (Spark and
> Flink), one at a time. Now, I'm understanding that I might not get the full
> potential from either of the two.
>
> You get the benefit of being able to choose, without rewriting your
> pipeline, whether to run on Spark or Flink. Or the next new runner
> that comes around. As well as the Beam model, API, etc.
>
> > Example: If I use Beam with Flink, and then a new feature is added to
> Flink but I cannot access it via Beam, and that feature is not important to
> the Beam community, then what is the suggested workaround? If I really need
> that feature, I would not want to re-write my pipeline in Flink from
> scratch.
>
> How is this different from "If I used Spark, and a new feature is
> added to Flink, I cannot access it from Spark. If that feature is not
> important to the Spark community, then what is the suggested
> workaround? If I really need that feature, I would not want to
> re-write my pipeline in Flink from scratch." Or vice-versa. Or when a
> new feature added to Beam itself (that may not be not present in any
> of the underlying systems). Beam's feature set is neither the
> intersection nor union of the feature sets of those runners it has
> available as execution engines. (Even the notion of what is meant by
> "feature" is nebulous enough that it's hard to make this discussion
> concrete.)
>
> > Is it possible that in the near future, most of Beam's capabilities
> would favor Google's Dataflow API? That way, Beam could be used to lure
> developers and organizations who would typically use Spark/Flink, with the
> promise of portability. After they get dependent on Beam and cannot afford
> to re-write their pipelines in Spark/Flink from scratch, they would realize
> that Beam does not give access to some of the capabilities of the free
> engines that they may require. Then, they would be told that if they want
> all possible capabilities and would want to use their code in Beam, they
> could pay for the Dataflow engine instead.
>
> Google is very upfront about the fact that they are selling a service
> to run Beam pipelines in a completely managed way. But Google has
> *also* invested very heavily in making sure that the portability story
> is not just talk, for those who need or want to run their jobs on
> premise or elsewhere (now or in the future). It is our goal that all
> runners be able to run all pipelines, and this is a community effort.
>
> > Pankaj
> >
> > On Tue, Apr 30, 2019 at 6:15 PM Kenneth Knowles  wrote:
> >>
> >> It is worth noting that Beam isn't solely a portability layer that
> exposes underlying API features, but a feature-rich layer in its own right,
> with carefully coherent abstractions. For example, quite early on the
> SparkRunner supported streaming aspects of the Beam model - watermarks,
> windowing, triggers - that were not really available any other way. Beam's
> various features sometimes requires just a pass-through API and sometimes
> requires clever new implementation. And everything is moving constantly. I
> don't see Beam as following the features of any engine, but rather coming
> up with new needed data processing abstractions and figuring out how to
> efficiently implement them on top of various architectures.
> >>
> >> Kenn
> >>
> >> On Tue, Apr 30, 2019 at 8:37 AM kant kodali  wrote:
> >>>
> >>> Staying behind doesn't imply one is better than the other and I didn't
> mean that in any way but I fail to see how an abstraction framework like
> Beam can stay ahead of the underlying execution engines?
> >>>
> >>> For example, If a new feature is added into the underlying execution
> engine that doesn't fit the interface of Beam or breaks then I would think
> the interface would need to be changed. Another example would say the
> underlying execution engines take different kind's of parameters for the
> same feature then it isn't so straight forward to come up with an interface
> since there might be very little in common in the first place so, in that
> sense, I fail to see how Beam can stay ahead.
> >>>
> >>> "Of course the API itself is Spark-specific, but it borrows heavily
> (among other things) on ideas that Beam itself pioneered long before Spark
> 2.0" Good to know.
> >>>
> >>> "one of the things Beam has focused on was a language portability
> framework"  Sure but how important is this for a typical user? Do people
> stop using a pa

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-02 Thread Jan Lukavský

Hi,

I'd say that what Pankaj meant could be rephrased as "What if I want to 
manually tune or tweak my Pipeline for specific runner? Do I have any 
options for that?". As I understand it, currently the answer is, no, 
PTransforms are somewhat hardwired into runners and the way they expand 
cannot be controlled or tuned. But, that could be changed, maybe 
something like this would make it possible:


PCollection<...> in = ...;
in.apply(new MyFancyFlinkOperator());

// ...

in.getPipeline().getOptions().as(FlinkPipelineOptions.class).setTransformOverride(MyFancyFlinkOperator.class, 
new MyFancyFlinkOperatorExpander());



The `expander` could have access to full Flink API (through the runner) 
and that way any transform could be overridden or customized for 
specific runtime conditions. Of course, this has downside, that you end 
up with non portable pipeline (also, this is probably in conflict with 
general portability framework). But, as I see it, usually you would need 
to override only very small part of you pipeline. So, say 90% of 
pipeline would be portable and in order to port it to different runner, 
you would need to implement only small specific part.


Jan

On 5/2/19 9:45 AM, Robert Bradshaw wrote:

On Thu, May 2, 2019 at 12:13 AM Pankaj Chand  wrote:

I thought by choosing Beam, I would get the benefits of both (Spark and Flink), 
one at a time. Now, I'm understanding that I might not get the full potential 
from either of the two.

You get the benefit of being able to choose, without rewriting your
pipeline, whether to run on Spark or Flink. Or the next new runner
that comes around. As well as the Beam model, API, etc.


Example: If I use Beam with Flink, and then a new feature is added to Flink but 
I cannot access it via Beam, and that feature is not important to the Beam 
community, then what is the suggested workaround? If I really need that 
feature, I would not want to re-write my pipeline in Flink from scratch.

How is this different from "If I used Spark, and a new feature is
added to Flink, I cannot access it from Spark. If that feature is not
important to the Spark community, then what is the suggested
workaround? If I really need that feature, I would not want to
re-write my pipeline in Flink from scratch." Or vice-versa. Or when a
new feature added to Beam itself (that may not be not present in any
of the underlying systems). Beam's feature set is neither the
intersection nor union of the feature sets of those runners it has
available as execution engines. (Even the notion of what is meant by
"feature" is nebulous enough that it's hard to make this discussion
concrete.)


Is it possible that in the near future, most of Beam's capabilities would favor 
Google's Dataflow API? That way, Beam could be used to lure developers and 
organizations who would typically use Spark/Flink, with the promise of 
portability. After they get dependent on Beam and cannot afford to re-write 
their pipelines in Spark/Flink from scratch, they would realize that Beam does 
not give access to some of the capabilities of the free engines that they may 
require. Then, they would be told that if they want all possible capabilities 
and would want to use their code in Beam, they could pay for the Dataflow 
engine instead.

Google is very upfront about the fact that they are selling a service
to run Beam pipelines in a completely managed way. But Google has
*also* invested very heavily in making sure that the portability story
is not just talk, for those who need or want to run their jobs on
premise or elsewhere (now or in the future). It is our goal that all
runners be able to run all pipelines, and this is a community effort.


Pankaj

On Tue, Apr 30, 2019 at 6:15 PM Kenneth Knowles  wrote:

It is worth noting that Beam isn't solely a portability layer that exposes 
underlying API features, but a feature-rich layer in its own right, with 
carefully coherent abstractions. For example, quite early on the SparkRunner 
supported streaming aspects of the Beam model - watermarks, windowing, triggers 
- that were not really available any other way. Beam's various features 
sometimes requires just a pass-through API and sometimes requires clever new 
implementation. And everything is moving constantly. I don't see Beam as 
following the features of any engine, but rather coming up with new needed data 
processing abstractions and figuring out how to efficiently implement them on 
top of various architectures.

Kenn

On Tue, Apr 30, 2019 at 8:37 AM kant kodali  wrote:

Staying behind doesn't imply one is better than the other and I didn't mean 
that in any way but I fail to see how an abstraction framework like Beam can 
stay ahead of the underlying execution engines?

For example, If a new feature is added into the underlying execution engine 
that doesn't fit the interface of Beam or breaks then I would think the 
interface would need to be changed. Another example would say the underlying 
execution engines take 

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-02 Thread Robert Bradshaw
On Thu, May 2, 2019 at 12:13 AM Pankaj Chand  wrote:
>
> I thought by choosing Beam, I would get the benefits of both (Spark and 
> Flink), one at a time. Now, I'm understanding that I might not get the full 
> potential from either of the two.

You get the benefit of being able to choose, without rewriting your
pipeline, whether to run on Spark or Flink. Or the next new runner
that comes around. As well as the Beam model, API, etc.

> Example: If I use Beam with Flink, and then a new feature is added to Flink 
> but I cannot access it via Beam, and that feature is not important to the 
> Beam community, then what is the suggested workaround? If I really need that 
> feature, I would not want to re-write my pipeline in Flink from scratch.

How is this different from "If I used Spark, and a new feature is
added to Flink, I cannot access it from Spark. If that feature is not
important to the Spark community, then what is the suggested
workaround? If I really need that feature, I would not want to
re-write my pipeline in Flink from scratch." Or vice-versa. Or when a
new feature added to Beam itself (that may not be not present in any
of the underlying systems). Beam's feature set is neither the
intersection nor union of the feature sets of those runners it has
available as execution engines. (Even the notion of what is meant by
"feature" is nebulous enough that it's hard to make this discussion
concrete.)

> Is it possible that in the near future, most of Beam's capabilities would 
> favor Google's Dataflow API? That way, Beam could be used to lure developers 
> and organizations who would typically use Spark/Flink, with the promise of 
> portability. After they get dependent on Beam and cannot afford to re-write 
> their pipelines in Spark/Flink from scratch, they would realize that Beam 
> does not give access to some of the capabilities of the free engines that 
> they may require. Then, they would be told that if they want all possible 
> capabilities and would want to use their code in Beam, they could pay for the 
> Dataflow engine instead.

Google is very upfront about the fact that they are selling a service
to run Beam pipelines in a completely managed way. But Google has
*also* invested very heavily in making sure that the portability story
is not just talk, for those who need or want to run their jobs on
premise or elsewhere (now or in the future). It is our goal that all
runners be able to run all pipelines, and this is a community effort.

> Pankaj
>
> On Tue, Apr 30, 2019 at 6:15 PM Kenneth Knowles  wrote:
>>
>> It is worth noting that Beam isn't solely a portability layer that exposes 
>> underlying API features, but a feature-rich layer in its own right, with 
>> carefully coherent abstractions. For example, quite early on the SparkRunner 
>> supported streaming aspects of the Beam model - watermarks, windowing, 
>> triggers - that were not really available any other way. Beam's various 
>> features sometimes requires just a pass-through API and sometimes requires 
>> clever new implementation. And everything is moving constantly. I don't see 
>> Beam as following the features of any engine, but rather coming up with new 
>> needed data processing abstractions and figuring out how to efficiently 
>> implement them on top of various architectures.
>>
>> Kenn
>>
>> On Tue, Apr 30, 2019 at 8:37 AM kant kodali  wrote:
>>>
>>> Staying behind doesn't imply one is better than the other and I didn't mean 
>>> that in any way but I fail to see how an abstraction framework like Beam 
>>> can stay ahead of the underlying execution engines?
>>>
>>> For example, If a new feature is added into the underlying execution engine 
>>> that doesn't fit the interface of Beam or breaks then I would think the 
>>> interface would need to be changed. Another example would say the 
>>> underlying execution engines take different kind's of parameters for the 
>>> same feature then it isn't so straight forward to come up with an interface 
>>> since there might be very little in common in the first place so, in that 
>>> sense, I fail to see how Beam can stay ahead.
>>>
>>> "Of course the API itself is Spark-specific, but it borrows heavily (among 
>>> other things) on ideas that Beam itself pioneered long before Spark 2.0" 
>>> Good to know.
>>>
>>> "one of the things Beam has focused on was a language portability 
>>> framework"  Sure but how important is this for a typical user? Do people 
>>> stop using a particular tool because it is in an X language? I personally 
>>> would put features first over language portability and it's completely fine 
>>> that may not be in line with Beam's priorities. All said I can agree Beam 
>>> focus on language portability is great.
>>>
>>> On Tue, Apr 30, 2019 at 2:48 AM Maximilian Michels  wrote:

 > I wouldn't say one is, or will always be, in front of or behind another.

 That's a great way to phrase it. I think it is very common to jump to
 the conclusion th

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-01 Thread Pankaj Chand
I thought by choosing Beam, I would get the benefits of both (Spark and
Flink), one at a time. Now, I'm understanding that I might not get the full
potential from either of the two.

Example: If I use Beam with Flink, and then a new feature is added to Flink
but I cannot access it via Beam, and that feature is not important to the
Beam community, then what is the suggested workaround? If I really need
that feature, I would not want to re-write my pipeline in Flink from
scratch.

Is it possible that in the near future, most of Beam's capabilities would
favor Google's Dataflow API? That way, Beam could be used to lure
developers and organizations who would typically use Spark/Flink, with the
promise of portability. After they get dependent on Beam and cannot afford
to re-write their pipelines in Spark/Flink from scratch, they would realize
that Beam does not give access to some of the capabilities of the free
engines that they may require. Then, they would be told that if they want
all possible capabilities and would want to use their code in Beam, they
could pay for the Dataflow engine instead.

Pankaj

On Tue, Apr 30, 2019 at 6:15 PM Kenneth Knowles  wrote:

> It is worth noting that Beam isn't solely a portability layer that exposes
> underlying API features, but a feature-rich layer in its own right, with
> carefully coherent abstractions. For example, quite early on the
> SparkRunner supported streaming aspects of the Beam model - watermarks,
> windowing, triggers - that were not really available any other way. Beam's
> various features sometimes requires just a pass-through API and sometimes
> requires clever new implementation. And everything is moving constantly. I
> don't see Beam as following the features of any engine, but rather coming
> up with new needed data processing abstractions and figuring out how to
> efficiently implement them on top of various architectures.
>
> Kenn
>
> On Tue, Apr 30, 2019 at 8:37 AM kant kodali  wrote:
>
>> Staying behind doesn't imply one is better than the other and I didn't
>> mean that in any way but I fail to see how an abstraction framework like
>> Beam can stay ahead of the underlying execution engines?
>>
>> For example, If a new feature is added into the underlying execution
>> engine that doesn't fit the interface of Beam or breaks then I would think
>> the interface would need to be changed. Another example would say the
>> underlying execution engines take different kind's of parameters for the
>> same feature then it isn't so straight forward to come up with an interface
>> since there might be very little in common in the first place so, in that
>> sense, I fail to see how Beam can stay ahead.
>>
>> "Of course the API itself is Spark-specific, but it borrows heavily
>> (among other things) on ideas that Beam itself pioneered long before Spark
>> 2.0" Good to know.
>>
>> "one of the things Beam has focused on was a language portability
>> framework"  Sure but how important is this for a typical user? Do people
>> stop using a particular tool because it is in an X language? I personally
>> would put features first over language portability and it's completely fine
>> that may not be in line with Beam's priorities. All said I can agree Beam
>> focus on language portability is great.
>>
>> On Tue, Apr 30, 2019 at 2:48 AM Maximilian Michels 
>> wrote:
>>
>>> > I wouldn't say one is, or will always be, in front of or behind
>>> another.
>>>
>>> That's a great way to phrase it. I think it is very common to jump to
>>> the conclusion that one system is better than the other. In reality it's
>>> often much more complicated.
>>>
>>> For example, one of the things Beam has focused on was a language
>>> portability framework. Do I get this with Flink? No. Does that mean Beam
>>> is better than Flink? No. Maybe a better question would be, do I want to
>>> be able to run Python pipelines?
>>>
>>> This is just an example, there are many more factors to consider.
>>>
>>> Cheers,
>>> Max
>>>
>>> On 30.04.19 10:59, Robert Bradshaw wrote:
>>> > Though we all certainly have our biases, I think it's fair to say that
>>> > all of these systems are constantly innovating, borrowing ideas from
>>> > one another, and have their strengths and weaknesses. I wouldn't say
>>> > one is, or will always be, in front of or behind another.
>>> >
>>> > Take, as the given example Spark Structured Streaming. Of course the
>>> > API itself is spark-specific, but it borrows heavily (among other
>>> > things) on ideas that Beam itself pioneered long before Spark 2.0,
>>> > specifically the unification of batch and streaming processing into a
>>> > single API, and the event-time based windowing (triggering) model for
>>> > consistently and correctly handling distributed, out-of-order data
>>> > streams.
>>> >
>>> > Of course there are also operational differences. Spark, for example,
>>> > is very tied to the micro-batch style of execution whereas Flink is
>>> > fundamentally very continuous, 

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-04-30 Thread Kenneth Knowles
It is worth noting that Beam isn't solely a portability layer that exposes
underlying API features, but a feature-rich layer in its own right, with
carefully coherent abstractions. For example, quite early on the
SparkRunner supported streaming aspects of the Beam model - watermarks,
windowing, triggers - that were not really available any other way. Beam's
various features sometimes requires just a pass-through API and sometimes
requires clever new implementation. And everything is moving constantly. I
don't see Beam as following the features of any engine, but rather coming
up with new needed data processing abstractions and figuring out how to
efficiently implement them on top of various architectures.

Kenn

On Tue, Apr 30, 2019 at 8:37 AM kant kodali  wrote:

> Staying behind doesn't imply one is better than the other and I didn't
> mean that in any way but I fail to see how an abstraction framework like
> Beam can stay ahead of the underlying execution engines?
>
> For example, If a new feature is added into the underlying execution
> engine that doesn't fit the interface of Beam or breaks then I would think
> the interface would need to be changed. Another example would say the
> underlying execution engines take different kind's of parameters for the
> same feature then it isn't so straight forward to come up with an interface
> since there might be very little in common in the first place so, in that
> sense, I fail to see how Beam can stay ahead.
>
> "Of course the API itself is Spark-specific, but it borrows heavily (among
> other things) on ideas that Beam itself pioneered long before Spark 2.0"
> Good to know.
>
> "one of the things Beam has focused on was a language portability
> framework"  Sure but how important is this for a typical user? Do people
> stop using a particular tool because it is in an X language? I personally
> would put features first over language portability and it's completely fine
> that may not be in line with Beam's priorities. All said I can agree Beam
> focus on language portability is great.
>
> On Tue, Apr 30, 2019 at 2:48 AM Maximilian Michels  wrote:
>
>> > I wouldn't say one is, or will always be, in front of or behind another.
>>
>> That's a great way to phrase it. I think it is very common to jump to
>> the conclusion that one system is better than the other. In reality it's
>> often much more complicated.
>>
>> For example, one of the things Beam has focused on was a language
>> portability framework. Do I get this with Flink? No. Does that mean Beam
>> is better than Flink? No. Maybe a better question would be, do I want to
>> be able to run Python pipelines?
>>
>> This is just an example, there are many more factors to consider.
>>
>> Cheers,
>> Max
>>
>> On 30.04.19 10:59, Robert Bradshaw wrote:
>> > Though we all certainly have our biases, I think it's fair to say that
>> > all of these systems are constantly innovating, borrowing ideas from
>> > one another, and have their strengths and weaknesses. I wouldn't say
>> > one is, or will always be, in front of or behind another.
>> >
>> > Take, as the given example Spark Structured Streaming. Of course the
>> > API itself is spark-specific, but it borrows heavily (among other
>> > things) on ideas that Beam itself pioneered long before Spark 2.0,
>> > specifically the unification of batch and streaming processing into a
>> > single API, and the event-time based windowing (triggering) model for
>> > consistently and correctly handling distributed, out-of-order data
>> > streams.
>> >
>> > Of course there are also operational differences. Spark, for example,
>> > is very tied to the micro-batch style of execution whereas Flink is
>> > fundamentally very continuous, and Beam delegates to the underlying
>> > runner.
>> >
>> > It is certainly Beam's goal to keep overhead minimal, and one of the
>> > primary selling points is the flexibility of portability (of both the
>> > execution runtime and the SDK) as your needs change.
>> >
>> > - Robert
>> >
>> >
>> > On Tue, Apr 30, 2019 at 5:29 AM  wrote:
>> >>
>> >> Ofcourse! I suspect beam will always be one or two step backwards to
>> the new functionality that is available or yet to come.
>> >>
>> >> For example: Spark Structured Streaming is still not available, no CEP
>> apis yet and much more.
>> >>
>> >> Sent from my iPhone
>> >>
>> >> On Apr 30, 2019, at 12:11 AM, Pankaj Chand 
>> wrote:
>> >>
>> >> Will Beam add any overhead or lack certain API/functions available in
>> Spark/Flink?
>>
>


Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-04-30 Thread kant kodali
Staying behind doesn't imply one is better than the other and I didn't mean
that in any way but I fail to see how an abstraction framework like Beam
can stay ahead of the underlying execution engines?

For example, If a new feature is added into the underlying execution engine
that doesn't fit the interface of Beam or breaks then I would think the
interface would need to be changed. Another example would say the
underlying execution engines take different kind's of parameters for the
same feature then it isn't so straight forward to come up with an interface
since there might be very little in common in the first place so, in that
sense, I fail to see how Beam can stay ahead.

"Of course the API itself is Spark-specific, but it borrows heavily (among
other things) on ideas that Beam itself pioneered long before Spark 2.0"
Good to know.

"one of the things Beam has focused on was a language portability
framework"  Sure but how important is this for a typical user? Do people
stop using a particular tool because it is in an X language? I personally
would put features first over language portability and it's completely fine
that may not be in line with Beam's priorities. All said I can agree Beam
focus on language portability is great.

On Tue, Apr 30, 2019 at 2:48 AM Maximilian Michels  wrote:

> > I wouldn't say one is, or will always be, in front of or behind another.
>
> That's a great way to phrase it. I think it is very common to jump to
> the conclusion that one system is better than the other. In reality it's
> often much more complicated.
>
> For example, one of the things Beam has focused on was a language
> portability framework. Do I get this with Flink? No. Does that mean Beam
> is better than Flink? No. Maybe a better question would be, do I want to
> be able to run Python pipelines?
>
> This is just an example, there are many more factors to consider.
>
> Cheers,
> Max
>
> On 30.04.19 10:59, Robert Bradshaw wrote:
> > Though we all certainly have our biases, I think it's fair to say that
> > all of these systems are constantly innovating, borrowing ideas from
> > one another, and have their strengths and weaknesses. I wouldn't say
> > one is, or will always be, in front of or behind another.
> >
> > Take, as the given example Spark Structured Streaming. Of course the
> > API itself is spark-specific, but it borrows heavily (among other
> > things) on ideas that Beam itself pioneered long before Spark 2.0,
> > specifically the unification of batch and streaming processing into a
> > single API, and the event-time based windowing (triggering) model for
> > consistently and correctly handling distributed, out-of-order data
> > streams.
> >
> > Of course there are also operational differences. Spark, for example,
> > is very tied to the micro-batch style of execution whereas Flink is
> > fundamentally very continuous, and Beam delegates to the underlying
> > runner.
> >
> > It is certainly Beam's goal to keep overhead minimal, and one of the
> > primary selling points is the flexibility of portability (of both the
> > execution runtime and the SDK) as your needs change.
> >
> > - Robert
> >
> >
> > On Tue, Apr 30, 2019 at 5:29 AM  wrote:
> >>
> >> Ofcourse! I suspect beam will always be one or two step backwards to
> the new functionality that is available or yet to come.
> >>
> >> For example: Spark Structured Streaming is still not available, no CEP
> apis yet and much more.
> >>
> >> Sent from my iPhone
> >>
> >> On Apr 30, 2019, at 12:11 AM, Pankaj Chand 
> wrote:
> >>
> >> Will Beam add any overhead or lack certain API/functions available in
> Spark/Flink?
>


Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-04-30 Thread Maximilian Michels

I wouldn't say one is, or will always be, in front of or behind another.


That's a great way to phrase it. I think it is very common to jump to 
the conclusion that one system is better than the other. In reality it's 
often much more complicated.


For example, one of the things Beam has focused on was a language 
portability framework. Do I get this with Flink? No. Does that mean Beam 
is better than Flink? No. Maybe a better question would be, do I want to 
be able to run Python pipelines?


This is just an example, there are many more factors to consider.

Cheers,
Max

On 30.04.19 10:59, Robert Bradshaw wrote:

Though we all certainly have our biases, I think it's fair to say that
all of these systems are constantly innovating, borrowing ideas from
one another, and have their strengths and weaknesses. I wouldn't say
one is, or will always be, in front of or behind another.

Take, as the given example Spark Structured Streaming. Of course the
API itself is spark-specific, but it borrows heavily (among other
things) on ideas that Beam itself pioneered long before Spark 2.0,
specifically the unification of batch and streaming processing into a
single API, and the event-time based windowing (triggering) model for
consistently and correctly handling distributed, out-of-order data
streams.

Of course there are also operational differences. Spark, for example,
is very tied to the micro-batch style of execution whereas Flink is
fundamentally very continuous, and Beam delegates to the underlying
runner.

It is certainly Beam's goal to keep overhead minimal, and one of the
primary selling points is the flexibility of portability (of both the
execution runtime and the SDK) as your needs change.

- Robert


On Tue, Apr 30, 2019 at 5:29 AM  wrote:


Ofcourse! I suspect beam will always be one or two step backwards to the new 
functionality that is available or yet to come.

For example: Spark Structured Streaming is still not available, no CEP apis yet 
and much more.

Sent from my iPhone

On Apr 30, 2019, at 12:11 AM, Pankaj Chand  wrote:

Will Beam add any overhead or lack certain API/functions available in 
Spark/Flink?


Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-04-30 Thread Robert Bradshaw
Though we all certainly have our biases, I think it's fair to say that
all of these systems are constantly innovating, borrowing ideas from
one another, and have their strengths and weaknesses. I wouldn't say
one is, or will always be, in front of or behind another.

Take, as the given example Spark Structured Streaming. Of course the
API itself is spark-specific, but it borrows heavily (among other
things) on ideas that Beam itself pioneered long before Spark 2.0,
specifically the unification of batch and streaming processing into a
single API, and the event-time based windowing (triggering) model for
consistently and correctly handling distributed, out-of-order data
streams.

Of course there are also operational differences. Spark, for example,
is very tied to the micro-batch style of execution whereas Flink is
fundamentally very continuous, and Beam delegates to the underlying
runner.

It is certainly Beam's goal to keep overhead minimal, and one of the
primary selling points is the flexibility of portability (of both the
execution runtime and the SDK) as your needs change.

- Robert


On Tue, Apr 30, 2019 at 5:29 AM  wrote:
>
> Ofcourse! I suspect beam will always be one or two step backwards to the new 
> functionality that is available or yet to come.
>
> For example: Spark Structured Streaming is still not available, no CEP apis 
> yet and much more.
>
> Sent from my iPhone
>
> On Apr 30, 2019, at 12:11 AM, Pankaj Chand  wrote:
>
> Will Beam add any overhead or lack certain API/functions available in 
> Spark/Flink?


Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-04-29 Thread kanth909
Ofcourse! I suspect beam will always be one or two step backwards to the new 
functionality that is available or yet to come. 

For example: Spark Structured Streaming is still not available, no CEP apis yet 
and much more.

Sent from my iPhone

> On Apr 30, 2019, at 12:11 AM, Pankaj Chand  wrote:
> 
> Will Beam add any overhead or lack certain API/functions available in 
> Spark/Flink?