Back from Berlin Beam Summit 2019

송원욱 Mon, 01 Jul 2019 08:40:26 -0700

Hi!

I got back from the Beam Summit Europe 2019 that happened last week in
Berlin, and I had lots of interesting conversations and feedbacks from the
people that I've met there. I thought I would share some of them with the
dev list. By the way you can check out the talk on youtube
<https://youtu.be/DKxYE8YWF_o>!


First of all, a lot of people were *very* interested in Apache Nemo! and a
lot of people from the Beam community were very excited to hear about a new
runner with primary support for their language! A few reasons for their
interest had been that since Beam does not actually get involved in the
runtime layer, where the actual scheduling or communication or distributed
computation happens, they were interested in the optimizations that can be
done in such layers.

Second, with all the support from the TFX team, as well as the Beam SQL
team, it would bring loads of new possibilities for Nemo by supporting the
*portability* *layer* of Beam, which supports applications written with any
languages among Java, Python, and Go (and more in the future!). The
portability layer is getting more and more mature, and I think it's about
time to support the portability layer for Nemo as well, as not a lot of
runners support it so far and it would give Nemo a head start.

Another thing that I've noticed is that a lot of people are still very much
interested in *batch* processing rather than stream processing. From the
people that I've talked to, I've learned that people found stream
processing to be quite pricey and that they haven't found stream processing
worth the price that they were paying (for example, Spotify runs all of
their data processing workloads as batch). I guess Nemo could be a good
candidate to run batch processing, as Spark often suffers from problems as
large-scale shuffle and data skew problems, if not provided with machines
with enough memory, whereas Nemo is able to provide the optimizations for
such problems. I've also found the people were interested if Nemo supports
Kubernetes, which is a topic that we should definitely look into.

I've also had many questions from the engineers from *Seznam.cz *and
*shopify.com
<http://shopify.com>* where they run their own datacenters to process their
data (I think). They have been facing exactly the same problems as
illustrated above (large-scale shuffle, data skew, frequent data reloading
for broadcasted data, utilizing transient resources, etc.), and have had
questions about running their data processing workloads on their large
amounts of data that they are facing every day (upto 40TB/day). I should
definitely follow up with them to see how they are doing, if they are
trying to use Nemo in their production, to provide help if needed and to
see Nemo's performance with real workloads.

Lastly, I have been talking with Pablo (from Beam) about the trip to
*Seattle* and Renton, Washington next week regarding the USENIX ATC '19
conference, and have had a chat about organizing a lunch and maybe a small
talk with the Googlers there as well! I've also heard that Davor is also
based in Seattle, so I have been thinking that it would be a great
opportunity for us to meet in person. 😀The date would be probably the *15th
of July*, so please keep the date in mind if you would be interested!

Cheers,
Wonook

Back from Berlin Beam Summit 2019

Reply via email to