Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Aljoscha Krettek Tue, 05 May 2020 01:01:32 -0700

Thanks Till for summarizing!

Another alternative is also to stick to one distribution but remove oneof the very heavy filesystem connectors and add all the mentioned SQLconnectors/formats, which will keep the size of the distribution thesame, or a bit smaller.


Best,
Aljoscha

On 04.05.20 18:59, Till Rohrmann wrote:

Thanks everyone for this lively discussion and all your thoughts.

Let me try to summarise the current state of the discussion and then let's
see how we can move it forward.

To begin with, I think everyone agrees that we want to improve Flink's user
experience. In particular, we want to improve the experience of first time
users who want to try out Flink's SQL functionality.

The problem which stands in the way of a good user experience is that the
current Flink distribution contains too few dependencies for a smooth first
time SQL experience and too many dependencies for a lean production setup.
Hence, Aljoscha proposed to create a "fat" and "slim" Flink distribution
addressing these two differing needs.

As far as the discussion goes there are two remaining discussion points.

1. How do we serve the different types of distributions?

a) Create a "fat" and "slim" distribution which is served from the Flink
web site.
b) Create a "slim" distribution which is served from the Flink web site and
have a tool (e.g. script) which can turn a slim distribution into a fat
distribution by downloading additional dependencies.

For a) speaks that it is simpler and does not require the user to execute
an additional step. The downside is that we will add another dimension to
the release matrix which will complicate the release process (see Chesnay's
last comment for more details).

For b) speaks that it is potentially the more general solution as we can
provide different options for different distributions (e.g. choosing a
connector version, required filesystems, metric reporters, etc.). The
downside is the additional step for the user and that we need such a tool
(which in itself could be quite complex).

2. What is contained in the "fat" distribution?

The current proposal is to move everything which can be moved from opt to
the plugins directory to the plugins directory (metric reporters and
filesystems). That way the user will be able to use all of these
implementations without running into dependency conflicts.

For the SQL support, Aljoscha proposed to add:

flink-avro-1.10.0.jar
flink-csv-1.10.0.jar
flink-hbase_2.11-1.10.0.jar
flink-jdbc_2.11-1.10.0.jar
flink-json-1.10.0.jar
flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
flink-sql-connector-kafka_2.11-1.10.0.jar
sql-connectors-formats

How to move forward from here?

Given that the time until the feature freeze is limited I would actually
propose to follow the simplest approach which is the creation of two
distributions ("fat" & "slim"). We can still rethink this decision at a
later point and introduce a tool which allows to download a custom build
Flink distribution. At this point we could then remove the "fat" jar from
the web site. Of course, this comes at the cost of increased release
complexity but I believe that the user experience will make up for it.

For the what to include, I think we could take Aljoscha's proposal and then
see what other dependencies the most common SQL use cases require. I guess
that the SQL guys know quite precisely where the users run into problems.

I know that this solution might not be perfect (in particular wrt releases)
but I hope that everyone could live with this solution for the time being.

Feel free to add anything I might have forgotten to mention here.

Cheers,
Till

On Tue, Apr 28, 2020 at 11:43 AM Chesnay Schepler <[email protected]>
wrote:

It would be good if we could nail down what a slim/fat distribution
would look like, as there are various ideas floating around in this thread.

Like, what is a "slim" distribution? Are we just emptying /opt? Removing
everything larger than 1mb? Are we throwing out the Table API from /lib
for a minimal streaming distribution?
Are we going ham and remove the YARN integration from the flink-dist jar?

While I can see how a fat distribution can certainly help for the
out-of-the-box experience, I'm not so sold on the slim variant.
If someone is capable of assembling a distribution matching to their
use-case, do they even need a slim distribution in the first place?

I really want us to stick to 1 distribution type, as I'm worried about
the implications of 2 or FWIW any number of additional distribution types:

- you need separate assemblies, including a new profile
      - adjusting opt/plugins and making sure the examples match the
bundled contents (e.g., no gelly/python, maybe some SQL examples if
there are any that use a connector)
- another 300mb uploaded to dist.apache.org + whatever the fat
distribution grows by x3 (scala 2.11/2.12 + python)
      - the latter naturally being susceptible to additional growth in
the future
      - this is also a pain for release managers since SVN likes to throw
up if the upload is too large + it increases upload time
- another 2 distributions to test during a release
- another distribution type we need to test via CI
- more content downloaded into the docker images by default
      - unless of course we release separate slim/fat images (where we
would then circle back to the above 2 points, just docker-flavored)
- any further addition to the release matrix implies an additional 4
distributions => long-term ramifications
      - e.g., another scala version

On 24/04/2020 15:15, Kurt Young wrote:

+1 for "slim" and "fat" solution. One comment about the fat one, I think

we

need to
put all needed jars into /lib (or /plugins). Put jars into /opt and

relying

on users moving
them from /opt to /lib doesn't really improve the out-of-box experience.

Best,
Kurt


On Fri, Apr 24, 2020 at 8:28 PM Aljoscha Krettek <[email protected]>
wrote:

re (1): I don't know about that, probably the people that did the
metrics reporter plugin support had some thoughts about that.

re (2): I agree, that's why I initially suggested to split it into
"slim" and "fat" because our current "medium fat" selection of jars in
Flink dist does not serve anyone too well. It's too fat for people that
want to build lean application images. It's to lean for people that want
a good first out-of-box experience.

Aljoscha

On 17.04.20 16:38, Stephan Ewen wrote:

@Aljoscha I think that is an interesting line of thinking. the swift-fs

may

be rarely enough used to move it to an optional download.

I would still drop two more thoughts:

(1) Now that we have plugins support, is there a reason to have a

metrics

reporter or file system in /opt instead of /plugins? They don't spoil

the

class path any more.

(2) I can imagine there still being a desire to have a "minimal" docker
file, for users that want to keep the container images as small as
possible, to speed up deployment. It is fine if that would not be the
default, though.


On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <[email protected]

wrote:

I think having such tools and/or tailor-made distributions can be nice
but I also think the discussion is missing the main point: The initial
observation/motivation is that apparently a lot of users (Kurt and I
talked about this) on the chinese DingTalk support groups, and other
support channels have problems when first using the SQL client because
of these missing connectors/formats. For these, having additional

tools

would not solve anything because they would also not take that extra
step. I think that even tiny friction should be avoided because the
annoyance from it accumulates of the (hopefully) many users that we

want

to have.

Maybe we should take a step back from discussing the "fat"/"slim" idea
and instead think about the composition of the current dist. As
mentioned we have these jars in opt/:

     17M flink-azure-fs-hadoop-1.10.0.jar
     52K flink-cep-scala_2.11-1.10.0.jar
180K flink-cep_2.11-1.10.0.jar
746K flink-gelly-scala_2.11-1.10.0.jar
626K flink-gelly_2.11-1.10.0.jar
512K flink-metrics-datadog-1.10.0.jar
159K flink-metrics-graphite-1.10.0.jar
1.0M flink-metrics-influxdb-1.10.0.jar
102K flink-metrics-prometheus-1.10.0.jar
     10K flink-metrics-slf4j-1.10.0.jar
     12K flink-metrics-statsd-1.10.0.jar
     36M flink-oss-fs-hadoop-1.10.0.jar
     28M flink-python_2.11-1.10.0.jar
     22K flink-queryable-state-runtime_2.11-1.10.0.jar
     18M flink-s3-fs-hadoop-1.10.0.jar
     31M flink-s3-fs-presto-1.10.0.jar
196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
518K flink-sql-client_2.11-1.10.0.jar
     99K flink-state-processor-api_2.11-1.10.0.jar
     25M flink-swift-fs-hadoop-1.10.0.jar
160M opt

The "filesystem" connectors ar ethe heavy hitters, there.

I downloaded most of the SQL connectors/formats and this is what I

got:


     73K flink-avro-1.10.0.jar
     36K flink-csv-1.10.0.jar
     55K flink-hbase_2.11-1.10.0.jar
     88K flink-jdbc_2.11-1.10.0.jar
     42K flink-json-1.10.0.jar
     20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
     24M sql-connectors-formats

We could just add these to the Flink distribution without blowing it

up

by much. We could drop any of the existing "filesystem" connectors

from

opt and add the SQL connectors/formats and not change the size of

Flink

dist. So maybe we should do that instead?

We would need some tooling for the sql-client shell script to pick-up
the connectors/formats up from opt/ because we don't want to add them

to

lib/. We're already doing that for finding the flink-sql-client jar,
which is also not in lib/.

What do you think?

Best,
Aljoscha

On 17.04.20 05:22, Jark Wu wrote:

Hi,

I like the idea of web tool to assemble fat distribution. And the
https://code.quarkus.io/ looks very nice.
All the users need to do is just select what he/she need (I think

this

step

can't be omitted anyway).
We can also provide a default fat distribution on the web which

default

selects some popular connectors.

Best,
Jark

On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <[email protected]>

wrote:

As a reference for a nice first-experience I had, take a look at
https://code.quarkus.io/
You reach this page after you click "Start Coding" at the project

homepage.

Rafi


On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <[email protected]>

wrote:

I'm not saying pre-bundle some jars will make this problem go away,

and

you're right that only hides the problem for
some users. But what if this solution can hide the problem for 90%

users?

Would't that be good enough for us to try?

Regarding to would users following instructions really be such a

big

problem?
I'm afraid yes. Otherwise I won't answer such questions for at

least

dozen times and I won't see such questions coming
up from time to time. During some periods, I even saw such

questions

every

day.

Best,
Kurt


On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <

[email protected]>

wrote:

The problem with having a distribution with "popular" stuff is

that

it

doesn't really *solve* a problem, it just hides it for users who

fall

into these particular use-cases.
Move out of it and you once again run into exact same problems

out-lined.

This is exactly why I like the tooling approach; you have to deal

with

it

from the start and transitioning to a custom use-case is easier.

Would users following instructions really be such a big problem?
I would expect that users generally know *what *they need, just

not

necessarily how it is assembled correctly (where do get which jar,

which

directory to put it in).
It seems like these are exactly the problem this would solve?
I just don't see how moving a jar corresponding to some feature

from

opt

to some directory (lib/plugins) is less error-prone than just

selecting

the

feature and having the tool handle the rest.

As for re-distributions, it depends on the form that the tool

would

take.

It could be an application that runs locally and works against

maven

central (note: not necessarily *using* maven); this should would

work

in

China, no?

A web tool would of course be fancy, but I don't know how feasible

this

is

with the ASF infrastructure.
You wouldn't be able to mirror the distribution, so the load can't

be

distributed. I doubt INFRA would like this.

Note that third-parties could also start distributing use-case

oriented

distributions, which would be perfectly fine as far as I'm

concerned.

On 16/04/2020 16:57, Kurt Young wrote:

I'm not so sure about the web tool solution though. The concern I

have

for

this approach is the final generated
distribution is kind of non-deterministic. We might generate too

many

different combinations when user trying to
package different types of connector, format, and even maybe

hadoop

releases.  As far as I can tell, most open
source projects and apache projects will only release some
pre-defined distributions, which most users are already
familiar with, thus hard to change IMO. And I also have went

through

in

some cases, users will try to re-distribute
the release package, because of the unstable network of apache

website

from

China. In web tool solution, I don't
think this kind of re-distribution would be possible anymore.

In the meantime, I also have a concern that we will fall back into

our

trap

again if we try to offer this smart & flexible
solution. Because it needs users to cooperate with such mechanism.

It's

exactly the situation what we currently fell
into:
1. We offered a smart solution.
2. We hope users will follow the correct instructions.
3. Everything will work as expected if users followed the right
instructions.

In reality, I suspect not all users will do the second step

correctly.

And

for new users who only trying to have a quick
experience with Flink, I would bet most users will do it wrong.

So, my proposal would be one of the following 2 options:
1. Provide a slim distribution for advanced product users and

provide

distribution which will have some popular builtin jars.
2. Only provide a distribution which will have some popular

builtin

jars.

If we are trying to reduce the distributions we released, I would

prefer

1.

Best,
Kurt


On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <

[email protected]

[email protected]> wrote:


I think what Chesnay and Dawid proposed would be the ideal

solution.

Ideally, we would also have a nice web tool for the website which

generates

the corresponding distribution for download.

To get things started we could start with only supporting to
download/creating the "fat" version with the script. The fat

version

would

then consist of the slim distribution and whatever we deem

important

for

new users to get started.

Cheers,
Till

On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <

[email protected]> <[email protected]>

wrote:


Hi all,

Few points from my side:

1. I like the idea of simplifying the experience for first time

users.

As for production use cases I share Jark's opinion that in this

case I

would expect users to combine their distribution manually. I think

in

such scenarios it is important to understand interconnections.
Personally I'd expect the slimmest possible distribution that I

can

extend further with what I need in my production scenario.

2. I think there is also the problem that the matrix of possible
combinations that can be useful is already big. Do we want to

have a

distribution for:

        SQL users: which connectors should we include? should we

include

hive? which other catalog?

        DataStream users: which connectors should we include?

       For both of the above should we include yarn/kubernetes?

I would opt for providing only the "slim" distribution as a

release

artifact.

3. However, as I said I think its worth investigating how we can

improve

users experience. What do you think of providing a tool, could be

e.g.

shell script that constructs a distribution based on users

choice. I

think that was also what Chesnay mentioned as "tooling to
assemble custom distributions" In the end how I see the difference
between a slim and fat distribution is which jars do we put into

the

lib, right? It could have a few "screens".

1. Which API are you interested in:
a. SQL API
b. DataStream API


2. [SQL] Which connectors do you want to use? [multichoice]:
a. Kafka
b. Elasticsearch
...

3. [SQL] Which catalog you want to use?

...

Such a tool would download all the dependencies from maven and put

them

into the correct folder. In the future we can extend it with

additional

rules e.g. kafka-0.9 cannot be chosen at the same time with
kafka-universal etc.

The benefit of it would be that the distribution that we release

could

remain "slim" or we could even make it slimmer. I might be missing
something here though.

Best,

Dawdi

On 16/04/2020 11:02, Aljoscha Krettek wrote:

I want to reinforce my opinion from earlier: This is about

improving

the situation both for first-time users and for experienced users

that

want to use a Flink dist in production. The current Flink dist is

too

"thin" for first-time SQL users and it is too "fat" for production
users, that is where serving no-one properly with the current
middle-ground. That's why I think introducing those specialized
"spins" of Flink dist would be good.

By the way, at some point in the future production users might not
even need to get a Flink dist anymore. They should be able to have
Flink as a dependency of their project (including the runtime) and
then build an image from this for Kubernetes or a fat jar for

YARN.


Aljoscha

On 15.04.20 18:14, wenlong.lwl wrote:

Hi all,

Regarding slim and fat distributions, I think different kinds of

jobs

may
prefer different type of distribution:

For DataStream job, I think we may not like fat distribution

containing

connectors because user would always need to depend on the

connector


in

user code, it is easy to include the connector jar in the user

lib.


Less

jar in lib means less class conflicts and problems.

For SQL job, I think we are trying to encourage user to user pure
sql(DDL +
DML) to construct their job, In order to improve user experience,

It

may be
important for flink, not only providing as many connector jar in
distribution as possible especially the connector and format we

have

well
documented,  but also providing an mechanism to load connectors
according
to the DDLs,

So I think it could be good to place connector/format jars in some
dir like
opt/connector which would not affect jobs by default, and

introduce

mechanism of dynamic discovery for SQL.

Best,
Wenlong

On Wed, 15 Apr 2020 at 22:46, Jingsong Li <[email protected]

[email protected]>

wrote:


Hi,

I am thinking both "improve first experience" and "improve

production

experience".

I'm thinking about what's the common mode of Flink?
Streaming job use Kafka? Batch job use Hive?

Hive 1.2.1 dependencies can be compatible with most of Hive server
versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
Flink is currently mainly used for streaming, so let's not talk
about hive.

For streaming jobs, first of all, the jobs in my mind is (related

to

connectors):
- ETL jobs: Kafka -> Kafka
- Join jobs: Kafka -> DimJDBC -> Kafka
- Aggregation jobs: Kafka -> JDBCSink
So Kafka and JDBC are probably the most commonly used. Of course,

also

includes CSV, JSON's formats.
So when we provide such a fat distribution:
- With CSV, JSON.
- With flink-kafka-universal and kafka dependencies.
- With flink-jdbc.
Using this fat distribution, most users can run their jobs well.

(jdbc

driver jar required, but this is very natural to do)
Can these dependencies lead to kinds of conflicts? Only Kafka may

have

conflicts, but if our goal is to use kafka-universal to support

all

Kafka
versions, it is hopeful to target the vast majority of users.

We don't want to plug all jars into the fat distribution. Only

need

less
conflict and common. of course, it is a matter of consideration to

put

which jar into fat distribution.
We have the opportunity to facilitate the majority of users, but
also left
opportunities for customization.

Best,
Jingsong Lee

On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <[email protected]> <

[email protected]> wrote:


Hi,

I think we should first reach an consensus on "what problem do we
want to
solve?"
(1) improve first experience? or (2) improve production

experience?


As far as I can see, with the above discussion, I think what we
want to
solve is the "first experience".
And I think the slim jar is still the best distribution for
production,
because it's easier to assembling jars
than excluding jars and can avoid potential class conflicts.

If we want to improve "first experience", I think it make sense to
have a
fat distribution to give users a more smooth first experience.
But I would like to call it "playground distribution" or something
like
that to explicitly differ from the "slim production-purpose

distribution".

The "playground distribution" can contains some widely used jars,

like

universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
json,
csv, etc..
Even we can provide a playground docker which may contain the fat
distribution, python3, and hive.

Best,
Jark


On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <

[email protected]>

[email protected]>

wrote:

I don't see a lot of value in having multiple distributions.

The simple reality is that no fat distribution we could provide

would

satisfy all use-cases, so why even try.
If users commonly run into issues for certain jars, then maybe

those

should be added to the current distribution.

Personally though I still believe we should only distribute a slim
version. I'd rather have users always add required jars to the
distribution than only when they go outside our "expected"

use-cases.

Then we might finally address this issue properly, i.e., tooling

to

assemble custom distributions and/or better error messages if
Flink-provided extensions cannot be found.

On 15/04/2020 15:23, Kurt Young wrote:

Regarding to the specific solution, I'm not sure about the "fat"

and

"slim"

solution though. I get the idea
that we can make the slim one even more lightweight than current
distribution, but what about the "fat"
one? Do you mean that we would package all connectors and formats

into

this? I'm not sure if this is
feasible. For example, we can't put all versions of kafka and hive
connector jars into lib directory, and
we also might need hadoop jars when using filesystem connector to

access

data from HDFS.

So my guess would be we might hand-pick some of the most

frequently

used

connectors and formats
into our "lib" directory, like kafka, csv, json metioned above,

and

still

leave some other connectors out of it.
If this is the case, then why not we just provide this

distribution

to

user? I'm not sure i get the benefit of
providing another super "slim" jar (we have to pay some costs to

provide

another suit of distribution).

What do you think?

Best,
Kurt


On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <

[email protected]

wrote:

Big +1.

I like "fat" and "slim".

For csv and json, like Jark said, they are quite small and don't

have

other

dependencies. They are important to kafka connector, and

important

to upcoming file system connector too.
So can we move them to both "fat" and "slim"? They're so

important,

and

they're so lightweight.

Best,
Jingsong Lee

On Wed, Apr 15, 2020 at 4:53 PM godfrey he <[email protected]>

[email protected]>

wrote:

Big +1.
This will improve user experience (special for Flink new users).
We answered so many questions about "class not found".

Best,
Godfrey

Dian Fu <[email protected]> <[email protected]>

于2020年4月15日周三

下午4:30写道：


+1 to this proposal.

Missing connector jars is also a big problem for PyFlink users.

Currently,

after a Python user has installed PyFlink using `pip`, he has

to

manually

copy the connector fat jars to the PyFlink installation

directory

for

the

connectors to be used if he wants to run jobs locally. This

process

is

very

confuse for users and affects the experience a lot.

Regards,
Dian


在 2020年4月15日，下午3:51，Jark Wu <[email protected]> <[email protected]>

写道：

+1 to the proposal. I also found the "download additional jar"

step

is

really verbose when I prepare webinars.

At least, I think the flink-csv and flink-json should in the

distribution,

they are quite small and don't have other dependencies.

Best,
Jark

On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <[email protected]> <

[email protected]>

wrote:

Hi Aljoscha,

Big +1 for the fat flink distribution, where do you plan to

put

these

connectors ? opt or lib ?

Aljoscha Krettek <[email protected]> <[email protected]>

于2020年4月15日周三

下午3:30写道：


Hi Everyone,

I'd like to discuss about releasing a more full-featured

Flink

distribution. The motivation is that there is friction for

SQL/Table

API

users that want to use Table connectors which are not there

in

the

current Flink Distribution. For these users the workflow is

currently

roughly:

       - download Flink dist
       - configure csv/Kafka/json connectors per configuration
       - run SQL client or program
       - decrypt error message and research the solution
       - download additional connector jars
       - program works correctly

I realize that this can be made to work but if every SQL

user

has

this

as their first experience that doesn't seem good to me.

My proposal is to provide two versions of the Flink

Distribution

in

the

future: "fat" and "slim" (names to be discussed):

       - slim would be even trimmer than todays distribution
       - fat would contain a lot of convenience connectors (yet

to

be

determined which one)

And yes, I realize that there are already more dimensions of

Flink

releases (Scala version and Java version).

For background, our current Flink dist has these in the opt

directory:

       - flink-azure-fs-hadoop-1.10.0.jar
       - flink-cep-scala_2.12-1.10.0.jar
       - flink-cep_2.12-1.10.0.jar
       - flink-gelly-scala_2.12-1.10.0.jar
       - flink-gelly_2.12-1.10.0.jar
       - flink-metrics-datadog-1.10.0.jar
       - flink-metrics-graphite-1.10.0.jar
       - flink-metrics-influxdb-1.10.0.jar
       - flink-metrics-prometheus-1.10.0.jar
       - flink-metrics-slf4j-1.10.0.jar
       - flink-metrics-statsd-1.10.0.jar
       - flink-oss-fs-hadoop-1.10.0.jar
       - flink-python_2.12-1.10.0.jar
       - flink-queryable-state-runtime_2.12-1.10.0.jar
       - flink-s3-fs-hadoop-1.10.0.jar
       - flink-s3-fs-presto-1.10.0.jar
       -

flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar

       - flink-sql-client_2.12-1.10.0.jar
       - flink-state-processor-api_2.12-1.10.0.jar
       - flink-swift-fs-hadoop-1.10.0.jar

Current Flink dist is 267M. If we removed everything from

opt

we

would

go down to 126M. I would reccomend this, because the large

majority

of

the files in opt are probably unused.

What do you think?

Best,
Aljoscha



--
Best Regards

Jeff Zhang


--
Best, Jingsong Lee


--
Best, Jingsong Lee

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Reply via email to