Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Jark Wu Thu, 04 Jun 2020 21:26:12 -0700

+1 to add these 3 formast into dist, under the lib/ directory.

This is a worth trying step toward better usability for SQL users.
They don't have *any* dependencies and very small, so I think it's safe to
add them.


Best,
Jark

On Fri, 5 Jun 2020 at 11:14, Jingsong Li <[email protected]> wrote:

> Hi all,
>
> Considering that 1.11 will be released soon, what about my previous
> proposal? Put flink-csv, flink-json and flink-avro under lib.
> These three formats are very small and no third party dependence, and they
> are widely used by table users.
>
> Best,
> Jingsong Lee
>
> On Tue, May 12, 2020 at 4:19 PM Jingsong Li <[email protected]>
> wrote:
>
> > Thanks for your discussion.
> >
> > Sorry to start discussing another thing:
> >
> > The biggest problem I see is the variety of problems caused by users'
> lack
> > of format dependency.
> > As Aljoscha said, these three formats are very small and no third party
> > dependence, and they are widely used by table users.
> > Actually, we don't have any other built-in table formats now... In total
> > 151K...
> >
> > 73K flink-avro-1.10.0.jar
> > 36K flink-csv-1.10.0.jar
> > 42K flink-json-1.10.0.jar
> >
> > So, Can we just put them into "lib/" or flink-table-uber?
> > It not solve all problems and maybe it is independent of "fat" and
> "slim".
> > But also improve usability.
> > What do you think? Any objections?
> >
> > Best,
> > Jingsong Lee
> >
> > On Mon, May 11, 2020 at 5:48 PM Chesnay Schepler <[email protected]>
> > wrote:
> >
> >> One downside would be that we're shipping more stuff when running on
> >> YARN for example, since the entire plugins directory is shiped by
> default.
> >>
> >> On 17/04/2020 16:38, Stephan Ewen wrote:
> >> > @Aljoscha I think that is an interesting line of thinking. the
> swift-fs
> >> may
> >> > be rarely enough used to move it to an optional download.
> >> >
> >> > I would still drop two more thoughts:
> >> >
> >> > (1) Now that we have plugins support, is there a reason to have a
> >> metrics
> >> > reporter or file system in /opt instead of /plugins? They don't spoil
> >> the
> >> > class path any more.
> >> >
> >> > (2) I can imagine there still being a desire to have a "minimal"
> docker
> >> > file, for users that want to keep the container images as small as
> >> > possible, to speed up deployment. It is fine if that would not be the
> >> > default, though.
> >> >
> >> >
> >> > On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <
> [email protected]>
> >> > wrote:
> >> >
> >> >> I think having such tools and/or tailor-made distributions can be
> nice
> >> >> but I also think the discussion is missing the main point: The
> initial
> >> >> observation/motivation is that apparently a lot of users (Kurt and I
> >> >> talked about this) on the chinese DingTalk support groups, and other
> >> >> support channels have problems when first using the SQL client
> because
> >> >> of these missing connectors/formats. For these, having additional
> tools
> >> >> would not solve anything because they would also not take that extra
> >> >> step. I think that even tiny friction should be avoided because the
> >> >> annoyance from it accumulates of the (hopefully) many users that we
> >> want
> >> >> to have.
> >> >>
> >> >> Maybe we should take a step back from discussing the "fat"/"slim"
> idea
> >> >> and instead think about the composition of the current dist. As
> >> >> mentioned we have these jars in opt/:
> >> >>
> >> >>    17M flink-azure-fs-hadoop-1.10.0.jar
> >> >>    52K flink-cep-scala_2.11-1.10.0.jar
> >> >> 180K flink-cep_2.11-1.10.0.jar
> >> >> 746K flink-gelly-scala_2.11-1.10.0.jar
> >> >> 626K flink-gelly_2.11-1.10.0.jar
> >> >> 512K flink-metrics-datadog-1.10.0.jar
> >> >> 159K flink-metrics-graphite-1.10.0.jar
> >> >> 1.0M flink-metrics-influxdb-1.10.0.jar
> >> >> 102K flink-metrics-prometheus-1.10.0.jar
> >> >>    10K flink-metrics-slf4j-1.10.0.jar
> >> >>    12K flink-metrics-statsd-1.10.0.jar
> >> >>    36M flink-oss-fs-hadoop-1.10.0.jar
> >> >>    28M flink-python_2.11-1.10.0.jar
> >> >>    22K flink-queryable-state-runtime_2.11-1.10.0.jar
> >> >>    18M flink-s3-fs-hadoop-1.10.0.jar
> >> >>    31M flink-s3-fs-presto-1.10.0.jar
> >> >> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >> >> 518K flink-sql-client_2.11-1.10.0.jar
> >> >>    99K flink-state-processor-api_2.11-1.10.0.jar
> >> >>    25M flink-swift-fs-hadoop-1.10.0.jar
> >> >> 160M opt
> >> >>
> >> >> The "filesystem" connectors ar ethe heavy hitters, there.
> >> >>
> >> >> I downloaded most of the SQL connectors/formats and this is what I
> got:
> >> >>
> >> >>    73K flink-avro-1.10.0.jar
> >> >>    36K flink-csv-1.10.0.jar
> >> >>    55K flink-hbase_2.11-1.10.0.jar
> >> >>    88K flink-jdbc_2.11-1.10.0.jar
> >> >>    42K flink-json-1.10.0.jar
> >> >>    20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> >> >> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
> >> >>    24M sql-connectors-formats
> >> >>
> >> >> We could just add these to the Flink distribution without blowing it
> up
> >> >> by much. We could drop any of the existing "filesystem" connectors
> from
> >> >> opt and add the SQL connectors/formats and not change the size of
> Flink
> >> >> dist. So maybe we should do that instead?
> >> >>
> >> >> We would need some tooling for the sql-client shell script to pick-up
> >> >> the connectors/formats up from opt/ because we don't want to add them
> >> to
> >> >> lib/. We're already doing that for finding the flink-sql-client jar,
> >> >> which is also not in lib/.
> >> >>
> >> >> What do you think?
> >> >>
> >> >> Best,
> >> >> Aljoscha
> >> >>
> >> >> On 17.04.20 05:22, Jark Wu wrote:
> >> >>> Hi,
> >> >>>
> >> >>> I like the idea of web tool to assemble fat distribution. And the
> >> >>> https://code.quarkus.io/ looks very nice.
> >> >>> All the users need to do is just select what he/she need (I think
> this
> >> >> step
> >> >>> can't be omitted anyway).
> >> >>> We can also provide a default fat distribution on the web which
> >> default
> >> >>> selects some popular connectors.
> >> >>>
> >> >>> Best,
> >> >>> Jark
> >> >>>
> >> >>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <[email protected]>
> >> wrote:
> >> >>>
> >> >>>> As a reference for a nice first-experience I had, take a look at
> >> >>>> https://code.quarkus.io/
> >> >>>> You reach this page after you click "Start Coding" at the project
> >> >> homepage.
> >> >>>> Rafi
> >> >>>>
> >> >>>>
> >> >>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <[email protected]>
> wrote:
> >> >>>>
> >> >>>>> I'm not saying pre-bundle some jars will make this problem go
> away,
> >> and
> >> >>>>> you're right that only hides the problem for
> >> >>>>> some users. But what if this solution can hide the problem for 90%
> >> >> users?
> >> >>>>> Would't that be good enough for us to try?
> >> >>>>>
> >> >>>>> Regarding to would users following instructions really be such a
> big
> >> >>>>> problem?
> >> >>>>> I'm afraid yes. Otherwise I won't answer such questions for at
> >> least a
> >> >>>>> dozen times and I won't see such questions coming
> >> >>>>> up from time to time. During some periods, I even saw such
> questions
> >> >>>> every
> >> >>>>> day.
> >> >>>>>
> >> >>>>> Best,
> >> >>>>> Kurt
> >> >>>>>
> >> >>>>>
> >> >>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
> >> [email protected]>
> >> >>>>> wrote:
> >> >>>>>
> >> >>>>>> The problem with having a distribution with "popular" stuff is
> >> that it
> >> >>>>>> doesn't really *solve* a problem, it just hides it for users who
> >> fall
> >> >>>>>> into these particular use-cases.
> >> >>>>>> Move out of it and you once again run into exact same problems
> >> >>>> out-lined.
> >> >>>>>> This is exactly why I like the tooling approach; you have to deal
> >> with
> >> >>>> it
> >> >>>>>> from the start and transitioning to a custom use-case is easier.
> >> >>>>>>
> >> >>>>>> Would users following instructions really be such a big problem?
> >> >>>>>> I would expect that users generally know *what *they need, just
> not
> >> >>>>>> necessarily how it is assembled correctly (where do get which
> jar,
> >> >>>> which
> >> >>>>>> directory to put it in).
> >> >>>>>> It seems like these are exactly the problem this would solve?
> >> >>>>>> I just don't see how moving a jar corresponding to some feature
> >> from
> >> >>>> opt
> >> >>>>>> to some directory (lib/plugins) is less error-prone than just
> >> >> selecting
> >> >>>>> the
> >> >>>>>> feature and having the tool handle the rest.
> >> >>>>>>
> >> >>>>>> As for re-distributions, it depends on the form that the tool
> would
> >> >>>> take.
> >> >>>>>> It could be an application that runs locally and works against
> >> maven
> >> >>>>>> central (note: not necessarily *using* maven); this should would
> >> work
> >> >>>> in
> >> >>>>>> China, no?
> >> >>>>>>
> >> >>>>>> A web tool would of course be fancy, but I don't know how
> feasible
> >> >> this
> >> >>>>> is
> >> >>>>>> with the ASF infrastructure.
> >> >>>>>> You wouldn't be able to mirror the distribution, so the load
> can't
> >> be
> >> >>>>>> distributed. I doubt INFRA would like this.
> >> >>>>>>
> >> >>>>>> Note that third-parties could also start distributing use-case
> >> >> oriented
> >> >>>>>> distributions, which would be perfectly fine as far as I'm
> >> concerned.
> >> >>>>>>
> >> >>>>>> On 16/04/2020 16:57, Kurt Young wrote:
> >> >>>>>>
> >> >>>>>> I'm not so sure about the web tool solution though. The concern I
> >> have
> >> >>>>> for
> >> >>>>>> this approach is the final generated
> >> >>>>>> distribution is kind of non-deterministic. We might generate too
> >> many
> >> >>>>>> different combinations when user trying to
> >> >>>>>> package different types of connector, format, and even maybe
> hadoop
> >> >>>>>> releases.  As far as I can tell, most open
> >> >>>>>> source projects and apache projects will only release some
> >> >>>>>> pre-defined distributions, which most users are already
> >> >>>>>> familiar with, thus hard to change IMO. And I also have went
> >> through
> >> >> in
> >> >>>>>> some cases, users will try to re-distribute
> >> >>>>>> the release package, because of the unstable network of apache
> >> website
> >> >>>>> from
> >> >>>>>> China. In web tool solution, I don't
> >> >>>>>> think this kind of re-distribution would be possible anymore.
> >> >>>>>>
> >> >>>>>> In the meantime, I also have a concern that we will fall back
> into
> >> our
> >> >>>>> trap
> >> >>>>>> again if we try to offer this smart & flexible
> >> >>>>>> solution. Because it needs users to cooperate with such
> mechanism.
> >> >> It's
> >> >>>>>> exactly the situation what we currently fell
> >> >>>>>> into:
> >> >>>>>> 1. We offered a smart solution.
> >> >>>>>> 2. We hope users will follow the correct instructions.
> >> >>>>>> 3. Everything will work as expected if users followed the right
> >> >>>>>> instructions.
> >> >>>>>>
> >> >>>>>> In reality, I suspect not all users will do the second step
> >> correctly.
> >> >>>>> And
> >> >>>>>> for new users who only trying to have a quick
> >> >>>>>> experience with Flink, I would bet most users will do it wrong.
> >> >>>>>>
> >> >>>>>> So, my proposal would be one of the following 2 options:
> >> >>>>>> 1. Provide a slim distribution for advanced product users and
> >> provide
> >> >> a
> >> >>>>>> distribution which will have some popular builtin jars.
> >> >>>>>> 2. Only provide a distribution which will have some popular
> builtin
> >> >>>> jars.
> >> >>>>>> If we are trying to reduce the distributions we released, I would
> >> >>>> prefer
> >> >>>>> 2
> >> >>>>>> 1.
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>> Kurt
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <
> >> [email protected]>
> >> >> <
> >> >>>>> [email protected]> wrote:
> >> >>>>>>
> >> >>>>>> I think what Chesnay and Dawid proposed would be the ideal
> >> solution.
> >> >>>>>> Ideally, we would also have a nice web tool for the website which
> >> >>>>> generates
> >> >>>>>> the corresponding distribution for download.
> >> >>>>>>
> >> >>>>>> To get things started we could start with only supporting to
> >> >>>>>> download/creating the "fat" version with the script. The fat
> >> version
> >> >>>>> would
> >> >>>>>> then consist of the slim distribution and whatever we deem
> >> important
> >> >>>> for
> >> >>>>>> new users to get started.
> >> >>>>>>
> >> >>>>>> Cheers,
> >> >>>>>> Till
> >> >>>>>>
> >> >>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> >> >>>>> [email protected]> <[email protected]>
> >> >>>>>> wrote:
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Hi all,
> >> >>>>>>
> >> >>>>>> Few points from my side:
> >> >>>>>>
> >> >>>>>> 1. I like the idea of simplifying the experience for first time
> >> users.
> >> >>>>>> As for production use cases I share Jark's opinion that in this
> >> case I
> >> >>>>>> would expect users to combine their distribution manually. I
> think
> >> in
> >> >>>>>> such scenarios it is important to understand interconnections.
> >> >>>>>> Personally I'd expect the slimmest possible distribution that I
> can
> >> >>>>>> extend further with what I need in my production scenario.
> >> >>>>>>
> >> >>>>>> 2. I think there is also the problem that the matrix of possible
> >> >>>>>> combinations that can be useful is already big. Do we want to
> have
> >> a
> >> >>>>>> distribution for:
> >> >>>>>>
> >> >>>>>>       SQL users: which connectors should we include? should we
> >> include
> >> >>>>>> hive? which other catalog?
> >> >>>>>>
> >> >>>>>>       DataStream users: which connectors should we include?
> >> >>>>>>
> >> >>>>>>      For both of the above should we include yarn/kubernetes?
> >> >>>>>>
> >> >>>>>> I would opt for providing only the "slim" distribution as a
> release
> >> >>>>>> artifact.
> >> >>>>>>
> >> >>>>>> 3. However, as I said I think its worth investigating how we can
> >> >>>> improve
> >> >>>>>> users experience. What do you think of providing a tool, could be
> >> e.g.
> >> >>>> a
> >> >>>>>> shell script that constructs a distribution based on users
> choice.
> >> I
> >> >>>>>> think that was also what Chesnay mentioned as "tooling to
> >> >>>>>> assemble custom distributions" In the end how I see the
> difference
> >> >>>>>> between a slim and fat distribution is which jars do we put into
> >> the
> >> >>>>>> lib, right? It could have a few "screens".
> >> >>>>>>
> >> >>>>>> 1. Which API are you interested in:
> >> >>>>>> a. SQL API
> >> >>>>>> b. DataStream API
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> 2. [SQL] Which connectors do you want to use? [multichoice]:
> >> >>>>>> a. Kafka
> >> >>>>>> b. Elasticsearch
> >> >>>>>> ...
> >> >>>>>>
> >> >>>>>> 3. [SQL] Which catalog you want to use?
> >> >>>>>>
> >> >>>>>> ...
> >> >>>>>>
> >> >>>>>> Such a tool would download all the dependencies from maven and
> put
> >> >> them
> >> >>>>>> into the correct folder. In the future we can extend it with
> >> >> additional
> >> >>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time with
> >> >>>>>> kafka-universal etc.
> >> >>>>>>
> >> >>>>>> The benefit of it would be that the distribution that we release
> >> could
> >> >>>>>> remain "slim" or we could even make it slimmer. I might be
> missing
> >> >>>>>> something here though.
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>>
> >> >>>>>> Dawdi
> >> >>>>>>
> >> >>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
> >> >>>>>>
> >> >>>>>> I want to reinforce my opinion from earlier: This is about
> >> improving
> >> >>>>>> the situation both for first-time users and for experienced users
> >> that
> >> >>>>>> want to use a Flink dist in production. The current Flink dist is
> >> too
> >> >>>>>> "thin" for first-time SQL users and it is too "fat" for
> production
> >> >>>>>> users, that is where serving no-one properly with the current
> >> >>>>>> middle-ground. That's why I think introducing those specialized
> >> >>>>>> "spins" of Flink dist would be good.
> >> >>>>>>
> >> >>>>>> By the way, at some point in the future production users might
> not
> >> >>>>>> even need to get a Flink dist anymore. They should be able to
> have
> >> >>>>>> Flink as a dependency of their project (including the runtime)
> and
> >> >>>>>> then build an image from this for Kubernetes or a fat jar for
> YARN.
> >> >>>>>>
> >> >>>>>> Aljoscha
> >> >>>>>>
> >> >>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
> >> >>>>>>
> >> >>>>>> Hi all,
> >> >>>>>>
> >> >>>>>> Regarding slim and fat distributions, I think different kinds of
> >> jobs
> >> >>>>>> may
> >> >>>>>> prefer different type of distribution:
> >> >>>>>>
> >> >>>>>> For DataStream job, I think we may not like fat distribution
> >> >>>>>>
> >> >>>>>> containing
> >> >>>>>>
> >> >>>>>> connectors because user would always need to depend on the
> >> connector
> >> >>>>>>
> >> >>>>>> in
> >> >>>>>>
> >> >>>>>> user code, it is easy to include the connector jar in the user
> lib.
> >> >>>>>>
> >> >>>>>> Less
> >> >>>>>>
> >> >>>>>> jar in lib means less class conflicts and problems.
> >> >>>>>>
> >> >>>>>> For SQL job, I think we are trying to encourage user to user pure
> >> >>>>>> sql(DDL +
> >> >>>>>> DML) to construct their job, In order to improve user experience,
> >> It
> >> >>>>>> may be
> >> >>>>>> important for flink, not only providing as many connector jar in
> >> >>>>>> distribution as possible especially the connector and format we
> >> have
> >> >>>>>> well
> >> >>>>>> documented,  but also providing an mechanism to load connectors
> >> >>>>>> according
> >> >>>>>> to the DDLs,
> >> >>>>>>
> >> >>>>>> So I think it could be good to place connector/format jars in
> some
> >> >>>>>> dir like
> >> >>>>>> opt/connector which would not affect jobs by default, and
> >> introduce a
> >> >>>>>> mechanism of dynamic discovery for SQL.
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>> Wenlong
> >> >>>>>>
> >> >>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <
> [email protected]>
> >> <
> >> >>>>> [email protected]>
> >> >>>>>> wrote:
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Hi,
> >> >>>>>>
> >> >>>>>> I am thinking both "improve first experience" and "improve
> >> production
> >> >>>>>> experience".
> >> >>>>>>
> >> >>>>>> I'm thinking about what's the common mode of Flink?
> >> >>>>>> Streaming job use Kafka? Batch job use Hive?
> >> >>>>>>
> >> >>>>>> Hive 1.2.1 dependencies can be compatible with most of Hive
> server
> >> >>>>>> versions. So Spark and Presto have built-in Hive 1.2.1
> dependency.
> >> >>>>>> Flink is currently mainly used for streaming, so let's not talk
> >> >>>>>> about hive.
> >> >>>>>>
> >> >>>>>> For streaming jobs, first of all, the jobs in my mind is (related
> >> to
> >> >>>>>> connectors):
> >> >>>>>> - ETL jobs: Kafka -> Kafka
> >> >>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
> >> >>>>>> - Aggregation jobs: Kafka -> JDBCSink
> >> >>>>>> So Kafka and JDBC are probably the most commonly used. Of course,
> >> >>>>>>
> >> >>>>>> also
> >> >>>>>>
> >> >>>>>> includes CSV, JSON's formats.
> >> >>>>>> So when we provide such a fat distribution:
> >> >>>>>> - With CSV, JSON.
> >> >>>>>> - With flink-kafka-universal and kafka dependencies.
> >> >>>>>> - With flink-jdbc.
> >> >>>>>> Using this fat distribution, most users can run their jobs well.
> >> >>>>>>
> >> >>>>>> (jdbc
> >> >>>>>>
> >> >>>>>> driver jar required, but this is very natural to do)
> >> >>>>>> Can these dependencies lead to kinds of conflicts? Only Kafka may
> >> >>>>>>
> >> >>>>>> have
> >> >>>>>>
> >> >>>>>> conflicts, but if our goal is to use kafka-universal to support
> all
> >> >>>>>> Kafka
> >> >>>>>> versions, it is hopeful to target the vast majority of users.
> >> >>>>>>
> >> >>>>>> We don't want to plug all jars into the fat distribution. Only
> need
> >> >>>>>> less
> >> >>>>>> conflict and common. of course, it is a matter of consideration
> to
> >> >>>>>>
> >> >>>>>> put
> >> >>>>>>
> >> >>>>>> which jar into fat distribution.
> >> >>>>>> We have the opportunity to facilitate the majority of users, but
> >> >>>>>> also left
> >> >>>>>> opportunities for customization.
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>> Jingsong Lee
> >> >>>>>>
> >> >>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <[email protected]> <
> >> >>>>> [email protected]> wrote:
> >> >>>>>>
> >> >>>>>> Hi,
> >> >>>>>>
> >> >>>>>> I think we should first reach an consensus on "what problem do we
> >> >>>>>> want to
> >> >>>>>> solve?"
> >> >>>>>> (1) improve first experience? or (2) improve production
> experience?
> >> >>>>>>
> >> >>>>>> As far as I can see, with the above discussion, I think what we
> >> >>>>>> want to
> >> >>>>>> solve is the "first experience".
> >> >>>>>> And I think the slim jar is still the best distribution for
> >> >>>>>> production,
> >> >>>>>> because it's easier to assembling jars
> >> >>>>>> than excluding jars and can avoid potential class conflicts.
> >> >>>>>>
> >> >>>>>> If we want to improve "first experience", I think it make sense
> to
> >> >>>>>> have a
> >> >>>>>> fat distribution to give users a more smooth first experience.
> >> >>>>>> But I would like to call it "playground distribution" or
> something
> >> >>>>>> like
> >> >>>>>> that to explicitly differ from the "slim production-purpose
> >> >>>>>>
> >> >>>>>> distribution".
> >> >>>>>>
> >> >>>>>> The "playground distribution" can contains some widely used jars,
> >> >>>>>>
> >> >>>>>> like
> >> >>>>>>
> >> >>>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector,
> avro,
> >> >>>>>> json,
> >> >>>>>> csv, etc..
> >> >>>>>> Even we can provide a playground docker which may contain the fat
> >> >>>>>> distribution, python3, and hive.
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>> Jark
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <
> [email protected]>
> >> <
> >> >>>>> [email protected]>
> >> >>>>>> wrote:
> >> >>>>>>
> >> >>>>>> I don't see a lot of value in having multiple distributions.
> >> >>>>>>
> >> >>>>>> The simple reality is that no fat distribution we could provide
> >> >>>>>>
> >> >>>>>> would
> >> >>>>>>
> >> >>>>>> satisfy all use-cases, so why even try.
> >> >>>>>> If users commonly run into issues for certain jars, then maybe
> >> >>>>>>
> >> >>>>>> those
> >> >>>>>>
> >> >>>>>> should be added to the current distribution.
> >> >>>>>>
> >> >>>>>> Personally though I still believe we should only distribute a
> slim
> >> >>>>>> version. I'd rather have users always add required jars to the
> >> >>>>>> distribution than only when they go outside our "expected"
> >> >>>>>>
> >> >>>>>> use-cases.
> >> >>>>>>
> >> >>>>>> Then we might finally address this issue properly, i.e., tooling
> to
> >> >>>>>> assemble custom distributions and/or better error messages if
> >> >>>>>> Flink-provided extensions cannot be found.
> >> >>>>>>
> >> >>>>>> On 15/04/2020 15:23, Kurt Young wrote:
> >> >>>>>>
> >> >>>>>> Regarding to the specific solution, I'm not sure about the "fat"
> >> >>>>>>
> >> >>>>>> and
> >> >>>>>>
> >> >>>>>> "slim"
> >> >>>>>>
> >> >>>>>> solution though. I get the idea
> >> >>>>>> that we can make the slim one even more lightweight than current
> >> >>>>>> distribution, but what about the "fat"
> >> >>>>>> one? Do you mean that we would package all connectors and formats
> >> >>>>>>
> >> >>>>>> into
> >> >>>>>>
> >> >>>>>> this? I'm not sure if this is
> >> >>>>>> feasible. For example, we can't put all versions of kafka and
> hive
> >> >>>>>> connector jars into lib directory, and
> >> >>>>>> we also might need hadoop jars when using filesystem connector to
> >> >>>>>>
> >> >>>>>> access
> >> >>>>>>
> >> >>>>>> data from HDFS.
> >> >>>>>>
> >> >>>>>> So my guess would be we might hand-pick some of the most
> >> >>>>>>
> >> >>>>>> frequently
> >> >>>>>>
> >> >>>>>> used
> >> >>>>>>
> >> >>>>>> connectors and formats
> >> >>>>>> into our "lib" directory, like kafka, csv, json metioned above,
> >> >>>>>>
> >> >>>>>> and
> >> >>>>>>
> >> >>>>>> still
> >> >>>>>>
> >> >>>>>> leave some other connectors out of it.
> >> >>>>>> If this is the case, then why not we just provide this
> >> >>>>>>
> >> >>>>>> distribution
> >> >>>>>>
> >> >>>>>> to
> >> >>>>>>
> >> >>>>>> user? I'm not sure i get the benefit of
> >> >>>>>> providing another super "slim" jar (we have to pay some costs to
> >> >>>>>>
> >> >>>>>> provide
> >> >>>>>>
> >> >>>>>> another suit of distribution).
> >> >>>>>>
> >> >>>>>> What do you think?
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>> Kurt
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> >> >>>>>>
> >> >>>>>> [email protected]
> >> >>>>>>
> >> >>>>>> wrote:
> >> >>>>>>
> >> >>>>>> Big +1.
> >> >>>>>>
> >> >>>>>> I like "fat" and "slim".
> >> >>>>>>
> >> >>>>>> For csv and json, like Jark said, they are quite small and don't
> >> >>>>>>
> >> >>>>>> have
> >> >>>>>>
> >> >>>>>> other
> >> >>>>>>
> >> >>>>>> dependencies. They are important to kafka connector, and
> >> >>>>>>
> >> >>>>>> important
> >> >>>>>>
> >> >>>>>> to upcoming file system connector too.
> >> >>>>>> So can we move them to both "fat" and "slim"? They're so
> >> >>>>>>
> >> >>>>>> important,
> >> >>>>>>
> >> >>>>>> and
> >> >>>>>>
> >> >>>>>> they're so lightweight.
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>> Jingsong Lee
> >> >>>>>>
> >> >>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <[email protected]>
> <
> >> >>>>> [email protected]>
> >> >>>>>> wrote:
> >> >>>>>>
> >> >>>>>> Big +1.
> >> >>>>>> This will improve user experience (special for Flink new users).
> >> >>>>>> We answered so many questions about "class not found".
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>> Godfrey
> >> >>>>>>
> >> >>>>>> Dian Fu <[email protected]> <[email protected]>
> >> 于2020年4月15日周三
> >> >>>>> 下午4:30写道：
> >> >>>>>>
> >> >>>>>> +1 to this proposal.
> >> >>>>>>
> >> >>>>>> Missing connector jars is also a big problem for PyFlink users.
> >> >>>>>>
> >> >>>>>> Currently,
> >> >>>>>>
> >> >>>>>> after a Python user has installed PyFlink using `pip`, he has
> >> >>>>>>
> >> >>>>>> to
> >> >>>>>>
> >> >>>>>> manually
> >> >>>>>>
> >> >>>>>> copy the connector fat jars to the PyFlink installation
> >> >>>>>>
> >> >>>>>> directory
> >> >>>>>>
> >> >>>>>> for
> >> >>>>>>
> >> >>>>>> the
> >> >>>>>>
> >> >>>>>> connectors to be used if he wants to run jobs locally. This
> >> >>>>>>
> >> >>>>>> process
> >> >>>>>>
> >> >>>>>> is
> >> >>>>>>
> >> >>>>>> very
> >> >>>>>>
> >> >>>>>> confuse for users and affects the experience a lot.
> >> >>>>>>
> >> >>>>>> Regards,
> >> >>>>>> Dian
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> 在 2020年4月15日，下午3:51，Jark Wu <[email protected]> <[email protected]
> >
> >> 写道：
> >> >>>>>>
> >> >>>>>> +1 to the proposal. I also found the "download additional jar"
> >> >>>>>>
> >> >>>>>> step
> >> >>>>>>
> >> >>>>>> is
> >> >>>>>>
> >> >>>>>> really verbose when I prepare webinars.
> >> >>>>>>
> >> >>>>>> At least, I think the flink-csv and flink-json should in the
> >> >>>>>>
> >> >>>>>> distribution,
> >> >>>>>>
> >> >>>>>> they are quite small and don't have other dependencies.
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>> Jark
> >> >>>>>>
> >> >>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <[email protected]> <
> >> >>>>> [email protected]>
> >> >>>>>> wrote:
> >> >>>>>>
> >> >>>>>> Hi Aljoscha,
> >> >>>>>>
> >> >>>>>> Big +1 for the fat flink distribution, where do you plan to
> >> >>>>>>
> >> >>>>>> put
> >> >>>>>>
> >> >>>>>> these
> >> >>>>>>
> >> >>>>>> connectors ? opt or lib ?
> >> >>>>>>
> >> >>>>>> Aljoscha Krettek <[email protected]> <[email protected]>
> >> >>>>> 于2020年4月15日周三
> >> >>>>>> 下午3:30写道：
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Hi Everyone,
> >> >>>>>>
> >> >>>>>> I'd like to discuss about releasing a more full-featured
> >> >>>>>>
> >> >>>>>> Flink
> >> >>>>>>
> >> >>>>>> distribution. The motivation is that there is friction for
> >> >>>>>>
> >> >>>>>> SQL/Table
> >> >>>>>>
> >> >>>>>> API
> >> >>>>>>
> >> >>>>>> users that want to use Table connectors which are not there
> >> >>>>>>
> >> >>>>>> in
> >> >>>>>>
> >> >>>>>> the
> >> >>>>>>
> >> >>>>>> current Flink Distribution. For these users the workflow is
> >> >>>>>>
> >> >>>>>> currently
> >> >>>>>>
> >> >>>>>> roughly:
> >> >>>>>>
> >> >>>>>>      - download Flink dist
> >> >>>>>>      - configure csv/Kafka/json connectors per configuration
> >> >>>>>>      - run SQL client or program
> >> >>>>>>      - decrypt error message and research the solution
> >> >>>>>>      - download additional connector jars
> >> >>>>>>      - program works correctly
> >> >>>>>>
> >> >>>>>> I realize that this can be made to work but if every SQL
> >> >>>>>>
> >> >>>>>> user
> >> >>>>>>
> >> >>>>>> has
> >> >>>>>>
> >> >>>>>> this
> >> >>>>>>
> >> >>>>>> as their first experience that doesn't seem good to me.
> >> >>>>>>
> >> >>>>>> My proposal is to provide two versions of the Flink
> >> >>>>>>
> >> >>>>>> Distribution
> >> >>>>>>
> >> >>>>>> in
> >> >>>>>>
> >> >>>>>> the
> >> >>>>>>
> >> >>>>>> future: "fat" and "slim" (names to be discussed):
> >> >>>>>>
> >> >>>>>>      - slim would be even trimmer than todays distribution
> >> >>>>>>      - fat would contain a lot of convenience connectors (yet
> >> >>>>>>
> >> >>>>>> to
> >> >>>>>>
> >> >>>>>> be
> >> >>>>>>
> >> >>>>>> determined which one)
> >> >>>>>>
> >> >>>>>> And yes, I realize that there are already more dimensions of
> >> >>>>>>
> >> >>>>>> Flink
> >> >>>>>>
> >> >>>>>> releases (Scala version and Java version).
> >> >>>>>>
> >> >>>>>> For background, our current Flink dist has these in the opt
> >> >>>>>>
> >> >>>>>> directory:
> >> >>>>>>
> >> >>>>>>      - flink-azure-fs-hadoop-1.10.0.jar
> >> >>>>>>      - flink-cep-scala_2.12-1.10.0.jar
> >> >>>>>>      - flink-cep_2.12-1.10.0.jar
> >> >>>>>>      - flink-gelly-scala_2.12-1.10.0.jar
> >> >>>>>>      - flink-gelly_2.12-1.10.0.jar
> >> >>>>>>      - flink-metrics-datadog-1.10.0.jar
> >> >>>>>>      - flink-metrics-graphite-1.10.0.jar
> >> >>>>>>      - flink-metrics-influxdb-1.10.0.jar
> >> >>>>>>      - flink-metrics-prometheus-1.10.0.jar
> >> >>>>>>      - flink-metrics-slf4j-1.10.0.jar
> >> >>>>>>      - flink-metrics-statsd-1.10.0.jar
> >> >>>>>>      - flink-oss-fs-hadoop-1.10.0.jar
> >> >>>>>>      - flink-python_2.12-1.10.0.jar
> >> >>>>>>      - flink-queryable-state-runtime_2.12-1.10.0.jar
> >> >>>>>>      - flink-s3-fs-hadoop-1.10.0.jar
> >> >>>>>>      - flink-s3-fs-presto-1.10.0.jar
> >> >>>>>>      -
> >> >>>>>>
> >> >>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >> >>>>>>
> >> >>>>>>      - flink-sql-client_2.12-1.10.0.jar
> >> >>>>>>      - flink-state-processor-api_2.12-1.10.0.jar
> >> >>>>>>      - flink-swift-fs-hadoop-1.10.0.jar
> >> >>>>>>
> >> >>>>>> Current Flink dist is 267M. If we removed everything from
> >> >>>>>>
> >> >>>>>> opt
> >> >>>>>>
> >> >>>>>> we
> >> >>>>>>
> >> >>>>>> would
> >> >>>>>>
> >> >>>>>> go down to 126M. I would reccomend this, because the large
> >> >>>>>>
> >> >>>>>> majority
> >> >>>>>>
> >> >>>>>> of
> >> >>>>>>
> >> >>>>>> the files in opt are probably unused.
> >> >>>>>>
> >> >>>>>> What do you think?
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>> Aljoscha
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> --
> >> >>>>>> Best Regards
> >> >>>>>>
> >> >>>>>> Jeff Zhang
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> --
> >> >>>>>> Best, Jingsong Lee
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> --
> >> >>>>>> Best, Jingsong Lee
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>
> >>
> >>
> >
> > --
> > Best, Jingsong Lee
> >
>
>
> --
> Best, Jingsong Lee
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Reply via email to