A similar issue exists for the docker files. I also heard the fame feedback from various users, for example why we don't simply include all FS connectors in the images by default.
I actually like the idea of having a slim and a fat/convenience docker file. - If you build a clean production image, you start with slim and add the jars you need. - If you just want to get started and play around, it is nice to have many popular connectors directly available. Even if this only meets the 90% popular cases, that is a good win. Users are not code after all, the simplest minimal solution is not always what resonates best with them. On Fri, Apr 17, 2020 at 5:22 AM Jark Wu <imj...@gmail.com> wrote: > Hi, > > I like the idea of web tool to assemble fat distribution. And the > https://code.quarkus.io/ looks very nice. > All the users need to do is just select what he/she need (I think this step > can't be omitted anyway). > We can also provide a default fat distribution on the web which default > selects some popular connectors. > > Best, > Jark > > On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <rafi.ar...@gmail.com> wrote: > > > As a reference for a nice first-experience I had, take a look at > > https://code.quarkus.io/ > > You reach this page after you click "Start Coding" at the project > homepage. > > > > Rafi > > > > > > On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <ykt...@gmail.com> wrote: > > > > > I'm not saying pre-bundle some jars will make this problem go away, and > > > you're right that only hides the problem for > > > some users. But what if this solution can hide the problem for 90% > users? > > > Would't that be good enough for us to try? > > > > > > Regarding to would users following instructions really be such a big > > > problem? > > > I'm afraid yes. Otherwise I won't answer such questions for at least a > > > dozen times and I won't see such questions coming > > > up from time to time. During some periods, I even saw such questions > > every > > > day. > > > > > > Best, > > > Kurt > > > > > > > > > On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <ches...@apache.org> > > > wrote: > > > > > > > The problem with having a distribution with "popular" stuff is that > it > > > > doesn't really *solve* a problem, it just hides it for users who fall > > > > into these particular use-cases. > > > > Move out of it and you once again run into exact same problems > > out-lined. > > > > > > > > This is exactly why I like the tooling approach; you have to deal > with > > it > > > > from the start and transitioning to a custom use-case is easier. > > > > > > > > Would users following instructions really be such a big problem? > > > > I would expect that users generally know *what *they need, just not > > > > necessarily how it is assembled correctly (where do get which jar, > > which > > > > directory to put it in). > > > > It seems like these are exactly the problem this would solve? > > > > I just don't see how moving a jar corresponding to some feature from > > opt > > > > to some directory (lib/plugins) is less error-prone than just > selecting > > > the > > > > feature and having the tool handle the rest. > > > > > > > > As for re-distributions, it depends on the form that the tool would > > take. > > > > It could be an application that runs locally and works against maven > > > > central (note: not necessarily *using* maven); this should would work > > in > > > > China, no? > > > > > > > > A web tool would of course be fancy, but I don't know how feasible > this > > > is > > > > with the ASF infrastructure. > > > > You wouldn't be able to mirror the distribution, so the load can't be > > > > distributed. I doubt INFRA would like this. > > > > > > > > Note that third-parties could also start distributing use-case > oriented > > > > distributions, which would be perfectly fine as far as I'm concerned. > > > > > > > > On 16/04/2020 16:57, Kurt Young wrote: > > > > > > > > I'm not so sure about the web tool solution though. The concern I > have > > > for > > > > this approach is the final generated > > > > distribution is kind of non-deterministic. We might generate too many > > > > different combinations when user trying to > > > > package different types of connector, format, and even maybe hadoop > > > > releases. As far as I can tell, most open > > > > source projects and apache projects will only release some > > > > pre-defined distributions, which most users are already > > > > familiar with, thus hard to change IMO. And I also have went through > in > > > > some cases, users will try to re-distribute > > > > the release package, because of the unstable network of apache > website > > > from > > > > China. In web tool solution, I don't > > > > think this kind of re-distribution would be possible anymore. > > > > > > > > In the meantime, I also have a concern that we will fall back into > our > > > trap > > > > again if we try to offer this smart & flexible > > > > solution. Because it needs users to cooperate with such mechanism. > It's > > > > exactly the situation what we currently fell > > > > into: > > > > 1. We offered a smart solution. > > > > 2. We hope users will follow the correct instructions. > > > > 3. Everything will work as expected if users followed the right > > > > instructions. > > > > > > > > In reality, I suspect not all users will do the second step > correctly. > > > And > > > > for new users who only trying to have a quick > > > > experience with Flink, I would bet most users will do it wrong. > > > > > > > > So, my proposal would be one of the following 2 options: > > > > 1. Provide a slim distribution for advanced product users and > provide a > > > > distribution which will have some popular builtin jars. > > > > 2. Only provide a distribution which will have some popular builtin > > jars. > > > > > > > > If we are trying to reduce the distributions we released, I would > > prefer > > > 2 > > > > > > > > 1. > > > > > > > > Best, > > > > Kurt > > > > > > > > > > > > On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <trohrm...@apache.org> > < > > > trohrm...@apache.org> wrote: > > > > > > > > > > > > I think what Chesnay and Dawid proposed would be the ideal solution. > > > > Ideally, we would also have a nice web tool for the website which > > > generates > > > > the corresponding distribution for download. > > > > > > > > To get things started we could start with only supporting to > > > > download/creating the "fat" version with the script. The fat version > > > would > > > > then consist of the slim distribution and whatever we deem important > > for > > > > new users to get started. > > > > > > > > Cheers, > > > > Till > > > > > > > > On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz < > > > dwysakow...@apache.org> <dwysakow...@apache.org> > > > > wrote: > > > > > > > > > > > > Hi all, > > > > > > > > Few points from my side: > > > > > > > > 1. I like the idea of simplifying the experience for first time > users. > > > > As for production use cases I share Jark's opinion that in this case > I > > > > would expect users to combine their distribution manually. I think in > > > > such scenarios it is important to understand interconnections. > > > > Personally I'd expect the slimmest possible distribution that I can > > > > extend further with what I need in my production scenario. > > > > > > > > 2. I think there is also the problem that the matrix of possible > > > > combinations that can be useful is already big. Do we want to have a > > > > distribution for: > > > > > > > > SQL users: which connectors should we include? should we include > > > > hive? which other catalog? > > > > > > > > DataStream users: which connectors should we include? > > > > > > > > For both of the above should we include yarn/kubernetes? > > > > > > > > I would opt for providing only the "slim" distribution as a release > > > > artifact. > > > > > > > > 3. However, as I said I think its worth investigating how we can > > improve > > > > users experience. What do you think of providing a tool, could be > e.g. > > a > > > > shell script that constructs a distribution based on users choice. I > > > > think that was also what Chesnay mentioned as "tooling to > > > > assemble custom distributions" In the end how I see the difference > > > > between a slim and fat distribution is which jars do we put into the > > > > lib, right? It could have a few "screens". > > > > > > > > 1. Which API are you interested in: > > > > a. SQL API > > > > b. DataStream API > > > > > > > > > > > > 2. [SQL] Which connectors do you want to use? [multichoice]: > > > > a. Kafka > > > > b. Elasticsearch > > > > ... > > > > > > > > 3. [SQL] Which catalog you want to use? > > > > > > > > ... > > > > > > > > Such a tool would download all the dependencies from maven and put > them > > > > into the correct folder. In the future we can extend it with > additional > > > > rules e.g. kafka-0.9 cannot be chosen at the same time with > > > > kafka-universal etc. > > > > > > > > The benefit of it would be that the distribution that we release > could > > > > remain "slim" or we could even make it slimmer. I might be missing > > > > something here though. > > > > > > > > Best, > > > > > > > > Dawdi > > > > > > > > On 16/04/2020 11:02, Aljoscha Krettek wrote: > > > > > > > > I want to reinforce my opinion from earlier: This is about improving > > > > the situation both for first-time users and for experienced users > that > > > > want to use a Flink dist in production. The current Flink dist is too > > > > "thin" for first-time SQL users and it is too "fat" for production > > > > users, that is where serving no-one properly with the current > > > > middle-ground. That's why I think introducing those specialized > > > > "spins" of Flink dist would be good. > > > > > > > > By the way, at some point in the future production users might not > > > > even need to get a Flink dist anymore. They should be able to have > > > > Flink as a dependency of their project (including the runtime) and > > > > then build an image from this for Kubernetes or a fat jar for YARN. > > > > > > > > Aljoscha > > > > > > > > On 15.04.20 18:14, wenlong.lwl wrote: > > > > > > > > Hi all, > > > > > > > > Regarding slim and fat distributions, I think different kinds of jobs > > > > may > > > > prefer different type of distribution: > > > > > > > > For DataStream job, I think we may not like fat distribution > > > > > > > > containing > > > > > > > > connectors because user would always need to depend on the connector > > > > > > > > in > > > > > > > > user code, it is easy to include the connector jar in the user lib. > > > > > > > > Less > > > > > > > > jar in lib means less class conflicts and problems. > > > > > > > > For SQL job, I think we are trying to encourage user to user pure > > > > sql(DDL + > > > > DML) to construct their job, In order to improve user experience, It > > > > may be > > > > important for flink, not only providing as many connector jar in > > > > distribution as possible especially the connector and format we have > > > > well > > > > documented, but also providing an mechanism to load connectors > > > > according > > > > to the DDLs, > > > > > > > > So I think it could be good to place connector/format jars in some > > > > dir like > > > > opt/connector which would not affect jobs by default, and introduce a > > > > mechanism of dynamic discovery for SQL. > > > > > > > > Best, > > > > Wenlong > > > > > > > > On Wed, 15 Apr 2020 at 22:46, Jingsong Li <jingsongl...@gmail.com> < > > > jingsongl...@gmail.com> > > > > wrote: > > > > > > > > > > > > Hi, > > > > > > > > I am thinking both "improve first experience" and "improve production > > > > experience". > > > > > > > > I'm thinking about what's the common mode of Flink? > > > > Streaming job use Kafka? Batch job use Hive? > > > > > > > > Hive 1.2.1 dependencies can be compatible with most of Hive server > > > > versions. So Spark and Presto have built-in Hive 1.2.1 dependency. > > > > Flink is currently mainly used for streaming, so let's not talk > > > > about hive. > > > > > > > > For streaming jobs, first of all, the jobs in my mind is (related to > > > > connectors): > > > > - ETL jobs: Kafka -> Kafka > > > > - Join jobs: Kafka -> DimJDBC -> Kafka > > > > - Aggregation jobs: Kafka -> JDBCSink > > > > So Kafka and JDBC are probably the most commonly used. Of course, > > > > > > > > also > > > > > > > > includes CSV, JSON's formats. > > > > So when we provide such a fat distribution: > > > > - With CSV, JSON. > > > > - With flink-kafka-universal and kafka dependencies. > > > > - With flink-jdbc. > > > > Using this fat distribution, most users can run their jobs well. > > > > > > > > (jdbc > > > > > > > > driver jar required, but this is very natural to do) > > > > Can these dependencies lead to kinds of conflicts? Only Kafka may > > > > > > > > have > > > > > > > > conflicts, but if our goal is to use kafka-universal to support all > > > > Kafka > > > > versions, it is hopeful to target the vast majority of users. > > > > > > > > We don't want to plug all jars into the fat distribution. Only need > > > > less > > > > conflict and common. of course, it is a matter of consideration to > > > > > > > > put > > > > > > > > which jar into fat distribution. > > > > We have the opportunity to facilitate the majority of users, but > > > > also left > > > > opportunities for customization. > > > > > > > > Best, > > > > Jingsong Lee > > > > > > > > On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <imj...@gmail.com> < > > > imj...@gmail.com> wrote: > > > > > > > > > > > > Hi, > > > > > > > > I think we should first reach an consensus on "what problem do we > > > > want to > > > > solve?" > > > > (1) improve first experience? or (2) improve production experience? > > > > > > > > As far as I can see, with the above discussion, I think what we > > > > want to > > > > solve is the "first experience". > > > > And I think the slim jar is still the best distribution for > > > > production, > > > > because it's easier to assembling jars > > > > than excluding jars and can avoid potential class conflicts. > > > > > > > > If we want to improve "first experience", I think it make sense to > > > > have a > > > > fat distribution to give users a more smooth first experience. > > > > But I would like to call it "playground distribution" or something > > > > like > > > > that to explicitly differ from the "slim production-purpose > > > > > > > > distribution". > > > > > > > > The "playground distribution" can contains some widely used jars, > > > > > > > > like > > > > > > > > universal-kafka-sql-connector, elasticsearch7-sql-connector, avro, > > > > json, > > > > csv, etc.. > > > > Even we can provide a playground docker which may contain the fat > > > > distribution, python3, and hive. > > > > > > > > Best, > > > > Jark > > > > > > > > > > > > On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ches...@apache.org> > < > > > ches...@apache.org> > > > > > > > > wrote: > > > > > > > > I don't see a lot of value in having multiple distributions. > > > > > > > > The simple reality is that no fat distribution we could provide > > > > > > > > would > > > > > > > > satisfy all use-cases, so why even try. > > > > If users commonly run into issues for certain jars, then maybe > > > > > > > > those > > > > > > > > should be added to the current distribution. > > > > > > > > Personally though I still believe we should only distribute a slim > > > > version. I'd rather have users always add required jars to the > > > > distribution than only when they go outside our "expected" > > > > > > > > use-cases. > > > > > > > > Then we might finally address this issue properly, i.e., tooling to > > > > assemble custom distributions and/or better error messages if > > > > Flink-provided extensions cannot be found. > > > > > > > > On 15/04/2020 15:23, Kurt Young wrote: > > > > > > > > Regarding to the specific solution, I'm not sure about the "fat" > > > > > > > > and > > > > > > > > "slim" > > > > > > > > solution though. I get the idea > > > > that we can make the slim one even more lightweight than current > > > > distribution, but what about the "fat" > > > > one? Do you mean that we would package all connectors and formats > > > > > > > > into > > > > > > > > this? I'm not sure if this is > > > > feasible. For example, we can't put all versions of kafka and hive > > > > connector jars into lib directory, and > > > > we also might need hadoop jars when using filesystem connector to > > > > > > > > access > > > > > > > > data from HDFS. > > > > > > > > So my guess would be we might hand-pick some of the most > > > > > > > > frequently > > > > > > > > used > > > > > > > > connectors and formats > > > > into our "lib" directory, like kafka, csv, json metioned above, > > > > > > > > and > > > > > > > > still > > > > > > > > leave some other connectors out of it. > > > > If this is the case, then why not we just provide this > > > > > > > > distribution > > > > > > > > to > > > > > > > > user? I'm not sure i get the benefit of > > > > providing another super "slim" jar (we have to pay some costs to > > > > > > > > provide > > > > > > > > another suit of distribution). > > > > > > > > What do you think? > > > > > > > > Best, > > > > Kurt > > > > > > > > > > > > On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li < > > > > > > > > jingsongl...@gmail.com > > > > > > > > wrote: > > > > > > > > Big +1. > > > > > > > > I like "fat" and "slim". > > > > > > > > For csv and json, like Jark said, they are quite small and don't > > > > > > > > have > > > > > > > > other > > > > > > > > dependencies. They are important to kafka connector, and > > > > > > > > important > > > > > > > > to upcoming file system connector too. > > > > So can we move them to both "fat" and "slim"? They're so > > > > > > > > important, > > > > > > > > and > > > > > > > > they're so lightweight. > > > > > > > > Best, > > > > Jingsong Lee > > > > > > > > On Wed, Apr 15, 2020 at 4:53 PM godfrey he <godfre...@gmail.com> < > > > godfre...@gmail.com> > > > > > > > > wrote: > > > > > > > > Big +1. > > > > This will improve user experience (special for Flink new users). > > > > We answered so many questions about "class not found". > > > > > > > > Best, > > > > Godfrey > > > > > > > > Dian Fu <dian0511...@gmail.com> <dian0511...@gmail.com> > 于2020年4月15日周三 > > > 下午4:30写道: > > > > > > > > > > > > +1 to this proposal. > > > > > > > > Missing connector jars is also a big problem for PyFlink users. > > > > > > > > Currently, > > > > > > > > after a Python user has installed PyFlink using `pip`, he has > > > > > > > > to > > > > > > > > manually > > > > > > > > copy the connector fat jars to the PyFlink installation > > > > > > > > directory > > > > > > > > for > > > > > > > > the > > > > > > > > connectors to be used if he wants to run jobs locally. This > > > > > > > > process > > > > > > > > is > > > > > > > > very > > > > > > > > confuse for users and affects the experience a lot. > > > > > > > > Regards, > > > > Dian > > > > > > > > > > > > 在 2020年4月15日,下午3:51,Jark Wu <imj...@gmail.com> <imj...@gmail.com> > 写道: > > > > > > > > +1 to the proposal. I also found the "download additional jar" > > > > > > > > step > > > > > > > > is > > > > > > > > really verbose when I prepare webinars. > > > > > > > > At least, I think the flink-csv and flink-json should in the > > > > > > > > distribution, > > > > > > > > they are quite small and don't have other dependencies. > > > > > > > > Best, > > > > Jark > > > > > > > > On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zjf...@gmail.com> < > > > zjf...@gmail.com> > > > > > > > > wrote: > > > > > > > > Hi Aljoscha, > > > > > > > > Big +1 for the fat flink distribution, where do you plan to > > > > > > > > put > > > > > > > > these > > > > > > > > connectors ? opt or lib ? > > > > > > > > Aljoscha Krettek <aljos...@apache.org> <aljos...@apache.org> > > > 于2020年4月15日周三 > > > > 下午3:30写道: > > > > > > > > > > > > Hi Everyone, > > > > > > > > I'd like to discuss about releasing a more full-featured > > > > > > > > Flink > > > > > > > > distribution. The motivation is that there is friction for > > > > > > > > SQL/Table > > > > > > > > API > > > > > > > > users that want to use Table connectors which are not there > > > > > > > > in > > > > > > > > the > > > > > > > > current Flink Distribution. For these users the workflow is > > > > > > > > currently > > > > > > > > roughly: > > > > > > > > - download Flink dist > > > > - configure csv/Kafka/json connectors per configuration > > > > - run SQL client or program > > > > - decrypt error message and research the solution > > > > - download additional connector jars > > > > - program works correctly > > > > > > > > I realize that this can be made to work but if every SQL > > > > > > > > user > > > > > > > > has > > > > > > > > this > > > > > > > > as their first experience that doesn't seem good to me. > > > > > > > > My proposal is to provide two versions of the Flink > > > > > > > > Distribution > > > > > > > > in > > > > > > > > the > > > > > > > > future: "fat" and "slim" (names to be discussed): > > > > > > > > - slim would be even trimmer than todays distribution > > > > - fat would contain a lot of convenience connectors (yet > > > > > > > > to > > > > > > > > be > > > > > > > > determined which one) > > > > > > > > And yes, I realize that there are already more dimensions of > > > > > > > > Flink > > > > > > > > releases (Scala version and Java version). > > > > > > > > For background, our current Flink dist has these in the opt > > > > > > > > directory: > > > > > > > > - flink-azure-fs-hadoop-1.10.0.jar > > > > - flink-cep-scala_2.12-1.10.0.jar > > > > - flink-cep_2.12-1.10.0.jar > > > > - flink-gelly-scala_2.12-1.10.0.jar > > > > - flink-gelly_2.12-1.10.0.jar > > > > - flink-metrics-datadog-1.10.0.jar > > > > - flink-metrics-graphite-1.10.0.jar > > > > - flink-metrics-influxdb-1.10.0.jar > > > > - flink-metrics-prometheus-1.10.0.jar > > > > - flink-metrics-slf4j-1.10.0.jar > > > > - flink-metrics-statsd-1.10.0.jar > > > > - flink-oss-fs-hadoop-1.10.0.jar > > > > - flink-python_2.12-1.10.0.jar > > > > - flink-queryable-state-runtime_2.12-1.10.0.jar > > > > - flink-s3-fs-hadoop-1.10.0.jar > > > > - flink-s3-fs-presto-1.10.0.jar > > > > - > > > > > > > > flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar > > > > > > > > - flink-sql-client_2.12-1.10.0.jar > > > > - flink-state-processor-api_2.12-1.10.0.jar > > > > - flink-swift-fs-hadoop-1.10.0.jar > > > > > > > > Current Flink dist is 267M. If we removed everything from > > > > > > > > opt > > > > > > > > we > > > > > > > > would > > > > > > > > go down to 126M. I would reccomend this, because the large > > > > > > > > majority > > > > > > > > of > > > > > > > > the files in opt are probably unused. > > > > > > > > What do you think? > > > > > > > > Best, > > > > Aljoscha > > > > > > > > > > > > > > > > -- > > > > Best Regards > > > > > > > > Jeff Zhang > > > > > > > > > > > > -- > > > > Best, Jingsong Lee > > > > > > > > > > > > -- > > > > Best, Jingsong Lee > > > > > > > > > > > > > > > > > > > > > >