Hi Jingsong! This sounds that with two pre-bundled versions (hive 1.2.1 and hive 2.3.6) you can cover a lot of versions.
Would it make sense to add these to flink-shaded (with proper dependency exclusions of unnecessary dependencies) and offer them as a download, similar as we offer pre-shaded Hadoop downloads? Best, Stephan On Thu, Feb 6, 2020 at 10:26 AM Jingsong Li <jingsongl...@gmail.com> wrote: > Hi Stephan, > > The hive/lib/ has many jars, this lib is for execution, metastore, hive > client and all things. > What we really depend on is hive-exec.jar. (hive-metastore.jar is also > required in the low version hive) > And hive-exec.jar is a uber jar. We just want half classes of it. These > half classes are not so clean, but it is OK to have them. > > Our solution now: > - exclude hive jars from build > - provide 8 versions dependencies way, user choose by his hive version.[1] > > Spark's solution: > - build-in hive 1.2.1 dependencies to support hive 0.12.0 through 2.3.3. > [2] > - hive-exec.jar is hive-exec.spark.jar, Spark has modified the > hive-exec build pom to exclude unnecessary classes including Orc and > parquet. > - build-in orc and parquet dependencies to optimizer performance. > - support hive version 2.3.3 upper by "mvn install -Phive-2.3", to > built-in hive-exec-2.3.6.jar. It seems that since this version, hive's API > has been seriously incompatible. > Most of the versions used by users are hive 0.12.0 through 2.3.3. So the > default build of Spark is good to most of users. > > Presto's solution: > - Built-in presto's hive.[3] Shade hive classes instead of thrift classes. > - Rewrite some client related code to solve kinds of issues. > This approach is the heaviest, but also the cleanest. It can support all > kinds of hive versions with one build. > > So I think we can do: > > - The eight versions we now maintain are too many. I think we can move > forward in the direction of Presto/Spark and try to reduce dependencies > versions. > > - As your said, about provide fat/uber jars or helper script, I prefer > uber jars, user can download one jar to their startup. Just like Kafka. > > [1] > https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies > [2] > https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore > [3] https://github.com/prestodb/presto-hive-apache > > Best, > Jingsong Lee > > On Wed, Feb 5, 2020 at 10:15 PM Stephan Ewen <se...@apache.org> wrote: > >> Some thoughts about other options we have: >> >> - Put fat/shaded jars for the common versions into "flink-shaded" and >> offer them for download on the website, similar to pre-bundles Hadoop >> versions. >> >> - Look at the Presto code (Metastore protocol) and see if we can reuse >> that >> >> - Have a setup helper script that takes the versions and pulls the >> required dependencies. >> >> Can you share how can a "built-in" dependency could work, if there are so >> many different conflicting versions? >> >> Thanks, >> Stephan >> >> >> On Tue, Feb 4, 2020 at 12:59 PM Rui Li <li...@apache.org> wrote: >> >>> Hi Stephan, >>> >>> As Jingsong stated, in our documentation the recommended way to add Hive >>> deps is to use exactly what users have installed. It's just we ask users >>> to >>> manually add those jars, instead of automatically find them based on env >>> variables. I prefer to keep it this way for a while, and see if there're >>> real concerns/complaints from user feedbacks. >>> >>> Please also note the Hive jars are not the only ones needed to integrate >>> with Hive, users have to make sure flink-connector-hive and Hadoop jars >>> are >>> in classpath too. So I'm afraid a single "HIVE" env variable wouldn't >>> save >>> all the manual work for our users. >>> >>> On Tue, Feb 4, 2020 at 5:54 PM Jingsong Li <jingsongl...@gmail.com> >>> wrote: >>> >>> > Hi all, >>> > >>> > For your information, we have document the dependencies detailed >>> > information [1]. I think it's a lot clearer than before, but it's worse >>> > than presto and spark (they avoid or have built-in hive dependency). >>> > >>> > I thought about Stephan's suggestion: >>> > - The hive/lib has 200+ jars, but we only need hive-exec.jar or plus >>> two >>> > or three jars, if so many jars are introduced, maybe will there be a >>> big >>> > conflict. >>> > - And hive/lib is not available on every machine. We need to upload so >>> > many jars. >>> > - A separate classloader maybe hard to work too, our >>> flink-connector-hive >>> > need hive jars, we may need to deal with flink-connector-hive jar >>> spacial >>> > too. >>> > CC: Rui Li >>> > >>> > I think the best system to integrate with hive is presto, which only >>> > connects hive metastore through thrift protocol. But I understand that >>> it >>> > costs a lot to rewrite the code. >>> > >>> > [1] >>> > >>> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies >>> > >>> > Best, >>> > Jingsong Lee >>> > >>> > On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <se...@apache.org> wrote: >>> > >>> >> We have had much trouble in the past from "too deep too custom" >>> >> integrations that everyone got out of the box, i.e., Hadoop. >>> >> Flink has has such a broad spectrum of use cases, if we have custom >>> build >>> >> for every other framework in that spectrum, we'll be in trouble. >>> >> >>> >> So I would also be -1 for custom builds. >>> >> >>> >> Couldn't we do something similar as we started doing for Hadoop? >>> Moving >>> >> away from convenience downloads to allowing users to "export" their >>> setup >>> >> for Flink? >>> >> >>> >> - We can have a "hive module (loader)" in flink/lib by default >>> >> - The module loader would look for an environment variable like >>> >> "HIVE_CLASSPATH" and load these classes (ideally in a separate >>> >> classloader). >>> >> - The loader can search for certain classes and instantiate catalog >>> / >>> >> functions / etc. when finding them instantiates the hive module >>> >> referencing >>> >> them >>> >> - That way, we use exactly what users have installed, without >>> needing to >>> >> build our own bundles. >>> >> >>> >> Could that work? >>> >> >>> >> Best, >>> >> Stephan >>> >> >>> >> >>> >> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <trohrm...@apache.org> >>> >> wrote: >>> >> >>> >> > Couldn't it simply be documented which jars are in the convenience >>> jars >>> >> > which are pre built and can be downloaded from the website? Then >>> people >>> >> who >>> >> > need a custom version know which jars they need to provide to Flink? >>> >> > >>> >> > Cheers, >>> >> > Till >>> >> > >>> >> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <bowenl...@gmail.com> >>> wrote: >>> >> > >>> >> > > I'm not sure providing an uber jar would be possible. >>> >> > > >>> >> > > Different from kafka and elasticsearch connector who have >>> dependencies >>> >> > for >>> >> > > a specific kafka/elastic version, or the kafka universal connector >>> >> that >>> >> > > provides good compatibilities, hive connector needs to deal with >>> hive >>> >> > jars >>> >> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH >>> >> distributions) >>> >> > > with incompatibility even between minor versions, different >>> versioned >>> >> > > hadoop and other extra dependency jars for each hive version. >>> >> > > >>> >> > > Besides, users usually need to be able to easily see which >>> individual >>> >> > jars >>> >> > > are required, which is invisible from an uber jar. Hive users >>> already >>> >> > have >>> >> > > their hive deployments. They usually have to use their own hive >>> jars >>> >> > > because, unlike hive jars on mvn, their own jars contain changes >>> >> in-house >>> >> > > or from vendors. They need to easily tell which jars Flink >>> requires >>> >> for >>> >> > > corresponding open sourced hive version to their own hive >>> deployment, >>> >> and >>> >> > > copy in-hosue jars over from hive deployments as replacements. >>> >> > > >>> >> > > Providing a script to download all the individual jars for a >>> specified >>> >> > hive >>> >> > > version can be an alternative. >>> >> > > >>> >> > > The goal is we need to provide a *product*, not a technology, to >>> make >>> >> it >>> >> > > less hassle for Hive users. Afterall, it's Flink embracing Hive >>> >> community >>> >> > > and ecosystem, not the other way around. I'd argue Hive connector >>> can >>> >> be >>> >> > > treat differently because its community/ecosystem/userbase is much >>> >> larger >>> >> > > than the other connectors, and it's way more important than other >>> >> > > connectors to Flink on the mission of becoming a batch/streaming >>> >> unified >>> >> > > engine and get Flink more widely adopted. >>> >> > > >>> >> > > >>> >> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <yuzhao....@gmail.com >>> > >>> >> > wrote: >>> >> > > >>> >> > > > Also -1 on separate builds. >>> >> > > > >>> >> > > > After referencing some other BigData engines for >>> distribution[1], i >>> >> > > didn't >>> >> > > > find strong needs to publish a separate build >>> >> > > > for just a separate Hive version, indeed there are builds for >>> >> different >>> >> > > > Hadoop version. >>> >> > > > >>> >> > > > Just like Seth and Aljoscha said, we could push a >>> >> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other >>> use >>> >> > > cases. >>> >> > > > >>> >> > > > [1] https://spark.apache.org/downloads.html >>> >> > > > [2] >>> >> > > >>> >> >>> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html >>> >> > > > >>> >> > > > Best, >>> >> > > > Danny Chan >>> >> > > > 在 2019年12月14日 +0800 AM3:03,dev@flink.apache.org,写道: >>> >> > > > > >>> >> > > > > >>> >> > > > >>> >> > > >>> >> > >>> >> >>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies >>> >> > > > >>> >> > > >>> >> > >>> >> >>> > >>> > >>> > -- >>> > Best, Jingsong Lee >>> > >>> >> > > -- > Best, Jingsong Lee >