Oh, actually, in order to decouple Hadoop 3.2 and Hive 2.3 upgrades, we will need a hive-2.3 profile anyway, no matter having the hive-1.2 profile or not.
On Wed, Nov 20, 2019 at 3:33 PM Cheng Lian <lian.cs....@gmail.com> wrote: > Just to summarize my points: > > 1. Let's still keep the Hive 1.2 dependency in Spark 3.0, but it is > optional. End-users may choose between Hive 1.2/2.3 via a new profile > (either adding a hive-1.2 profile or adding a hive-2.3 profile works for > me, depending on which Hive version we pick as the default version). > 2. Decouple Hive version upgrade and Hadoop version upgrade, so that > people may have more choices in production, and makes Spark 3.0 migration > easier (e.g., you don't have to switch to Hadoop 3 in order to pick Hive > 2.3 and/or JDK 11.). > 3. For default Hadoop/Hive versions in Spark 3.0, I personally do not > have a preference as long as the above two are met. > > > On Wed, Nov 20, 2019 at 3:22 PM Cheng Lian <lian.cs....@gmail.com> wrote: > >> Dongjoon, I don't think we have any conflicts here. As stated in other >> threads multiple times, as long as Hive 2.3 and Hadoop 3.2 version upgrades >> can be decoupled, I have no preference over picking which Hive/Hadoop >> version as the default version. So the following two plans both work for me: >> >> 1. Keep Hive 1.2 as default Spark 3.0 execution Hive version, and >> have an extra hive-2.3 profile. >> 2. Choose Hive 2.3 as default Spark 3.0 execution Hive version, and >> have an extra hive-1.2 profile. >> >> BTW, I was also discussing Hive dependency issues with other people >> offline, and I realized that the Hive isolated client loader is not well >> known, and caused unnecessary confusion/worry. So I would like to provide >> some background context for readers who are not familiar with Spark Hive >> integration here. *Building Spark 3.0 with Hive 1.2.1 does NOT mean that >> you can only interact with Hive 1.2.1.* >> >> Spark does work with different versions of Hive metastore via an isolated >> classloading mechanism. *Even if Spark itself is built with the Hive >> 1.2.1 fork, you can still interact with a Hive 2.3 metastore, and this has >> been true ever since Spark 1.x.* In order to do this, just set the >> following two options according to instructions in our official doc page >> <http://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore> >> : >> >> - spark.sql.hive.metastore.version >> - spark.sql.hive.metastore.jars >> >> Say you set "spark.sql.hive.metastore.version" to "2.3.6", and >> "spark.sql.hive.metastore.jars" to "maven", Spark will pull Hive 2.3.6 >> dependencies from Maven at runtime when initializing the Hive metastore >> client. And those dependencies will NOT conflict with the built-in Hive >> 1.2.1 jars, because the downloaded jars are loaded using an isolated >> classloader (see here >> <https://github.com/apache/spark/blob/1febd373ea806326d269a60048ee52543a76c918/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala>). >> Historically, we call these two sets of Hive dependencies "execution Hive" >> and "metastore Hive". The former is mostly used for features like SerDe, >> while the latter is used to interact with Hive metastore. And the Hive >> version upgrade we are discussing here is about the execution Hive. >> >> Cheng >> >> On Wed, Nov 20, 2019 at 2:38 PM Dongjoon Hyun <dongjoon.h...@gmail.com> >> wrote: >> >>> Nice. That's a progress. >>> >>> Let's narrow down to the path. We need to clarify what is the criteria >>> we can agree. >>> >>> 1. What does `battle-tested for years` mean exactly? >>> How and when can we start the `battle-tested` stage for Hive 2.3? >>> >>> 2. What is the new "Hive integration in Spark"? >>> During introducing Hive 2.3, we fixed the compatibility stuff as you >>> said. >>> Most of code is shared for Hive 1.2 and Hive 2.3. >>> That means if there is a bug inside this shared code, both of them >>> will be affected. >>> Of course, we can fix this because it's Spark code. We will learn >>> and fix it as you said. >>> >>> > Yes, there are issues, but people have learned how to get along >>> with these issues. >>> >>> The only non-shared code are the following. >>> Do you have a concern on the following directories? >>> If there is no bugs on the following codebase, can we switch? >>> >>> $ find . -name v2.3.5 >>> ./sql/core/v2.3.5 >>> ./sql/hive-thriftserver/v2.3.5 >>> >>> 3. We know that we can keep both code bases, but the community should >>> choose Hive 2.3 officially. >>> That's the right choice in the Apache project policy perspective. At >>> least, Sean and I prefer that. >>> If someone really want to stick to Hive 1.2 fork, they can use it at >>> their own risks. >>> >>> > for Spark 3.0 end-users who really don't want to interact with >>> this Hive 1.2 fork, they can always use Hive 2.3 at their own risks. >>> >>> Specifically, what about having a profile `hive-1.2` at `3.0.0` with the >>> default Hive 2.3 pom at least? >>> How do you think about that way, Cheng? >>> >>> Bests, >>> Dongjoon. >>> >>> >>> On Wed, Nov 20, 2019 at 12:59 PM Cheng Lian <lian.cs....@gmail.com> >>> wrote: >>> >>>> Hey Dongjoon and Felix, >>>> >>>> I totally agree that Hive 2.3 is more stable than Hive 1.2. Otherwise, >>>> we wouldn't even consider integrating with Hive 2.3 in Spark 3.0. >>>> >>>> However, *"Hive" and "Hive integration in Spark" are two quite >>>> different things*, and I don't think anybody has ever mentioned "the >>>> forked Hive 1.2.1 is stable" in any recent Hadoop/Hive version discussions >>>> (at least I double-checked all my replies). >>>> >>>> What I really care about is the stability and quality of "Hive >>>> integration in Spark", which have gone through some major updates due to >>>> the recent Hive 2.3 upgrade in Spark 3.0. We had already found bugs in this >>>> piece, and empirically, for a significant upgrade like this one, it is not >>>> surprising that other bugs/regressions can be found in the near future. On >>>> the other hand, the Hive 1.2 integration code path in Spark has been >>>> battle-tested for years. Yes, there are issues, but people have learned how >>>> to get along with these issues. And please don't forget that, for Spark 3.0 >>>> end-users who really don't want to interact with this Hive 1.2 fork, they >>>> can always use Hive 2.3 at their own risks. >>>> >>>> True, "stable" is quite vague a criterion, and hard to be proven. But >>>> that is exactly the reason why we may want to be conservative and wait for >>>> some time and see whether there are further signals suggesting that the >>>> Hive 2.3 integration in Spark 3.0 is *unstable*. After one or two >>>> Spark 3.x minor releases, if we've fixed all the outstanding issues and no >>>> more significant ones are showing up, we can declare that the Hive 2.3 >>>> integration in Spark 3.x is stable, and then we can consider removing >>>> reference to the Hive 1.2 fork. Does that make sense? >>>> >>>> Cheng >>>> >>>> On Wed, Nov 20, 2019 at 11:49 AM Felix Cheung < >>>> felixcheun...@hotmail.com> wrote: >>>> >>>>> Just to add - hive 1.2 fork is definitely not more stable. We know of >>>>> a few critical bug fixes that we cherry picked into a fork of that fork to >>>>> maintain ourselves. >>>>> >>>>> >>>>> ------------------------------ >>>>> *From:* Dongjoon Hyun <dongjoon.h...@gmail.com> >>>>> *Sent:* Wednesday, November 20, 2019 11:07:47 AM >>>>> *To:* Sean Owen <sro...@gmail.com> >>>>> *Cc:* dev <dev@spark.apache.org> >>>>> *Subject:* Re: The Myth: the forked Hive 1.2.1 is stabler than XXX >>>>> >>>>> Thanks. That will be a giant step forward, Sean! >>>>> >>>>> > I'd prefer making it the default in the POM for 3.0. >>>>> >>>>> Bests, >>>>> Dongjoon. >>>>> >>>>> On Wed, Nov 20, 2019 at 11:02 AM Sean Owen <sro...@gmail.com> wrote: >>>>> >>>>> Yeah 'stable' is ambiguous. It's old and buggy, but at least it's the >>>>> same old and buggy that's been there a while. "stable" in that sense >>>>> I'm sure there is a lot more delta between Hive 1 and 2 in terms of >>>>> bug fixes that are important; the question isn't just 1.x releases. >>>>> >>>>> What I don't know is how much affects Spark, as it's a Hive client >>>>> mostly. Clearly some do. >>>>> >>>>> I'd prefer making it the default in the POM for 3.0. Mostly on the >>>>> grounds that its effects are on deployed clusters, not apps. And >>>>> deployers can still choose a binary distro with 1.x or make the choice >>>>> they want. Those that don't care should probably be nudged to 2.x. >>>>> Spark 3.x is already full of behavior changes and 'unstable', so I >>>>> think this is minor relative to the overall risk question. >>>>> >>>>> On Wed, Nov 20, 2019 at 12:53 PM Dongjoon Hyun < >>>>> dongjoon.h...@gmail.com> wrote: >>>>> > >>>>> > Hi, All. >>>>> > >>>>> > I'm sending this email because it's important to discuss this topic >>>>> narrowly >>>>> > and make a clear conclusion. >>>>> > >>>>> > `The forked Hive 1.2.1 is stable`? It sounds like a myth we created >>>>> > by ignoring the existing bugs. If you want to say the forked Hive >>>>> 1.2.1 is >>>>> > stabler than XXX, please give us the evidence. Then, we can fix it. >>>>> > Otherwise, let's stop making `The forked Hive 1.2.1` invincible. >>>>> > >>>>> > Historically, the following forked Hive 1.2.1 has never been stable. >>>>> > It's just frozen. Since the forked Hive is out of our control, we >>>>> ignored bugs. >>>>> > That's all. The reality is a way far from the stable status. >>>>> > >>>>> > https://mvnrepository.com/artifact/org.spark-project.hive/ >>>>> > >>>>> https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark >>>>> (2015 August) >>>>> > >>>>> https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2 >>>>> (2016 April) >>>>> > >>>>> > First, let's begin Hive itself by comparing with Apache Hive 1.2.2 >>>>> and 1.2.3, >>>>> > >>>>> > Apache Hive 1.2.2 has 50 bug fixes. >>>>> > Apache Hive 1.2.3 has 9 bug fixes. >>>>> > >>>>> > I will not cover all of them, but Apache Hive community also >>>>> backports >>>>> > important patches like Apache Spark community. >>>>> > >>>>> > Second, let's move to SPARK issues because we aren't exposed to all >>>>> Hive issues. >>>>> > >>>>> > SPARK-19109 ORC metadata section can sometimes exceed protobuf >>>>> message size limit >>>>> > SPARK-22267 Spark SQL incorrectly reads ORC file when column >>>>> order is different >>>>> > >>>>> > These were reported since Apache Spark 1.6.x because the forked Hive >>>>> doesn't have >>>>> > a proper upstream patch like HIVE-11592 (fixed at Apache Hive 1.3.0). >>>>> > >>>>> > Since we couldn't update the frozen forked Hive, we added Apache ORC >>>>> dependency >>>>> > at SPARK-20682 (2.3.0), added a switching configuration at >>>>> SPARK-20728 (2.3.0), >>>>> > tured on `spark.sql.hive.convertMetastoreOrc by default` at >>>>> SPARK-22279 (2.4.0). >>>>> > However, if you turn off the switch and start to use the forked hive, >>>>> > you will be exposed to the buggy forked Hive 1.2.1 again. >>>>> > >>>>> > Third, let's talk about the new features like Hadoop 3 and JDK11. >>>>> > No one believe that the ancient forked Hive 1.2.1 will work with >>>>> this. >>>>> > I saw that the following issue is mentioned as an evidence of Hive >>>>> 2.3.6 bug. >>>>> > >>>>> > SPARK-29245 ClassCastException during creating >>>>> HiveMetaStoreClient >>>>> > >>>>> > Yes. I know that issue because I reported it and verified HIVE-21508. >>>>> > It's fixed already and will be released ad Apache Hive 2.3.7. >>>>> > >>>>> > Can we imagine something like this in the forked Hive 1.2.1? >>>>> > 'No'. There is no future on it. It's frozen. >>>>> > >>>>> > From now, I want to claim that the forked Hive 1.2.1 is the unstable >>>>> one. >>>>> > I welcome all your positive and negative opinions. >>>>> > Please share your concerns and problems and fix them together. >>>>> > Apache Spark is an open source project we shared. >>>>> > >>>>> > Bests, >>>>> > Dongjoon. >>>>> > >>>>> >>>>>