+1 to having an always releasable head of master. +1 to having test verified API compliance. I was thinking that the project should set up verification tests where it runs against supported versions of Hive, Imapala, Spark, … (and obviously open to others to add their tests as well) on a nightly basis so that we guarantee API stability.
Alan. On Thu, Jul 6, 2017 at 2:10 AM, Peter Vary <pv...@cloudera.com> wrote: > Hi folks, > > I agree with most of the things Edward said. I have faced similar issues > in smaller scale when integrated Hive with Yetus. We are forced to keep > patched Yetus files in Hive repo until they push their next release. Also > followed one more serious problem when a patch was committed to Hive, > Impala and Spark, and just few days before the release all of them was > reverted from the projects due to concerns raised by the Spark committee > (after the changes was already committed to Spark as well) > > Having said all of these, I still think that separating the HMS to a new > top level project could be a step to the right direction with the following > constraints. The new project should have: > - Strict, stability oriented branching strategy following Edward's > suggestions, so if a downstream project - for example Hive - needs some fix > or easy change that could be incorporated, and released almost immediately. > So we have to have these: > - Always releasable head > - Every multi commit feature should be added as a feature branch > - Strict, enforced, stability oriented API strategy. So we will not be > surprised by features added by other projects and break Hive compatibility. > To avoid this situation we need to design for it, have pre-commit tests in > place for catch the in-adverted changes, and most importantly have a clear > commitment for it. > > I think, since the current HMS is already used by numerous other projects, > we already should have these in mind when modifying anything in HMS related > code. This is not the main focus of Hive, so we do not concentrate on this > and there are often interoperability issues, problems. We can do this > inside Hive as well, but the current approach followed by Hive, and the one > required by the HMS are requiring a different mindset. We need a clear, > well defined boundary and separating the 2 projects could help in this. We > can focus on the different needs and goal and eventually we might have > different culture as well which suits the specific needs of the specific > part of the code. > > I think keeping these rules in the new to level HMS we can mitigate most > of the issues mentioned below, and we will be better of overall. > What do you think Edward? > > Thanks, > Peter > > > > On Jul 5, 2017, at 10:16 PM, Xuefu Zhang <xu...@apache.org> wrote: > > > > I think Edward's concern is valid. While I voiced my support for this > > proposal, which was more from the benefits of the whole Hadoop > ecosystem, I > > don't see the equal benefits for Hive. Instead, it may even create more > > overhead for Hive. I'd really like to take time to see what are the road > > blocks for other projects to use HMS as it is. The issue of Spark > including > > a Hive fork, which was brought up some time back, is certainly not one of > > them. > > > > Thanks, > > Xuefu > > > > On Wed, Jul 5, 2017 at 12:33 PM, Edward Capriolo <edlinuxg...@gmail.com> > > wrote: > > > >> On Wed, Jul 5, 2017 at 1:51 PM, Alan Gates <alanfga...@gmail.com> > wrote: > >> > >>> On Mon, Jul 3, 2017 at 6:20 AM, Edward Capriolo <edlinuxg...@gmail.com > > > >>> wrote: > >>> > >>>> > >>>> We already have things in the meta-store not directly tied to language > >>>> features. For example hive metastore has a "retention" property which > >> is > >>>> not actively in use by anything. In reality, we rarely say 'no' or -1 > >> to > >>>> much. Which in part is why I believe our release process is grinding > >>>> slower: we have so many things in flight I do not feel that any one > >>> person > >>>> can keep track. You are working on porting the metastore to hbase. > >>>> https://issues.apache.org/jira/browse/HIVE-9452 did you get a -1 or > >> 'No' > >>>> along the way? When I first noticed this I pointed out that someone > has > >>>> already ported the metastore to Cassandra > >>>> https://github.com/riptano/brisk/blob/master/src/java/ > >>>> src/org/apache/cassandra/hadoop/hive/metastore/SchemaManager > >>> Service.java, > >>>> but I was more exciting/rational for this multi-year approach using > >> hbase > >>>> so I let everyone 'have at it'. > >>>> > >>> Your example and mine are not equivalent. The HBase metastore is > still a > >>> Hive feature, even if some thought it not worth while. That is > different > >>> than people bringing features that will never interest Hive or that > Hive > >>> could never use (e.g. Dain’s desire for the metastore to support Presto > >>> style views). > >>> > >>> I forgot to mention the issue these would be non-Hive contributors have > >>> with releases if they contribute their features to the metastore while > >> it’s > >>> inside Hive. Is Hive going to do a release just to push out features > in > >>> the metastore that it doesn’t care about? > >>> > >>> You seem to be asserting that doing this doesn’t really help non-Hive > >> based > >>> systems that are using or would like to use the metastore. But it is > >>> interesting that people from three of those systems have commented in > the > >>> thread so far, and all are positive (Dmitrias from Impala, Dain from > >>> Presto, and Sriharsha from the schema registry project). > >>> > >>> > >>>> I am going to give a hypothetical but real world situation. Suppose I > >>> want > >>>> to add the statement "CREATE permanent macro xyz", this feature I > >> believe > >>>> would cross cut calcite, hive, and hive metastore. To build this > >> feature > >>> I > >>>> would need to orchestrate the change across 3 separate groups of hive > >>>> 'subcommittees' for lack of a better word. 3 git repos, 3 Jira's 3 > >>>> releases. That is not counting if we run into some bug or misfeature > >>> (maybe > >>>> with Tez or something else) so that brings in 4-5 releases of upstream > >> to > >>>> add a feature to hive. This does not take into account normal > processes > >>>> mess ups. For example say you get the metastore done, but now the > >> people > >>>> doing the calcite/antlr suggest the feature have different syntax > >> because > >>>> they did not read the 3-4 linked tickets when the process started? > Now, > >>> you > >>>> have to loop back around the process. Finding 1 person in 1 project to > >>>> usher along the feature you want is difficult, having to find and > clear > >>>> time with 3 people across three projects is going to be a difficult > >> along > >>>> with then 'pushing' them all to kick out a release so you can finally > >> use > >>>> said feature. > >>>> > >>> > >>> I partially agree with you. On the reviews, JIRAs, etc. I don’t think > it > >>> adds much, if any, overhead. Hive is a big project and no one person > >> knows > >>> all the code anymore. If you wanted to add a permanent macros feature > >> you > >>> would need reviews from someone who knows the parser (probably > >> Pengcheng), > >>> people who know the optimizer (Jesus, Ashutosh, …), and someone who > knows > >>> the metastore (me, Thejas, …). And any large feature is going to be > >>> implemented over multiple JIRAs, all of which are linkable regardless > of > >>> whether the JIRAs start with METASTORE- or HIVE-. I also don’t think > it > >>> makes the feature disagreement any worse. If the optimizer team > >> absolutely > >>> insists it has to have some feature and the metastore team insists that > >> it > >>> can’t have that feature you’re going to have to work through the issue > >>> whether they all are in Hive or in two separate projects. > >>> > >>> Where I agree the split adds cost is releases. Before your macro > feature > >>> could go live you need releases from each of the components. And while > >> in > >>> development the components need to use snapshot versions of the other > >>> components. My assertion is that the benefits out weigh this cost. > >>> > >>> Alan. > >>> > >> > >> > >> "You seem to be asserting that doing this doesn’t really help non-Hive > >> based > >> systems that are using or would like to use the metastore. But it is > >> interesting that people from three of those systems have commented in > the > >> thread so far, and all are positive (Dmitrias from Impala, Dain from > >> Presto, and Sriharsha from the schema registry project)." > >> > >> I notice that impala has a syntax for caching. > >> > >> https://www.cloudera.com/documentation/enterprise/5-8-x/topi > >> cs/impala_perf_hdfs_caching.html > >> > >> Notice how the cache syntax did not way into Hive? It would make sense > if > >> this feature trickled it's way into hive and use HDFS caching for > example. > >> I have heard many people claim that using hive metastore is such a > because > >> it is packaged weird (like with ORC), but again besides > claim/complaining > >> no one has stepped up to deal with that. > >> > >> What I would suggest is going forward for maybe a trial period of 6 > months, > >> labeling JIRA tickets with a tag that would be > >> "SeeThisProvesWeNeedATLPMetastore". Because right now I do not enough > >> active use cases of people giving anything back to justify hurting our > >> workflow so much. > >> > >> > >> "I partially agree with you. On the reviews, JIRAs, etc. I don’t think > it > >> adds much, if any, overhead. Hive is a big project and no one person > knows > >> all the code anymore. If you wanted to add a permanent macros feature > you > >> would need reviews from someone who knows the parser (probably > Pengcheng), > >> people who know the optimizer (Jesus, Ashutosh, …), and someone who > knows > >> the metastore (me, Thejas, …). And any large feature is going to be > >> implemented over multiple JIRAs, all of which are linkable regardless of > >> whether the JIRAs start with METASTORE- or HIVE-. I also don’t think > it > >> makes the feature disagreement any worse. If the optimizer team > absolutely > >> insists it has to have some feature and the metastore team insists that > it > >> can’t have that feature you’re going to have to work through the issue > >> whether they all are in Hive or in two separate projects" > >> > >> Macro was done in 1 patch and reviewed by 2 people. With 2-3 follow on > >> bugs. > >> > >> https://issues.apache.org/jira/browse/HIVE-2655 > >> > >> I think your perception is different then mine because of > circumstances. I > >> have waited weeks/months for reviews/merges (in Hive and other apache > >> projects) from mundane udfs to cassandra-storage-handlers. You obviously > >> work in a large company and you can more easily align objectives, go to > the > >> water cooler and say "hey bob you know it would be cool if you can > release > >> x so I can do y". When you are not in that situation its like, "hey > mailing > >> list, my patch was done for three months now and like I have had to > rebase > >> it three times and like I notice like other stuff is getting committed." > >> > >> If you look at it tactically, "create permanent macro xzy". I go over to > >> calcite and suggest some changes there, if this concept is not "game > >> changer" it is probably going to sit unreviewed. If it is "game changer" > >> exciting that is 72 hours for release voting. Next go to hive-metastore > >> repeat the process, but remember now I have to "wow" the metastore > people > >> with the "game changer" and if that crew is super focused on something > >> about kafka well now Hive features are second fiddle. Now lets say a > hive > >> release is coming up, and I really want my feature in it. > >> hive-metastore-tlp might currently have a broken trunk because mongo > wants > >> to add spaceships to wombats feature has a bug and frankly that should > not > >> effect us. > >> > >> I hate to draw in something else but I feel it is related: > >> > >> 8 December 2016 : release 2.1.1 available > >> 07 April 2017 : release 1.2.2 available > >> hive-dev [DISCUSS] Supporting Hadoop-1 and experimental features > >> hive-dev Re: release chaos? > >> > >> I have been vocal about not liking certain branching strategies and > >> proposals that take us away from releasable trunk. We have steadily > headed > >> in a direction where we are pulling things out of hive, and we are not > able > >> to turn out releases. We even had a thread "release chaos" talking about > >> our 5 active branches (with friends I say "jumped the shark"). Pulling > out > >> the metastore is only going to make this worse. I do not even see the > model > >> as successful. You may say it is great that calcite lets people share > our > >> sql dialect or the ORC TLP has 5 committers, but if Hive can not get a > >> release out the door I do not see us optimizing for the proper thing. > >> > >