Commit then review, and self commit, destroys the good things we get from our normal system.
http://anna.gs/blog/2013/08/12/code-review-ftw/ I am most worried about silo's and knowledge, lax testing policies, and code quality. Which I now have seen on several occasions when something is happening in a branch. (not calling out tez branch in particular) On Fri, Aug 16, 2013 at 9:13 AM, Edward Capriolo <edlinuxg...@gmail.com>wrote: > I still am not sure we are doing this the ideal way. I am not a believer > in a commit-then-review branch. > > This issue is an example. > > https://issues.apache.org/jira/browse/HIVE-5108 > > I ask myself these questions: > Does this currently work? Are their tests? If so which ones are broken? > How does the patch fix them without tests to validate? > > Having a commit-then-review branch just seems subversive to our normal > process, and a quick short cut to not have to be bothered by writing tests > or involving anyone else. > > > > On Mon, Aug 5, 2013 at 1:54 PM, Alan Gates <ga...@hortonworks.com> wrote: > >> >> On Jul 29, 2013, at 9:53 PM, Edward Capriolo wrote: >> >> > Also watched http://www.ustream.tv/recorded/36323173 >> > >> > I definitely see the win in being able to stream inter-stage output. >> > >> > I see some cases where small intermediate results can be kept "In >> memory". >> > But I was somewhat under the impression that the map reduce spill >> settings >> > kept stuff in memory, isn't that what spill settings are? >> >> No. MapReduce always writes shuffle data to local disk. And >> intermediate results between MR jobs are always persisted to HDFS, as >> there's no other option. When we talk of being able to keep intermediate >> results in memory we mean getting rid of both of these disk writes/reads >> when appropriate (meaning not always, there's a trade off between speed and >> error handling to be made here, see below for more details). >> >> > >> > There is a few bullet points that came up repeatedly that I do not >> follow: >> > >> > Something was said to the effect of "Container reuse makes X faster". >> > Hadoop has jvm reuse. Not following what the difference is here? Not >> > everyone has a 10K node cluster. >> >> Sharing JVMs across users is inherently insecure (we can't guarantee what >> code the first user left behind that may interfere with later users). As I >> understand container re-use in Tez it constrains the re-use to one user for >> security reasons, but still avoids additional JVM start up costs. But this >> is a question that the Tez guys could answer better on the Tez lists ( >> d...@tez.incubator.apache.org) >> >> > >> > "Joins in map reduce are hard" Really? I mean some of them are I guess, >> but >> > the typical join is very easy. Just shuffle by the join key. There was >> not >> > really enough low level details here saying why joins are better in tez. >> >> Join is not a natural operation in MapReduce. MR gives you one input and >> one output. You end up having to bend the rules to do have multiple >> inputs. The idea here is that Tez can provide operators that naturally >> work with joins and other operations that don't fit the one input/one >> output model (eg unions, etc.). >> >> > >> > "Chosing the number of maps and reduces is hard" Really? I do not find >> it >> > that hard, I think there are times when it's not perfect but I do not >> find >> > it hard. The talk did not really offer anything here technical on how >> tez >> > makes this better other then it could make it better. >> >> Perhaps manual would be a better term here than hard. In our experience >> it takes quite a bit of engineer trial and error to determine the optimal >> numbers. This may be ok if you're going to invest the time once and then >> run the same query every day for 6 months. But obviously it doesn't work >> for the ad hoc case. Even in the batch case it's not optimal because every >> once and a while an engineer has to go back and re-optimize the query to >> deal with changing data sizes, data characteristics, etc. We want the >> optimizer to handle this without human intervention. >> >> > >> > The presentations mentioned streaming data, how do two nodes stream data >> > between a tasks and how it it reliable? If the sender or receiver dies >> does >> > the entire process have to start again? >> >> If the sender or receiver dies then the query has to be restarted from >> some previous point where data was persisted to disk. The idea here is >> that speed vs error recovery trade offs should be made by the optimizer. >> If the optimizer estimates that a query will complete in 5 seconds it can >> stream everything and if a node fails it just re-runs the whole query. If >> it estimates that a particular phase of a query will run for an hour it can >> choose to persist the results to HDFS so that in the event of a failure >> downstream the long phase need not be re-run. Again we want this to be >> done automatically by the system so the user doesn't need to control this >> level of detail. >> >> > >> > Again one of the talks implied there is a prototype out there that >> launches >> > hive jobs into tez. I would like to see that, it might answer more >> > questions then a power point, and I could profile some common queries. >> >> As mentioned in a previous email afaik Gunther's pushed all these changes >> to the Tez branch in Hive. >> >> Alan. >> >> > >> > Random late night thoughts over, >> > Ed >> > >> > >> > >> > >> > >> > >> > On Tue, Jul 30, 2013 at 12:02 AM, Edward Capriolo < >> edlinuxg...@gmail.com>wrote: >> > >> >> At ~25:00 >> >> >> >> "There is a working prototype of hive which is using tez as the >> targeted >> >> runtime" >> >> >> >> Can I get a look at that code? Is it on github? >> >> >> >> Edward >> >> >> >> >> >> On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates <ga...@hortonworks.com> >> wrote: >> >> >> >>> Answers to some of your questions inlined. >> >>> >> >>> Alan. >> >>> >> >>> On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote: >> >>> >> >>>> There are some points I want to bring up. First, I am on the PMC. >> Here >> >>> is >> >>>> something I find relevant: >> >>>> >> >>>> http://www.apache.org/foundation/how-it-works.html >> >>>> >> >>>> ------------------------------ >> >>>> >> >>>> The role of the PMC from a Foundation perspective is oversight. The >> main >> >>>> role of the PMC is not code and not coding - but to ensure that all >> >>> legal >> >>>> issues are addressed, that procedure is followed, and that each and >> >>> every >> >>>> release is the product of the community as a whole. That is key to >> our >> >>>> litigation protection mechanisms. >> >>>> >> >>>> Secondly the role of the PMC is to further the long term development >> and >> >>>> health of the community as a whole, and to ensure that balanced and >> wide >> >>>> scale peer review and collaboration does happen. Within the ASF we >> worry >> >>>> about any community which centers around a few individuals who are >> >>> working >> >>>> virtually uncontested. We believe that this is detrimental to >> quality, >> >>>> stability, and robustness of both code and long term social >> structures. >> >>>> >> >>>> -------------------------------- >> >>>> >> >>>> >> >>> >> https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different >> >>>> >> >>>> ------------------------------------- >> >>>> >> >>>> All other decisions happen on the dev list, discussions on the >> private >> >>> list >> >>>> are kept to a minimum. >> >>>> >> >>>> "If it didn't happen on the dev list, it didn't happen" - which leads >> >>> to: >> >>>> >> >>>> a) Elections of committers and PMC members are published on the dev >> list >> >>>> once finalized. >> >>>> >> >>>> b) Out-of-band discussions (IRC etc.) are summarized on the dev list >> as >> >>>> soon as they have impact on the project, code or community. >> >>>> --------------------------------- >> >>>> >> >>>> https://issues.apache.org/jira/browse/HIVE-4660 ironically titled >> "Let >> >>>> their be Tez" has not be +1 ed by any committer. It was never >> discussed >> >>> on >> >>>> the dev or the user list (as far as I can tell). >> >>> >> >>> As all JIRA creations and updates are sent to dev@hive, creating a >> JIRA >> >>> is de facto posting to the list. >> >>> >> >>>> >> >>>> As a PMC member I feel we need more discussion on Tez on the dev list >> >>> along >> >>>> with a wiki-fied design document. Topics of discussion should >> include: >> >>> >> >>> I talked with Gunther and he's working on posting a design doc on the >> >>> wiki. He has a PDF on the JIRA but he doesn't have write permissions >> yet >> >>> on the wiki. >> >>> >> >>>> >> >>>> 1) What is tez? >> >>> In Hadoop 2.0, YARN opens up the ability to have multiple execution >> >>> frameworks in Hadoop. Hadoop apps are no longer tied to MapReduce as >> the >> >>> only execution option. Tez is an effort to build an execution engine >> that >> >>> is optimized for relational data processing, such as Hive and Pig. >> >>> >> >>> The biggest change here is to move away from only Map and Reduce as >> >>> processing options and to allow alternate combinations of processing, >> such >> >>> as map -> reduce -> reduce or tasks that take multiple inputs or >> shuffles >> >>> that avoid sorting when it isn't needed. >> >>> >> >>> For a good intro to Tez, see Arun's presentation on it at the recent >> >>> Hadoop summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8slides >> >>> >> http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212) >> >>>> >> >>>> 2) How is tez different from oozie, http://code.google.com/p/hop/, >> >>>> http://cs.brown.edu/~backman/cmr.html , and other DAG and or >> streaming >> >>> map >> >>>> reduce tools/frameworks? Why should we use this and not those? >> >>> >> >>> Oozie is a completely different thing. Oozie is a workflow engine >> and a >> >>> scheduler. It's core competencies are the ability to coordinate >> workflows >> >>> of disparate job types (MR, Pig, Hive, etc.) and to schedule them. >> It is >> >>> not intended as an execution engine for apps such as Pig and Hive. >> >>> >> >>> I am not familiar with these other engines, but the short answer is >> that >> >>> Tez is built to work on YARN, which works well for Hive since it is >> tied to >> >>> Hadoop. >> >>>> >> >>>> 3) When can we expect the first tez release? >> >>> I don't know, but I hope sometime this fall. >> >>> >> >>>> >> >>>> 4) How much effort is involved in integrating hive and tez? >> >>> Covered in the design doc. >> >>> >> >>>> >> >>>> 5) Who is ready to commit to this effort? >> >>> I'll let people speak for themselves on that one. >> >>> >> >>>> >> >>>> 6) can we expect this work to be done in one hive release? >> >>> Unlikely. Initial integration will be done in one release, but as >> Tez is >> >>> a new project I expect it will be adding features in the future that >> Hive >> >>> will want to take advantage of. >> >>> >> >>>> >> >>>> In my opinion we should not start any work on this tez-hive until >> these >> >>>> questions are answered to the satisfaction of the hive developers. >> >>> >> >>> Can we change this to "not commit patches"? We can't tell willing >> people >> >>> not to work on it. >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> On Mon, Jul 15, 2013 at 9:51 PM, Edward Capriolo < >> edlinuxg...@gmail.com >> >>>> wrote: >> >>>> >> >>>>> >> >>>>>>> The Hive bylaws, >> >>>>> https://cwiki.apache.org/confluence/display/Hive/Bylaws , lay out >> what >> >>>>> votes are needed for what. I don't see anything there about >> needing 3 >> >>> +1s >> >>>>> for a branch. Branching >>would seem to fall under code change, >> which >> >>>>> requires one vote and a minimum length of 1 day. >> >>>>> >> >>>>> You could argue that all you need is one +1 to create a branch, but >> >>> this >> >>>>> is more then a branch. If you are talking about something that is: >> >>>>> 1) going to cause major re-factoring of critical pieces of hive like >> >>>>> ExecDriver and MapRedTask >> >>>>> 2) going to be very disruptive to the efforts of other committers >> >>>>> 3) something that may be a major architectural change >> >>>>> >> >>>>> Getting the project on board with the idea is a good idea. >> >>>>> >> >>>>> Now I want to point something out. Here are some recent initiatives >> in >> >>>>> hive: >> >>>>> >> >>>>> 1) At one point there was a big initiative to "support oracle" after >> >>> the >> >>>>> initial work, there are patches in Jira no one seems to care about >> >>> oracle >> >>>>> support. >> >>>>> 2) Another such decisions was this "support windows" one, there are >> >>>>> probably 4 windows patches waiting reviews. >> >>>>> 3) I still have no clue what the official hadoop1 hadoop2, hadoop >> 0.23 >> >>>>> support prospective is, but every couple weeks we get another jira >> >>> about >> >>>>> something not working/testing on one of those versions, seems like >> >>> several >> >>>>> builds are broken. >> >>>>> 4) Hive-storage handler, after the initial implementation no one >> cares >> >>> to >> >>>>> review any other storage handler implementation, 3 patches there or >> >>> more, >> >>>>> could not even find anyone willing to review the cassandra storage >> >>> handler >> >>>>> I spent months on. >> >>>>> 5) OCR, Vectorization >> >>>>> 6) Windowing: committed, numerous check-style violations. >> >>>>> >> >>>>> We have !!!160+!!! PATCH_AVAILABLE Jira issues. Few active >> committers. >> >>> We >> >>>>> are spread very thin, and embarking on another side project not >> >>> involved >> >>>>> with core hive seems like the wrong direction at the moment. >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> On Mon, Jul 15, 2013 at 8:37 PM, Alan Gates <ga...@hortonworks.com> >> >>> wrote: >> >>>>> >> >>>>>> >> >>>>>> On Jul 13, 2013, at 9:48 AM, Edward Capriolo wrote: >> >>>>>> >> >>>>>>> I have started to see several re factoring patches around tez. >> >>>>>>> https://issues.apache.org/jira/browse/HIVE-4843 >> >>>>>>> >> >>>>>>> This is the only mention on the hive list I can find with tez: >> >>>>>>> "Makes sense. I will create the branch soon. >> >>>>>>> >> >>>>>>> Thanks, >> >>>>>>> Ashutosh >> >>>>>>> >> >>>>>>> >> >>>>>>> On Tue, Jun 11, 2013 at 7:44 PM, Gunther Hagleitner < >> >>>>>>> ghagleit...@hortonworks.com> wrote: >> >>>>>>> >> >>>>>>>> Hi, >> >>>>>>>> >> >>>>>>>> I am starting to work on integrating Tez into Hive (see >> HIVE-4660, >> >>>>>> design >> >>>>>>>> doc has already been uploaded - any feedback will be much >> >>> appreciated). >> >>>>>>>> This will be a fair amount of work that will take time to >> >>>>>> stabilize/test. >> >>>>>>>> I'd like to propose creating a branch in order to be able to do >> this >> >>>>>>>> incrementally and collaboratively. In order to progress rapidly >> with >> >>>>>> this, >> >>>>>>>> I would also like to go "commit-then-review". >> >>>>>>>> >> >>>>>>>> Thanks, >> >>>>>>>> Gunther. >> >>>>>>>> " >> >>>>>>> >> >>>>>>> These refactor-ings are largely destructive to a number of bugs >> and >> >>>>>>> language improvements in hive.The language improvements and bug >> fixes >> >>>>>> that >> >>>>>>> have been sitting in Jira for quite some time now marked >> >>> patch-available >> >>>>>>> and are waiting for review. >> >>>>>>> >> >>>>>>> There are a few things I want to point out: >> >>>>>>> 1) Normally we create design docs in out wiki (which it is not) >> >>>>>>> 2) Normally when the change is significantly complex we get >> multiple >> >>>>>>> committers to comment on it (which we did not) >> >>>>>>> On point 2 no one -1 the branch, but this is really something >> that >> >>>>>> should >> >>>>>>> have required a +1 from 3 committers. >> >>>>>> >> >>>>>> The Hive bylaws, >> >>> https://cwiki.apache.org/confluence/display/Hive/Bylaws, lay out what >> >>> votes are needed for what. I don't see anything there about >> >>>>>> needing 3 +1s for a branch. Branching would seem to fall under >> code >> >>>>>> change, which requires one vote and a minimum length of 1 day. >> >>>>>> >> >>>>>>> >> >>>>>>> I for one am not completely sold on Tez. >> >>>>>>> http://incubator.apache.org/projects/tez.html. >> >>>>>>> "directed-acyclic-graph of tasks for processing data" this >> >>> description >> >>>>>>> sounds like many things which have never become popular. One to >> think >> >>>>>> of is >> >>>>>>> oozie "Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of >> >>>>>>> actions.". I am sure I can find a number of libraries/frameworks >> that >> >>>>>> make >> >>>>>>> this same claim. In general I do not feel like we have done our >> >>> homework >> >>>>>>> and pre-requisites to justify all this work. If we have done the >> >>>>>> homework, >> >>>>>>> I am sure that it has not been communicated and accepted by hive >> >>>>>> developers >> >>>>>>> at large. >> >>>>>> >> >>>>>> A request for better documentation on Tez and a project road map >> seems >> >>>>>> totally reasonable. >> >>>>>> >> >>>>>>> >> >>>>>>> If we have a branch, why are we also committing on trunk? Scanning >> >>>>>> through >> >>>>>>> the tez doc the only language I keep finding language like >> "minimal >> >>>>>> changes >> >>>>>>> to the planner" yet, there is ALREADY lots of large changes going >> on! >> >>>>>>> >> >>>>>>> Really none of the above would bother me accept for the fact that >> >>> these >> >>>>>>> "minimal changes" are causing many "patch available" >> ready-for-review >> >>>>>> bugs >> >>>>>>> and core hive features to need to be re based. >> >>>>>>> >> >>>>>>> I am sure I have mentioned this before, but I have to spend 12+ >> >>> hours to >> >>>>>>> test a single patch on my laptop. A few days ago I was testing a >> new >> >>>>>> core >> >>>>>>> hive feature. After all the tests passed and before I was able to >> >>>>>> commit, >> >>>>>>> someone unleashed a tez patch on trunk which caused the thing I >> was >> >>>>>> testing >> >>>>>>> for 12 hours to need to be rebased. >> >>>>>>> >> >>>>>>> >> >>>>>>> I'm not cool with this.Next time that happens to me I will >> seriously >> >>>>>>> consider reverting the patch. Bug fixes and new hive features are >> >>> more >> >>>>>>> important to me then integrating with incubator projects. >> >>>>>> >> >>>>>> (With my Apache member hat on) Reverting patches that aren't >> breaking >> >>>>>> the build is considered very bad form in Apache. It does make >> sense >> >>> to >> >>>>>> request that when people are going to commit a patch that will >> break >> >>> many >> >>>>>> other patches they first give a few hours of notice so people can >> say >> >>>>>> something if they're about to commit another patch and avoid your >> >>> fate of >> >>>>>> needing to rerun the tests. The other thing is we need to get get >> the >> >>>>>> automated build of patches working on Hive so committers are forced >> >>> to run >> >>>>>> all of the tests themselves. We are working on it, but we're not >> >>> there yet. >> >>>>>> >> >>>>>> Alan. >> >>>>>> >> >>>>>> >> >>>>> >> >>> >> >>> >> >> >> >> >