Re: Tez branch and tez based patches

Edward Capriolo Fri, 16 Aug 2013 07:55:08 -0700

Commit then review, and self commit, destroys the good things we get from
our normal system.


http://anna.gs/blog/2013/08/12/code-review-ftw/

I am most worried about silo's and knowledge, lax testing policies, and
code quality. Which I now have seen on several occasions when something is
happening in a branch. (not calling out tez branch in particular)



On Fri, Aug 16, 2013 at 9:13 AM, Edward Capriolo <[email protected]>wrote:

> I still am not sure we are doing this the ideal way. I am not a believer
> in a commit-then-review branch.
>
> This issue is an example.
>
> https://issues.apache.org/jira/browse/HIVE-5108
>
> I ask myself these questions:
> Does this currently work? Are their tests? If so which ones are broken?
> How does the patch fix them without tests to validate?
>
> Having a commit-then-review branch just seems subversive to our normal
> process, and a quick short cut to not have to be bothered by writing tests
> or involving anyone else.
>
>
>
> On Mon, Aug 5, 2013 at 1:54 PM, Alan Gates <[email protected]> wrote:
>
>>
>> On Jul 29, 2013, at 9:53 PM, Edward Capriolo wrote:
>>
>> > Also watched http://www.ustream.tv/recorded/36323173
>> >
>> > I definitely see the win in being able to stream inter-stage output.
>> >
>> > I see some cases where small intermediate results can be kept "In
>> memory".
>> > But I was somewhat under the impression that the map reduce spill
>> settings
>> > kept stuff in memory, isn't that what spill settings are?
>>
>> No.  MapReduce always writes shuffle data to local disk.  And
>> intermediate results between MR jobs are always persisted to HDFS, as
>> there's no other option.  When we talk of being able to keep intermediate
>> results in memory we mean getting rid of both of these disk writes/reads
>> when appropriate (meaning not always, there's a trade off between speed and
>> error handling to be made here, see below for more details).
>>
>> >
>> > There is a few bullet points that came up repeatedly that I do not
>> follow:
>> >
>> > Something was said to the effect of "Container reuse makes X faster".
>> > Hadoop has jvm reuse. Not following what the difference is here? Not
>> > everyone has a 10K node cluster.
>>
>> Sharing JVMs across users is inherently insecure (we can't guarantee what
>> code the first user left behind that may interfere with later users).  As I
>> understand container re-use in Tez it constrains the re-use to one user for
>> security reasons, but still avoids additional JVM start up costs.  But this
>> is a question that the Tez guys could answer better on the Tez lists (
>> [email protected])
>>
>> >
>> > "Joins in map reduce are hard" Really? I mean some of them are I guess,
>> but
>> > the typical join is very easy. Just shuffle by the join key. There was
>> not
>> > really enough low level details here saying why joins are better in tez.
>>
>> Join is not a natural operation in MapReduce.  MR gives you one input and
>> one output.  You end up having to bend the rules to do have multiple
>> inputs.  The idea here is that Tez can provide operators that naturally
>> work with joins and other operations that don't fit the one input/one
>> output model (eg unions, etc.).
>>
>> >
>> > "Chosing the number of maps and reduces is hard" Really? I do not find
>> it
>> > that hard, I think there are times when it's not perfect but I do not
>> find
>> > it hard. The talk did not really offer anything here technical on how
>> tez
>> > makes this better other then it could make it better.
>>
>> Perhaps manual would be a better term here than hard.  In our experience
>> it takes quite a bit of engineer trial and error to determine the optimal
>> numbers.  This may be ok if you're going to invest the time once and then
>> run the same query every day for 6 months.  But obviously it doesn't work
>> for the ad hoc case.  Even in the batch case it's not optimal because every
>> once and a while an engineer has to go back and re-optimize the query to
>> deal with changing data sizes, data characteristics, etc.  We want the
>> optimizer to handle this without human intervention.
>>
>> >
>> > The presentations mentioned streaming data, how do two nodes stream data
>> > between a tasks and how it it reliable? If the sender or receiver dies
>> does
>> > the entire process have to start again?
>>
>> If the sender or receiver dies then the query has to be restarted from
>> some previous point where data was persisted to disk.  The idea here is
>> that speed vs error recovery trade offs should be made by the optimizer.
>>  If the optimizer estimates that a query will complete in 5 seconds it can
>> stream everything and if a node fails it just re-runs the whole query.  If
>> it estimates that a particular phase of a query will run for an hour it can
>> choose to persist the results to HDFS so that in the event of a failure
>> downstream the long phase need not be re-run.  Again we want this to be
>> done automatically by the system so the user doesn't need to control this
>> level of detail.
>>
>> >
>> > Again one of the talks implied there is a prototype out there that
>> launches
>> > hive jobs into tez. I would like to see that, it might answer more
>> > questions then a power point, and I could profile some common queries.
>>
>> As mentioned in a previous email afaik Gunther's pushed all these changes
>> to the Tez branch in Hive.
>>
>> Alan.
>>
>> >
>> > Random late night thoughts over,
>> > Ed
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Tue, Jul 30, 2013 at 12:02 AM, Edward Capriolo <
>> [email protected]>wrote:
>> >
>> >> At ~25:00
>> >>
>> >> "There is a working prototype of hive which is using tez as the
>> targeted
>> >> runtime"
>> >>
>> >> Can I get a look at that code? Is it on github?
>> >>
>> >> Edward
>> >>
>> >>
>> >> On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates <[email protected]>
>> wrote:
>> >>
>> >>> Answers to some of your questions inlined.
>> >>>
>> >>> Alan.
>> >>>
>> >>> On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote:
>> >>>
>> >>>> There are some points I want to bring up. First, I am on the PMC.
>> Here
>> >>> is
>> >>>> something I find relevant:
>> >>>>
>> >>>> http://www.apache.org/foundation/how-it-works.html
>> >>>>
>> >>>> ------------------------------
>> >>>>
>> >>>> The role of the PMC from a Foundation perspective is oversight. The
>> main
>> >>>> role of the PMC is not code and not coding - but to ensure that all
>> >>> legal
>> >>>> issues are addressed, that procedure is followed, and that each and
>> >>> every
>> >>>> release is the product of the community as a whole. That is key to
>> our
>> >>>> litigation protection mechanisms.
>> >>>>
>> >>>> Secondly the role of the PMC is to further the long term development
>> and
>> >>>> health of the community as a whole, and to ensure that balanced and
>> wide
>> >>>> scale peer review and collaboration does happen. Within the ASF we
>> worry
>> >>>> about any community which centers around a few individuals who are
>> >>> working
>> >>>> virtually uncontested. We believe that this is detrimental to
>> quality,
>> >>>> stability, and robustness of both code and long term social
>> structures.
>> >>>>
>> >>>> --------------------------------
>> >>>>
>> >>>>
>> >>>
>> https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different
>> >>>>
>> >>>> -------------------------------------
>> >>>>
>> >>>> All other decisions happen on the dev list, discussions on the
>> private
>> >>> list
>> >>>> are kept to a minimum.
>> >>>>
>> >>>> "If it didn't happen on the dev list, it didn't happen" - which leads
>> >>> to:
>> >>>>
>> >>>> a) Elections of committers and PMC members are published on the dev
>> list
>> >>>> once finalized.
>> >>>>
>> >>>> b) Out-of-band discussions (IRC etc.) are summarized on the dev list
>> as
>> >>>> soon as they have impact on the project, code or community.
>> >>>> ---------------------------------
>> >>>>
>> >>>> https://issues.apache.org/jira/browse/HIVE-4660 ironically titled
>> "Let
>> >>>> their be Tez" has not be +1 ed by any committer. It was never
>> discussed
>> >>> on
>> >>>> the dev or the user list (as far as I can tell).
>> >>>
>> >>> As all JIRA creations and updates are sent to dev@hive, creating a
>> JIRA
>> >>> is de facto posting to the list.
>> >>>
>> >>>>
>> >>>> As a PMC member I feel we need more discussion on Tez on the dev list
>> >>> along
>> >>>> with a wiki-fied design document. Topics of discussion should
>> include:
>> >>>
>> >>> I talked with Gunther and he's working on posting a design doc on the
>> >>> wiki.  He has a PDF on the JIRA but he doesn't have write permissions
>> yet
>> >>> on the wiki.
>> >>>
>> >>>>
>> >>>> 1) What is tez?
>> >>> In Hadoop 2.0, YARN opens up the ability to have multiple execution
>> >>> frameworks in Hadoop.  Hadoop apps are no longer tied to MapReduce as
>> the
>> >>> only execution option.  Tez is an effort to build an execution engine
>> that
>> >>> is optimized for relational data processing, such as Hive and Pig.
>> >>>
>> >>> The biggest change here is to move away from only Map and Reduce as
>> >>> processing options and to allow alternate combinations of processing,
>> such
>> >>> as map -> reduce -> reduce or tasks that take multiple inputs or
>> shuffles
>> >>> that avoid sorting when it isn't needed.
>> >>>
>> >>> For a good intro to Tez, see Arun's presentation on it at the recent
>> >>> Hadoop summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8slides
>> >>>
>> http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212)
>> >>>>
>> >>>> 2) How is tez different from oozie, http://code.google.com/p/hop/,
>> >>>> http://cs.brown.edu/~backman/cmr.html , and other DAG and or
>> streaming
>> >>> map
>> >>>> reduce tools/frameworks? Why should we use this and not those?
>> >>>
>> >>> Oozie is a completely different thing.  Oozie is a workflow engine
>> and a
>> >>> scheduler.  It's core competencies are the ability to coordinate
>> workflows
>> >>> of disparate job types (MR, Pig, Hive, etc.) and to schedule them.
>>  It is
>> >>> not intended as an execution engine for apps such as Pig and Hive.
>> >>>
>> >>> I am not familiar with these other engines, but the short answer is
>> that
>> >>> Tez is built to work on YARN, which works well for Hive since it is
>> tied to
>> >>> Hadoop.
>> >>>>
>> >>>> 3) When can we expect the first tez release?
>> >>> I don't know, but I hope sometime this fall.
>> >>>
>> >>>>
>> >>>> 4) How much effort is involved in integrating hive and tez?
>> >>> Covered in the design doc.
>> >>>
>> >>>>
>> >>>> 5) Who is ready to commit to this effort?
>> >>> I'll let people speak for themselves on that one.
>> >>>
>> >>>>
>> >>>> 6) can we expect this work to be done in one hive release?
>> >>> Unlikely.  Initial integration will be done in one release, but as
>> Tez is
>> >>> a new project I expect it will be adding features in the future that
>> Hive
>> >>> will want to take advantage of.
>> >>>
>> >>>>
>> >>>> In my opinion we should not start any work on this tez-hive until
>> these
>> >>>> questions are answered to the satisfaction of the hive developers.
>> >>>
>> >>> Can we change this to "not commit patches"?  We can't tell willing
>> people
>> >>> not to work on it.
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Mon, Jul 15, 2013 at 9:51 PM, Edward Capriolo <
>> [email protected]
>> >>>> wrote:
>> >>>>
>> >>>>>
>> >>>>>>> The Hive bylaws,
>> >>>>> https://cwiki.apache.org/confluence/display/Hive/Bylaws , lay out
>> what
>> >>>>> votes are needed for what.  I don't see anything there about
>> needing 3
>> >>> +1s
>> >>>>> for a branch.  Branching >>would seem to fall under code change,
>> which
>> >>>>> requires one vote and a minimum length of 1 day.
>> >>>>>
>> >>>>> You could argue that all you need is one +1 to create a branch, but
>> >>> this
>> >>>>> is more then a branch. If you are talking about something that is:
>> >>>>> 1) going to cause major re-factoring of critical pieces of hive like
>> >>>>> ExecDriver and MapRedTask
>> >>>>> 2) going to be very disruptive to the efforts of other committers
>> >>>>> 3) something that may be a major architectural change
>> >>>>>
>> >>>>> Getting the project on board with the idea is a good idea.
>> >>>>>
>> >>>>> Now I want to point something out. Here are some recent initiatives
>> in
>> >>>>> hive:
>> >>>>>
>> >>>>> 1) At one point there was a big initiative to "support oracle" after
>> >>> the
>> >>>>> initial work, there are patches in Jira no one seems to care about
>> >>> oracle
>> >>>>> support.
>> >>>>> 2) Another such decisions was this "support windows" one, there are
>> >>>>> probably 4 windows patches waiting reviews.
>> >>>>> 3) I still have no clue what the official hadoop1 hadoop2, hadoop
>> 0.23
>> >>>>> support prospective is, but every couple weeks we get another jira
>> >>> about
>> >>>>> something not working/testing on one of those versions, seems like
>> >>> several
>> >>>>> builds are broken.
>> >>>>> 4) Hive-storage handler, after the initial implementation no one
>> cares
>> >>> to
>> >>>>> review any other storage handler implementation, 3 patches there or
>> >>> more,
>> >>>>> could not even find anyone willing to review the cassandra storage
>> >>> handler
>> >>>>> I spent months on.
>> >>>>> 5) OCR, Vectorization
>> >>>>> 6) Windowing: committed, numerous check-style violations.
>> >>>>>
>> >>>>> We have !!!160+!!! PATCH_AVAILABLE Jira issues. Few active
>> committers.
>> >>> We
>> >>>>> are spread very thin, and embarking on another side project not
>> >>> involved
>> >>>>> with core hive seems like the wrong direction at the moment.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Mon, Jul 15, 2013 at 8:37 PM, Alan Gates <[email protected]>
>> >>> wrote:
>> >>>>>
>> >>>>>>
>> >>>>>> On Jul 13, 2013, at 9:48 AM, Edward Capriolo wrote:
>> >>>>>>
>> >>>>>>> I have started to see several re factoring patches around tez.
>> >>>>>>> https://issues.apache.org/jira/browse/HIVE-4843
>> >>>>>>>
>> >>>>>>> This is the only mention on the hive list I can find with tez:
>> >>>>>>> "Makes sense. I will create the branch soon.
>> >>>>>>>
>> >>>>>>> Thanks,
>> >>>>>>> Ashutosh
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Tue, Jun 11, 2013 at 7:44 PM, Gunther Hagleitner <
>> >>>>>>> [email protected]> wrote:
>> >>>>>>>
>> >>>>>>>> Hi,
>> >>>>>>>>
>> >>>>>>>> I am starting to work on integrating Tez into Hive (see
>> HIVE-4660,
>> >>>>>> design
>> >>>>>>>> doc has already been uploaded - any feedback will be much
>> >>> appreciated).
>> >>>>>>>> This will be a fair amount of work that will take time to
>> >>>>>> stabilize/test.
>> >>>>>>>> I'd like to propose creating a branch in order to be able to do
>> this
>> >>>>>>>> incrementally and collaboratively. In order to progress rapidly
>> with
>> >>>>>> this,
>> >>>>>>>> I would also like to go "commit-then-review".
>> >>>>>>>>
>> >>>>>>>> Thanks,
>> >>>>>>>> Gunther.
>> >>>>>>>> "
>> >>>>>>>
>> >>>>>>> These refactor-ings are largely destructive to a number of bugs
>> and
>> >>>>>>> language improvements in hive.The language improvements and bug
>> fixes
>> >>>>>> that
>> >>>>>>> have been sitting in Jira for quite some time now marked
>> >>> patch-available
>> >>>>>>> and are waiting for review.
>> >>>>>>>
>> >>>>>>> There are a few things I want to point out:
>> >>>>>>> 1) Normally we create design docs in out wiki (which it is not)
>> >>>>>>> 2) Normally when the change is significantly complex we get
>> multiple
>> >>>>>>> committers to comment on it (which we did not)
>> >>>>>>> On point 2 no one -1  the branch, but this is really something
>> that
>> >>>>>> should
>> >>>>>>> have required a +1 from 3 committers.
>> >>>>>>
>> >>>>>> The Hive bylaws,
>> >>> https://cwiki.apache.org/confluence/display/Hive/Bylaws, lay out what
>> >>> votes are needed for what.  I don't see anything there about
>> >>>>>> needing 3 +1s for a branch.  Branching would seem to fall under
>> code
>> >>>>>> change, which requires one vote and a minimum length of 1 day.
>> >>>>>>
>> >>>>>>>
>> >>>>>>> I for one am not completely sold on Tez.
>> >>>>>>> http://incubator.apache.org/projects/tez.html.
>> >>>>>>> "directed-acyclic-graph of tasks for processing data" this
>> >>> description
>> >>>>>>> sounds like many things which have never become popular. One to
>> think
>> >>>>>> of is
>> >>>>>>> oozie "Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of
>> >>>>>>> actions.". I am sure I can find a number of libraries/frameworks
>> that
>> >>>>>> make
>> >>>>>>> this same claim. In general I do not feel like we have done our
>> >>> homework
>> >>>>>>> and pre-requisites to justify all this work. If we have done the
>> >>>>>> homework,
>> >>>>>>> I am sure that it has not been communicated and accepted by hive
>> >>>>>> developers
>> >>>>>>> at large.
>> >>>>>>
>> >>>>>> A request for better documentation on Tez and a project road map
>> seems
>> >>>>>> totally reasonable.
>> >>>>>>
>> >>>>>>>
>> >>>>>>> If we have a branch, why are we also committing on trunk? Scanning
>> >>>>>> through
>> >>>>>>> the tez doc the only language I keep finding language like
>> "minimal
>> >>>>>> changes
>> >>>>>>> to the planner" yet, there is ALREADY lots of large changes going
>> on!
>> >>>>>>>
>> >>>>>>> Really none of the above would bother me accept for the fact that
>> >>> these
>> >>>>>>> "minimal changes" are causing many "patch available"
>> ready-for-review
>> >>>>>> bugs
>> >>>>>>> and core hive features to need to be re based.
>> >>>>>>>
>> >>>>>>> I am sure I have mentioned this before, but I have to spend 12+
>> >>> hours to
>> >>>>>>> test a single patch on my laptop. A few days ago I was testing a
>> new
>> >>>>>> core
>> >>>>>>> hive feature. After all the tests passed and before I was able to
>> >>>>>> commit,
>> >>>>>>> someone unleashed a tez patch on trunk which caused the thing I
>> was
>> >>>>>> testing
>> >>>>>>> for 12 hours to need to be rebased.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> I'm not cool with this.Next time that happens to me I will
>> seriously
>> >>>>>>> consider reverting the patch. Bug fixes and new hive features are
>> >>> more
>> >>>>>>> important to me then integrating with incubator projects.
>> >>>>>>
>> >>>>>> (With my Apache member hat on)  Reverting patches that aren't
>> breaking
>> >>>>>> the build is considered very bad form in Apache.  It does make
>> sense
>> >>> to
>> >>>>>> request that when people are going to commit a patch that will
>> break
>> >>> many
>> >>>>>> other patches they first give a few hours of notice so people can
>> say
>> >>>>>> something if they're about to commit another patch and avoid your
>> >>> fate of
>> >>>>>> needing to rerun the tests.  The other thing is we need to get get
>> the
>> >>>>>> automated build of patches working on Hive so committers are forced
>> >>> to run
>> >>>>>> all of the tests themselves.  We are working on it, but we're not
>> >>> there yet.
>> >>>>>>
>> >>>>>> Alan.
>> >>>>>>
>> >>>>>>
>> >>>>>
>> >>>
>> >>>
>> >>
>>
>>
>

Re: Tez branch and tez based patches

Reply via email to