Guys, first off apologies for bringing in the topic of MR-based compactions.. 
But I was thinking more about the SpliceMachine approach of managing 
compactions in Spark where apparently they saw a lot of benefits. Apologies for 
giving you that sore throat Andrew; I really didn't mean to :-)

So on this issue, we have these on the plate:
0. Somehow not use MR but something like that
1. Run a standalone service other than master
2. Shell out from the master

I don't think we have a good answer to (0), and I don't think it's even worth 
the effort of trying to build something when MR is already there, and being 
used by HBase already for some operations.

On (1), we have to deal with a myriad of issues - HA of the server not being 
the least of them all. Security (kerberos authentication, another keytab to 
manage, etc. etc. etc.). IMO, that approach is DOA. Instead let's substitute 
that (1) with the HBase Master. I haven't seen any good reason why the HBase 
master shouldn't launch MR jobs if needed. It's not ideal; agreed.

Now before going to (2), let's see what are the benefits of running the 
backup/restore jobs from the master. I think Ted has summarized some of the 
issues that we need to take care of - basically, the master can keep track of 
running jobs, and should it fail, the backup master can continue keeping track 
of it (since the jobId would have been recorded in the proc WAL). The master 
can also do cleanup, etc. of failed backup/restore processes. Security is 
another issue - the job needs to run as 'hbase' since it owns the data. Having 
the master launch the job makes it get that privilege. In the (2) approach, 
it's hard to do some of the above management.

Guys, just to reiterate, the patch as such is ready from the overall 
design/arch point of view (maybe code review is still pending from Matteo). If 
in the future, we find better ways of doing this without using MR, we can 
certainly consider that. But IMO don't think we should block this patch from 
getting merged.

________________________________________
From: 张铎 <palomino...@gmail.com>
Sent: Thursday, September 22, 2016 8:32 PM
To: dev@hbase.apache.org
Subject: Re: [DISCUSSION] MR jobs started by Master or RS

So what about a standalone service other than master? You can use your own
procedure store in that service?

2016-09-23 11:28 GMT+08:00 Ted Yu <yuzhih...@gmail.com>:

> An earlier implementation was client driven.
>
> But with that approach, it is hard to resume if there is error midway.
> Using Procedure V2 makes the backup / restore more robust.
>
> Another consideration is for security. It is hard to enforce security (to
> be implemented) for client driven actions.
>
> Cheers
>
> > On Sep 22, 2016, at 8:15 PM, Andrew Purtell <andrew.purt...@gmail.com>
> wrote:
> >
> > No, this misses Matteo's finer point, which is "shelling out" from the
> master directly to run MR is a first. Why not drive this with a utility
> derived from Tool?
> >
> > On Sep 22, 2016, at 7:57 PM, Vladimir Rodionov <vladrodio...@gmail.com>
> wrote:
> >
> >>>> In our production cluster,  it is a common case we just have HDFS and
> >>>> HBase deployed.
> >>>> If our Master/RS depend on MR framework (especially some features we
> >>>> have not used at all),  it introduced another cost for maintain.  I
> >>>> don't think it is a good idea.
> >>
> >> So , you are not backup users in this case. Many our customers have full
> >> stack deployed and
> >> want see backup to be a standard feature. Besides this, nothing will
> happen
> >> in your cluster
> >> if you won't be doing backups.
> >>
> >> This discussion (we do not want see M/R dependency) goes to nowhere. We
> >> asked already, at least twice, to suggest another framework (other than
> M/R)
> >> for bulk data copy with *conversion*. Still waiting for suggestions.
> >>
> >> -Vlad
> >>
> >>
> >>
> >>
> >>> On Thu, Sep 22, 2016 at 7:49 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> >>>
> >>> If MR framework is not deployed in the cluster, hbase still functions
> >>> normally (post merge).
> >>>
> >>> In terms of build time dependency, we have long been depending on
> >>> mapreduce. Take a look at ExportSnapshot.
> >>>
> >>> Cheers
> >>>
> >>> On Thu, Sep 22, 2016 at 7:42 PM, Heng Chen <heng.chen.1...@gmail.com>
> >>> wrote:
> >>>
> >>>> In our production cluster,  it is a common case we just have HDFS and
> >>>> HBase deployed.
> >>>> If our Master/RS depend on MR framework (especially some features we
> >>>> have not used at all),  it introduced another cost for maintain.  I
> >>>> don't think it is a good idea.
> >>>>
> >>>> 2016-09-23 10:28 GMT+08:00 张铎 <palomino...@gmail.com>:
> >>>>> To be specific, for example, our nice Backup/Restore feature, if we
> >>> think
> >>>>> this is not a core feature of HBase, then we could make it depend on
> >>> MR,
> >>>>> and start a standalone BackupManager instance that submits MR jobs to
> >>> do
> >>>>> periodical maintenance job. And if we think this is a core feature
> that
> >>>>> everyone should use it, then we'd better implement it without MR
> >>>>> dependency, like DLS.
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>> 2016-09-23 10:11 GMT+08:00 张铎 <palomino...@gmail.com>:
> >>>>>
> >>>>>> I‘m -1 on let master or rs launch MR jobs. It is OK that some of our
> >>>>>> features depend on MR but I think the bottom line is that we should
> >>>> launch
> >>>>>> the jobs from outside manually or by other services.
> >>>>>>
> >>>>>> 2016-09-23 9:47 GMT+08:00 Andrew Purtell <andrew.purt...@gmail.com
> >:
> >>>>>>
> >>>>>>> Ok, got it. Well "shelling out" is on the line I think, so a fair
> >>>>>>> question.
> >>>>>>>
> >>>>>>> Can this be driven by a utility derived from Tool like our other MR
> >>>> apps?
> >>>>>>> The issue is needing the AccessController to decide if allowed? But
> >>>> nothing
> >>>>>>> prevents the user from running the job manually/independently,
> right?
> >>>>>>>
> >>>>>>>> On Sep 22, 2016, at 3:44 PM, Matteo Bertozzi <
> >>>> theo.berto...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> just a remark. my query was not about tools using MR (everyone i
> >>>> think
> >>>>>>> is
> >>>>>>>> ok with those).
> >>>>>>>> the topic was about: "are we ok with running MR jobs from Master
> >>> and
> >>>> RSs
> >>>>>>>> code?" since this will be the first time we do this
> >>>>>>>>
> >>>>>>>> Matteo
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> On Thu, Sep 22, 2016 at 2:49 PM, Devaraj Das <
> >>> d...@hortonworks.com>
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Very much agree; for tools like ExportSnapshot / Backup /
> Restore,
> >>>> it's
> >>>>>>>>> fine to be dependent on MR. MR is the right framework for such.
> We
> >>>>>>> should
> >>>>>>>>> also do compactions using MR (just saying :) )
> >>>>>>>>> ________________________________________
> >>>>>>>>> From: Ted Yu <yuzhih...@gmail.com>
> >>>>>>>>> Sent: Thursday, September 22, 2016 2:00 PM
> >>>>>>>>> To: dev@hbase.apache.org
> >>>>>>>>> Subject: Re: [DISCUSSION] MR jobs started by Master or RS
> >>>>>>>>>
> >>>>>>>>> I agree - backup / restore is in the same category as import /
> >>>> export.
> >>>>>>>>>
> >>>>>>>>> On Thu, Sep 22, 2016 at 1:58 PM, Andrew Purtell <
> >>>>>>> andrew.purt...@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Backup is extra tooling around core in my opinion. Like import
> or
> >>>>>>> export.
> >>>>>>>>>> Or the optional MOB tool. It's fine.
> >>>>>>>>>>
> >>>>>>>>>>> On Sep 22, 2016, at 1:50 PM, Matteo Bertozzi <
> >>>> mberto...@apache.org>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> What's the latest opinion around running MR jobs from hbase
> >>>> (Master
> >>>>>>> or
> >>>>>>>>>> RS)?
> >>>>>>>>>>>
> >>>>>>>>>>> I remember in the past that there was discussion about not
> >>> having
> >>>> MR
> >>>>>>>>> has
> >>>>>>>>>>> direct dependency of hbase.
> >>>>>>>>>>>
> >>>>>>>>>>> I think some of discussion where around MOB that had a MR job
> to
> >>>>>>>>> compact,
> >>>>>>>>>>> that later was transformed in a non-MR job to be merged, I
> think
> >>>> we
> >>>>>>>>> had a
> >>>>>>>>>>> similar discussion for log split/replay.
> >>>>>>>>>>>
> >>>>>>>>>>> the latest is the new Backup feature (HBASE-7912), that runs a
> >>> MR
> >>>> job
> >>>>>>>>>> from
> >>>>>>>>>>> the master to copy data or restore data.
> >>>>>>>>>>> (backup is also "not really core" as in.. if you don't use
> >>> backup
> >>>>>>>>> you'll
> >>>>>>>>>>> not end up running MR jobs, but this was probably true for MOB
> >>> as
> >>>> in
> >>>>>>>>> "if
> >>>>>>>>>>> you don't enable MOB you don't need MR")
> >>>>>>>>>>>
> >>>>>>>>>>> any thoughts? do we a rule that says "we don't want to have
> >>> hbase
> >>>> run
> >>>>>>>>> MR
> >>>>>>>>>>> jobs, only tool started manually by the user can do that". or
> >>> can
> >>>> we
> >>>>>>>>>> start
> >>>>>>>>>>> adding MR calls around without problems?
> >>>
>

Reply via email to