Re: [DISCUSS] Properties for scheduling compactions on specific queues

2022-02-10 Thread Stamatis Zampetakis
Hi all,

@Janos: The patch didn't go through but I get the idea. FYI: in most apache
lists attachments are not allowed

Since people do not have strong feelings about this I am inclined to move
forward with Janos suggestion and accept all three properties for
specifying compactions.

I just logged HIVE-25947 about this.

Best,
Stamatis

On Tue, Feb 8, 2022 at 2:45 PM Janos Kovacs  wrote:

> Hi Stamatis,
>
> The attached one is a more generic proposal: basically moves out target
> queue resolution to CompactorUtil.getCompactorJobQueueName as it is
> already in use for the StatsUpdater.
> There then the old properties are used based on a new fallback config
> prop. Fallback is only for statement (ci.props) and table (t.props) and not
> global configuration.
>
> It's just a mock-up, I didn't even check if it compiles, but shows the
> logic which should be good enough for the discussion.
>
> R, Janos
>
>
> Stamatis Zampetakis  ezt írta (időpont: 2022. febr.
> 7., H, 23:44):
>
>> Thanks Janos for the feedback.
>>
>> If I understand well your suggestion is support all of the properties
>> below
>> for table level compactions and treat them as equivalent:
>> * compactor.mapred.job.queue.name
>> * compactor.mapreduce.job.queuename
>> * compactor.hive.compactor.job.queue
>>
>> It is something that crossed my mind as well but I am slightly skeptical
>> because like this we explicitly state that people are free to use whatever
>> they like. It might also have as a consequence MR properties affecting Tez
>> (as it happens a bit with HIVE-25595) which from my perspective is not
>> that
>> great. I am also thinking that it will lead to more requests for accepting
>> these MR specific properties in the query based compactor which cannot
>> (and
>> probably will never) use MR as the underlying engine. We should also keep
>> in mind that the MR engine was deprecated ~6years ago and the MR compactor
>> may follow soon.
>>
>> I am fine implementing this specific change (accepting all properties
>> above) as long as someone from the people contributing to the compactor
>> confirms it is the desired path going forward.
>>
>> Best,
>> Stamatis
>>
>> On Mon, Feb 7, 2022 at 11:50 AM Janos Kovacs  wrote:
>>
>> > Hi Stamatis,
>> >
>> > I agree that the [compactor.]*hive.compactor.queue.name
>> > * is a better solution as hive now
>> also
>> > supports query based compaction, not only MR.
>> > ...although I think this needs to be backward compatible!
>> >
>> > What do you think about a logic similar to this:
>> >
>> > ---
>> a/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorMR.java
>> 2022-02-07 10:31:28.0 +0100
>> > +++
>> b/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorMR.java
>> 2022-02-07 10:33:25.0 +0100
>> > @@ -145,10 +145,19 @@
>> >  overrideMRProps(job, t.getParameters()); // override MR properties
>> from tblproperties if applicable
>> >  if (ci.properties != null) {
>> >overrideTblProps(job, t.getParameters(), ci.properties);
>> >  }
>> >
>> > +// make queue configuration backward compatible
>> > +// at that point overrideMRProps and OverrideTblProps already
>> consolidated
>> > +// the final value, just need to use job.TBALE_PROPS
>> > +String queueNameLegacy =
>> > +  (new
>> StringableMap(job.get(TABLE_PROPS))).toProperties().getProperty("
>> compactor.mapred.job.queue.name");
>> > +if (queueNameLegacy != null && queueNameLegacy.length() > 0) {
>> > +  job.set(ConfVars.COMPACTOR_JOB_QUEUE, queueNameLegacy);
>> > +}
>> > +
>> >  String queueName = HiveConf.getVar(job,
>> ConfVars.COMPACTOR_JOB_QUEUE);
>> >  if (queueName != null && queueName.length() > 0) {
>> >job.setQueueName(queueName);
>> >  }
>> >
>> >
>> > Of course this can be wrapped around with a new config if needed, like
>> > hive.compaction.queue.name.use.legacy or whatever...
>> > FYI: we might also want to check legacy config not only for *"
>> compactor.mapred.job.queue.name
>> > "* but also for
>> > *"compactor.mapreduce.job.queuename" *as the first one was already on
>> the
>> > deprecated list as pointed out by Peter Vary.
>> >
>> > Please also note that the change introduced by HIVE-25595 is currently
>> not
>> > compatible with the new config as it was developed for the old
>> > compactor.mapred... property:
>> >
>> >
>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorUtil.java#L31
>> > This also needs to be handled - for both the new prop name and backward
>> > compatibility.
>> >
>> > R, Janos
>> >
>> >
>> > On 2022/01/31 09:50:49 Stamatis Zampetakis wrote:
>> > > Hi all,
>> > >
>> > > This email is an attempt to converge on which Hive/Tez/MR properties
>> > > someone should use in order to schedule a compaction on specific
>> queues.
>> > > For those who are not familiar with how queues are used the YARN
>

Re: [DISCUSS] Properties for scheduling compactions on specific queues

2022-02-08 Thread Janos Kovacs
Hi Stamatis,

The attached one is a more generic proposal: basically moves out target
queue resolution to CompactorUtil.getCompactorJobQueueName as it is already
in use for the StatsUpdater.
There then the old properties are used based on a new fallback config prop.
Fallback is only for statement (ci.props) and table (t.props) and not
global configuration.

It's just a mock-up, I didn't even check if it compiles, but shows the
logic which should be good enough for the discussion.

R, Janos


Stamatis Zampetakis  ezt írta (időpont: 2022. febr. 7.,
H, 23:44):

> Thanks Janos for the feedback.
>
> If I understand well your suggestion is support all of the properties below
> for table level compactions and treat them as equivalent:
> * compactor.mapred.job.queue.name
> * compactor.mapreduce.job.queuename
> * compactor.hive.compactor.job.queue
>
> It is something that crossed my mind as well but I am slightly skeptical
> because like this we explicitly state that people are free to use whatever
> they like. It might also have as a consequence MR properties affecting Tez
> (as it happens a bit with HIVE-25595) which from my perspective is not that
> great. I am also thinking that it will lead to more requests for accepting
> these MR specific properties in the query based compactor which cannot (and
> probably will never) use MR as the underlying engine. We should also keep
> in mind that the MR engine was deprecated ~6years ago and the MR compactor
> may follow soon.
>
> I am fine implementing this specific change (accepting all properties
> above) as long as someone from the people contributing to the compactor
> confirms it is the desired path going forward.
>
> Best,
> Stamatis
>
> On Mon, Feb 7, 2022 at 11:50 AM Janos Kovacs  wrote:
>
> > Hi Stamatis,
> >
> > I agree that the [compactor.]*hive.compactor.queue.name
> > * is a better solution as hive now
> also
> > supports query based compaction, not only MR.
> > ...although I think this needs to be backward compatible!
> >
> > What do you think about a logic similar to this:
> >
> > ---
> a/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorMR.java
> 2022-02-07 10:31:28.0 +0100
> > +++
> b/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorMR.java
> 2022-02-07 10:33:25.0 +0100
> > @@ -145,10 +145,19 @@
> >  overrideMRProps(job, t.getParameters()); // override MR properties
> from tblproperties if applicable
> >  if (ci.properties != null) {
> >overrideTblProps(job, t.getParameters(), ci.properties);
> >  }
> >
> > +// make queue configuration backward compatible
> > +// at that point overrideMRProps and OverrideTblProps already
> consolidated
> > +// the final value, just need to use job.TBALE_PROPS
> > +String queueNameLegacy =
> > +  (new
> StringableMap(job.get(TABLE_PROPS))).toProperties().getProperty("
> compactor.mapred.job.queue.name");
> > +if (queueNameLegacy != null && queueNameLegacy.length() > 0) {
> > +  job.set(ConfVars.COMPACTOR_JOB_QUEUE, queueNameLegacy);
> > +}
> > +
> >  String queueName = HiveConf.getVar(job,
> ConfVars.COMPACTOR_JOB_QUEUE);
> >  if (queueName != null && queueName.length() > 0) {
> >job.setQueueName(queueName);
> >  }
> >
> >
> > Of course this can be wrapped around with a new config if needed, like
> > hive.compaction.queue.name.use.legacy or whatever...
> > FYI: we might also want to check legacy config not only for *"
> compactor.mapred.job.queue.name
> > "* but also for
> > *"compactor.mapreduce.job.queuename" *as the first one was already on the
> > deprecated list as pointed out by Peter Vary.
> >
> > Please also note that the change introduced by HIVE-25595 is currently
> not
> > compatible with the new config as it was developed for the old
> > compactor.mapred... property:
> >
> >
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorUtil.java#L31
> > This also needs to be handled - for both the new prop name and backward
> > compatibility.
> >
> > R, Janos
> >
> >
> > On 2022/01/31 09:50:49 Stamatis Zampetakis wrote:
> > > Hi all,
> > >
> > > This email is an attempt to converge on which Hive/Tez/MR properties
> > > someone should use in order to schedule a compaction on specific
> queues.
> > > For those who are not familiar with how queues are used the YARN
> capacity
> > > scheduler documentation [1] gives the general idea.
> > >
> > > Using specific queues for compaction jobs is necessary to be able to
> > > efficiently allocate resources for maintenance tasks (compaction) and
> > > production workloads. Hive provides various ways to control the queues
> > used
> > > by the compactor and there have been various tickets with improvements
> > and
> > > fixes in this area (see list below).
> > >
> > > The granularity we can select queues for compactions (all tables vs.
> per
> > >

Re: [DISCUSS] Properties for scheduling compactions on specific queues

2022-02-07 Thread Stamatis Zampetakis
Thanks Janos for the feedback.

If I understand well your suggestion is support all of the properties below
for table level compactions and treat them as equivalent:
* compactor.mapred.job.queue.name
* compactor.mapreduce.job.queuename
* compactor.hive.compactor.job.queue

It is something that crossed my mind as well but I am slightly skeptical
because like this we explicitly state that people are free to use whatever
they like. It might also have as a consequence MR properties affecting Tez
(as it happens a bit with HIVE-25595) which from my perspective is not that
great. I am also thinking that it will lead to more requests for accepting
these MR specific properties in the query based compactor which cannot (and
probably will never) use MR as the underlying engine. We should also keep
in mind that the MR engine was deprecated ~6years ago and the MR compactor
may follow soon.

I am fine implementing this specific change (accepting all properties
above) as long as someone from the people contributing to the compactor
confirms it is the desired path going forward.

Best,
Stamatis

On Mon, Feb 7, 2022 at 11:50 AM Janos Kovacs  wrote:

> Hi Stamatis,
>
> I agree that the [compactor.]*hive.compactor.queue.name
> * is a better solution as hive now also
> supports query based compaction, not only MR.
> ...although I think this needs to be backward compatible!
>
> What do you think about a logic similar to this:
>
> --- a/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorMR.java
> 2022-02-07 10:31:28.0 +0100
> +++ b/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorMR.java
> 2022-02-07 10:33:25.0 +0100
> @@ -145,10 +145,19 @@
>  overrideMRProps(job, t.getParameters()); // override MR properties from 
> tblproperties if applicable
>  if (ci.properties != null) {
>overrideTblProps(job, t.getParameters(), ci.properties);
>  }
>
> +// make queue configuration backward compatible
> +// at that point overrideMRProps and OverrideTblProps already 
> consolidated
> +// the final value, just need to use job.TBALE_PROPS
> +String queueNameLegacy =
> +  (new 
> StringableMap(job.get(TABLE_PROPS))).toProperties().getProperty("compactor.mapred.job.queue.name");
> +if (queueNameLegacy != null && queueNameLegacy.length() > 0) {
> +  job.set(ConfVars.COMPACTOR_JOB_QUEUE, queueNameLegacy);
> +}
> +
>  String queueName = HiveConf.getVar(job, ConfVars.COMPACTOR_JOB_QUEUE);
>  if (queueName != null && queueName.length() > 0) {
>job.setQueueName(queueName);
>  }
>
>
> Of course this can be wrapped around with a new config if needed, like
> hive.compaction.queue.name.use.legacy or whatever...
> FYI: we might also want to check legacy config not only for 
> *"compactor.mapred.job.queue.name
> "* but also for
> *"compactor.mapreduce.job.queuename" *as the first one was already on the
> deprecated list as pointed out by Peter Vary.
>
> Please also note that the change introduced by HIVE-25595 is currently not
> compatible with the new config as it was developed for the old
> compactor.mapred... property:
>
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorUtil.java#L31
> This also needs to be handled - for both the new prop name and backward
> compatibility.
>
> R, Janos
>
>
> On 2022/01/31 09:50:49 Stamatis Zampetakis wrote:
> > Hi all,
> >
> > This email is an attempt to converge on which Hive/Tez/MR properties
> > someone should use in order to schedule a compaction on specific queues.
> > For those who are not familiar with how queues are used the YARN capacity
> > scheduler documentation [1] gives the general idea.
> >
> > Using specific queues for compaction jobs is necessary to be able to
> > efficiently allocate resources for maintenance tasks (compaction) and
> > production workloads. Hive provides various ways to control the queues
> used
> > by the compactor and there have been various tickets with improvements
> and
> > fixes in this area (see list below).
> >
> > The granularity we can select queues for compactions (all tables vs. per
> > table) currently depends on which compactor is in use (MR vs Query based)
> > and boils down to the following properties:
> >
> > Global configuration:
> > * hive.compactor.job.queue
> > * mapred.job.queue.name
> > * tez.queue.name
> >
> > Per table/statement configuration (table properties):
> > * compactor.mapred.job.queue.name (before HIVE-20723)
> > * compactor.hive.compactor.job.queue (after HIVE-20723)
> >
> > Things are a bit blurred with respect to what properties someone should
> use
> > to achieve the desired result. Some changes, such as HIVE-20723, raise
> > backward compatibility concerns and other changes seem to have a larger
> > impact than the one specifically designed for. For example, after
> > HIVE-25595, map reduce queue proper

Re: [DISCUSS] Properties for scheduling compactions on specific queues

2022-01-31 Thread Alessandro Solimando
Hi Stamatis,
the proposal seems reasonable to me.

I think that setting the two properties you mention, independently from the
underlying execution engine in use, should lead to the same result.

In addition, I also agree that we should deprecate the per-execution engine
properties.

Best regards,
Alessandro

On Mon, 31 Jan 2022 at 10:51, Stamatis Zampetakis  wrote:

> Hi all,
>
> This email is an attempt to converge on which Hive/Tez/MR properties
> someone should use in order to schedule a compaction on specific queues.
> For those who are not familiar with how queues are used the YARN capacity
> scheduler documentation [1] gives the general idea.
>
> Using specific queues for compaction jobs is necessary to be able to
> efficiently allocate resources for maintenance tasks (compaction) and
> production workloads. Hive provides various ways to control the queues used
> by the compactor and there have been various tickets with improvements and
> fixes in this area (see list below).
>
> The granularity we can select queues for compactions (all tables vs. per
> table) currently depends on which compactor is in use (MR vs Query based)
> and boils down to the following properties:
>
> Global configuration:
> * hive.compactor.job.queue
> * mapred.job.queue.name
> * tez.queue.name
>
> Per table/statement configuration (table properties):
> * compactor.mapred.job.queue.name (before HIVE-20723)
> * compactor.hive.compactor.job.queue (after HIVE-20723)
>
> Things are a bit blurred with respect to what properties someone should
> use to achieve the desired result. Some changes, such as HIVE-20723, raise
> backward compatibility concerns and other changes seem to have a larger
> impact than the one specifically designed for. For example, after
> HIVE-25595, map reduce queue properties can have an impact on the compactor
> queues even when Tez is in use.
>
> In order to avoid confusion and ensure long term support of these queue
> selection features we should clarify which of the above properties should
> be used.
>
> Given the current situation, I would propose to officially support only
> the following:
> * hive.compactor.job.queue
> * compactor.hive.compactor.job.queue
> and align the implementation based on these (if necessary). In other
> words, Hive users should not use mapred.job.queue.name and tez.queue.name
> explicitly at least when it comes to the compactor. Hive should set them
> transparently (as it happens now in various places) based on
> [compactor.]hive.compactor.job.queue.
>
> What do people think? Are there other ideas?
>
> Best,
> Stamatis
>
> [1]
> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
>
> HIVE-11997: Add ability to send Compaction Jobs to specific queue
> HIVE-13354: Add ability to specify Compaction options per table and per
> request
> HIVE-20723: Allow per table specification of compaction yarn queue
> HIVE-24781: Allow to use custom queue for query based compaction
> HIVE-25801: Custom queue settings is not honoured by Query based
> compaction StatsUpdater
> HIVE-25595: Custom queue settings is not honoured by compaction
> StatsUpdater
>


[DISCUSS] Properties for scheduling compactions on specific queues

2022-01-31 Thread Stamatis Zampetakis
Hi all,

This email is an attempt to converge on which Hive/Tez/MR properties
someone should use in order to schedule a compaction on specific queues.
For those who are not familiar with how queues are used the YARN capacity
scheduler documentation [1] gives the general idea.

Using specific queues for compaction jobs is necessary to be able to
efficiently allocate resources for maintenance tasks (compaction) and
production workloads. Hive provides various ways to control the queues used
by the compactor and there have been various tickets with improvements and
fixes in this area (see list below).

The granularity we can select queues for compactions (all tables vs. per
table) currently depends on which compactor is in use (MR vs Query based)
and boils down to the following properties:

Global configuration:
* hive.compactor.job.queue
* mapred.job.queue.name
* tez.queue.name

Per table/statement configuration (table properties):
* compactor.mapred.job.queue.name (before HIVE-20723)
* compactor.hive.compactor.job.queue (after HIVE-20723)

Things are a bit blurred with respect to what properties someone should use
to achieve the desired result. Some changes, such as HIVE-20723, raise
backward compatibility concerns and other changes seem to have a larger
impact than the one specifically designed for. For example, after
HIVE-25595, map reduce queue properties can have an impact on the compactor
queues even when Tez is in use.

In order to avoid confusion and ensure long term support of these queue
selection features we should clarify which of the above properties should
be used.

Given the current situation, I would propose to officially support only the
following:
* hive.compactor.job.queue
* compactor.hive.compactor.job.queue
and align the implementation based on these (if necessary). In other words,
Hive users should not use mapred.job.queue.name and tez.queue.name
explicitly at least when it comes to the compactor. Hive should set them
transparently (as it happens now in various places) based on
[compactor.]hive.compactor.job.queue.

What do people think? Are there other ideas?

Best,
Stamatis

[1]
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html

HIVE-11997: Add ability to send Compaction Jobs to specific queue
HIVE-13354: Add ability to specify Compaction options per table and per
request
HIVE-20723: Allow per table specification of compaction yarn queue
HIVE-24781: Allow to use custom queue for query based compaction
HIVE-25801: Custom queue settings is not honoured by Query based compaction
StatsUpdater
HIVE-25595: Custom queue settings is not honoured by compaction StatsUpdater