Jacob Tolar created OOZIE-3668:
----------------------------------
Summary: Simplify setting oozie.launcher.mapreduce.job.hdfs-servers
Key: OOZIE-3668
URL: https://issues.apache.org/jira/browse/OOZIE-3668
Project: Oozie
Issue Type: New Feature
Reporter: Jacob Tolar
When running oozie jobs that depend on cross cluster HDFS paths, I am required
to provide the parameter {{oozie.launcher.mapreduce.job.hdfs-servers}}.
This is a pain to manage when there are many datasources, or when the same
coordinator/workflow is deployed to multiple clusters (e.g. staging,
production) which have different cross-cluster data access requirements. We
need to keep track of the datasets and nameNode lists in two places.
It's especially obnoxious if you are using something like an HCatalog table
with partitions registered on a different HDFS. In that case, you can define
your dataset and Oozie's coordiantor takes care of all the details no matter
where the partitions are stored, but the workflow will fail unless you inspect
the table and add the correct name nodes to the hdfs-servers setting.
If you are using Oozie coordinators with data dependencies to schedule jobs,
Oozie should have access to all the required information to provide this
setting automatically which would help to eliminate errors when it the setting
is missing or set incorrectly.
I think there are two reasonable approaches which should be feasible. They're
not necessarily mutually exclusive, but I would be happy with just one of them:
1. Oozie sets the value automatically
In this case, Oozie coordinator execution is updated to compute the list of
hdfs-servers and pass it through to the workflow via the configuration. The
Oozie workflow execution is updated to use the value provided by the
coordinator as the default value for
{{oozie.launcher.mapreduce.job.hdfs-servers}} if the setting is not provided.
The user should still be able to override the setting if needed. It would be
helpful if there were a way for the user to specify *additional* hdfs-servers
(i.e. specify
{{oozie.launcher.mapreduce.job.hdfs-servers=${oozie.coord.hdfs-servers},hdfs://name-node}}
everything computed by the coordinator plus something else), but that may be
an uncommon use case.
2. Oozie provides EL functions for easily computing the {{hdfs-servers}} setting
In this case, Oozie could be updated to provide three new coordinator
functions. The output could be passed through to the workflow and used as
needed by the user.
1. {{coord:getAllDatasetHdfsServers()}}: takes no parameters and outputs a
string.
This function will iterate over all {{dataIn}} and {{dataOut}} configured in
the coordinator, and construct a string suitable for passing to the workflow
parameter {{oozie.launcher.mapreduce.job.hdfs-servers}} . It should work for
all supported dataset types (e.g. HDFS, HCatalog, etc).
2. {{coord:getDataInHdfsServers(String dataIn)}}: Takes one parameter and
outputs a string.
This function does the same thing as (1), but for the specified dataIn dataset.
3. {{coord:getDataOutHdfsServers(String dataOut)}}: Takes one parameter and
outputs a string.
This function does the same thing as (1), but only for the specified dataOut
dataset.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)