[
https://issues.apache.org/jira/browse/GOBBLIN-2022?focusedWorklogId=912899&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-912899
]
ASF GitHub Bot logged work on GOBBLIN-2022:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 03/Apr/24 19:52
Start Date: 03/Apr/24 19:52
Worklog Time Spent: 10m
Work Description: phet commented on code in PR #3896:
URL: https://github.com/apache/gobblin/pull/3896#discussion_r1550342146
##########
gobblin-service/src/main/java/org/apache/gobblin/service/monitoring/KafkaJobStatusMonitor.java:
##########
@@ -236,8 +238,9 @@ protected void
processMessage(DecodeableKafkaRecord<byte[],byte[]> message) {
* It fills missing fields in job status and also merge the fields with the
* existing job status in the state store. Merging is required because we
* do not want to lose the information sent by other GobblinTrackingEvents.
- * Returns false if adding this state transitions the job status of the job
to final, otherwise returns false.
- * It will also return false if the job status was already final before
calling this method.
+ * Returns an absent Optional if adding this state transitions the job
status of the job to final, otherwise returns
+ * the updated job status wrapped inside an Optional.
+ * It will also return an absent Optional if the job status was already
final before calling this method.
Review Comment:
"exactly-once guarantee" might be clearest way to phrase.
with not choosing "at-least-once", we must consider what happens if this
method returns (and did successfully updated the JobStateStore), but we fail
before completing the update to the DagActionStore.
seems the reevaluate action would be lost. in fact, it would be lost even
if there is never another GTE that might cause the KJSM to work again on this
same job's state.
what can we do to prevent?
##########
gobblin-service/src/main/java/org/apache/gobblin/service/monitoring/KafkaJobStatusMonitor.java:
##########
@@ -275,6 +279,7 @@ static void
addJobStatusToStateStore(org.apache.gobblin.configuration.State jobS
modifyStateIfRetryRequired(jobStatus);
Review Comment:
the KJSM would send a `ReevaluateDagAction` to trigger consideration. then
`ReevaluateDagProc` would make the determination in the same way as
`modifyStateIfRetryRequired` is doing here. basically, move that logic from
this class to `ReevaluateDagProc`, as a way to centralize all determination of
how to follow job "completion" (success, failure, or cancel).
##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/proc/ReevaluateDagProc.java:
##########
@@ -188,7 +187,7 @@ void submitNext(DagManagementStateStore
dagManagementStateStore) throws IOExcept
}
}
- private void handleMultipleJobs(Set<Dag.DagNode<JobExecutionPlan>>
nextNodes) {
+ private static void handleMultipleJobs(Set<Dag.DagNode<JobExecutionPlan>>
nextNodes) {
Review Comment:
I don't believe this should be `static`... it's not in `LaunchDagProc`. why
do you think so?
##########
gobblin-service/src/main/java/org/apache/gobblin/service/monitoring/KafkaJobStatusMonitor.java:
##########
@@ -193,7 +197,18 @@ protected void
processMessage(DecodeableKafkaRecord<byte[],byte[]> message) {
org.apache.gobblin.configuration.State jobStatus =
parseJobStatus(gobblinTrackingEvent);
if (jobStatus != null) {
try (Timer.Context context =
getMetricContext().timer(GET_AND_SET_JOB_STATUS).time()) {
- addJobStatusToStateStore(jobStatus, this.stateStore,
this.eventProducer);
+ Optional<org.apache.gobblin.configuration.State> updatedJobStatus
= addJobStatusToStateStore(jobStatus, this.stateStore);
+ boolean isJobStatusUpdated = updatedJobStatus.isPresent();
+ // todo - retried/resumed jobs *may* not be handled here, we may
want to create their dag action elsewhere
+ if (isJobStatusUpdated) {
+ jobStatus = updatedJobStatus.get();
+ this.eventProducer.emitObservabilityEvent(jobStatus);
+ String flowName =
jobStatus.getProp(TimingEvent.FlowEventConstants.FLOW_NAME_FIELD);
+ String flowGroup =
jobStatus.getProp(TimingEvent.FlowEventConstants.FLOW_GROUP_FIELD);
+ String flowExecutionId =
jobStatus.getProp(TimingEvent.FlowEventConstants.FLOW_EXECUTION_ID_FIELD);
+ String jobName =
jobStatus.getProp(TimingEvent.FlowEventConstants.JOB_NAME_FIELD);
+ this.dagActionStore.addJobDagAction(flowGroup, flowName,
flowExecutionId, jobName, DagActionStore.DagActionType.REEVALUATE);
Review Comment:
sure, we can wait for now... but let's be on the lookout and if we read the
same props in the same way, please make a utility
##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/proc/ReevaluateDagProc.java:
##########
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.gobblin.service.modules.orchestration.proc;
+
+import java.io.IOException;
+import java.util.Optional;
+import java.util.Set;
+
+import org.apache.commons.lang3.tuple.ImmutablePair;
+import org.apache.commons.lang3.tuple.Pair;
+
+import lombok.extern.slf4j.Slf4j;
+
+import org.apache.gobblin.metrics.event.TimingEvent;
+import org.apache.gobblin.service.ExecutionStatus;
+import org.apache.gobblin.service.modules.flowgraph.Dag;
+import org.apache.gobblin.service.modules.flowgraph.DagNodeId;
+import
org.apache.gobblin.service.modules.orchestration.DagManagementStateStore;
+import org.apache.gobblin.service.modules.orchestration.DagManagerUtils;
+import org.apache.gobblin.service.modules.orchestration.task.ReevaluateDagTask;
+import org.apache.gobblin.service.modules.spec.JobExecutionPlan;
+import org.apache.gobblin.service.monitoring.FlowStatusGenerator;
+import org.apache.gobblin.service.monitoring.JobStatus;
+
+
+/**
+ * An implementation for {@link DagProc} that launches a new job if there
exists a job whose pre-requisite jobs are
+ * completed successfully. If there are no more jobs to run and no job is
running for the Dag, it cleans up the Dag.
+ * (In future), if there are multiple new jobs to be launched, separate launch
dag actions are created for each of them.
+ */
+@Slf4j
+public class ReevaluateDagProc extends
DagProc<Pair<Optional<Dag.DagNode<JobExecutionPlan>>, Optional<JobStatus>>> {
+
+ public ReevaluateDagProc(ReevaluateDagTask reEvaluateDagTask) {
+ super(reEvaluateDagTask);
+ }
+
+ @Override
+ protected Pair<Optional<Dag.DagNode<JobExecutionPlan>>, Optional<JobStatus>>
initialize(DagManagementStateStore dagManagementStateStore)
+ throws IOException {
+ Pair<Optional<Dag.DagNode<JobExecutionPlan>>, Optional<JobStatus>>
dagNodeWithJobStatus =
+ dagManagementStateStore.getDagNodeWithJobStatus(this.dagNodeId);
+
+ if (!dagNodeWithJobStatus.getLeft().isPresent() ||
!dagNodeWithJobStatus.getRight().isPresent()) {
+ // this is possible when MALA malfunctions and a duplicated reevaluate
dag proc is launched for a dag node that is
+ // already "reevaluated" and cleaned up.
+ return ImmutablePair.of(Optional.empty(), Optional.empty());
+ }
+
+ ExecutionStatus executionStatus =
ExecutionStatus.valueOf(dagNodeWithJobStatus.getRight().get().getEventName());
+ if
(!FlowStatusGenerator.FINISHED_STATUSES.contains(executionStatus.name())) {
+ log.warn("Job status for dagNode {} is {}. Re-evaluate dag action should
have been created only for finished status - {}",
+ dagNodeId, executionStatus, FlowStatusGenerator.FINISHED_STATUSES);
+ return ImmutablePair.of(Optional.empty(), Optional.empty());
+ }
+
+ setStatus(dagManagementStateStore, dagNodeWithJobStatus.getLeft().get(),
executionStatus);
+ return dagNodeWithJobStatus;
+ }
+
+ @Override
+ protected void act(DagManagementStateStore dagManagementStateStore,
Pair<Optional<Dag.DagNode<JobExecutionPlan>>, Optional<JobStatus>>
dagNodeWithJobStatus)
+ throws IOException {
+ if (!dagNodeWithJobStatus.getLeft().isPresent()) {
+ log.error("DagNode or its job status not found for a Reevaluate
DagAction with dag node id {}", this.dagNodeId);
+ return;
+ }
+
+ Dag.DagNode<JobExecutionPlan> dagNode =
dagNodeWithJobStatus.getLeft().get();
+ JobStatus jobStatus = dagNodeWithJobStatus.getRight().get();
+ ExecutionStatus executionStatus = dagNode.getValue().getExecutionStatus();
+ onJobFinish(dagManagementStateStore, dagNode, executionStatus);
+ Dag<JobExecutionPlan> dag =
dagManagementStateStore.getDag(getDagId()).get();
+
+ if (jobStatus.isShouldRetry()) {
+ log.info("Retrying job: {}, current attempts: {}, max attempts: {}",
+ DagManagerUtils.getFullyQualifiedJobName(dagNode),
jobStatus.getCurrentAttempts(), jobStatus.getMaxAttempts());
+ // todo - be careful when unsetting this, it is possible that this is
set to FAILED because some other job in the
+ // dag failed and is also not retryable. in that case if this job's
retry passes, overall status of the dag can be
+ // set to PASS, which would be incorrect.
+ dag.setFlowEvent(null);
+ DagProcUtils.submitJobToExecutor(dagManagementStateStore, dagNode,
getDagId());
+ } else if (!dagManagementStateStore.hasRunningJobs(getDagId())) {
+ if (dag.getFlowEvent() == null) {
+ // If the dag flow event is not set and there are no more jobs
running, then it is successful
+ // also note that `onJobFinish` method does whatever is required to do
after job finish, determining a Dag's
+ // status is not possible on individual job's finish status
+ dag.setFlowEvent(TimingEvent.FlowTimings.FLOW_SUCCEEDED);
+ }
+ String flowEvent = dag.getFlowEvent();
+ DagManagerUtils.emitFlowEvent(eventSubmitter, dag, flowEvent);
+ if (flowEvent.equals(TimingEvent.FlowTimings.FLOW_SUCCEEDED)) {
+ // todo - verify if work from PR#3641 is required
+ dagManagementStateStore.deleteDag(getDagId());
+ } else {
+ dagManagementStateStore.markDagFailed(dag);
+ }
+ }
+ }
+
+ /**
+ * Sets status of a dag node inside the given Dag.
+ * todo - DMSS should support this functionality like an atomic get-and-set
operation.
+ */
+ private void setStatus(DagManagementStateStore dagManagementStateStore,
+ Dag.DagNode<JobExecutionPlan> dagNode, ExecutionStatus executionStatus)
throws IOException {
+ Dag<JobExecutionPlan> dag =
dagManagementStateStore.getDag(getDagId()).get();
+ DagNodeId dagNodeId = dagNode.getValue().getId();
+ for (Dag.DagNode<JobExecutionPlan> node : dag.getNodes()) {
+ if (node.getValue().getId().equals(dagNodeId)) {
+ node.getValue().setExecutionStatus(executionStatus);
+ dagManagementStateStore.checkpointDag(dag);
+ return;
+ }
+ }
+ log.error("DagNode with id {} not found in Dag {}", dagNodeId, getDagId());
+ }
+
+ /**
+ * Method that defines the actions to be performed when a job finishes
either successfully or with failure.
+ * This method updates the state of the dag and performs clean up actions as
necessary.
+ */
+ private void onJobFinish(DagManagementStateStore dagManagementStateStore,
+ Dag.DagNode<JobExecutionPlan> dagNode, ExecutionStatus executionStatus)
+ throws IOException {
+ String jobName = DagManagerUtils.getFullyQualifiedJobName(dagNode);
+ log.info("Job {} of Dag {} has finished with status {}", jobName,
getDagId(), executionStatus.name());
+ // Only decrement counters and quota for jobs that actually ran on the
executor, not from a GaaS side failure/skip event
+ if (dagManagementStateStore.releaseQuota(dagNode)) {
+
dagManagementStateStore.getDagManagerMetrics().decrementRunningJobMetrics(dagNode);
+ }
+
+ Dag<JobExecutionPlan> dag =
dagManagementStateStore.getDag(getDagId()).get();
+
+ switch (executionStatus) {
+ case FAILED:
+ dag.setMessage("Flow failed because job " + jobName + " failed");
+ dag.setFlowEvent(TimingEvent.FlowTimings.FLOW_FAILED);
+
dagManagementStateStore.getDagManagerMetrics().incrementExecutorFailed(dagNode);
+ break;
+ case CANCELLED:
+ dag.setFlowEvent(TimingEvent.FlowTimings.FLOW_CANCELLED);
+ break;
+ case COMPLETE:
+
dagManagementStateStore.getDagManagerMetrics().incrementExecutorSuccess(dagNode);
+ submitNext(dagManagementStateStore);
+ break;
+ default:
+ log.warn("It should not reach here. Job status {} is unexpected.",
executionStatus);
+ }
+
+ //Checkpoint the dag state, it should have an updated value of dag fields
+ dagManagementStateStore.checkpointDag(dag);
+ dagManagementStateStore.deleteDagNodeState(getDagId(), dagNode);
+ }
+
+ /**
+ * Submit next set of Dag nodes in the Dag identified by the provided dagId
+ */
+ void submitNext(DagManagementStateStore dagManagementStateStore) throws
IOException {
Review Comment:
I really don't think those should be `static` (in `DagProcUtils`) but would
belong to a common base class for both to derive from
Issue Time Tracking
-------------------
Worklog Id: (was: 912899)
Time Spent: 13h 10m (was: 13h)
> create dag proc for taking actions on job completion
> ----------------------------------------------------
>
> Key: GOBBLIN-2022
> URL: https://issues.apache.org/jira/browse/GOBBLIN-2022
> Project: Apache Gobblin
> Issue Type: Task
> Reporter: Arjun Singh Bora
> Priority: Major
> Time Spent: 13h 10m
> Remaining Estimate: 0h
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)