homatthew opened a new pull request, #3603:
URL: https://github.com/apache/gobblin/pull/3603

   Dear Gobblin maintainers,
   
   Please accept this PR. I understand that it will not be reviewed until I 
have checked off all the steps below!
   
   
   ### JIRA
   - [X] My PR addresses the following [Gobblin 
JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references 
them in the PR title. For example, "[GOBBLIN-1744] My Gobblin PR"
       - https://issues.apache.org/jira/browse/GOBBLIN-1744
   
   ```
   2022-11-17 18:23:02 PST ERROR [pool-35-thread-1] 
org.apache.gobblin.cluster.HelixAssignedParticipantCheck 143 - The current 
assigned participant is null. This implies that 
                (a)Helix failed to write to zookeeper, which is often caused by 
lack of compression leading / exceeding zookeeper jute max buffer size (Default 
1MB)
                (b)Helix reassigned the task (unlikely if this current task has 
been running without issue. Helix does not have code for reassigning "running" 
tasks)
   Note: This logic is true as of Helix version 1.0.2 and ZK version 3.6
   ```
   
   
   
   ### Description
   - [X] Here are some details about my PR, including screenshots (if 
applicable):
   
   ####HelixAssignedParticipantCheck:
   In production, we've seen that the helix assigned participant check failed 
due but due to helix issues not due to a split brain. When helix returns null, 
this actually means that the data does not exist. This is an unexpected case 
and we can assume that Helix itself is having issues (i.e. not a Gobblin side 
issue).
   
   I am adding this log because if the Helix assigned participant check fails, 
this is most likely a Helix issue but it's not immediately obvious what the 
exact issue is. I've added 2 likely scenarios we've seen internally as common 
scenarios where oncall has seen this as the rootcause.
   
   #### HelixUtils#getWorkflowIdsFromJobNames(HelixManager helixManager, 
Collection<String> jobNames)
   This is a similar case where Helix returns a null value. This can be caused 
when this util is called during a replanner / restart of the helix workflow. It 
can also be caused by a helix data consistency issue. The code doesn't expect a 
null and will fail with NPE. It is much better to fail gracefully and leave a 
descriptive log. We do not want to fail loudly because the job can exist in 
other workflows. In which case, we want to proceed with checking the other 
workflows gracefully
   
   
   ### Tests
   - [X] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   The existing helix assigned participant check triggers this because it 
returns a null participant from mock helix
   
   
   
   ### Commits
   - [ ] My commits all reference JIRA issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
       1. Subject is separated from body by a blank line
       2. Subject is limited to 50 characters
       3. Subject does not end with a period
       4. Subject uses the imperative mood ("add", not "adding")
       5. Body wraps at 72 characters
       6. Body explains "what" and "why", not "how"
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to