homatthew opened a new pull request, #3603: URL: https://github.com/apache/gobblin/pull/3603
Dear Gobblin maintainers, Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below! ### JIRA - [X] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-1744] My Gobblin PR" - https://issues.apache.org/jira/browse/GOBBLIN-1744 ``` 2022-11-17 18:23:02 PST ERROR [pool-35-thread-1] org.apache.gobblin.cluster.HelixAssignedParticipantCheck 143 - The current assigned participant is null. This implies that (a)Helix failed to write to zookeeper, which is often caused by lack of compression leading / exceeding zookeeper jute max buffer size (Default 1MB) (b)Helix reassigned the task (unlikely if this current task has been running without issue. Helix does not have code for reassigning "running" tasks) Note: This logic is true as of Helix version 1.0.2 and ZK version 3.6 ``` ### Description - [X] Here are some details about my PR, including screenshots (if applicable): ####HelixAssignedParticipantCheck: In production, we've seen that the helix assigned participant check failed due but due to helix issues not due to a split brain. When helix returns null, this actually means that the data does not exist. This is an unexpected case and we can assume that Helix itself is having issues (i.e. not a Gobblin side issue). I am adding this log because if the Helix assigned participant check fails, this is most likely a Helix issue but it's not immediately obvious what the exact issue is. I've added 2 likely scenarios we've seen internally as common scenarios where oncall has seen this as the rootcause. #### HelixUtils#getWorkflowIdsFromJobNames(HelixManager helixManager, Collection<String> jobNames) This is a similar case where Helix returns a null value. This can be caused when this util is called during a replanner / restart of the helix workflow. It can also be caused by a helix data consistency issue. The code doesn't expect a null and will fail with NPE. It is much better to fail gracefully and leave a descriptive log. We do not want to fail loudly because the job can exist in other workflows. In which case, we want to proceed with checking the other workflows gracefully ### Tests - [X] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: The existing helix assigned participant check triggers this because it returns a null participant from mock helix ### Commits - [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 2. Subject is limited to 50 characters 3. Subject does not end with a period 4. Subject uses the imperative mood ("add", not "adding") 5. Body wraps at 72 characters 6. Body explains "what" and "why", not "how" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@gobblin.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org