Hi Lee, Thanks for the trick. I didn't know that we can poke the controller like that :) However now we can see that tasks are moving smoothly in our staging setup. This behavior can be seen from time to time and get resolved automatically in few hours. I can't find a particular pattern however my best guess is that this happens when the load is high. I will put some load on testing setup and see if I can reproduce this issue and try your instructions then get back to you
Thanks Dimuthu On Thu, Mar 21, 2019 at 5:27 PM Hunter Lee <[email protected]> wrote: > Hi Dimuthu, > > What Junkai meant by touching the IdealState is this: > > 1) use Zooinspector to log into ZK > 2) Locate the IDEALSTATES/ path > 3) grab any ZNode under that path and try to modify (just add a > whitespace) and save > 4) This will trigger a ZK callback which should tell Helix Controller to > rebalance/schedule things > > On Thu, Mar 21, 2019 at 11:30 AM DImuthu Upeksha < > [email protected]> wrote: > >> Hi Junkai, >> >> What do you mean by touching ideal state to trigger an event? I didn't >> quite get what you said. Is that like creating some path in zookeeper? >> Workflows are eventually scheduled but the problem is, it is very slow due >> to that 30s freeze. >> >> Thanks >> Dimuthu >> >> On Thu, Mar 21, 2019 at 2:26 PM Xue Junkai <[email protected]> wrote: >> >> > Can you try one thing? Touch the ideal state to trigger an event. If >> > workflows are not scheduled, it should scheduling has problem. >> > >> > Best, >> > >> > Junkai >> > >> > On Wed, Mar 20, 2019 at 10:31 PM DImuthu Upeksha < >> > [email protected]> wrote: >> > >> >> Hi Junkai, >> >> >> >> We are using 0.8.1 >> >> >> >> Dimuthu >> >> >> >> On Thu, Mar 21, 2019 at 12:14 AM Xue Junkai <[email protected]> >> wrote: >> >> >> >> > Hi Dimuthu, >> >> > >> >> > What's the version of Helix you are using? >> >> > >> >> > Best, >> >> > >> >> > Junkai >> >> > >> >> > On Wed, Mar 20, 2019 at 8:54 PM DImuthu Upeksha < >> >> > [email protected]> >> >> > wrote: >> >> > >> >> > > Hi Helix Dev, >> >> > > >> >> > > We are again seeing this delay in task execution. Please have a >> look >> >> at >> >> > the >> >> > > screencast [1] of logs printed in participant (top shell) and >> >> controller >> >> > > (bottom shell). When I record this, there were about 90 - 100 >> >> workflows >> >> > > pending to be executed. As you can see some tasks were suddenly >> >> executed >> >> > > and then participant freezed for about 30 seconds before executing >> >> next >> >> > set >> >> > > of tasks. I can see some WARN logs on controller log. I feel like >> >> this 30 >> >> > > second delay is some sort of a pattern. What do you think as the >> >> reason >> >> > for >> >> > > this? I can provide you more information by turning on verbose >> logs on >> >> > > controller if you want. >> >> > > >> >> > > [1] https://youtu.be/3EUdSxnIxVw >> >> > > >> >> > > Thanks >> >> > > Dimuthu >> >> > > >> >> > > On Thu, Oct 4, 2018 at 4:46 PM DImuthu Upeksha < >> >> > [email protected] >> >> > > > >> >> > > wrote: >> >> > > >> >> > > > Hi Junkai, >> >> > > > >> >> > > > I'm CCing Airavata dev list as this is directly related to the >> >> project. >> >> > > > >> >> > > > I just went through the zookeeper path like /<Cluster >> >> > Name>/EXTERNALVIEW, >> >> > > > /<Cluster Name>/CONFIGS/RESOURCE as I have noticed that helix >> >> > controller >> >> > > is >> >> > > > periodically monitoring for the children of those paths even >> though >> >> all >> >> > > the >> >> > > > Workflows have moved into a saturated state like COMPLETED and >> >> STOPPED. >> >> > > In >> >> > > > our case, we have a lot of completed workflows piled up in those >> >> > paths. I >> >> > > > believe that helix is clearing up those resources after some TTL. >> >> What >> >> > I >> >> > > > did was writing an external spectator [1] that continuously >> monitors >> >> > for >> >> > > > saturated workflows and clearing up resources before controller >> does >> >> > that >> >> > > > after a TTL. After that, we didn't see such delays in workflow >> >> > execution >> >> > > > and everything seems to be running smoothly. However we are >> >> > continuously >> >> > > > monitoring our deployments for any form of adverse effect >> >> introduced by >> >> > > > that improvement. >> >> > > > >> >> > > > Please let us know if we are doing something wrong in this >> >> improvement >> >> > or >> >> > > > is there any better way to achieve this directly through helix >> task >> >> > > > framework. >> >> > > > >> >> > > > [1] >> >> > > > >> >> > > >> >> > >> >> >> https://github.com/apache/airavata/blob/staging/modules/airavata-helix/helix-spectator/src/main/java/org/apache/airavata/helix/impl/controller/WorkflowCleanupAgent.java >> >> > > > >> >> > > > Thanks >> >> > > > Dimuthu >> >> > > > >> >> > > > On Tue, Oct 2, 2018 at 1:12 PM Xue Junkai <[email protected]> >> >> > wrote: >> >> > > > >> >> > > >> Could you please check the log of how long for each pipeline >> stage >> >> > > takes? >> >> > > >> >> >> > > >> Also, did you set expiry for workflows? Are they piled up for >> long >> >> > time? >> >> > > >> How long for each workflow completes? >> >> > > >> >> >> > > >> best, >> >> > > >> >> >> > > >> Junkai >> >> > > >> >> >> > > >> On Wed, Sep 26, 2018 at 8:52 AM DImuthu Upeksha < >> >> > > >> [email protected]> >> >> > > >> wrote: >> >> > > >> >> >> > > >> > Hi Junkai, >> >> > > >> > >> >> > > >> > Average load is like 10 - 20 workflows per minutes. In some >> cases >> >> > it's >> >> > > >> less >> >> > > >> > than that However based on the observations, I feel like it >> does >> >> not >> >> > > >> depend >> >> > > >> > on the load and it is sporadic. Is there a particular log >> lines >> >> > that I >> >> > > >> can >> >> > > >> > filter in controller and participant to capture the timeline >> of >> >> > > >> workflow so >> >> > > >> > that I can figure out which which component is >> malfunctioning? We >> >> > use >> >> > > >> helix >> >> > > >> > v 0.8.1. >> >> > > >> > >> >> > > >> > Thanks >> >> > > >> > Dimuthu >> >> > > >> > >> >> > > >> > On Tue, Sep 25, 2018 at 5:19 PM Xue Junkai < >> [email protected] >> >> > >> >> > > >> wrote: >> >> > > >> > >> >> > > >> > > Hi Dimuthu, >> >> > > >> > > >> >> > > >> > > At which rate, you are keep submitting workflows? Usually, >> >> > Workflow >> >> > > >> > > scheduling is very fast. And which version of Helix you are >> >> using? >> >> > > >> > > >> >> > > >> > > Best, >> >> > > >> > > >> >> > > >> > > Junkai >> >> > > >> > > >> >> > > >> > > On Tue, Sep 25, 2018 at 8:58 AM DImuthu Upeksha < >> >> > > >> > > [email protected]> >> >> > > >> > > wrote: >> >> > > >> > > >> >> > > >> > > > Hi Folks, >> >> > > >> > > > >> >> > > >> > > > We have noticed some delays between workflow submission >> and >> >> > actual >> >> > > >> > > picking >> >> > > >> > > > up by participants and seems like that delay is somewhat >> >> > constant >> >> > > >> > around >> >> > > >> > > 2- >> >> > > >> > > > 3 minutes. We used to continuously submit workflows and >> >> after 2 >> >> > -3 >> >> > > >> > > minutes, >> >> > > >> > > > a bulk of workflows are picked by participant and execute >> >> them. >> >> > > >> Then it >> >> > > >> > > > remain silent for next 2 -3 minutes event we submit more >> >> > > workflows. >> >> > > >> > It's >> >> > > >> > > > like participant picking up workflows in discrete time >> >> > intervals. >> >> > > >> I'm >> >> > > >> > not >> >> > > >> > > > sure whether this is an issue of controller or the >> >> participant. >> >> > Do >> >> > > >> you >> >> > > >> > > have >> >> > > >> > > > any experience with this sort of behavior? >> >> > > >> > > > >> >> > > >> > > > Thanks >> >> > > >> > > > Dimuthu >> >> > > >> > > > >> >> > > >> > > >> >> > > >> > > >> >> > > >> > > -- >> >> > > >> > > Junkai Xue >> >> > > >> > > >> >> > > >> > >> >> > > >> >> >> > > >> >> >> > > >> -- >> >> > > >> Junkai Xue >> >> > > >> >> >> > > > >> >> > > >> >> > >> >> > >> >> > -- >> >> > Junkai Xue >> >> > >> >> >> > >> > >> > -- >> > Junkai Xue >> > >> >
