Re: Sporadic delays in task execution

DImuthu Upeksha Fri, 22 Mar 2019 08:59:46 -0700

Hi Lee,

Thanks for the trick. I didn't know that we can poke the controller like
that :) However now we can see that tasks are moving smoothly in our
staging setup. This behavior can be seen from time to time and get resolved
automatically in few hours. I can't find a particular pattern however my
best guess is that this happens when the load is high. I will put some load
on testing setup and see if I can reproduce this issue and try your
instructions then get back to you


Thanks
Dimuthu

On Thu, Mar 21, 2019 at 5:27 PM Hunter Lee <[email protected]> wrote:

> Hi Dimuthu,
>
> What Junkai meant by touching the IdealState is this:
>
> 1) use Zooinspector to log into ZK
> 2) Locate the IDEALSTATES/ path
> 3) grab any ZNode under that path and try to modify (just add a
> whitespace) and save
> 4) This will trigger a ZK callback which should tell Helix Controller to
> rebalance/schedule things
>
> On Thu, Mar 21, 2019 at 11:30 AM DImuthu Upeksha <
> [email protected]> wrote:
>
>> Hi Junkai,
>>
>> What do you mean by touching ideal state to trigger an event? I didn't
>> quite get what you said. Is that like creating some path in zookeeper?
>> Workflows are eventually scheduled but the problem is, it is very slow due
>> to that 30s freeze.
>>
>> Thanks
>> Dimuthu
>>
>> On Thu, Mar 21, 2019 at 2:26 PM Xue Junkai <[email protected]> wrote:
>>
>> > Can you try one thing? Touch the ideal state to trigger an event. If
>> > workflows are not scheduled, it should scheduling has problem.
>> >
>> > Best,
>> >
>> > Junkai
>> >
>> > On Wed, Mar 20, 2019 at 10:31 PM DImuthu Upeksha <
>> > [email protected]> wrote:
>> >
>> >> Hi Junkai,
>> >>
>> >> We are using 0.8.1
>> >>
>> >> Dimuthu
>> >>
>> >> On Thu, Mar 21, 2019 at 12:14 AM Xue Junkai <[email protected]>
>> wrote:
>> >>
>> >> > Hi Dimuthu,
>> >> >
>> >> > What's the version of Helix you are using?
>> >> >
>> >> > Best,
>> >> >
>> >> > Junkai
>> >> >
>> >> > On Wed, Mar 20, 2019 at 8:54 PM DImuthu Upeksha <
>> >> > [email protected]>
>> >> > wrote:
>> >> >
>> >> > > Hi Helix Dev,
>> >> > >
>> >> > > We are again seeing this delay in task execution. Please have a
>> look
>> >> at
>> >> > the
>> >> > > screencast [1] of logs printed in participant (top shell) and
>> >> controller
>> >> > > (bottom shell). When I record this, there were about 90 - 100
>> >> workflows
>> >> > > pending to be executed. As you can see some tasks were suddenly
>> >> executed
>> >> > > and then participant freezed for about 30 seconds before executing
>> >> next
>> >> > set
>> >> > > of tasks. I can see some WARN logs on controller log. I feel like
>> >> this 30
>> >> > > second delay is some sort of a pattern. What do you think as the
>> >> reason
>> >> > for
>> >> > > this? I can provide you more information by turning on verbose
>> logs on
>> >> > > controller if you want.
>> >> > >
>> >> > > [1] https://youtu.be/3EUdSxnIxVw
>> >> > >
>> >> > > Thanks
>> >> > > Dimuthu
>> >> > >
>> >> > > On Thu, Oct 4, 2018 at 4:46 PM DImuthu Upeksha <
>> >> > [email protected]
>> >> > > >
>> >> > > wrote:
>> >> > >
>> >> > > > Hi Junkai,
>> >> > > >
>> >> > > > I'm CCing Airavata dev list as this is directly related to the
>> >> project.
>> >> > > >
>> >> > > > I just went through the zookeeper path like /<Cluster
>> >> > Name>/EXTERNALVIEW,
>> >> > > > /<Cluster Name>/CONFIGS/RESOURCE as I have noticed that helix
>> >> > controller
>> >> > > is
>> >> > > > periodically monitoring for the children of those paths even
>> though
>> >> all
>> >> > > the
>> >> > > > Workflows have moved into a saturated state like COMPLETED and
>> >> STOPPED.
>> >> > > In
>> >> > > > our case, we have a lot of completed workflows piled up in those
>> >> > paths. I
>> >> > > > believe that helix is clearing up those resources after some TTL.
>> >> What
>> >> > I
>> >> > > > did was writing an external spectator [1] that continuously
>> monitors
>> >> > for
>> >> > > > saturated workflows and clearing up resources before controller
>> does
>> >> > that
>> >> > > > after a TTL. After that, we didn't see such delays in workflow
>> >> > execution
>> >> > > > and everything seems to be running smoothly. However we are
>> >> > continuously
>> >> > > > monitoring our deployments for any form of adverse effect
>> >> introduced by
>> >> > > > that improvement.
>> >> > > >
>> >> > > > Please let us know if we are doing something wrong in this
>> >> improvement
>> >> > or
>> >> > > > is there any better way to achieve this directly through helix
>> task
>> >> > > > framework.
>> >> > > >
>> >> > > > [1]
>> >> > > >
>> >> > >
>> >> >
>> >>
>> https://github.com/apache/airavata/blob/staging/modules/airavata-helix/helix-spectator/src/main/java/org/apache/airavata/helix/impl/controller/WorkflowCleanupAgent.java
>> >> > > >
>> >> > > > Thanks
>> >> > > > Dimuthu
>> >> > > >
>> >> > > > On Tue, Oct 2, 2018 at 1:12 PM Xue Junkai <[email protected]>
>> >> > wrote:
>> >> > > >
>> >> > > >> Could you please check the log of how long for each pipeline
>> stage
>> >> > > takes?
>> >> > > >>
>> >> > > >> Also, did you set expiry for workflows? Are they piled up for
>> long
>> >> > time?
>> >> > > >> How long for each workflow completes?
>> >> > > >>
>> >> > > >> best,
>> >> > > >>
>> >> > > >> Junkai
>> >> > > >>
>> >> > > >> On Wed, Sep 26, 2018 at 8:52 AM DImuthu Upeksha <
>> >> > > >> [email protected]>
>> >> > > >> wrote:
>> >> > > >>
>> >> > > >> > Hi Junkai,
>> >> > > >> >
>> >> > > >> > Average load is like 10 - 20 workflows per minutes. In some
>> cases
>> >> > it's
>> >> > > >> less
>> >> > > >> > than that However based on the observations, I feel like it
>> does
>> >> not
>> >> > > >> depend
>> >> > > >> > on the load and it is sporadic. Is there a particular log
>> lines
>> >> > that I
>> >> > > >> can
>> >> > > >> > filter in controller and participant to capture the timeline
>> of
>> >> > > >> workflow so
>> >> > > >> > that I can figure out which which component is
>> malfunctioning? We
>> >> > use
>> >> > > >> helix
>> >> > > >> > v 0.8.1.
>> >> > > >> >
>> >> > > >> > Thanks
>> >> > > >> > Dimuthu
>> >> > > >> >
>> >> > > >> > On Tue, Sep 25, 2018 at 5:19 PM Xue Junkai <
>> [email protected]
>> >> >
>> >> > > >> wrote:
>> >> > > >> >
>> >> > > >> > > Hi Dimuthu,
>> >> > > >> > >
>> >> > > >> > > At which rate, you are keep submitting workflows? Usually,
>> >> > Workflow
>> >> > > >> > > scheduling is very fast. And which version of Helix you are
>> >> using?
>> >> > > >> > >
>> >> > > >> > > Best,
>> >> > > >> > >
>> >> > > >> > > Junkai
>> >> > > >> > >
>> >> > > >> > > On Tue, Sep 25, 2018 at 8:58 AM DImuthu Upeksha <
>> >> > > >> > > [email protected]>
>> >> > > >> > > wrote:
>> >> > > >> > >
>> >> > > >> > > > Hi Folks,
>> >> > > >> > > >
>> >> > > >> > > > We have noticed some delays between workflow submission
>> and
>> >> > actual
>> >> > > >> > > picking
>> >> > > >> > > > up by participants and seems like that delay is somewhat
>> >> > constant
>> >> > > >> > around
>> >> > > >> > > 2-
>> >> > > >> > > > 3 minutes. We used to continuously submit workflows and
>> >> after 2
>> >> > -3
>> >> > > >> > > minutes,
>> >> > > >> > > > a bulk of workflows are picked by participant and execute
>> >> them.
>> >> > > >> Then it
>> >> > > >> > > > remain silent for next 2 -3 minutes event we submit more
>> >> > > workflows.
>> >> > > >> > It's
>> >> > > >> > > > like participant picking up workflows in discrete time
>> >> > intervals.
>> >> > > >> I'm
>> >> > > >> > not
>> >> > > >> > > > sure whether this is an issue of controller or the
>> >> participant.
>> >> > Do
>> >> > > >> you
>> >> > > >> > > have
>> >> > > >> > > > any experience with this sort of behavior?
>> >> > > >> > > >
>> >> > > >> > > > Thanks
>> >> > > >> > > > Dimuthu
>> >> > > >> > > >
>> >> > > >> > >
>> >> > > >> > >
>> >> > > >> > > --
>> >> > > >> > > Junkai Xue
>> >> > > >> > >
>> >> > > >> >
>> >> > > >>
>> >> > > >>
>> >> > > >> --
>> >> > > >> Junkai Xue
>> >> > > >>
>> >> > > >
>> >> > >
>> >> >
>> >> >
>> >> > --
>> >> > Junkai Xue
>> >> >
>> >>
>> >
>> >
>> > --
>> > Junkai Xue
>> >
>>
>

Re: Sporadic delays in task execution

Reply via email to