Re: Sporadic delays in task execution

Hunter Lee Fri, 22 Mar 2019 13:29:29 -0700

Let me add a caveat to my previous email. Although it comes with
scalability improvements, there are currently a few known issues with the
latest version. We'd encourage you to check back to make sure your current
usage isn't affected.


Hunter

On Fri, Mar 22, 2019 at 12:35 PM Hunter Lee <[email protected]> wrote:

> No problem. If you have further questions, let us know what kind of load
> you're putting on Helix as well. The newest version of Helix contains Task
> Framework 2.0, and has greater scalability in scheduling tasks, so you
> might want to consider using the newest version as well.
>
> Hunter
>
> On Fri, Mar 22, 2019 at 8:59 AM DImuthu Upeksha <
> [email protected]> wrote:
>
>> Hi Lee,
>>
>> Thanks for the trick. I didn't know that we can poke the controller like
>> that :) However now we can see that tasks are moving smoothly in our
>> staging setup. This behavior can be seen from time to time and get
>> resolved
>> automatically in few hours. I can't find a particular pattern however my
>> best guess is that this happens when the load is high. I will put some
>> load
>> on testing setup and see if I can reproduce this issue and try your
>> instructions then get back to you
>>
>> Thanks
>> Dimuthu
>>
>> On Thu, Mar 21, 2019 at 5:27 PM Hunter Lee <[email protected]> wrote:
>>
>> > Hi Dimuthu,
>> >
>> > What Junkai meant by touching the IdealState is this:
>> >
>> > 1) use Zooinspector to log into ZK
>> > 2) Locate the IDEALSTATES/ path
>> > 3) grab any ZNode under that path and try to modify (just add a
>> > whitespace) and save
>> > 4) This will trigger a ZK callback which should tell Helix Controller to
>> > rebalance/schedule things
>> >
>> > On Thu, Mar 21, 2019 at 11:30 AM DImuthu Upeksha <
>> > [email protected]> wrote:
>> >
>> >> Hi Junkai,
>> >>
>> >> What do you mean by touching ideal state to trigger an event? I didn't
>> >> quite get what you said. Is that like creating some path in zookeeper?
>> >> Workflows are eventually scheduled but the problem is, it is very slow
>> due
>> >> to that 30s freeze.
>> >>
>> >> Thanks
>> >> Dimuthu
>> >>
>> >> On Thu, Mar 21, 2019 at 2:26 PM Xue Junkai <[email protected]>
>> wrote:
>> >>
>> >> > Can you try one thing? Touch the ideal state to trigger an event. If
>> >> > workflows are not scheduled, it should scheduling has problem.
>> >> >
>> >> > Best,
>> >> >
>> >> > Junkai
>> >> >
>> >> > On Wed, Mar 20, 2019 at 10:31 PM DImuthu Upeksha <
>> >> > [email protected]> wrote:
>> >> >
>> >> >> Hi Junkai,
>> >> >>
>> >> >> We are using 0.8.1
>> >> >>
>> >> >> Dimuthu
>> >> >>
>> >> >> On Thu, Mar 21, 2019 at 12:14 AM Xue Junkai <[email protected]>
>> >> wrote:
>> >> >>
>> >> >> > Hi Dimuthu,
>> >> >> >
>> >> >> > What's the version of Helix you are using?
>> >> >> >
>> >> >> > Best,
>> >> >> >
>> >> >> > Junkai
>> >> >> >
>> >> >> > On Wed, Mar 20, 2019 at 8:54 PM DImuthu Upeksha <
>> >> >> > [email protected]>
>> >> >> > wrote:
>> >> >> >
>> >> >> > > Hi Helix Dev,
>> >> >> > >
>> >> >> > > We are again seeing this delay in task execution. Please have a
>> >> look
>> >> >> at
>> >> >> > the
>> >> >> > > screencast [1] of logs printed in participant (top shell) and
>> >> >> controller
>> >> >> > > (bottom shell). When I record this, there were about 90 - 100
>> >> >> workflows
>> >> >> > > pending to be executed. As you can see some tasks were suddenly
>> >> >> executed
>> >> >> > > and then participant freezed for about 30 seconds before
>> executing
>> >> >> next
>> >> >> > set
>> >> >> > > of tasks. I can see some WARN logs on controller log. I feel
>> like
>> >> >> this 30
>> >> >> > > second delay is some sort of a pattern. What do you think as the
>> >> >> reason
>> >> >> > for
>> >> >> > > this? I can provide you more information by turning on verbose
>> >> logs on
>> >> >> > > controller if you want.
>> >> >> > >
>> >> >> > > [1] https://youtu.be/3EUdSxnIxVw
>> >> >> > >
>> >> >> > > Thanks
>> >> >> > > Dimuthu
>> >> >> > >
>> >> >> > > On Thu, Oct 4, 2018 at 4:46 PM DImuthu Upeksha <
>> >> >> > [email protected]
>> >> >> > > >
>> >> >> > > wrote:
>> >> >> > >
>> >> >> > > > Hi Junkai,
>> >> >> > > >
>> >> >> > > > I'm CCing Airavata dev list as this is directly related to the
>> >> >> project.
>> >> >> > > >
>> >> >> > > > I just went through the zookeeper path like /<Cluster
>> >> >> > Name>/EXTERNALVIEW,
>> >> >> > > > /<Cluster Name>/CONFIGS/RESOURCE as I have noticed that helix
>> >> >> > controller
>> >> >> > > is
>> >> >> > > > periodically monitoring for the children of those paths even
>> >> though
>> >> >> all
>> >> >> > > the
>> >> >> > > > Workflows have moved into a saturated state like COMPLETED and
>> >> >> STOPPED.
>> >> >> > > In
>> >> >> > > > our case, we have a lot of completed workflows piled up in
>> those
>> >> >> > paths. I
>> >> >> > > > believe that helix is clearing up those resources after some
>> TTL.
>> >> >> What
>> >> >> > I
>> >> >> > > > did was writing an external spectator [1] that continuously
>> >> monitors
>> >> >> > for
>> >> >> > > > saturated workflows and clearing up resources before
>> controller
>> >> does
>> >> >> > that
>> >> >> > > > after a TTL. After that, we didn't see such delays in workflow
>> >> >> > execution
>> >> >> > > > and everything seems to be running smoothly. However we are
>> >> >> > continuously
>> >> >> > > > monitoring our deployments for any form of adverse effect
>> >> >> introduced by
>> >> >> > > > that improvement.
>> >> >> > > >
>> >> >> > > > Please let us know if we are doing something wrong in this
>> >> >> improvement
>> >> >> > or
>> >> >> > > > is there any better way to achieve this directly through helix
>> >> task
>> >> >> > > > framework.
>> >> >> > > >
>> >> >> > > > [1]
>> >> >> > > >
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>> https://github.com/apache/airavata/blob/staging/modules/airavata-helix/helix-spectator/src/main/java/org/apache/airavata/helix/impl/controller/WorkflowCleanupAgent.java
>> >> >> > > >
>> >> >> > > > Thanks
>> >> >> > > > Dimuthu
>> >> >> > > >
>> >> >> > > > On Tue, Oct 2, 2018 at 1:12 PM Xue Junkai <
>> [email protected]>
>> >> >> > wrote:
>> >> >> > > >
>> >> >> > > >> Could you please check the log of how long for each pipeline
>> >> stage
>> >> >> > > takes?
>> >> >> > > >>
>> >> >> > > >> Also, did you set expiry for workflows? Are they piled up for
>> >> long
>> >> >> > time?
>> >> >> > > >> How long for each workflow completes?
>> >> >> > > >>
>> >> >> > > >> best,
>> >> >> > > >>
>> >> >> > > >> Junkai
>> >> >> > > >>
>> >> >> > > >> On Wed, Sep 26, 2018 at 8:52 AM DImuthu Upeksha <
>> >> >> > > >> [email protected]>
>> >> >> > > >> wrote:
>> >> >> > > >>
>> >> >> > > >> > Hi Junkai,
>> >> >> > > >> >
>> >> >> > > >> > Average load is like 10 - 20 workflows per minutes. In some
>> >> cases
>> >> >> > it's
>> >> >> > > >> less
>> >> >> > > >> > than that However based on the observations, I feel like it
>> >> does
>> >> >> not
>> >> >> > > >> depend
>> >> >> > > >> > on the load and it is sporadic. Is there a particular log
>> >> lines
>> >> >> > that I
>> >> >> > > >> can
>> >> >> > > >> > filter in controller and participant to capture the
>> timeline
>> >> of
>> >> >> > > >> workflow so
>> >> >> > > >> > that I can figure out which which component is
>> >> malfunctioning? We
>> >> >> > use
>> >> >> > > >> helix
>> >> >> > > >> > v 0.8.1.
>> >> >> > > >> >
>> >> >> > > >> > Thanks
>> >> >> > > >> > Dimuthu
>> >> >> > > >> >
>> >> >> > > >> > On Tue, Sep 25, 2018 at 5:19 PM Xue Junkai <
>> >> [email protected]
>> >> >> >
>> >> >> > > >> wrote:
>> >> >> > > >> >
>> >> >> > > >> > > Hi Dimuthu,
>> >> >> > > >> > >
>> >> >> > > >> > > At which rate, you are keep submitting workflows?
>> Usually,
>> >> >> > Workflow
>> >> >> > > >> > > scheduling is very fast. And which version of Helix you
>> are
>> >> >> using?
>> >> >> > > >> > >
>> >> >> > > >> > > Best,
>> >> >> > > >> > >
>> >> >> > > >> > > Junkai
>> >> >> > > >> > >
>> >> >> > > >> > > On Tue, Sep 25, 2018 at 8:58 AM DImuthu Upeksha <
>> >> >> > > >> > > [email protected]>
>> >> >> > > >> > > wrote:
>> >> >> > > >> > >
>> >> >> > > >> > > > Hi Folks,
>> >> >> > > >> > > >
>> >> >> > > >> > > > We have noticed some delays between workflow submission
>> >> and
>> >> >> > actual
>> >> >> > > >> > > picking
>> >> >> > > >> > > > up by participants and seems like that delay is
>> somewhat
>> >> >> > constant
>> >> >> > > >> > around
>> >> >> > > >> > > 2-
>> >> >> > > >> > > > 3 minutes. We used to continuously submit workflows and
>> >> >> after 2
>> >> >> > -3
>> >> >> > > >> > > minutes,
>> >> >> > > >> > > > a bulk of workflows are picked by participant and
>> execute
>> >> >> them.
>> >> >> > > >> Then it
>> >> >> > > >> > > > remain silent for next 2 -3 minutes event we submit
>> more
>> >> >> > > workflows.
>> >> >> > > >> > It's
>> >> >> > > >> > > > like participant picking up workflows in discrete time
>> >> >> > intervals.
>> >> >> > > >> I'm
>> >> >> > > >> > not
>> >> >> > > >> > > > sure whether this is an issue of controller or the
>> >> >> participant.
>> >> >> > Do
>> >> >> > > >> you
>> >> >> > > >> > > have
>> >> >> > > >> > > > any experience with this sort of behavior?
>> >> >> > > >> > > >
>> >> >> > > >> > > > Thanks
>> >> >> > > >> > > > Dimuthu
>> >> >> > > >> > > >
>> >> >> > > >> > >
>> >> >> > > >> > >
>> >> >> > > >> > > --
>> >> >> > > >> > > Junkai Xue
>> >> >> > > >> > >
>> >> >> > > >> >
>> >> >> > > >>
>> >> >> > > >>
>> >> >> > > >> --
>> >> >> > > >> Junkai Xue
>> >> >> > > >>
>> >> >> > > >
>> >> >> > >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Junkai Xue
>> >> >> >
>> >> >>
>> >> >
>> >> >
>> >> > --
>> >> > Junkai Xue
>> >> >
>> >>
>> >
>>
>

Re: Sporadic delays in task execution

Reply via email to