On 03/09/15 02:56, Angus Salkeld wrote:
On Thu, Sep 3, 2015 at 3:53 AM Zane Bitter <[email protected]
<mailto:[email protected]>> wrote:
On 02/09/15 04:55, Steven Hardy wrote:
> On Wed, Sep 02, 2015 at 04:33:36PM +1200, Robert Collins wrote:
>> On 2 September 2015 at 11:53, Angus Salkeld
<[email protected] <mailto:[email protected]>> wrote:
>>
>>> 1. limit the number of resource actions in parallel (maybe base
on the
>>> number of cores)
>>
>> I'm having trouble mapping that back to 'and heat-engine is
running on
>> 3 separate servers'.
>
> I think Angus was responding to my test feedback, which was a
different
> setup, one 4-core laptop running heat-engine with 4 worker processes.
>
> In that environment, the level of additional concurrency becomes
a problem
> because all heat workers become so busy that creating a large stack
> DoSes the Heat services, and in my case also the DB.
>
> If we had a configurable option, similar to num_engine_workers, which
> enabled control of the number of resource actions in parallel, I
probably
> could have controlled that explosion in activity to a more
managable series
> of tasks, e.g I'd set num_resource_actions to
(num_engine_workers*2) or
> something.
I think that's actually the opposite of what we need.
The resource actions are just sent to the worker queue to get processed
whenever. One day we will get to the point where we are overflowing the
queue, but I guarantee that we are nowhere near that day. If we are
DoSing ourselves, it can only be because we're pulling *everything* off
the queue and starting it in separate greenthreads.
worker does not use a greenthread per job like service.py does.
This issue is if you have actions that are fast you can hit the db hard.
QueuePool limit of size 5 overflow 10 reached, connection timed out,
timeout 30
It seems like it's not very hard to hit this limit. It comes from simply
loading
the resource in the worker:
"/home/angus/work/heat/heat/engine/worker.py", line 276, in check_resource
"/home/angus/work/heat/heat/engine/worker.py", line 145, in _load_resource
"/home/angus/work/heat/heat/engine/resource.py", line 290, in load
resource_objects.Resource.get_obj(context, resource_id)
This is probably me being naive, but that sounds strange. I would have
thought that there is no way to exhaust the connection pool by doing
lots of actions in rapid succession. I'd have guessed that the only way
to exhaust a connection pool would be to have lots of connections open
simultaneously. That suggests to me that either we are failing to
expeditiously close connections and return them to the pool, or that we
are - explicitly or implicitly - processing a bunch of messages in parallel.
In an ideal world, we might only ever pull one task off that queue at a
time. Any time the task is sleeping, we would use for processing stuff
off the engine queue (which needs a quick response, since it is serving
the ReST API). The trouble is that you need a *huge* number of
heat-engines to handle stuff in parallel. In the reductio-ad-absurdum
case of a single engine only processing a single task at a time, we're
back to creating resources serially. So we probably want a higher number
than 1. (Phase 2 of convergence will make tasks much smaller, and may
even get us down to the point where we can pull only a single task at a
time.)
However, the fewer engines you have, the more greenthreads we'll have to
allow to get some semblance of parallelism. To the extent that more
cores means more engines (which assumes all running on one box, but
still), the number of cores is negatively correlated with the number of
tasks that we want to allow.
Note that all of the greenthreads run in a single CPU thread, so having
more cores doesn't help us at all with processing more stuff in
parallel.
Except, as I said above, we are not creating greenthreads in worker.
Well, maybe we'll need to in order to make things still work sanely with
a low number of engines :) (Should be pretty easy to do with a semaphore.)
I think what y'all are suggesting is limiting the number of jobs that go
into the queue... that's quite wrong IMO. Apart from the fact it's
impossible (resources put jobs into the queue entirely independently,
and have no knowledge of the global state required to throttle inputs),
we shouldn't implement an in-memory queue with long-running tasks
containing state that can be lost if the process dies - the whole point
of convergence is we have... a message queue for that. We need to limit
the rate that stuff comes *out* of the queue. And, again, since we have
no knowledge of global state, we can only control the rate at which an
individual worker processes tasks. The way to avoid killing the DB is to
out a constant ceiling on the workers * concurrent_tasks_per_worker product.
cheers,
Zane.
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev