Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

Joshua Harlow Mon, 30 Nov 2015 12:31:32 -0800

Clint Byrum wrote:

Excerpts from Joshua Harlow's message of 2015-11-30 10:42:53 -0800:

Hi all,


I just wanted to bring up an issue, possible solution and get feedback
on it from folks because it seems to be an on-going problem that shows
up not when an application is initially deployed but as on-going
operation and running of that application proceeds (ie after running for
a period of time).

The jist of the problem is the following:

A<<pick your favorite openstack project>>  has a need to ensure that no
application on the same machine can manipulate a given resource on that
same machine, so it uses the lock file pattern (acquire a *local* lock
file for that resource, manipulate that resource, release that lock
file) to do actions on that resource in a safe manner (note this does
not ensure safety outside of that machine, lock files are *not*
distributed locks).

The api that we expose from oslo is typically accessed via the following:

    oslo_concurrency.lockutils.synchronized(name, lock_file_prefix=None,
external=False, lock_path=None, semaphores=None, delay=0.01)

or via its underlying library (that I extracted from oslo.concurrency
and have improved to add more usefulness) @
http://fasteners.readthedocs.org/

The issue though for<<your favorite openstack project>>  is that each of
these projects now typically has a large amount of lock files that exist
or have existed and no easy way to determine when those lock files can
be deleted (afaik no? periodic task exists in said projects to clean up
lock files, or to delete them when they are no longer in use...) so what
happens is bugs like https://bugs.launchpad.net/cinder/+bug/1432387
appear and there is no a simple solution to clean lock files up (since
oslo.concurrency is really not the right layer to know when a lock can
or can not be deleted, only the application knows that...)

So then we get a few creative solutions like the following:

- https://review.openstack.org/#/c/241663/
- https://review.openstack.org/#/c/239678/
- (and others?)

So I wanted to ask the question, how are people involved in<<your
favorite openstack project>>  cleaning up these files (are they at all?)

Another idea that I have been proposing also is to use offset locks.

This would allow for not creating X lock files, but create a *single*
lock file per project and use offsets into it as the way to lock. For
example nova could/would create a 1MB (or larger/smaller) *empty* file
for locks, that would allow for 1,048,576 locks to be used at the same
time, which honestly should be way more than enough, and then there
would not need to be any lock cleanup at all... Is there any reason this
wasn't initially done back way when this lock file code was created?
(https://github.com/harlowja/fasteners/pull/10 adds this functionality
to the underlying library if people want to look it over)


This is really complicated, and basically just makes the directory of
lock files _look_ clean. But it still leaves each offset stale, and has
to be cleaned anyway.

What do u mean here (out of curiosity), each offset stale? The filewould basically never change size after startup (pick a large enoughnumber, 10 million, 1 trillion billion...) and use it appropriately fromthere on out...


Fasteners already has process locks that use fcntl/flock.

These locks provide enough to allow you to infer things about.  the owner
of the lock file. If there's no process still holding the exclusive lock
when you try to lock it, then YOU own it, and thus control the resource.

Well not really, python doesn't expose the ability to introspect who hasthe handle afaik, I tried to look into that and it looks like fnctl(the C api) might have a way to get it, but u can't really introspectthat, without as u stated, acquiring the lock yourself... I can try torecall more of this investigation when I was trying to add a @owner_pidproperty onto fasteners interprocess lock class but from my simplememory the exposed API isn't there in python.


A cron job which tries to flock anything older than ${REASONABLE_TIME}
and deletes them seems fine. Whatever process was trying to interact
with the resource is gone at that point.

Yes, or a periodic thread in the application that can do this in a safemanner (using its ability to know exactly what its own apps internalsare doing...)


Now, anything that needs to safely manage a resource beyond without a
live process will need to keep track of its own state and be idempotent
anyway. IMO this isn't something lock files alone solve well. I believe
you're familiar with a library named taskflow that is supposed to help
write code that does this better ;). Even without taskflow, if you are
trying to do something exclusive without a single process that stays
alive, you need to do _something_ to keep track of state and restart
or revert that flow. That is a state management problem, not a locking
problem.


Agreed. ;)

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

Reply via email to