Re: [openstack-dev] memory usage in devstack-gate (the oom-killer strikes again)
On Tue, Sep 9, 2014 at 12:24 AM, Joe Gordon joe.gord...@gmail.com wrote: 1) Should we explicitly set the number of workers that services use in devstack? Why have so many workers in a small all-in-one environment? What is the right balance here? This is what we do for Swift, without setting this up it would killed devstack even before the tempest runs. Chmouel ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] memory usage in devstack-gate (the oom-killer strikes again)
yes. guppy seems to have some nicer string formatting for this dump as well, but i was unable to figure out how to get this string format to write to a file, it seems like the tool is very geared towards interactive console use. We should pick a nice memory formatter we like, there’s a bunch of them, and then add it to our standard toolset. On Sep 9, 2014, at 10:35 AM, Doug Hellmann d...@doughellmann.com wrote: On Sep 8, 2014, at 8:12 PM, Mike Bayer mba...@redhat.com wrote: Hi All - Joe had me do some quick memory profiling on nova, just an FYI if anyone wants to play with this technique, I place a little bit of memory profiling code using Guppy into nova/api/__init__.py, or anywhere in your favorite app that will definitely get imported when the thing first runs: from guppy import hpy import signal import datetime def handler(signum, frame): print guppy memory dump fname = /tmp/memory_%s.txt % datetime.datetime.now().strftime(%Y%m%d_%H%M%S) prof = hpy().heap() with open(fname, 'w') as handle: prof.dump(handle) del prof signal.signal(signal.SIGUSR2, handler) This looks like something we could build into our standard service startup code. Maybe in http://git.openstack.org/cgit/openstack/oslo-incubator/tree/openstack/common/service.py for example? Doug Then, run nova-api, run some API calls, then you hit the nova-api process with a SIGUSR2 signal, and it will dump a profile into /tmp/ like this: http://paste.openstack.org/show/108536/ Now obviously everyone is like, oh boy memory lets go beat up SQLAlchemy again…..which is fine I can take it. In that particular profile, there’s a bunch of SQLAlchemy stuff, but that is all structural to the classes that are mapped in Nova API, e.g. 52 classes with a total of 656 attributes mapped. That stuff sets up once and doesn’t change. If Nova used less ORM, e.g. didn’t map everything, that would be less. But in that profile there’s no “data” lying around. But even if you don’t have that many objects resident, your Python process might still be using up a ton of memory. The reason for this is that the cPython interpreter has a model where it will grab all the memory it needs to do something, a time consuming process by the way, but then it really doesn’t ever release it (see http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm for the “classic” answer on this, things may have improved/modernized in 2.7 but I think this is still the general idea). So in terms of SQLAlchemy, a good way to suck up a ton of memory all at once that probably won’t get released is to do this: 1. fetching a full ORM object with all of its data 2. fetching lots of them all at once So to avoid doing that, the answer isn’t necessarily that simple. The quick wins to loading full objects are to …not load the whole thing! E.g. assuming we can get Openstack onto 0.9 in requirements.txt, we can start using load_only(): session.query(MyObject).options(load_only(“id”, “name”, “ip”)) or with any version, just load those columns - we should be using this as much as possible for any query that is row/time intensive and doesn’t need full ORM behaviors (like relationships, persistence): session.query(MyObject.id, MyObject.name, MyObject.ip) Another quick win, if we *really* need an ORM object, not a row, and we have to fetch a ton of them in one big result, is to fetch them using yield_per(): for obj in session.query(MyObject).yield_per(100): # work with obj and then make sure to lose all references to it yield_per() will dish out objects drawing from batches of the number you give it. But it has two huge caveats: one is that it isn’t compatible with most forms of eager loading, except for many-to-one joined loads. The other is that the DBAPI, e.g. like the MySQL driver, does *not* stream the rows; virtually all DBAPIs by default load a result set fully before you ever see the first row. psycopg2 is one of the only DBAPIs that even offers a special mode to work around this (server side cursors). Which means its even *better* to paginate result sets, so that you only ask the database for a chunk at a time, only storing at most a subset of objects in memory at once. Pagination itself is tricky, if you are using a naive LIMIT/OFFSET approach, it takes awhile if you are working with a large OFFSET. It’s better to SELECT into windows of data, where you can specify a start and end criteria (against an indexed column) for each window, like a timestamp. Then of course, using Core only is another level of fastness/low memory. Though querying for individual columns with ORM is not far off, and I’ve also made some major improvements to that in 1.0 so that query(*cols) is pretty competitive with straight Core (and Core is…well I’d say
Re: [openstack-dev] memory usage in devstack-gate (the oom-killer strikes again)
Hi All - Joe had me do some quick memory profiling on nova, just an FYI if anyone wants to play with this technique, I place a little bit of memory profiling code using Guppy into nova/api/__init__.py, or anywhere in your favorite app that will definitely get imported when the thing first runs: from guppy import hpy import signal import datetime def handler(signum, frame): print guppy memory dump fname = /tmp/memory_%s.txt % datetime.datetime.now().strftime(%Y%m%d_%H%M%S) prof = hpy().heap() with open(fname, 'w') as handle: prof.dump(handle) del prof signal.signal(signal.SIGUSR2, handler) Then, run nova-api, run some API calls, then you hit the nova-api process with a SIGUSR2 signal, and it will dump a profile into /tmp/ like this: http://paste.openstack.org/show/108536/ Now obviously everyone is like, oh boy memory lets go beat up SQLAlchemy again…..which is fine I can take it. In that particular profile, there’s a bunch of SQLAlchemy stuff, but that is all structural to the classes that are mapped in Nova API, e.g. 52 classes with a total of 656 attributes mapped. That stuff sets up once and doesn’t change. If Nova used less ORM, e.g. didn’t map everything, that would be less. But in that profile there’s no “data” lying around. But even if you don’t have that many objects resident, your Python process might still be using up a ton of memory. The reason for this is that the cPython interpreter has a model where it will grab all the memory it needs to do something, a time consuming process by the way, but then it really doesn’t ever release it (see http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm for the “classic” answer on this, things may have improved/modernized in 2.7 but I think this is still the general idea). So in terms of SQLAlchemy, a good way to suck up a ton of memory all at once that probably won’t get released is to do this: 1. fetching a full ORM object with all of its data 2. fetching lots of them all at once So to avoid doing that, the answer isn’t necessarily that simple. The quick wins to loading full objects are to …not load the whole thing! E.g. assuming we can get Openstack onto 0.9 in requirements.txt, we can start using load_only(): session.query(MyObject).options(load_only(“id”, “name”, “ip”)) or with any version, just load those columns - we should be using this as much as possible for any query that is row/time intensive and doesn’t need full ORM behaviors (like relationships, persistence): session.query(MyObject.id, MyObject.name, MyObject.ip) Another quick win, if we *really* need an ORM object, not a row, and we have to fetch a ton of them in one big result, is to fetch them using yield_per(): for obj in session.query(MyObject).yield_per(100): # work with obj and then make sure to lose all references to it yield_per() will dish out objects drawing from batches of the number you give it. But it has two huge caveats: one is that it isn’t compatible with most forms of eager loading, except for many-to-one joined loads. The other is that the DBAPI, e.g. like the MySQL driver, does *not* stream the rows; virtually all DBAPIs by default load a result set fully before you ever see the first row. psycopg2 is one of the only DBAPIs that even offers a special mode to work around this (server side cursors). Which means its even *better* to paginate result sets, so that you only ask the database for a chunk at a time, only storing at most a subset of objects in memory at once. Pagination itself is tricky, if you are using a naive LIMIT/OFFSET approach, it takes awhile if you are working with a large OFFSET. It’s better to SELECT into windows of data, where you can specify a start and end criteria (against an indexed column) for each window, like a timestamp. Then of course, using Core only is another level of fastness/low memory. Though querying for individual columns with ORM is not far off, and I’ve also made some major improvements to that in 1.0 so that query(*cols) is pretty competitive with straight Core (and Core is…well I’d say becoming visible in raw DBAPI’s rear view mirror, at least….). What I’d suggest here is that we start to be mindful of memory/performance patterns and start to work out naive ORM use into more savvy patterns; being aware of what columns are needed, what rows, how many SQL queries we really need to emit, what the “worst case” number of rows will be for sections that really need to scale. By far the hardest part is recognizing and reimplementing when something might have to deal with an arbitrarily large number of rows, which means organizing that code to deal with a “streaming” pattern where you never have all the rows in memory at once - on other projects I’ve had tasks that would normally take about a day, but in order to organize it to “scale”, took weeks - such as being able
Re: [openstack-dev] memory usage in devstack-gate (the oom-killer strikes again)
Excerpts from Joe Gordon's message of 2014-09-08 15:24:29 -0700: Hi All, We have recently started seeing assorted memory issues in the gate including the oom-killer [0] and libvirt throwing memory errors [1]. Luckily we run ps and dstat on every devstack run so we have some insight into why we are running out of memory. Based on the output from job taken at random [2][3] a typical run consists of: * 68 openstack api processes alone * the following services are running 8 processes (number of CPUs on test nodes) * nova-api (we actually run 24 of these, 8 compute, 8 EC2, 8 metadata) * nova-conductor * cinder-api * glance-api * trove-api * glance-registry * trove-conductor * together nova-api, nova-conductor, cinder-api alone take over 45 %MEM (note: some of that is memory usage is counted multiple times as RSS includes shared libraries) * based on dstat numbers, it looks like we don't use that much memory before tempest runs, and after tempest runs we use a lot of memory. Based on this information I have two categories of questions: 1) Should we explicitly set the number of workers that services use in devstack? Why have so many workers in a small all-in-one environment? What is the right balance here? I'm kind of wondering why we aren't pushing everything to go the same direction keystone did with apache. I may be crazy but apache gives us all kinds of tools to tune around process forking that we'll have to reinvent in our own daemon bits (like MaxRequestsPerChild to prevent leaky or slow GC from eating all our memory over time). Meanwhile, the idea on running api processes with ncpu is that we don't want to block an API request if there is a CPU available to it. Of course if we have enough cinder, nova, keystone, trove, etc. requests all at one time that we do need to block, we defer to the CPU scheduler of the box to do it, rather than queue things up at the event level. This can lead to quite ugly CPU starvation issues, and that is a lot easier to tune for if you have one tuning knob for apache + mod_wsgi instead of nservices. In production systems I'd hope that memory would be quite a bit more available than on the bazillions of cloud instances that run tests. So, while process-per-cpu-per-service is a large percentage of 8G, it is a very small percentage of 24G+, which is a pretty normal amount of memory to have on an all-in-one type of server that one might choose as a baremetal controller. For VMs that are handling production loads, It's a pretty easy trade-off to give them a little more RAM so they can take advantage of all the CPU's as needed. All this to say, since devstack is always expected to be run in a dev context, and not production, I think it would make sense to dial it back to 4 from ncpu. 2) Should we be worried that some OpenStack services such as nova-api, nova-conductor and cinder-api take up so much memory? Does there memory usage keep growing over time, does anyone have any numbers to answer this? Why do these processes take up so much memory? Yes I do think we should be worried that they grow quite a bit. I've experienced this problem a few times in a few scripting languages, and almost every time it turned out to be too much data being read from the database or MQ. Moving to tighter messages, and tighter database interaction, nearly always results in less wasted RAM. I like the other suggestion to start graphing this. Since we have all that dstat data, I wonder if we can just process that directly into graphite. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev