Hi, I have been investigating a pretty serious Swift performance problem for a while now. I have a single node Swift instance with 16 cores, 64GB memory and 8 MDs of 3TB each. I only write 256KB objects into this Swift instance with high concurrency; 256 parallel object PUTs. Also, I was sharding the objects equally across 32 containers.
On a completely clean system, we were getting ~375 object puts per second. But this kept on reducing pretty quickly and by the time we had 600GB of data in Swift, the throughput was ~100 objects per second. We used sysdig to get a trace of what's happening in the system and found that the open system calls were taking way longer; several 100s of milliseconds, sometimes even 1 second. Investigating this further revealed a problem in the way Swift writes the objects on XFS. Swift's object server creates a temp directory under the mount point /srv/node/r0. It create an file under this temp directory first (say /srv/node/r0/tmp/tmpASDF) and eventually renames this file to its final destination. rename /srv/node/r0/tmp/tmpASDF -> /srv/node/r0/objects/312/eef/deadbeef/33453453454323424.data. XFS creates an inode in the same allocation group as it parent. So, when the temp file tmpASDF is created, it goes in the same allocation group of "tmp". When the rename happens, only the filesystem metadata gets modified. The allocation groups of the inodes don't change. Since all object PUTs start off in the tmp directory, all inodes get created in the same allocation group. The B-tree used for keeping track of these inodes in the allocation group grows bigger and bigger as more files are written and parsing this tree for existence checks or for creating new inodes becomes more and more expensive. See this discussion [1] I had on the XFS mailing list where this issue was brought to light. And this other slightly old thread where the problem was identical [2]. I validated this theory by periodically deleting the temp directory. I observed that the objects per second was not reducing at the same rate as earlier. Staring at ~375 obj/s, after 600GB data in Swift, I was getting ~340 obj/s. Now, how do we fix this? One option would be to make the temp directory somewhere deeper in the filesystem rather than immediately under the mount point. E.g. create one temp directory under each of the 3-byte hash directories. And use the temp directory corresponding to the object's hash. But, it's unclear what other repercussions will this have? Will the replicator start replicating this temp directory? Another option is to actually delete the tmp directory periodically. Problem is that we don't know when. And whenever we decide to do it, the temp directory may have some file in it making it impossible to delete the directory. Any other options? Thanks in advance. -Shri [1] http://www.spinics.net/lists/xfs/msg32868.html [2] http://xfs.9218.n7.nabble.com/Performance-degradation-over-time-td28514.html _______________________________________________ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack