Great, thanks. Sounds like a pretty interesting performance improvement. --John
> On Apr 30, 2015, at 11:27 AM, Shrinand Javadekar <shrin...@maginatics.com> > wrote: > > I was able to make the code change to create the tmp directory in the > 3-byte hash directory and fix the unit tests to get this to work. I > will file a bug to get a discussion started on this, in case there are > people not following this thread. > > On Wed, Apr 29, 2015 at 4:08 PM, Shrinand Javadekar > <shrin...@maginatics.com> wrote: >> Hi, >> >> I have been investigating a pretty serious Swift performance problem >> for a while now. I have a single node Swift instance with 16 cores, >> 64GB memory and 8 MDs of 3TB each. I only write 256KB objects into >> this Swift instance with high concurrency; 256 parallel object PUTs. >> Also, I was sharding the objects equally across 32 containers. >> >> On a completely clean system, we were getting ~375 object puts per >> second. But this kept on reducing pretty quickly and by the time we >> had 600GB of data in Swift, the throughput was ~100 objects per >> second. >> >> We used sysdig to get a trace of what's happening in the system and >> found that the open system calls were taking way longer; several 100s >> of milliseconds, sometimes even 1 second. >> >> Investigating this further revealed a problem in the way Swift writes >> the objects on XFS. Swift's object server creates a temp directory >> under the mount point /srv/node/r0. It create an file under this temp >> directory first (say /srv/node/r0/tmp/tmpASDF) and eventually renames >> this file to its final destination. >> >> rename /srv/node/r0/tmp/tmpASDF -> >> /srv/node/r0/objects/312/eef/deadbeef/33453453454323424.data. >> >> XFS creates an inode in the same allocation group as it parent. So, >> when the temp file tmpASDF is created, it goes in the same allocation >> group of "tmp". When the rename happens, only the filesystem metadata >> gets modified. The allocation groups of the inodes don't change. >> >> Since all object PUTs start off in the tmp directory, all inodes get >> created in the same allocation group. The B-tree used for keeping >> track of these inodes in the allocation group grows bigger and bigger >> as more files are written and parsing this tree for existence checks >> or for creating new inodes becomes more and more expensive. >> >> See this discussion [1] I had on the XFS mailing list where this issue >> was brought to light. And this other slightly old thread where the >> problem was identical [2]. >> >> I validated this theory by periodically deleting the temp directory. I >> observed that the objects per second was not reducing at the same rate >> as earlier. Staring at ~375 obj/s, after 600GB data in Swift, I was >> getting ~340 obj/s. >> >> Now, how do we fix this? >> >> One option would be to make the temp directory somewhere deeper in the >> filesystem rather than immediately under the mount point. E.g. create >> one temp directory under each of the 3-byte hash directories. And use >> the temp directory corresponding to the object's hash. >> >> But, it's unclear what other repercussions will this have? Will the >> replicator start replicating this temp directory? >> >> Another option is to actually delete the tmp directory periodically. >> Problem is that we don't know when. And whenever we decide to do it, >> the temp directory may have some file in it making it impossible to >> delete the directory. >> >> Any other options? >> >> Thanks in advance. >> -Shri >> >> [1] http://www.spinics.net/lists/xfs/msg32868.html >> [2] >> http://xfs.9218.n7.nabble.com/Performance-degradation-over-time-td28514.html > > _______________________________________________ > Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack > Post to : openstack@lists.openstack.org > Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack