Hi List, I have introduced flashcache (https://github.com/facebook/flashcache) aim at reduce Ceph metadata IOs to OSD's disk. Basically, for every data writes, ceph need to write 3 things: Pg log Pg info Actual data First 2 requests are small, but for non-btrfs filesystem, the first 2 writes will results OSD disk to do 2 seeks, it's critical to spindle-disk's throughput as mentioned in earlier mail.
I list the detail of my experiment , any inputs is highly appreciate. [Setup] 2 Host, 1 SSD with 1 SATA disk, SSD is partitioned into 4 partitions. P1 as OSD Journal, P2 as FlashCache for sata. P3 used as XFS metadata journal. 1 Client,1 RBD Volume created and mounted [FlashCache setup] [Create cached device] flashcache_create -v -p back fsdc /dev/sda2 /dev/sdc [Create filesystem] mkfs.xfs -f -i size=2048 -d agcount=1 -l logdev=/dev/sda3,size=128m /dev/mapper/fsdc [Mount] mount -o logdev=/dev/sda3 -o logbsize=256k -o delaylog -o inode64 /dev/mapper/fsdc /data/osd.21/ [Tuning] sysctl dev.flashcache.sda9+sdc.skip_seq_thresh_kb=32 Since I am aiming to cache only ceph metadata and the metadata writes are very small, so I configured flashcache to skip all sequential write larger than 32K. Basically you can set this to 1K because the meta writes are all less than 1K. I set it to 32K just for a quick test. [Experiment] Doing dd from the client on top of the RBD Volume [Result] Throughput boost from 37MB/s to ~ 90MB/s, since the flashcache working in DM level, it's transparent to Ceph. My test is just a quick test, further test (include sequential R/W,random R/W) are in schedule. Will come back with you if there are some progress. Xiaoxi -----Original Message----- From: Sage Weil [mailto:s...@inktank.com] Sent: 2013年1月16日 5:43 To: Chen, Xiaoxi Cc: Mark Nelson; Yan, Zheng Subject: RE: Seperate metadata disk for OSD On Tue, 15 Jan 2013, Chen, Xiaoxi wrote: > Hi Sage, > FlashCache works well for this scenarios, I created a hybrid-disk with > 1 ssd partition(shared the same ssd but different patition with Ceph journal > and XFS journal) and 1 sata disk.Configured the FlashCache to ignore all > sequential request larger than 32K(Well, it can be set to a smaller number). > The results shows a comparable performance with CephMeta-to-ssd > solution. > Since flashcache working in the DM layer , I suppose it's transparent > to Ceph, right? Right. That's great to hear that it works well. If you don't mind, it would be great if you could report the same thing to ceph-devel with a bit of detail about how you configured FlashCache so that others can do the same. Thanks! sage > > > Xiaoxi > > -----Original Message----- > From: Sage Weil [mailto:s...@inktank.com] > Sent: 2013?1?15? 2:19 > To: Chen, Xiaoxi > Cc: Mark Nelson; Yan, Zheng ; ceph-devel@vger.kernel.org > Subject: RE: Seperate metadata disk for OSD > > On Mon, 14 Jan 2013, Chen, Xiaoxi wrote: > > Hi Sage, > > Thanks for your mail~ > > Would you have a timetable about when such improvement can be > > ready?It's critical for non-btrfs filesystem. > > I am thinking about introducing flashcache into my configuration to > > cache such meta write, since flashcache working under the filesystem, I > > suppose it will not break the assumption inside Ceph. I will try it on > > tomorrow and come back with you ~ > > Thanks again for the helps! > > I think avoiding the pginfo change may be pretty simple. The log one I am a > bit less concerned about (the appends from many rbd IOs will get aggregated > into a small number of XFS IOs), and changing that around would be a bigger > deal. > > sage > > > > > > > > Xiaoxi > > > > -----Original Message----- > > From: Sage Weil [mailto:s...@inktank.com] > > Sent: 2013?1?13? 0:57 > > To: Chen, Xiaoxi > > Cc: Mark Nelson; Yan, Zheng ; ceph-devel@vger.kernel.org > > Subject: RE: Seperate metadata disk for OSD > > > > On Sat, 12 Jan 2013, Chen, Xiaoxi wrote: > > > Hi Zheng? > > > I have put XFS log to a separate disk, indeed it provide some > > > performance gain but not that significant. > > > Ceph's metadata is somehow separate(it's some files reside in OSD's > > > disk), therefore,it cannot be helped by neither XFS journal log nor OSD's > > > journal.That's why I am trying to put ceph's metadata(/data/osd.x/meta > > > folder ) to a separate SSD disk. > > > To Nelson, > > > I did the experiment with just 1 client, if using more clients, the > > > gain will not be that much. > > > It looks to me that a single write from client side become 3 writes to > > > disk is somehow a big overhead for in-place-update filesystem such like > > > XFS since it introduce more seeks.Out-of-place-update filesystem will not > > > suffer a lot for such pattern,I didn?t find this problem when I using > > > BTRFS as backend filesystem. But forBTRFS, fragmentation is another > > > performance killer, for a single RBD volume, if you did a lot of random > > > write on it, the sequential read performance will drop to 30% of a new > > > RBD volume. This make BTRFS unusable in production. > > > Separate Ceph meta seems quite easy to me ( I just mount a partition to > > > /data/osd.X/meta), is it right ? is there any potential problem in it? > > > > Unfortunately, yes. The ceph journal and fs sync are carefully timed. > > The ceph-osd assumes that syncfs(2) on the $osd_data/current/commit_op_seq > > file will sync everything, but if meta/ is another fs that isnt true. At > > the every least, the code needs to be modified to sync that as well. > > > > That said, there is a lot of improvement that can be had here. The three > > things we write are: > > > > the pg log > > the pg info, spread across the pg dir xattr and that pginfo file > > the actual io > > > > The pg log could go in leveldb, which would translate those writes into a > > single sequential stream across the entire OSD. And the PG info separate > > between the xattr and the file is far from optimal: most of that data > > doesn't actually change on each write. What little does is very small, and > > could be moved into the xattr, avoiding touching the file (which means an > > inode + data block write) at all. > > > > We need to look a bit more closely to see how difficult that will really be > > to implement, but I think it is promising! > > > > sage > > > > > > > > > > > > > > > > > > > Xiaoxi -----Original Message----- > > > From: Mark Nelson [mailto:mark.nel...@inktank.com] > > > Sent: 2013?1?12? 21:36 > > > To: Yan, Zheng > > > Cc: Chen, Xiaoxi; ceph-devel@vger.kernel.org > > > Subject: Re: Seperate metadata disk for OSD > > > > > > Hi Xiaoxi and Zheng, > > > > > > We've played with both of these some internally, but not for a production > > > deployment. Mostly just for diagnosing performance problems. > > > It's been a while since I last played with this, but I hadn't seen a > > > whole lot of performance improvements at the time. That may have been > > > due to the hardware in use, or perhaps other parts of Ceph have improved > > > to the point where this matters now! > > > > > > > On a side note, Btrfs also had a google summer of code project to let you > > > put metadata on an external device. Originally I think that was supposed > > > to make it into 3.7, but am not sure if that happened. > > > > > > Mark > > > > > > On 01/12/2013 06:21 AM, Yan, Zheng wrote: > > > > On Sat, Jan 12, 2013 at 2:57 PM, Chen, Xiaoxi <xiaoxi.c...@intel.com> > > > > wrote: > > > >> > > > >> Hi list, > > > >> For a rbd write request, Ceph need to do 3 writes: > > > >> 2013-01-10 13:10:15.539967 7f52f516c700 10 > > > >>filestore(/data/osd.21) _do_transaction on 0x327d790 > > > >> 2013-01-10 13:10:15.539979 7f52f516c700 15 > > > >>filestore(/data/osd.21) write meta/516b801c/pglog_2.1a/0//-1 > > > >>36015~147 > > > >> 2013-01-10 13:10:15.540016 7f52f516c700 15 > > > >>filestore(/data/osd.21) > > > >>path: > > > >>/data/osd.21/current/meta/DIR_C/pglog\u2.1a__0_516B801C__none > > > >> 2013-01-10 13:10:15.540164 7f52f516c700 15 > > > >>filestore(/data/osd.21) write meta/28d2f4a8/pginfo_2.1a/0//-1 > > > >>0~496 > > > >> 2013-01-10 13:10:15.540189 7f52f516c700 15 > > > >>filestore(/data/osd.21) > > > >>path: > > > >>/data/osd.21/current/meta/DIR_8/pginfo\u2.1a__0_28D2F4A8__none > > > >> 2013-01-10 13:10:15.540217 7f52f516c700 10 > > > >>filestore(/data/osd.21) _do_transaction on 0x327d708 > > > >> 2013-01-10 13:10:15.540222 7f52f516c700 15 > > > >>filestore(/data/osd.21) write > > > >>2.1a_head/8abf341a/rb.0.106e.6b8b4567.0000000002d3/head//2 > > > >>3227648~524288 > > > >> 2013-01-10 13:10:15.540245 7f52f516c700 15 > > > >>filestore(/data/osd.21) > > > >>path: > > > >>/data/osd.21/current/2.1a_head/rb.0.106e.6b8b4567.0000000002d3__ > > > >>he > > > >>ad > > > >>_8 > > > >>ABF341A__2 > > > >>l > > > >> If using XFS as backend file system and running xfs on top of > > > >> traditional sata disk, it will introduce a lot of seeks and therefore > > > >> reduce bandwidth, a blktrace is available here :( > > > >> http://ww3.sinaimg.cn/mw690/6e1aee47jw1e0qsbxbvddj.jpg) to demonstrate > > > >> this issue.( single client running dd on top of a new RBD volumes). > > > >> Then I tried to move /osd.X/current/meta to a separate disk, > > > >> the bandwidth boosted.(look blktrace at > > > >> http://ww4.sinaimg.cn/mw690/6e1aee47jw1e0qsadz1bij.jpg). > > > >> I haven't test other access pattern or something else, but it > > > >> looks to me that moving such meta to a separate disk (ssd or sata with > > > >> btrfs) will benefit ceph write performance, is it true? Will ceph > > > >> introduce this feature in the future? Is there any potential problem > > > >> for such hack? > > > >> > > > > > > > > Did you try putting XFS metadata log a separate and fast device > > > > (mkfs.xfs -l logdev=/dev/sdbx,size=10000b). I think it will > > > > boost performance too. > > > > > > > > Regards > > > > Yan, Zheng > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > > in the body of a message to majord...@vger.kernel.org More > > > > majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > N????y????b?????v?????{.n??????z??ay????????j > > > ???f????????????????:+v???????? ??zZ+??????"?!? > > N????y????b?????v?????{.n??????z??ay????????j > > ???f????????????????:+v???????? ??zZ+??????"?!? >