RE: Seperate metadata disk for OSD

2013-01-14 Thread Chen, Xiaoxi
Hi Sage,
Thanks for your mail~
Would you have a timetable about when such improvement can be 
ready?It's critical for non-btrfs filesystem.
I am thinking about introducing flashcache into my configuration to 
cache such meta write, since flashcache working under the filesystem, I suppose 
it will not break the assumption inside Ceph. I will try it on tomorrow and 
come back with you ~
Thanks again for the helps!


Xiaoxi

-Original Message-
From: Sage Weil [mailto:s...@inktank.com] 
Sent: 2013年1月13日 0:57
To: Chen, Xiaoxi
Cc: Mark Nelson; Yan, Zheng ; ceph-devel@vger.kernel.org
Subject: RE: Seperate metadata disk for OSD

On Sat, 12 Jan 2013, Chen, Xiaoxi wrote:
 Hi Zheng?
   I have put XFS log to a separate disk, indeed it provide some 
 performance gain but not that significant.
   Ceph's metadata is somehow separate(it's some files reside in OSD's 
 disk), therefore,it cannot be helped by neither XFS journal log nor OSD's 
 journal.That's why I am trying to put ceph's metadata(/data/osd.x/meta folder 
 ) to a separate SSD disk.
 To Nelson,
   I did the experiment with just 1 client, if using more clients, the 
 gain will not be that much.
   It looks to me that a single write from client side become 3 writes to 
 disk is somehow a big overhead for in-place-update filesystem such like XFS 
 since it introduce more seeks.Out-of-place-update filesystem will not suffer 
 a lot for such pattern,I didn?t find this problem when I using BTRFS as 
 backend filesystem. But forBTRFS, fragmentation is another performance 
 killer, for a single RBD volume, if you did a lot of random write on it, the 
 sequential read performance will drop to 30% of a new RBD volume. This make 
 BTRFS unusable in production.
   Separate Ceph meta seems quite easy to me ( I just mount a partition to 
 /data/osd.X/meta), is it right  ? is there any potential problem in it? 

Unfortunately, yes.  The ceph journal and fs sync are carefully timed.  
The ceph-osd assumes that syncfs(2) on the $osd_data/current/commit_op_seq file 
will sync everything, but if meta/ is another fs that isnt true.  At the every 
least, the code needs to be modified to sync that as well.

That said, there is a lot of improvement that can be had here.  The three 
things we write are:

 the pg log
 the pg info, spread across the pg dir xattr and that pginfo file  the actual io

The pg log could go in leveldb, which would translate those writes into a 
single sequential stream across the entire OSD.  And the PG info separate 
between the xattr and the file is far from optimal: most of that data doesn't 
actually change on each write.  What little does is very small, and could be 
moved into the xattr, avoiding touching the file (which means an inode + data 
block write) at all.

We need to look a bit more closely to see how difficult that will really be to 
implement, but I think it is promising!

sage


 
   
   
   
 Xiaoxi
 -Original Message-
 From: Mark Nelson [mailto:mark.nel...@inktank.com]
 Sent: 2013?1?12? 21:36
 To: Yan, Zheng
 Cc: Chen, Xiaoxi; ceph-devel@vger.kernel.org
 Subject: Re: Seperate metadata disk for OSD
 
 Hi Xiaoxi and Zheng,
 
 We've played with both of these some internally, but not for a production 
 deployment. Mostly just for diagnosing performance problems. 
   It's been a while since I last played with this, but I hadn't seen a whole 
 lot of performance improvements at the time.  That may have been due to the 
 hardware in use, or perhaps other parts of Ceph have improved to the point 
 where this matters now!
 
 On a side note, Btrfs also had a google summer of code project to let you put 
 metadata on an external device.  Originally I think that was supposed to make 
 it into 3.7, but am not sure if that happened.
 
 Mark
 
 On 01/12/2013 06:21 AM, Yan, Zheng wrote:
  On Sat, Jan 12, 2013 at 2:57 PM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote:
 
  Hi list,
   For a rbd write request, Ceph need to do 3 writes:
  2013-01-10 13:10:15.539967 7f52f516c700 10 filestore(/data/osd.21) 
 _do_transaction on 0x327d790
  2013-01-10 13:10:15.539979 7f52f516c700 15 filestore(/data/osd.21) 
 write meta/516b801c/pglog_2.1a/0//-1 36015~147
  2013-01-10 13:10:15.540016 7f52f516c700 15 filestore(/data/osd.21)
 path: /data/osd.21/current/meta/DIR_C/pglog\u2.1a__0_516B801C__none
  2013-01-10 13:10:15.540164 7f52f516c700 15 filestore(/data/osd.21) 
 write meta/28d2f4a8/pginfo_2.1a/0//-1 0~496
  2013-01-10 13:10:15.540189 7f52f516c700 15 filestore(/data/osd.21)
 path

RE: Seperate metadata disk for OSD

2013-01-14 Thread Sage Weil
On Mon, 14 Jan 2013, Chen, Xiaoxi wrote:
 Hi Sage,
 Thanks for your mail~
   Would you have a timetable about when such improvement can be 
 ready?It's critical for non-btrfs filesystem.
   I am thinking about introducing flashcache into my configuration to 
 cache such meta write, since flashcache working under the filesystem, I 
 suppose it will not break the assumption inside Ceph. I will try it on 
 tomorrow and come back with you ~
   Thanks again for the helps!

I think avoiding the pginfo change may be pretty simple.  The log one I am 
a bit less concerned about (the appends from many rbd IOs will get 
aggregated into a small number of XFS IOs), and changing that around would 
be a bigger deal.

sage


   
   
   Xiaoxi
 
 -Original Message-
 From: Sage Weil [mailto:s...@inktank.com] 
 Sent: 2013?1?13? 0:57
 To: Chen, Xiaoxi
 Cc: Mark Nelson; Yan, Zheng ; ceph-devel@vger.kernel.org
 Subject: RE: Seperate metadata disk for OSD
 
 On Sat, 12 Jan 2013, Chen, Xiaoxi wrote:
  Hi Zheng?
  I have put XFS log to a separate disk, indeed it provide some 
  performance gain but not that significant.
  Ceph's metadata is somehow separate(it's some files reside in OSD's 
  disk), therefore,it cannot be helped by neither XFS journal log nor OSD's 
  journal.That's why I am trying to put ceph's metadata(/data/osd.x/meta 
  folder ) to a separate SSD disk.
  To Nelson,
  I did the experiment with just 1 client, if using more clients, the 
  gain will not be that much.
  It looks to me that a single write from client side become 3 writes to 
  disk is somehow a big overhead for in-place-update filesystem such like XFS 
  since it introduce more seeks.Out-of-place-update filesystem will not 
  suffer a lot for such pattern,I didn?t find this problem when I using BTRFS 
  as backend filesystem. But forBTRFS, fragmentation is another performance 
  killer, for a single RBD volume, if you did a lot of random write on it, 
  the sequential read performance will drop to 30% of a new RBD volume. This 
  make BTRFS unusable in production.
  Separate Ceph meta seems quite easy to me ( I just mount a partition to 
  /data/osd.X/meta), is it right  ? is there any potential problem in it? 
 
 Unfortunately, yes.  The ceph journal and fs sync are carefully timed.  
 The ceph-osd assumes that syncfs(2) on the $osd_data/current/commit_op_seq 
 file will sync everything, but if meta/ is another fs that isnt true.  At the 
 every least, the code needs to be modified to sync that as well.
 
 That said, there is a lot of improvement that can be had here.  The three 
 things we write are:
 
  the pg log
  the pg info, spread across the pg dir xattr and that pginfo file  the actual 
 io
 
 The pg log could go in leveldb, which would translate those writes into a 
 single sequential stream across the entire OSD.  And the PG info separate 
 between the xattr and the file is far from optimal: most of that data doesn't 
 actually change on each write.  What little does is very small, and could be 
 moved into the xattr, avoiding touching the file (which means an inode + data 
 block write) at all.
 
 We need to look a bit more closely to see how difficult that will really be 
 to implement, but I think it is promising!
 
 sage
 
 
  
  
  
  
  Xiaoxi
  -Original Message-
  From: Mark Nelson [mailto:mark.nel...@inktank.com]
  Sent: 2013?1?12? 21:36
  To: Yan, Zheng
  Cc: Chen, Xiaoxi; ceph-devel@vger.kernel.org
  Subject: Re: Seperate metadata disk for OSD
  
  Hi Xiaoxi and Zheng,
  
  We've played with both of these some internally, but not for a production 
  deployment. Mostly just for diagnosing performance problems. 
It's been a while since I last played with this, but I hadn't seen a 
  whole lot of performance improvements at the time.  That may have been due 
  to the hardware in use, or perhaps other parts of Ceph have improved to the 
  point where this matters now!
  
  On a side note, Btrfs also had a google summer of code project to let you 
  put metadata on an external device.  Originally I think that was supposed 
  to make it into 3.7, but am not sure if that happened.
  
  Mark
  
  On 01/12/2013 06:21 AM, Yan, Zheng wrote:
   On Sat, Jan 12, 2013 at 2:57 PM, Chen, Xiaoxi xiaoxi.c...@intel.com 
   wrote:
  
   Hi list,
For a rbd write request, Ceph need to do 3 writes:
   2013-01-10 13:10:15.539967 7f52f516c700 10 filestore(/data/osd.21) 
  _do_transaction on 0x327d790
   2013-01-10 13:10:15.539979 7f52f516c700 15 filestore(/data/osd.21

Re: Seperate metadata disk for OSD

2013-01-12 Thread Yan, Zheng
On Sat, Jan 12, 2013 at 2:57 PM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote:

 Hi list,
 For a rbd write request, Ceph need to do 3 writes:
 2013-01-10 13:10:15.539967 7f52f516c700 10 filestore(/data/osd.21) 
 _do_transaction on 0x327d790
 2013-01-10 13:10:15.539979 7f52f516c700 15 filestore(/data/osd.21) write 
 meta/516b801c/pglog_2.1a/0//-1 36015~147
 2013-01-10 13:10:15.540016 7f52f516c700 15 filestore(/data/osd.21) path: 
 /data/osd.21/current/meta/DIR_C/pglog\u2.1a__0_516B801C__none
 2013-01-10 13:10:15.540164 7f52f516c700 15 filestore(/data/osd.21) write 
 meta/28d2f4a8/pginfo_2.1a/0//-1 0~496
 2013-01-10 13:10:15.540189 7f52f516c700 15 filestore(/data/osd.21) path: 
 /data/osd.21/current/meta/DIR_8/pginfo\u2.1a__0_28D2F4A8__none
 2013-01-10 13:10:15.540217 7f52f516c700 10 filestore(/data/osd.21) 
 _do_transaction on 0x327d708
 2013-01-10 13:10:15.540222 7f52f516c700 15 filestore(/data/osd.21) write 
 2.1a_head/8abf341a/rb.0.106e.6b8b4567.02d3/head//2 3227648~524288
 2013-01-10 13:10:15.540245 7f52f516c700 15 filestore(/data/osd.21) path: 
 /data/osd.21/current/2.1a_head/rb.0.106e.6b8b4567.02d3__head_8ABF341A__2

 If using XFS as backend file system and running xfs on top of 
 traditional sata disk, it will introduce a lot of seeks and therefore reduce 
 bandwidth, a blktrace is available here :( 
 http://ww3.sinaimg.cn/mw690/6e1aee47jw1e0qsbxbvddj.jpg) to demonstrate this 
 issue.( single client running dd on top of a new RBD volumes).
 Then I tried to move /osd.X/current/meta to a separate disk, the 
 bandwidth boosted.(look blktrace at 
 http://ww4.sinaimg.cn/mw690/6e1aee47jw1e0qsadz1bij.jpg).
 I haven't test other access pattern or something else, but it looks 
 to me that moving such meta to a separate disk (ssd or sata with btrfs) will 
 benefit ceph write performance, is it true? Will ceph introduce this feature 
 in the future?  Is there any potential problem for such hack?


Did you try putting XFS metadata log a separate and fast device
(mkfs.xfs -l logdev=/dev/sdbx,size=1b). I think it will boost
performance too.

Regards
Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Seperate metadata disk for OSD

2013-01-12 Thread Mark Nelson

Hi Xiaoxi and Zheng,

We've played with both of these some internally, but not for a 
production deployment. Mostly just for diagnosing performance problems. 
 It's been a while since I last played with this, but I hadn't seen a 
whole lot of performance improvements at the time.  That may have been 
due to the hardware in use, or perhaps other parts of Ceph have improved 
to the point where this matters now!


On a side note, Btrfs also had a google summer of code project to let 
you put metadata on an external device.  Originally I think that was 
supposed to make it into 3.7, but am not sure if that happened.


Mark

On 01/12/2013 06:21 AM, Yan, Zheng wrote:

On Sat, Jan 12, 2013 at 2:57 PM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote:


Hi list,
 For a rbd write request, Ceph need to do 3 writes:
2013-01-10 13:10:15.539967 7f52f516c700 10 filestore(/data/osd.21) 
_do_transaction on 0x327d790
2013-01-10 13:10:15.539979 7f52f516c700 15 filestore(/data/osd.21) write 
meta/516b801c/pglog_2.1a/0//-1 36015~147
2013-01-10 13:10:15.540016 7f52f516c700 15 filestore(/data/osd.21) path: 
/data/osd.21/current/meta/DIR_C/pglog\u2.1a__0_516B801C__none
2013-01-10 13:10:15.540164 7f52f516c700 15 filestore(/data/osd.21) write 
meta/28d2f4a8/pginfo_2.1a/0//-1 0~496
2013-01-10 13:10:15.540189 7f52f516c700 15 filestore(/data/osd.21) path: 
/data/osd.21/current/meta/DIR_8/pginfo\u2.1a__0_28D2F4A8__none
2013-01-10 13:10:15.540217 7f52f516c700 10 filestore(/data/osd.21) 
_do_transaction on 0x327d708
2013-01-10 13:10:15.540222 7f52f516c700 15 filestore(/data/osd.21) write 
2.1a_head/8abf341a/rb.0.106e.6b8b4567.02d3/head//2 3227648~524288
2013-01-10 13:10:15.540245 7f52f516c700 15 filestore(/data/osd.21) path: 
/data/osd.21/current/2.1a_head/rb.0.106e.6b8b4567.02d3__head_8ABF341A__2
l
 If using XFS as backend file system and running xfs on top of 
traditional sata disk, it will introduce a lot of seeks and therefore reduce 
bandwidth, a blktrace is available here :( 
http://ww3.sinaimg.cn/mw690/6e1aee47jw1e0qsbxbvddj.jpg) to demonstrate this 
issue.( single client running dd on top of a new RBD volumes).
 Then I tried to move /osd.X/current/meta to a separate disk, the 
bandwidth boosted.(look blktrace at 
http://ww4.sinaimg.cn/mw690/6e1aee47jw1e0qsadz1bij.jpg).
 I haven't test other access pattern or something else, but it looks to 
me that moving such meta to a separate disk (ssd or sata with btrfs) will 
benefit ceph write performance, is it true? Will ceph introduce this feature in 
the future?  Is there any potential problem for such hack?



Did you try putting XFS metadata log a separate and fast device
(mkfs.xfs -l logdev=/dev/sdbx,size=1b). I think it will boost
performance too.

Regards
Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Seperate metadata disk for OSD

2013-01-12 Thread Chen, Xiaoxi
Hi Zheng,
I have put XFS log to a separate disk, indeed it provide some 
performance gain but not that significant.
Ceph's metadata is somehow separate(it's some files reside in OSD's 
disk), therefore,it cannot be helped by neither XFS journal log nor OSD's 
journal.That's why I am trying to put ceph's metadata(/data/osd.x/meta folder ) 
to a separate SSD disk.
To Nelson,
I did the experiment with just 1 client, if using more clients, the 
gain will not be that much.
It looks to me that a single write from client side become 3 writes to 
disk is somehow a big overhead for in-place-update filesystem such like XFS 
since it introduce more seeks.Out-of-place-update filesystem will not suffer a 
lot for such pattern,I didn’t find this problem when I using BTRFS as backend 
filesystem. But forBTRFS, fragmentation is another performance killer, for a 
single RBD volume, if you did a lot of random write on it, the sequential read 
performance will drop to 30% of a new RBD volume. This make BTRFS unusable in 
production.
Separate Ceph meta seems quite easy to me ( I just mount a partition to 
/data/osd.X/meta), is it right  ? is there any potential problem in it? 




Xiaoxi
-Original Message-
From: Mark Nelson [mailto:mark.nel...@inktank.com] 
Sent: 2013年1月12日 21:36
To: Yan, Zheng 
Cc: Chen, Xiaoxi; ceph-devel@vger.kernel.org
Subject: Re: Seperate metadata disk for OSD

Hi Xiaoxi and Zheng,

We've played with both of these some internally, but not for a production 
deployment. Mostly just for diagnosing performance problems. 
  It's been a while since I last played with this, but I hadn't seen a whole 
lot of performance improvements at the time.  That may have been due to the 
hardware in use, or perhaps other parts of Ceph have improved to the point 
where this matters now!

On a side note, Btrfs also had a google summer of code project to let you put 
metadata on an external device.  Originally I think that was supposed to make 
it into 3.7, but am not sure if that happened.

Mark

On 01/12/2013 06:21 AM, Yan, Zheng wrote:
 On Sat, Jan 12, 2013 at 2:57 PM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote:

 Hi list,
  For a rbd write request, Ceph need to do 3 writes:
 2013-01-10 13:10:15.539967 7f52f516c700 10 filestore(/data/osd.21) 
_do_transaction on 0x327d790
 2013-01-10 13:10:15.539979 7f52f516c700 15 filestore(/data/osd.21) 
write meta/516b801c/pglog_2.1a/0//-1 36015~147
 2013-01-10 13:10:15.540016 7f52f516c700 15 filestore(/data/osd.21) 
path: /data/osd.21/current/meta/DIR_C/pglog\u2.1a__0_516B801C__none
 2013-01-10 13:10:15.540164 7f52f516c700 15 filestore(/data/osd.21) 
write meta/28d2f4a8/pginfo_2.1a/0//-1 0~496
 2013-01-10 13:10:15.540189 7f52f516c700 15 filestore(/data/osd.21) 
path: /data/osd.21/current/meta/DIR_8/pginfo\u2.1a__0_28D2F4A8__none
 2013-01-10 13:10:15.540217 7f52f516c700 10 filestore(/data/osd.21) 
_do_transaction on 0x327d708
 2013-01-10 13:10:15.540222 7f52f516c700 15 filestore(/data/osd.21) 
write 2.1a_head/8abf341a/rb.0.106e.6b8b4567.02d3/head//2 
3227648~524288
 2013-01-10 13:10:15.540245 7f52f516c700 15 filestore(/data/osd.21) 
path: 
/data/osd.21/current/2.1a_head/rb.0.106e.6b8b4567.02d3__head_8
ABF341A__2
l
  If using XFS as backend file system and running xfs on top of 
 traditional sata disk, it will introduce a lot of seeks and therefore reduce 
 bandwidth, a blktrace is available here :( 
 http://ww3.sinaimg.cn/mw690/6e1aee47jw1e0qsbxbvddj.jpg) to demonstrate this 
 issue.( single client running dd on top of a new RBD volumes).
  Then I tried to move /osd.X/current/meta to a separate disk, the 
 bandwidth boosted.(look blktrace at 
 http://ww4.sinaimg.cn/mw690/6e1aee47jw1e0qsadz1bij.jpg).
  I haven't test other access pattern or something else, but it looks 
 to me that moving such meta to a separate disk (ssd or sata with btrfs) will 
 benefit ceph write performance, is it true? Will ceph introduce this feature 
 in the future?  Is there any potential problem for such hack?


 Did you try putting XFS metadata log a separate and fast device 
 (mkfs.xfs -l logdev=/dev/sdbx,size=1b). I think it will boost 
 performance too.

 Regards
 Yan, Zheng
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel 
 in the body of a message to majord...@vger.kernel.org More majordomo 
 info at  http://vger.kernel.org/majordomo-info.html