Re: Impact of page cache on OSD read performance for SSD

2014-09-24 Thread Sage Weil
On Wed, 24 Sep 2014, Haomai Wang wrote:
 I agree with that direct read will help for disk read. But if read data 
 is hot and small enough to fit in memory, page cache is a good place to 
 hold data cache. If discard page cache, we need to implement a cache to 
 provide with effective lookup impl.

This is true for some workloads, but not necessarily true for all.  Many
clients (notably RBD) will be caching at the client side (in VM's fs, and 
possibly in librbd itself) such that caching at the OSD is largely wasted 
effort.  For RGW the often is likely true, unless there is a varnish cache 
or something in front.

We should probably have a direct_io config option for filestore.  But even 
better would be some hint from the client about whether it is caching or 
not so that FileStore could conditionally cache...

sage

  
 BTW, whether to use direct io we can refer to MySQL Innodb engine with
 direct io and PostgreSQL with page cache.
 
 On Wed, Sep 24, 2014 at 10:29 AM, Somnath Roy somnath@sandisk.com wrote:
  Haomai,
  I am considering only about random reads and the changes I made only 
  affecting reads. For write, I have not measured yet. But, yes, page cache 
  may be helpful for write coalescing. Still need to evaluate how it is 
  behaving comparing direct_io on SSD though. I think Ceph code path will be 
  much shorter if we use direct_io in the write path where it is actually 
  executing the transactions. Probably, the sync thread and all will not be 
  needed.
 
  I am trying to analyze where is the extra reads coming from in case of 
  buffered io by using blktrace etc. This should give us a clear 
  understanding what exactly is going on there and it may turn out that 
  tuning kernel parameters only  we can achieve similar performance as 
  direct_io.
 
  Thanks  Regards
  Somnath
 
  -Original Message-
  From: Haomai Wang [mailto:haomaiw...@gmail.com]
  Sent: Tuesday, September 23, 2014 7:07 PM
  To: Sage Weil
  Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org
  Subject: Re: Impact of page cache on OSD read performance for SSD
 
  Good point, but do you have considered that the impaction for write ops? 
  And if skip page cache, FileStore is responsible for data cache?
 
  On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil sw...@redhat.com wrote:
  On Tue, 23 Sep 2014, Somnath Roy wrote:
  Milosz,
  Thanks for the response. I will see if I can get any information out of 
  perf.
 
  Here is my OS information.
 
  root@emsclient:~# lsb_release -a
  No LSB modules are available.
  Distributor ID: Ubuntu
  Description:Ubuntu 13.10
  Release:13.10
  Codename:   saucy
  root@emsclient:~# uname -a
  Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46
  UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
 
  BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I 
  was able to get almost *2X* performance improvement with direct_io.
  It's not only page cache (memory) lookup, in case of buffered_io  the 
  following could be problem.
 
  1. Double copy (disk - file buffer cache, file buffer cache - user
  buffer)
 
  2. As the iostat output shows, it is not reading 4K only, it is
  reading more data from disk as required and in the end it will be
  wasted in case of random workload..
 
  It might be worth using blktrace to see what the IOs it is issueing are.
  Which ones are  4K and what they point to...
 
  sage
 
 
 
  Thanks  Regards
  Somnath
 
  -Original Message-
  From: Milosz Tanski [mailto:mil...@adfin.com]
  Sent: Tuesday, September 23, 2014 12:09 PM
  To: Somnath Roy
  Cc: ceph-devel@vger.kernel.org
  Subject: Re: Impact of page cache on OSD read performance for SSD
 
  Somnath,
 
  I wonder if there's a bottleneck or a point of contention for the kernel. 
  For a entirely uncached workload I expect the page cache lookup to cause 
  a slow down (since the lookup should be wasted). What I wouldn't expect 
  is a 45% performance drop. Memory speed should be one magnitude faster 
  then a modern SATA SSD drive (so it should be more negligible overhead).
 
  Is there anyway you could perform the same test but monitor what's going 
  on with the OSD process using the perf tool? Whatever is the default cpu 
  time spent hardware counter is fine. Make sure you have the kernel debug 
  info package installed so can get symbol information for kernel and 
  module calls. With any luck the diff in perf output in two runs will show 
  us the culprit.
 
  Also, can you tell us what OS/kernel version you're using on the OSD 
  machines?
 
  - Milosz
 
  On Tue, Sep 23, 2014 at 2:05 PM, Somnath Roy somnath@sandisk.com 
  wrote:
   Hi Sage,
   I have created the following setup in order to examine how a single OSD 
   is behaving if say ~80-90% of ios hitting the SSDs.
  
   My test includes the following steps.
  
   1. Created a single OSD cluster.
   2. Created two rbd images (110GB each) on 2 different pools.
   

Re: [ceph-users] Can ceph-deploy be used with 'osd objectstore = keyvaluestore-dev' in config file ?

2014-09-24 Thread Sage Weil
On Wed, 24 Sep 2014, Mark Kirkwood wrote:
 On 24/09/14 14:29, Aegeaner wrote:
  I run ceph on Red Hat Enterprise Linux Server 6.4 Santiago, and when I
  run service ceph start i got:
 
  # service ceph start
 
  ERROR:ceph-disk:Failed to activate
  ceph-disk: Does not look like a Ceph OSD, or incompatible version:
  /var/lib/ceph/tmp/mnt.I71N5T
  mount: /dev/hioa1 already mounted or /var/lib/ceph/tmp/mnt.02sVHj busy
  ceph-disk: Mounting filesystem failed: Command '['/bin/mount', '-t',
  'xfs', '-o', 'noatime', '--',
 
 '/dev/disk/by-parttypeuuid/4fbd7e29-9d25-41b8-afd0-062c0ceff05d.6d726c93-41f9-453d-858e-ab4132b5c8fd',
  '/var/lib/ceph/tmp/mnt.02sVHj']' returned non-zero exit status 32
  ceph-disk: Error: One or more partitions failed to activate
 
  Someone told me service ceph start still tries to call ceph-disk which
  will create a filestore type OSD, and create a journal partition, is it
  true?
 
  ls -l /dev/disk/by-parttypeuuid/
 
  lrwxrwxrwx. 1 root root 11 9?  23 16:56
 
 45b0969e-9b03-4f30-b4c6-b4b80ceff106.00dbee5e-fb68-47c4-aa58-924c904c4383
  - ../../hioa2
  lrwxrwxrwx. 1 root root 10 9?  23 17:02
 
 45b0969e-9b03-4f30-b4c6-b4b80ceff106.c30e5b97-b914-4eb8-8306-a9649e1c20ba
  - ../../sdb2
  lrwxrwxrwx. 1 root root 11 9?  23 16:56
 
 4fbd7e29-9d25-41b8-afd0-062c0ceff05d.6d726c93-41f9-453d-858e-ab4132b5c8fd
  - ../../hioa1
  lrwxrwxrwx. 1 root root 10 9?  23 17:02
 
 4fbd7e29-9d25-41b8-afd0-062c0ceff05d.b56ec699-e134-4b90-8f55-4952453e1b7e
  - ../../sdb1
  lrwxrwxrwx. 1 root root 11 9?  23 16:52
 
 89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be.6d726c93-41f9-453d-858e-ab4132b5c8fd
  - ../../hioa1
 
  There seems to be two hioa1 partitions there, maybe remained from last
  time I create the OSD using ceph-deploy osd prepare?
 
 
 Crap - it is fighting you, yes - looks like the startup script has tried 
 to build an osd for you using ceph-disk (which will make two partitions 
 by default). So that's toasted the setup that your script did.
 
 Growl - that's made it more complicated for sure.

Hrm, yeah.  I think ceph-disk needs to have an option (via ceph.conf) that 
will avoid creating a journal [partition], and we need to make sure that 
the journal behavior is all conditional on the journal symlink being 
present.  Do you mind opening a bug for this?  It could condition itself 
off of the osd objectstore option (we'd need to teach ceph-diska bout the 
varoius backends), or we could add a secondary option (awkware to 
configure), or we could call into ceph-osd with something like 'ceph-osd 
-i 0 --does-backend-need-journal' so that a call into the backend 
code itself can control things.  The latter is probably ideal.

Opened http://tracker.ceph.com/issues/9580 and copying ceph-devel

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Impact of page cache on OSD read performance for SSD

2014-09-24 Thread Mark Nelson

On 09/24/2014 07:38 AM, Sage Weil wrote:

On Wed, 24 Sep 2014, Haomai Wang wrote:

I agree with that direct read will help for disk read. But if read data
is hot and small enough to fit in memory, page cache is a good place to
hold data cache. If discard page cache, we need to implement a cache to
provide with effective lookup impl.


This is true for some workloads, but not necessarily true for all.  Many
clients (notably RBD) will be caching at the client side (in VM's fs, and
possibly in librbd itself) such that caching at the OSD is largely wasted
effort.  For RGW the often is likely true, unless there is a varnish cache
or something in front.

We should probably have a direct_io config option for filestore.  But even
better would be some hint from the client about whether it is caching or
not so that FileStore could conditionally cache...


I like the hinting idea.  Having said that, if the effect being seen is 
due to page cache, it seems like something is off.  We've seen 
performance issues in the kernel before so it's not unprecedented. 
Working around it with direct IO could be the right way to go, but it 
might be that this is something that could be fixed higher up and 
improve performance in other scenarios too.  I'd hate to let it go by 
the wayside of we could find something actionable.




sage

  

BTW, whether to use direct io we can refer to MySQL Innodb engine with
direct io and PostgreSQL with page cache.

On Wed, Sep 24, 2014 at 10:29 AM, Somnath Roy somnath@sandisk.com wrote:

Haomai,
I am considering only about random reads and the changes I made only affecting 
reads. For write, I have not measured yet. But, yes, page cache may be helpful 
for write coalescing. Still need to evaluate how it is behaving comparing 
direct_io on SSD though. I think Ceph code path will be much shorter if we use 
direct_io in the write path where it is actually executing the transactions. 
Probably, the sync thread and all will not be needed.

I am trying to analyze where is the extra reads coming from in case of buffered 
io by using blktrace etc. This should give us a clear understanding what 
exactly is going on there and it may turn out that tuning kernel parameters 
only  we can achieve similar performance as direct_io.

Thanks  Regards
Somnath

-Original Message-
From: Haomai Wang [mailto:haomaiw...@gmail.com]
Sent: Tuesday, September 23, 2014 7:07 PM
To: Sage Weil
Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org
Subject: Re: Impact of page cache on OSD read performance for SSD

Good point, but do you have considered that the impaction for write ops? And if 
skip page cache, FileStore is responsible for data cache?

On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil sw...@redhat.com wrote:

On Tue, 23 Sep 2014, Somnath Roy wrote:

Milosz,
Thanks for the response. I will see if I can get any information out of perf.

Here is my OS information.

root@emsclient:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 13.10
Release:13.10
Codename:   saucy
root@emsclient:~# uname -a
Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46
UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was 
able to get almost *2X* performance improvement with direct_io.
It's not only page cache (memory) lookup, in case of buffered_io  the following 
could be problem.

1. Double copy (disk - file buffer cache, file buffer cache - user
buffer)

2. As the iostat output shows, it is not reading 4K only, it is
reading more data from disk as required and in the end it will be
wasted in case of random workload..


It might be worth using blktrace to see what the IOs it is issueing are.
Which ones are  4K and what they point to...

sage




Thanks  Regards
Somnath

-Original Message-
From: Milosz Tanski [mailto:mil...@adfin.com]
Sent: Tuesday, September 23, 2014 12:09 PM
To: Somnath Roy
Cc: ceph-devel@vger.kernel.org
Subject: Re: Impact of page cache on OSD read performance for SSD

Somnath,

I wonder if there's a bottleneck or a point of contention for the kernel. For a 
entirely uncached workload I expect the page cache lookup to cause a slow down 
(since the lookup should be wasted). What I wouldn't expect is a 45% 
performance drop. Memory speed should be one magnitude faster then a modern 
SATA SSD drive (so it should be more negligible overhead).

Is there anyway you could perform the same test but monitor what's going on 
with the OSD process using the perf tool? Whatever is the default cpu time 
spent hardware counter is fine. Make sure you have the kernel debug info 
package installed so can get symbol information for kernel and module calls. 
With any luck the diff in perf output in two runs will show us the culprit.

Also, can you tell us what OS/kernel version you're using on the OSD machines?

- Milosz

On Tue, Sep 23, 2014 at 2:05 PM, Somnath Roy 

Re: Fwd: S3 API Compatibility support

2014-09-24 Thread M Ranga Swami Reddy
There is the main #4099 issue for object expiration, but there is no real
detail there.  The plan is (as always) to have equivalent functionality to S3.

Do you mind creating a new feature ticket that specifically references the
ability to move objects to a second storage tier based on policy?  Any
references to AWS docs about the API or functionality would be helpful in
the ticket.


Sure, I will create a new feature ticket and add the needful information  
there.

Created a new ticket: http://tracker.ceph.com/issues/9581


Thanks
Swami

On Fri, Sep 19, 2014 at 9:23 PM, M Ranga Swami Reddy
swamire...@gmail.com wrote:
What do you mean by RRS storage-low cost storage?  My read of the RRS
numbers is that they simply have a different tier of S3 that runs fewer
replicas and (probably) cheaper disks.  In radosgw-land, this would just
be a different rados pool with 2x replicas and (probably) a CRUSH rule
mapping it to different hardware (with bigger and/or cheaper disks).

 Thats correct. If we could do the with a different rados pool  using
 2x replicas along with CURSH
 mapping it to different h/w (with bigger and cheaper disks) , then its
 same as RRS support in AWS.


 What isn't currently supported is the ability to reduce the redundancy of
 individual objects in a bucket.  I don't think there is anything
 architecturally preventing that, but it is not implemented or supported.

 OK. Do we have the issue id for the above? Else, we can file one. Please 
 advise.

There is the main #4099 issue for object expiration, but there is no real
detail there.  The plan is (as always) to have equivalent functionality to S3.

Do you mind creating a new feature ticket that specifically references the
ability to move objects to a second storage tier based on policy?  Any
references to AWS docs about the API or functionality would be helpful in
the ticket.


 Sure, I will create a new feature ticket and add the needful information  
 there.

 Thanks
 Swami

 On Fri, Sep 19, 2014 at 9:08 PM, Sage Weil sw...@redhat.com wrote:
 On Fri, 19 Sep 2014, M Ranga Swami Reddy wrote:
 Hi Sage,
 Thanks for quick reply.

 what you mean.
 For RRS, though, I assume you mean the ability to create buckets with
 reduced redundancy with radosgw?  That is supported, although not quite
 the way AWS does it.  You can create different pools that back RGW
 buckets, and each bucket is stored in one of those pools.  So you could
 make one of them 2x instead of 3x, or use an erasure code of your choice.

 Yes, we can confiure ceph to use 2x replicas, which will look like
 reduced redundancy, but AWS uses a separate RRS storage-low cost
 (instead of
 standard) storage for this purpose. I am checking, if we could
 similarly in ceph too.

 What do you mean by RRS storage-low cost storage?  My read of the RRS
 numbers is that they simply have a different tier of S3 that runs fewer
 replicas and (probably) cheaper disks.  In radosgw-land, this would just
 be a different rados pool with 2x replicas and (probably) a CRUSH rule
 mapping it to different hardware (with bigger and/or cheaper disks).

 What isn't currently supported is the ability to reduce the redundancy of
 individual objects in a bucket.  I don't think there is anything
 architecturally preventing that, but it is not implemented or supported.

 OK. Do we have the issue id for the above? Else, we can file one. Please 
 advise.

 There is the main #4099 issue for object expiration, but there is no real
 detail there.  The plan is (as always) to have equivalent functionality to
 S3.

 Do you mind creating a new feature ticket that specifically references the
 ability to move objects to a second storage tier based on policy?  Any
 references to AWS docs about the API or functionality would be helpful in
 the ticket.

 When we look at the S3 archival features in more detail (soon!) I'm sure
 this will come up!  The current plan is to address object versioning
 first.  That is, unless a developer surfaces who wants to start hacking on
 this right away...

 Great to know this. Even we are keen with S3 support in Ceph and we
 are happy support you here.

 Great to hear!

 Thanks-
 sage



 Thanks
 Swami

 On Fri, Sep 19, 2014 at 11:08 AM, Sage Weil sw...@redhat.com wrote:
  On Fri, 19 Sep 2014, M Ranga Swami Reddy wrote:
  Hi Sage,
  Could you please advise, if Ceph support the low cost object
  storages(like Amazon Glacier or RRS) for archiving objects like log
  file etc.?
 
  Ceph doesn't interact at all with AWS services like Glacier, if that's
  what you mean.
 
  For RRS, though, I assume you mean the ability to create buckets with
  reduced redundancy with radosgw?  That is supported, although not quite
  the way AWS does it.  You can create different pools that back RGW
  buckets, and each bucket is stored in one of those pools.  So you could
  make one of them 2x instead of 3x, or use an erasure code of your choice.
 
  What isn't currently supported is the ability to reduce the redundancy 

Re: Impact of page cache on OSD read performance for SSD

2014-09-24 Thread Milosz Tanski
On Wed, Sep 24, 2014 at 9:27 AM, Mark Nelson mark.nel...@inktank.com wrote:
 On 09/24/2014 07:38 AM, Sage Weil wrote:

 On Wed, 24 Sep 2014, Haomai Wang wrote:

 I agree with that direct read will help for disk read. But if read data
 is hot and small enough to fit in memory, page cache is a good place to
 hold data cache. If discard page cache, we need to implement a cache to
 provide with effective lookup impl.


 This is true for some workloads, but not necessarily true for all.  Many
 clients (notably RBD) will be caching at the client side (in VM's fs, and
 possibly in librbd itself) such that caching at the OSD is largely wasted
 effort.  For RGW the often is likely true, unless there is a varnish cache
 or something in front.

 We should probably have a direct_io config option for filestore.  But even
 better would be some hint from the client about whether it is caching or
 not so that FileStore could conditionally cache...


 I like the hinting idea.  Having said that, if the effect being seen is due
 to page cache, it seems like something is off.  We've seen performance
 issues in the kernel before so it's not unprecedented. Working around it
 with direct IO could be the right way to go, but it might be that this is
 something that could be fixed higher up and improve performance in other
 scenarios too.  I'd hate to let it go by the wayside of we could find
 something actionable.



 sage

   

 BTW, whether to use direct io we can refer to MySQL Innodb engine with
 direct io and PostgreSQL with page cache.

 On Wed, Sep 24, 2014 at 10:29 AM, Somnath Roy somnath@sandisk.com
 wrote:

 Haomai,
 I am considering only about random reads and the changes I made only
 affecting reads. For write, I have not measured yet. But, yes, page cache
 may be helpful for write coalescing. Still need to evaluate how it is
 behaving comparing direct_io on SSD though. I think Ceph code path will be
 much shorter if we use direct_io in the write path where it is actually
 executing the transactions. Probably, the sync thread and all will not be
 needed.

 I am trying to analyze where is the extra reads coming from in case of
 buffered io by using blktrace etc. This should give us a clear 
 understanding
 what exactly is going on there and it may turn out that tuning kernel
 parameters only  we can achieve similar performance as direct_io.

 Thanks  Regards
 Somnath

 -Original Message-
 From: Haomai Wang [mailto:haomaiw...@gmail.com]
 Sent: Tuesday, September 23, 2014 7:07 PM
 To: Sage Weil
 Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org
 Subject: Re: Impact of page cache on OSD read performance for SSD

 Good point, but do you have considered that the impaction for write ops?
 And if skip page cache, FileStore is responsible for data cache?

 On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil sw...@redhat.com wrote:

 On Tue, 23 Sep 2014, Somnath Roy wrote:

 Milosz,
 Thanks for the response. I will see if I can get any information out
 of perf.

 Here is my OS information.

 root@emsclient:~# lsb_release -a
 No LSB modules are available.
 Distributor ID: Ubuntu
 Description:Ubuntu 13.10
 Release:13.10
 Codename:   saucy
 root@emsclient:~# uname -a
 Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46
 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

 BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter
 I was able to get almost *2X* performance improvement with direct_io.
 It's not only page cache (memory) lookup, in case of buffered_io  the
 following could be problem.

 1. Double copy (disk - file buffer cache, file buffer cache - user
 buffer)

 2. As the iostat output shows, it is not reading 4K only, it is
 reading more data from disk as required and in the end it will be
 wasted in case of random workload..


 It might be worth using blktrace to see what the IOs it is issueing
 are.
 Which ones are  4K and what they point to...

 sage



 Thanks  Regards
 Somnath

 -Original Message-
 From: Milosz Tanski [mailto:mil...@adfin.com]
 Sent: Tuesday, September 23, 2014 12:09 PM
 To: Somnath Roy
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: Impact of page cache on OSD read performance for SSD

 Somnath,

 I wonder if there's a bottleneck or a point of contention for the
 kernel. For a entirely uncached workload I expect the page cache lookup 
 to
 cause a slow down (since the lookup should be wasted). What I wouldn't
 expect is a 45% performance drop. Memory speed should be one magnitude
 faster then a modern SATA SSD drive (so it should be more negligible
 overhead).

 Is there anyway you could perform the same test but monitor what's
 going on with the OSD process using the perf tool? Whatever is the 
 default
 cpu time spent hardware counter is fine. Make sure you have the kernel 
 debug
 info package installed so can get symbol information for kernel and 
 module
 calls. With any luck the diff in perf output in two runs will show us the
 culprit.


Re: Impact of page cache on OSD read performance for SSD

2014-09-24 Thread Haomai Wang
On Wed, Sep 24, 2014 at 8:38 PM, Sage Weil sw...@redhat.com wrote:
 On Wed, 24 Sep 2014, Haomai Wang wrote:
 I agree with that direct read will help for disk read. But if read data
 is hot and small enough to fit in memory, page cache is a good place to
 hold data cache. If discard page cache, we need to implement a cache to
 provide with effective lookup impl.

 This is true for some workloads, but not necessarily true for all.  Many
 clients (notably RBD) will be caching at the client side (in VM's fs, and
 possibly in librbd itself) such that caching at the OSD is largely wasted
 effort.  For RGW the often is likely true, unless there is a varnish cache
 or something in front.

Still now, I don't think librbd cache can meet all the cache demand
for rbd usage. Even though
we have a effective librbd cache impl, we still need a buffer cache in
ObjectStore level
just like what database did. Client cache and host cache are both needed.


 We should probably have a direct_io config option for filestore.  But even
 better would be some hint from the client about whether it is caching or
 not so that FileStore could conditionally cache...

Yes, I remember we already did some early works like it.


 sage

  
 BTW, whether to use direct io we can refer to MySQL Innodb engine with
 direct io and PostgreSQL with page cache.

 On Wed, Sep 24, 2014 at 10:29 AM, Somnath Roy somnath@sandisk.com 
 wrote:
  Haomai,
  I am considering only about random reads and the changes I made only 
  affecting reads. For write, I have not measured yet. But, yes, page cache 
  may be helpful for write coalescing. Still need to evaluate how it is 
  behaving comparing direct_io on SSD though. I think Ceph code path will be 
  much shorter if we use direct_io in the write path where it is actually 
  executing the transactions. Probably, the sync thread and all will not be 
  needed.
 
  I am trying to analyze where is the extra reads coming from in case of 
  buffered io by using blktrace etc. This should give us a clear 
  understanding what exactly is going on there and it may turn out that 
  tuning kernel parameters only  we can achieve similar performance as 
  direct_io.
 
  Thanks  Regards
  Somnath
 
  -Original Message-
  From: Haomai Wang [mailto:haomaiw...@gmail.com]
  Sent: Tuesday, September 23, 2014 7:07 PM
  To: Sage Weil
  Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org
  Subject: Re: Impact of page cache on OSD read performance for SSD
 
  Good point, but do you have considered that the impaction for write ops? 
  And if skip page cache, FileStore is responsible for data cache?
 
  On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil sw...@redhat.com wrote:
  On Tue, 23 Sep 2014, Somnath Roy wrote:
  Milosz,
  Thanks for the response. I will see if I can get any information out of 
  perf.
 
  Here is my OS information.
 
  root@emsclient:~# lsb_release -a
  No LSB modules are available.
  Distributor ID: Ubuntu
  Description:Ubuntu 13.10
  Release:13.10
  Codename:   saucy
  root@emsclient:~# uname -a
  Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46
  UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
 
  BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I 
  was able to get almost *2X* performance improvement with direct_io.
  It's not only page cache (memory) lookup, in case of buffered_io  the 
  following could be problem.
 
  1. Double copy (disk - file buffer cache, file buffer cache - user
  buffer)
 
  2. As the iostat output shows, it is not reading 4K only, it is
  reading more data from disk as required and in the end it will be
  wasted in case of random workload..
 
  It might be worth using blktrace to see what the IOs it is issueing are.
  Which ones are  4K and what they point to...
 
  sage
 
 
 
  Thanks  Regards
  Somnath
 
  -Original Message-
  From: Milosz Tanski [mailto:mil...@adfin.com]
  Sent: Tuesday, September 23, 2014 12:09 PM
  To: Somnath Roy
  Cc: ceph-devel@vger.kernel.org
  Subject: Re: Impact of page cache on OSD read performance for SSD
 
  Somnath,
 
  I wonder if there's a bottleneck or a point of contention for the 
  kernel. For a entirely uncached workload I expect the page cache lookup 
  to cause a slow down (since the lookup should be wasted). What I 
  wouldn't expect is a 45% performance drop. Memory speed should be one 
  magnitude faster then a modern SATA SSD drive (so it should be more 
  negligible overhead).
 
  Is there anyway you could perform the same test but monitor what's going 
  on with the OSD process using the perf tool? Whatever is the default cpu 
  time spent hardware counter is fine. Make sure you have the kernel debug 
  info package installed so can get symbol information for kernel and 
  module calls. With any luck the diff in perf output in two runs will 
  show us the culprit.
 
  Also, can you tell us what OS/kernel version you're using on the OSD 
  machines?
 
  - Milosz
 
  On 

BlaumRoth with w=7 : what are the consequences ?

2014-09-24 Thread Loic Dachary
Hi Kevin,

When implementing the plugin for Ceph, the check for isprime(w+1) was not added 
and w=7 was made the default for the BlaumRoth technique. As a result all 
content encoded with the BlaumRoth technique have used this parameter. I do not 
know what the consequences are. It does not crash and from what I've tried 
content can be encoded/decoded/repaired fine. But maybe I was lucky ? 

Your expert opinion on the consequence of chosing w=7 would be greatly 
appreciated :-)

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: [ceph-users] Status of snapshots in CephFS

2014-09-24 Thread Florian Haas
On Fri, Sep 19, 2014 at 5:25 PM, Sage Weil sw...@redhat.com wrote:
 On Fri, 19 Sep 2014, Florian Haas wrote:
 Hello everyone,

 Just thought I'd circle back on some discussions I've had with people
 earlier in the year:

 Shortly before firefly, snapshot support for CephFS clients was
 effectively disabled by default at the MDS level, and can only be
 enabled after accepting a scary warning that your filesystem is highly
 likely to break if snapshot support is enabled. Has any progress been
 made on this in the interim?

 With libcephfs support slowly maturing in Ganesha, the option of
 deploying a Ceph-backed userspace NFS server is becoming more
 attractive -- and it's probably a better use of resources than mapping
 a boatload of RBDs on an NFS head node and then exporting all the data
 from there. Recent snapshot trimming issues notwithstanding, RBD
 snapshot support is reasonably stable, but even so, making snapshot
 data available via NFS, that way, is rather ugly. In addition, the
 libcephfs/Ganesha approach would obviously include much better
 horizontal scalability.

 We haven't done any work on snapshot stability.  It is probably moderately
 stable if snapshots are only done at the root or at a consistent point in
 the hierarcy (as opposed to random directories), but there are still some
 basic problems that need to be resolved.  I would not suggest deploying
 this in production!  But some stress testing woudl as always be very
 welcome.  :)

OK, on a semi-related note: is there any reasonably current
authoritative list of features that are supported and unsupported in
either ceph-fuse or kernel cephfs, and if so, at what minimal version?

The most comprehensive overview that seems to be available is one from
Greg, which however is a year and a half old:

http://ceph.com/dev-notes/cephfs-mds-status-discussion/

 In addition, 
 https://github.com/nfs-ganesha/nfs-ganesha/wiki/ReleaseNotes_2.0#CEPH
 states:

 The current requirement to build and use the Ceph FSAL is a Ceph
 build environment which includes Ceph client enhancements staged on
 the libwipcephfs development branch. These changes are expected to be
 part of the Ceph Firefly release.

 ... though it's not clear whether they ever did make it into firefly.
 Could someone in the know comment on that?

 I think this is referring to the libcephfs API changes that the cohortfs
 folks did.  That all merged shortly before firefly.

Great, thanks for the clarification.

 By the way, we have some basic samba integration tests in our regular
 regression tests, but nothing based on ganesha.  If you really want this
 to the work, the most valuable thing you could do would be to help
 get the tests written and integrated into ceph-qa-suite.git.  Probably the
 biggest piece of work there is creating a task/ganesha.py that installs
 and configures ganesha with the ceph FSAL.

Hmmm, given the excellent writeup that Niels de Vos of Gluster fame
wrote about this topic, I might actually be able to cargo-cult some of
what's in the Samba task and adapt it for ganesha.

Sorry while I'm being ignorant about Teuthology: what platform does it
normally run on? I ask because I understand most of your testing is done
on Ubuntu, and Ubuntu currently doesn't ship a Ganesha package, which
would make the install task a bit more complex.

Cheers,
Florian



signature.asc
Description: OpenPGP digital signature


Re: [ceph-users] Status of snapshots in CephFS

2014-09-24 Thread Florian Haas
On Fri, Sep 19, 2014 at 5:25 PM, Sage Weil sw...@redhat.com wrote:
 On Fri, 19 Sep 2014, Florian Haas wrote:
 Hello everyone,

 Just thought I'd circle back on some discussions I've had with people
 earlier in the year:

 Shortly before firefly, snapshot support for CephFS clients was
 effectively disabled by default at the MDS level, and can only be
 enabled after accepting a scary warning that your filesystem is highly
 likely to break if snapshot support is enabled. Has any progress been
 made on this in the interim?

 With libcephfs support slowly maturing in Ganesha, the option of
 deploying a Ceph-backed userspace NFS server is becoming more
 attractive -- and it's probably a better use of resources than mapping
 a boatload of RBDs on an NFS head node and then exporting all the data
 from there. Recent snapshot trimming issues notwithstanding, RBD
 snapshot support is reasonably stable, but even so, making snapshot
 data available via NFS, that way, is rather ugly. In addition, the
 libcephfs/Ganesha approach would obviously include much better
 horizontal scalability.

 We haven't done any work on snapshot stability.  It is probably moderately
 stable if snapshots are only done at the root or at a consistent point in
 the hierarcy (as opposed to random directories), but there are still some
 basic problems that need to be resolved.  I would not suggest deploying
 this in production!  But some stress testing woudl as always be very
 welcome.  :)

OK, on a semi-related note: is there any reasonably current
authoritative list of features that are supported and unsupported in
either ceph-fuse or kernel cephfs, and if so, at what minimal version?

The most comprehensive overview that seems to be available is one from
Greg, which however is a year and a half old:

http://ceph.com/dev-notes/cephfs-mds-status-discussion/

 In addition, 
 https://github.com/nfs-ganesha/nfs-ganesha/wiki/ReleaseNotes_2.0#CEPH
 states:

 The current requirement to build and use the Ceph FSAL is a Ceph
 build environment which includes Ceph client enhancements staged on
 the libwipcephfs development branch. These changes are expected to be
 part of the Ceph Firefly release.

 ... though it's not clear whether they ever did make it into firefly.
 Could someone in the know comment on that?

 I think this is referring to the libcephfs API changes that the cohortfs
 folks did.  That all merged shortly before firefly.

Great, thanks for the clarification.

 By the way, we have some basic samba integration tests in our regular
 regression tests, but nothing based on ganesha.  If you really want this
 to the work, the most valuable thing you could do would be to help
 get the tests written and integrated into ceph-qa-suite.git.  Probably the
 biggest piece of work there is creating a task/ganesha.py that installs
 and configures ganesha with the ceph FSAL.

Hmmm, given the excellent writeup that Niels de Vos of Gluster fame
wrote about this topic, I might actually be able to cargo-cult some of
what's in the Samba task and adapt it for ganesha.

Sorry while I'm being ignorant about Teuthology: what platform does it
normally run on? I ask because I understand most of your testing is done
on Ubuntu, and Ubuntu currently doesn't ship a Ganesha package, which
would make the install task a bit more complex.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-24 Thread Florian Haas
On Wed, Sep 24, 2014 at 1:05 AM, Sage Weil sw...@redhat.com wrote:
 Sam and I discussed this on IRC and have we think two simpler patches that
 solve the problem more directly.  See wip-9487.

So I understand this makes Dan's patch (and the config parameter that
it introduces) unnecessary, but is it correct to assume that just like
Dan's patch yours too will not be effective unless osd snap trim sleep
 0?

 Queued for testing now.
 Once that passes we can backport and test for firefly and dumpling too.

 Note that this won't make the next dumpling or firefly point releases
 (which are imminent).  Should be in the next ones, though.

OK, just in case anyone else runs into problems after removing tons of
snapshots with =0.67.11, what's the plan to get them going again
until 0.67.12 comes out? Install the autobuild package from the wip
branch?

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BlaumRoth with w=7 : what are the consequences ?

2014-09-24 Thread Loic Dachary
Hi Kevin,

On 24/09/2014 20:40, Kevin Greenan wrote: The constraint guarantees the MDS 
property.  I believe there are conditions where w+1 is composite and you still 
have an MDS code, but there are restrictions on 'n' (codeword length).  So, you 
may have chosen the right parameters.  Did you verify all possible combinations 
of erasures is tolerated?

I tried all combinations that are likely to have been used and they all work 
out. Here is the script I used:

for w in 7 11 13 17 19 ; do for k in $(seq 2 $w) ; do for m in $(seq 1 $k) ; do 
for erasures in $(seq 1 $m) ; do ./ceph_erasure_code_benchmark --plugin 
jerasure --workload decoded --iterations 1 --size 4096 --erasures $erasures 
--parameter w=$w --parameter k=$k --parameter m=2 --parameter 
technique=blaum_roth ; done ; done ; done ; done

does that mean we're safe despite the fact that w+1 is not prime in all 
settings ? 

 For sake of safety, you probably do not want to get experimental with 
 'production' code.  IMHO, you should put the check in, especially if 'n' is 
 tunable.

n is tunable and the check is now enforced. I'm worried about the content that 
was previously encoded with codeword length that do not match prime(codeword + 
1)

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


RE: Impact of page cache on OSD read performance for SSD

2014-09-24 Thread Somnath Roy
Hi,
After going through the blktrace, I think I have figured out what is going on 
there. I think kernel read_ahead is causing the extra reads in case of buffered 
read. If I set read_ahead = 0 , the performance I am getting similar (or more 
when cache hit actually happens) to direct_io :-)
IMHO, if any user doesn't want these nasty kernel effects and be sure of the 
random work pattern, we should provide a configurable direct_io read option 
(Need to quantify direct_io write also) as Sage suggested.

Thanks  Regards
Somnath


-Original Message-
From: Haomai Wang [mailto:haomaiw...@gmail.com] 
Sent: Wednesday, September 24, 2014 9:06 AM
To: Sage Weil
Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org
Subject: Re: Impact of page cache on OSD read performance for SSD

On Wed, Sep 24, 2014 at 8:38 PM, Sage Weil sw...@redhat.com wrote:
 On Wed, 24 Sep 2014, Haomai Wang wrote:
 I agree with that direct read will help for disk read. But if read 
 data is hot and small enough to fit in memory, page cache is a good 
 place to hold data cache. If discard page cache, we need to implement 
 a cache to provide with effective lookup impl.

 This is true for some workloads, but not necessarily true for all.  
 Many clients (notably RBD) will be caching at the client side (in VM's 
 fs, and possibly in librbd itself) such that caching at the OSD is 
 largely wasted effort.  For RGW the often is likely true, unless there 
 is a varnish cache or something in front.

Still now, I don't think librbd cache can meet all the cache demand for rbd 
usage. Even though we have a effective librbd cache impl, we still need a 
buffer cache in ObjectStore level just like what database did. Client cache and 
host cache are both needed.


 We should probably have a direct_io config option for filestore.  But 
 even better would be some hint from the client about whether it is 
 caching or not so that FileStore could conditionally cache...

Yes, I remember we already did some early works like it.


 sage

  
 BTW, whether to use direct io we can refer to MySQL Innodb engine 
 with direct io and PostgreSQL with page cache.

 On Wed, Sep 24, 2014 at 10:29 AM, Somnath Roy somnath@sandisk.com 
 wrote:
  Haomai,
  I am considering only about random reads and the changes I made only 
  affecting reads. For write, I have not measured yet. But, yes, page cache 
  may be helpful for write coalescing. Still need to evaluate how it is 
  behaving comparing direct_io on SSD though. I think Ceph code path will be 
  much shorter if we use direct_io in the write path where it is actually 
  executing the transactions. Probably, the sync thread and all will not be 
  needed.
 
  I am trying to analyze where is the extra reads coming from in case of 
  buffered io by using blktrace etc. This should give us a clear 
  understanding what exactly is going on there and it may turn out that 
  tuning kernel parameters only  we can achieve similar performance as 
  direct_io.
 
  Thanks  Regards
  Somnath
 
  -Original Message-
  From: Haomai Wang [mailto:haomaiw...@gmail.com]
  Sent: Tuesday, September 23, 2014 7:07 PM
  To: Sage Weil
  Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org
  Subject: Re: Impact of page cache on OSD read performance for SSD
 
  Good point, but do you have considered that the impaction for write ops? 
  And if skip page cache, FileStore is responsible for data cache?
 
  On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil sw...@redhat.com wrote:
  On Tue, 23 Sep 2014, Somnath Roy wrote:
  Milosz,
  Thanks for the response. I will see if I can get any information out of 
  perf.
 
  Here is my OS information.
 
  root@emsclient:~# lsb_release -a
  No LSB modules are available.
  Distributor ID: Ubuntu
  Description:Ubuntu 13.10
  Release:13.10
  Codename:   saucy
  root@emsclient:~# uname -a
  Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 
  16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
 
  BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I 
  was able to get almost *2X* performance improvement with direct_io.
  It's not only page cache (memory) lookup, in case of buffered_io  the 
  following could be problem.
 
  1. Double copy (disk - file buffer cache, file buffer cache - 
  user
  buffer)
 
  2. As the iostat output shows, it is not reading 4K only, it is 
  reading more data from disk as required and in the end it will be 
  wasted in case of random workload..
 
  It might be worth using blktrace to see what the IOs it is issueing are.
  Which ones are  4K and what they point to...
 
  sage
 
 
 
  Thanks  Regards
  Somnath
 
  -Original Message-
  From: Milosz Tanski [mailto:mil...@adfin.com]
  Sent: Tuesday, September 23, 2014 12:09 PM
  To: Somnath Roy
  Cc: ceph-devel@vger.kernel.org
  Subject: Re: Impact of page cache on OSD read performance for SSD
 
  Somnath,
 
  I wonder if there's a bottleneck or a point of contention for the 
 

Fwd: question about client's cluster aware

2014-09-24 Thread yue longguang
-- Forwarded message --
From: yue longguang yuelonggu...@gmail.com
Date: Tue, Sep 23, 2014 at 5:53 PM
Subject: question about client's cluster aware
To: ceph-devel@vger.kernel.org


hi,all

my question is from my test.
let's take a example.   object1(4MB)-- pg 0.1 -- osd 1,2,3,p1

when client is writing object1, during the write , osd1 is down. let
suppose 2MB is writed.
1.
   when the connection to osd1 is down, what does client do?  ask
monitor for new osdmap? or only the pg map?

2.
  now client gets a newer map , continues the write , the primary osd
should be osd2.  the rest 2MB is writed out.
 now what does ceph do to integrate the two part data? and to promise
that replicas is enough?

3.
 where is the code.  Be sure to tell me where the code is。

it is a very difficult question.

Thanks so much
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fwd: question about object replication theory

2014-09-24 Thread yue longguang
-- Forwarded message --
From: yue longguang yuelonggu...@gmail.com
Date: Tue, Sep 23, 2014 at 5:37 PM
Subject: question about object replication theory
To: ceph-devel@vger.kernel.org


hi,all
   take a look at the link ,
http://www.ceph.com/docs/master/architecture/#smart-daemons-enable-hyperscale
could you explain  point 2, 3 in that picture.

1.
at point 2,3, before primary writes data  to next osd, where is the
data?  it is in momory or on disk already?

2. where is the  code of point 2 or 3,  at there  primary distributes
data to others?

thanks
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Impact of page cache on OSD read performance for SSD

2014-09-24 Thread Haomai Wang
On Thu, Sep 25, 2014 at 7:49 AM, Somnath Roy somnath@sandisk.com wrote:
 Hi,
 After going through the blktrace, I think I have figured out what is going on 
 there. I think kernel read_ahead is causing the extra reads in case of 
 buffered read. If I set read_ahead = 0 , the performance I am getting similar 
 (or more when cache hit actually happens) to direct_io :-)

Hmm, BTW if set read_ahead=0, what about seq read performance compared
to before?

 IMHO, if any user doesn't want these nasty kernel effects and be sure of the 
 random work pattern, we should provide a configurable direct_io read option 
 (Need to quantify direct_io write also) as Sage suggested.

 Thanks  Regards
 Somnath


 -Original Message-
 From: Haomai Wang [mailto:haomaiw...@gmail.com]
 Sent: Wednesday, September 24, 2014 9:06 AM
 To: Sage Weil
 Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org
 Subject: Re: Impact of page cache on OSD read performance for SSD

 On Wed, Sep 24, 2014 at 8:38 PM, Sage Weil sw...@redhat.com wrote:
 On Wed, 24 Sep 2014, Haomai Wang wrote:
 I agree with that direct read will help for disk read. But if read
 data is hot and small enough to fit in memory, page cache is a good
 place to hold data cache. If discard page cache, we need to implement
 a cache to provide with effective lookup impl.

 This is true for some workloads, but not necessarily true for all.
 Many clients (notably RBD) will be caching at the client side (in VM's
 fs, and possibly in librbd itself) such that caching at the OSD is
 largely wasted effort.  For RGW the often is likely true, unless there
 is a varnish cache or something in front.

 Still now, I don't think librbd cache can meet all the cache demand for rbd 
 usage. Even though we have a effective librbd cache impl, we still need a 
 buffer cache in ObjectStore level just like what database did. Client cache 
 and host cache are both needed.


 We should probably have a direct_io config option for filestore.  But
 even better would be some hint from the client about whether it is
 caching or not so that FileStore could conditionally cache...

 Yes, I remember we already did some early works like it.


 sage

  
 BTW, whether to use direct io we can refer to MySQL Innodb engine
 with direct io and PostgreSQL with page cache.

 On Wed, Sep 24, 2014 at 10:29 AM, Somnath Roy somnath@sandisk.com 
 wrote:
  Haomai,
  I am considering only about random reads and the changes I made only 
  affecting reads. For write, I have not measured yet. But, yes, page cache 
  may be helpful for write coalescing. Still need to evaluate how it is 
  behaving comparing direct_io on SSD though. I think Ceph code path will 
  be much shorter if we use direct_io in the write path where it is 
  actually executing the transactions. Probably, the sync thread and all 
  will not be needed.
 
  I am trying to analyze where is the extra reads coming from in case of 
  buffered io by using blktrace etc. This should give us a clear 
  understanding what exactly is going on there and it may turn out that 
  tuning kernel parameters only  we can achieve similar performance as 
  direct_io.
 
  Thanks  Regards
  Somnath
 
  -Original Message-
  From: Haomai Wang [mailto:haomaiw...@gmail.com]
  Sent: Tuesday, September 23, 2014 7:07 PM
  To: Sage Weil
  Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org
  Subject: Re: Impact of page cache on OSD read performance for SSD
 
  Good point, but do you have considered that the impaction for write ops? 
  And if skip page cache, FileStore is responsible for data cache?
 
  On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil sw...@redhat.com wrote:
  On Tue, 23 Sep 2014, Somnath Roy wrote:
  Milosz,
  Thanks for the response. I will see if I can get any information out of 
  perf.
 
  Here is my OS information.
 
  root@emsclient:~# lsb_release -a
  No LSB modules are available.
  Distributor ID: Ubuntu
  Description:Ubuntu 13.10
  Release:13.10
  Codename:   saucy
  root@emsclient:~# uname -a
  Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9
  16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
 
  BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I 
  was able to get almost *2X* performance improvement with direct_io.
  It's not only page cache (memory) lookup, in case of buffered_io  the 
  following could be problem.
 
  1. Double copy (disk - file buffer cache, file buffer cache -
  user
  buffer)
 
  2. As the iostat output shows, it is not reading 4K only, it is
  reading more data from disk as required and in the end it will be
  wasted in case of random workload..
 
  It might be worth using blktrace to see what the IOs it is issueing are.
  Which ones are  4K and what they point to...
 
  sage
 
 
 
  Thanks  Regards
  Somnath
 
  -Original Message-
  From: Milosz Tanski [mailto:mil...@adfin.com]
  Sent: Tuesday, September 23, 2014 12:09 PM
  To: Somnath Roy
  Cc: 

RE: Impact of page cache on OSD read performance for SSD

2014-09-24 Thread Somnath Roy
It will be definitely hampered.
There will not be a single solution fits all. These parameters needs to be 
tuned based on the workload.

Thanks  Regards
Somnath

-Original Message-
From: Haomai Wang [mailto:haomaiw...@gmail.com] 
Sent: Wednesday, September 24, 2014 7:56 PM
To: Somnath Roy
Cc: Sage Weil; Milosz Tanski; ceph-devel@vger.kernel.org
Subject: Re: Impact of page cache on OSD read performance for SSD

On Thu, Sep 25, 2014 at 7:49 AM, Somnath Roy somnath@sandisk.com wrote:
 Hi,
 After going through the blktrace, I think I have figured out what is 
 going on there. I think kernel read_ahead is causing the extra reads 
 in case of buffered read. If I set read_ahead = 0 , the performance I 
 am getting similar (or more when cache hit actually happens) to 
 direct_io :-)

Hmm, BTW if set read_ahead=0, what about seq read performance compared to 
before?

 IMHO, if any user doesn't want these nasty kernel effects and be sure of the 
 random work pattern, we should provide a configurable direct_io read option 
 (Need to quantify direct_io write also) as Sage suggested.

 Thanks  Regards
 Somnath


 -Original Message-
 From: Haomai Wang [mailto:haomaiw...@gmail.com]
 Sent: Wednesday, September 24, 2014 9:06 AM
 To: Sage Weil
 Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org
 Subject: Re: Impact of page cache on OSD read performance for SSD

 On Wed, Sep 24, 2014 at 8:38 PM, Sage Weil sw...@redhat.com wrote:
 On Wed, 24 Sep 2014, Haomai Wang wrote:
 I agree with that direct read will help for disk read. But if read 
 data is hot and small enough to fit in memory, page cache is a good 
 place to hold data cache. If discard page cache, we need to 
 implement a cache to provide with effective lookup impl.

 This is true for some workloads, but not necessarily true for all.
 Many clients (notably RBD) will be caching at the client side (in 
 VM's fs, and possibly in librbd itself) such that caching at the OSD 
 is largely wasted effort.  For RGW the often is likely true, unless 
 there is a varnish cache or something in front.

 Still now, I don't think librbd cache can meet all the cache demand for rbd 
 usage. Even though we have a effective librbd cache impl, we still need a 
 buffer cache in ObjectStore level just like what database did. Client cache 
 and host cache are both needed.


 We should probably have a direct_io config option for filestore.  But 
 even better would be some hint from the client about whether it is 
 caching or not so that FileStore could conditionally cache...

 Yes, I remember we already did some early works like it.


 sage

  
 BTW, whether to use direct io we can refer to MySQL Innodb engine 
 with direct io and PostgreSQL with page cache.

 On Wed, Sep 24, 2014 at 10:29 AM, Somnath Roy somnath@sandisk.com 
 wrote:
  Haomai,
  I am considering only about random reads and the changes I made only 
  affecting reads. For write, I have not measured yet. But, yes, page cache 
  may be helpful for write coalescing. Still need to evaluate how it is 
  behaving comparing direct_io on SSD though. I think Ceph code path will 
  be much shorter if we use direct_io in the write path where it is 
  actually executing the transactions. Probably, the sync thread and all 
  will not be needed.
 
  I am trying to analyze where is the extra reads coming from in case of 
  buffered io by using blktrace etc. This should give us a clear 
  understanding what exactly is going on there and it may turn out that 
  tuning kernel parameters only  we can achieve similar performance as 
  direct_io.
 
  Thanks  Regards
  Somnath
 
  -Original Message-
  From: Haomai Wang [mailto:haomaiw...@gmail.com]
  Sent: Tuesday, September 23, 2014 7:07 PM
  To: Sage Weil
  Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org
  Subject: Re: Impact of page cache on OSD read performance for SSD
 
  Good point, but do you have considered that the impaction for write ops? 
  And if skip page cache, FileStore is responsible for data cache?
 
  On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil sw...@redhat.com wrote:
  On Tue, 23 Sep 2014, Somnath Roy wrote:
  Milosz,
  Thanks for the response. I will see if I can get any information out of 
  perf.
 
  Here is my OS information.
 
  root@emsclient:~# lsb_release -a No LSB modules are available.
  Distributor ID: Ubuntu
  Description:Ubuntu 13.10
  Release:13.10
  Codename:   saucy
  root@emsclient:~# uname -a
  Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9
  16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
 
  BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I 
  was able to get almost *2X* performance improvement with direct_io.
  It's not only page cache (memory) lookup, in case of buffered_io  the 
  following could be problem.
 
  1. Double copy (disk - file buffer cache, file buffer cache - 
  user
  buffer)
 
  2. As the iostat output shows, it is not reading 4K only, it is 

RE: Impact of page cache on OSD read performance for SSD

2014-09-24 Thread Chen, Xiaoxi
Have you ever seen large readahead_kb would hear random performance?

We usually set it to very large (2M) , the random read performance keep steady, 
even in all SSD setup. Maybe with your optimization code for OP_QUEUE, the 
things may different?

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Thursday, September 25, 2014 11:15 AM
To: Haomai Wang
Cc: Sage Weil; Milosz Tanski; ceph-devel@vger.kernel.org
Subject: RE: Impact of page cache on OSD read performance for SSD

It will be definitely hampered.
There will not be a single solution fits all. These parameters needs to be 
tuned based on the workload.

Thanks  Regards
Somnath

-Original Message-
From: Haomai Wang [mailto:haomaiw...@gmail.com]
Sent: Wednesday, September 24, 2014 7:56 PM
To: Somnath Roy
Cc: Sage Weil; Milosz Tanski; ceph-devel@vger.kernel.org
Subject: Re: Impact of page cache on OSD read performance for SSD

On Thu, Sep 25, 2014 at 7:49 AM, Somnath Roy somnath@sandisk.com wrote:
 Hi,
 After going through the blktrace, I think I have figured out what is 
 going on there. I think kernel read_ahead is causing the extra reads 
 in case of buffered read. If I set read_ahead = 0 , the performance I 
 am getting similar (or more when cache hit actually happens) to 
 direct_io :-)

Hmm, BTW if set read_ahead=0, what about seq read performance compared to 
before?

 IMHO, if any user doesn't want these nasty kernel effects and be sure of the 
 random work pattern, we should provide a configurable direct_io read option 
 (Need to quantify direct_io write also) as Sage suggested.

 Thanks  Regards
 Somnath


 -Original Message-
 From: Haomai Wang [mailto:haomaiw...@gmail.com]
 Sent: Wednesday, September 24, 2014 9:06 AM
 To: Sage Weil
 Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org
 Subject: Re: Impact of page cache on OSD read performance for SSD

 On Wed, Sep 24, 2014 at 8:38 PM, Sage Weil sw...@redhat.com wrote:
 On Wed, 24 Sep 2014, Haomai Wang wrote:
 I agree with that direct read will help for disk read. But if read 
 data is hot and small enough to fit in memory, page cache is a good 
 place to hold data cache. If discard page cache, we need to 
 implement a cache to provide with effective lookup impl.

 This is true for some workloads, but not necessarily true for all.
 Many clients (notably RBD) will be caching at the client side (in 
 VM's fs, and possibly in librbd itself) such that caching at the OSD 
 is largely wasted effort.  For RGW the often is likely true, unless 
 there is a varnish cache or something in front.

 Still now, I don't think librbd cache can meet all the cache demand for rbd 
 usage. Even though we have a effective librbd cache impl, we still need a 
 buffer cache in ObjectStore level just like what database did. Client cache 
 and host cache are both needed.


 We should probably have a direct_io config option for filestore.  But 
 even better would be some hint from the client about whether it is 
 caching or not so that FileStore could conditionally cache...

 Yes, I remember we already did some early works like it.


 sage

  
 BTW, whether to use direct io we can refer to MySQL Innodb engine 
 with direct io and PostgreSQL with page cache.

 On Wed, Sep 24, 2014 at 10:29 AM, Somnath Roy somnath@sandisk.com 
 wrote:
  Haomai,
  I am considering only about random reads and the changes I made only 
  affecting reads. For write, I have not measured yet. But, yes, page cache 
  may be helpful for write coalescing. Still need to evaluate how it is 
  behaving comparing direct_io on SSD though. I think Ceph code path will 
  be much shorter if we use direct_io in the write path where it is 
  actually executing the transactions. Probably, the sync thread and all 
  will not be needed.
 
  I am trying to analyze where is the extra reads coming from in case of 
  buffered io by using blktrace etc. This should give us a clear 
  understanding what exactly is going on there and it may turn out that 
  tuning kernel parameters only  we can achieve similar performance as 
  direct_io.
 
  Thanks  Regards
  Somnath
 
  -Original Message-
  From: Haomai Wang [mailto:haomaiw...@gmail.com]
  Sent: Tuesday, September 23, 2014 7:07 PM
  To: Sage Weil
  Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org
  Subject: Re: Impact of page cache on OSD read performance for SSD
 
  Good point, but do you have considered that the impaction for write ops? 
  And if skip page cache, FileStore is responsible for data cache?
 
  On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil sw...@redhat.com wrote:
  On Tue, 23 Sep 2014, Somnath Roy wrote:
  Milosz,
  Thanks for the response. I will see if I can get any information out of 
  perf.
 
  Here is my OS information.
 
  root@emsclient:~# lsb_release -a No LSB modules are available.
  Distributor ID: Ubuntu
  Description:Ubuntu 13.10
  Release:13.10
 

Re: [ceph-users] Can ceph-deploy be used with 'osd objectstore = keyvaluestore-dev' in config file ?

2014-09-24 Thread Mark Kirkwood

On 25/09/14 01:03, Sage Weil wrote:

On Wed, 24 Sep 2014, Mark Kirkwood wrote:

On 24/09/14 14:29, Aegeaner wrote:

I run ceph on Red Hat Enterprise Linux Server 6.4 Santiago, and when I
run service ceph start i got:

# service ceph start

 ERROR:ceph-disk:Failed to activate
 ceph-disk: Does not look like a Ceph OSD, or incompatible version:
 /var/lib/ceph/tmp/mnt.I71N5T
 mount: /dev/hioa1 already mounted or /var/lib/ceph/tmp/mnt.02sVHj busy
 ceph-disk: Mounting filesystem failed: Command '['/bin/mount', '-t',
 'xfs', '-o', 'noatime', '--',


'/dev/disk/by-parttypeuuid/4fbd7e29-9d25-41b8-afd0-062c0ceff05d.6d726c93-41f9-453d-858e-ab4132b5c8fd',

 '/var/lib/ceph/tmp/mnt.02sVHj']' returned non-zero exit status 32
 ceph-disk: Error: One or more partitions failed to activate

Someone told me service ceph start still tries to call ceph-disk which
will create a filestore type OSD, and create a journal partition, is it
true?

ls -l /dev/disk/by-parttypeuuid/

 lrwxrwxrwx. 1 root root 11 9?  23 16:56


45b0969e-9b03-4f30-b4c6-b4b80ceff106.00dbee5e-fb68-47c4-aa58-924c904c4383

 - ../../hioa2
 lrwxrwxrwx. 1 root root 10 9?  23 17:02


45b0969e-9b03-4f30-b4c6-b4b80ceff106.c30e5b97-b914-4eb8-8306-a9649e1c20ba

 - ../../sdb2
 lrwxrwxrwx. 1 root root 11 9?  23 16:56


4fbd7e29-9d25-41b8-afd0-062c0ceff05d.6d726c93-41f9-453d-858e-ab4132b5c8fd

 - ../../hioa1
 lrwxrwxrwx. 1 root root 10 9?  23 17:02


4fbd7e29-9d25-41b8-afd0-062c0ceff05d.b56ec699-e134-4b90-8f55-4952453e1b7e

 - ../../sdb1
 lrwxrwxrwx. 1 root root 11 9?  23 16:52


89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be.6d726c93-41f9-453d-858e-ab4132b5c8fd

 - ../../hioa1

There seems to be two hioa1 partitions there, maybe remained from last
time I create the OSD using ceph-deploy osd prepare?



Crap - it is fighting you, yes - looks like the startup script has tried
to build an osd for you using ceph-disk (which will make two partitions
by default). So that's toasted the setup that your script did.

Growl - that's made it more complicated for sure.


Hrm, yeah.  I think ceph-disk needs to have an option (via ceph.conf) that
will avoid creating a journal [partition], and we need to make sure that
the journal behavior is all conditional on the journal symlink being
present.  Do you mind opening a bug for this?  It could condition itself
off of the osd objectstore option (we'd need to teach ceph-diska bout the
varoius backends), or we could add a secondary option (awkware to
configure), or we could call into ceph-osd with something like 'ceph-osd
-i 0 --does-backend-need-journal' so that a call into the backend
code itself can control things.  The latter is probably ideal.

Opened http://tracker.ceph.com/issues/9580 and copying ceph-devel



Yeah, looks good - the approach to ask ceph-osd if it needs a journal 
seems sound.


Regards

Mark

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html