Re: Impact of page cache on OSD read performance for SSD
On Wed, 24 Sep 2014, Haomai Wang wrote: I agree with that direct read will help for disk read. But if read data is hot and small enough to fit in memory, page cache is a good place to hold data cache. If discard page cache, we need to implement a cache to provide with effective lookup impl. This is true for some workloads, but not necessarily true for all. Many clients (notably RBD) will be caching at the client side (in VM's fs, and possibly in librbd itself) such that caching at the OSD is largely wasted effort. For RGW the often is likely true, unless there is a varnish cache or something in front. We should probably have a direct_io config option for filestore. But even better would be some hint from the client about whether it is caching or not so that FileStore could conditionally cache... sage BTW, whether to use direct io we can refer to MySQL Innodb engine with direct io and PostgreSQL with page cache. On Wed, Sep 24, 2014 at 10:29 AM, Somnath Roy somnath@sandisk.com wrote: Haomai, I am considering only about random reads and the changes I made only affecting reads. For write, I have not measured yet. But, yes, page cache may be helpful for write coalescing. Still need to evaluate how it is behaving comparing direct_io on SSD though. I think Ceph code path will be much shorter if we use direct_io in the write path where it is actually executing the transactions. Probably, the sync thread and all will not be needed. I am trying to analyze where is the extra reads coming from in case of buffered io by using blktrace etc. This should give us a clear understanding what exactly is going on there and it may turn out that tuning kernel parameters only we can achieve similar performance as direct_io. Thanks Regards Somnath -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Tuesday, September 23, 2014 7:07 PM To: Sage Weil Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org Subject: Re: Impact of page cache on OSD read performance for SSD Good point, but do you have considered that the impaction for write ops? And if skip page cache, FileStore is responsible for data cache? On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil sw...@redhat.com wrote: On Tue, 23 Sep 2014, Somnath Roy wrote: Milosz, Thanks for the response. I will see if I can get any information out of perf. Here is my OS information. root@emsclient:~# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description:Ubuntu 13.10 Release:13.10 Codename: saucy root@emsclient:~# uname -a Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io. It's not only page cache (memory) lookup, in case of buffered_io the following could be problem. 1. Double copy (disk - file buffer cache, file buffer cache - user buffer) 2. As the iostat output shows, it is not reading 4K only, it is reading more data from disk as required and in the end it will be wasted in case of random workload.. It might be worth using blktrace to see what the IOs it is issueing are. Which ones are 4K and what they point to... sage Thanks Regards Somnath -Original Message- From: Milosz Tanski [mailto:mil...@adfin.com] Sent: Tuesday, September 23, 2014 12:09 PM To: Somnath Roy Cc: ceph-devel@vger.kernel.org Subject: Re: Impact of page cache on OSD read performance for SSD Somnath, I wonder if there's a bottleneck or a point of contention for the kernel. For a entirely uncached workload I expect the page cache lookup to cause a slow down (since the lookup should be wasted). What I wouldn't expect is a 45% performance drop. Memory speed should be one magnitude faster then a modern SATA SSD drive (so it should be more negligible overhead). Is there anyway you could perform the same test but monitor what's going on with the OSD process using the perf tool? Whatever is the default cpu time spent hardware counter is fine. Make sure you have the kernel debug info package installed so can get symbol information for kernel and module calls. With any luck the diff in perf output in two runs will show us the culprit. Also, can you tell us what OS/kernel version you're using on the OSD machines? - Milosz On Tue, Sep 23, 2014 at 2:05 PM, Somnath Roy somnath@sandisk.com wrote: Hi Sage, I have created the following setup in order to examine how a single OSD is behaving if say ~80-90% of ios hitting the SSDs. My test includes the following steps. 1. Created a single OSD cluster. 2. Created two rbd images (110GB each) on 2 different pools.
Re: [ceph-users] Can ceph-deploy be used with 'osd objectstore = keyvaluestore-dev' in config file ?
On Wed, 24 Sep 2014, Mark Kirkwood wrote: On 24/09/14 14:29, Aegeaner wrote: I run ceph on Red Hat Enterprise Linux Server 6.4 Santiago, and when I run service ceph start i got: # service ceph start ERROR:ceph-disk:Failed to activate ceph-disk: Does not look like a Ceph OSD, or incompatible version: /var/lib/ceph/tmp/mnt.I71N5T mount: /dev/hioa1 already mounted or /var/lib/ceph/tmp/mnt.02sVHj busy ceph-disk: Mounting filesystem failed: Command '['/bin/mount', '-t', 'xfs', '-o', 'noatime', '--', '/dev/disk/by-parttypeuuid/4fbd7e29-9d25-41b8-afd0-062c0ceff05d.6d726c93-41f9-453d-858e-ab4132b5c8fd', '/var/lib/ceph/tmp/mnt.02sVHj']' returned non-zero exit status 32 ceph-disk: Error: One or more partitions failed to activate Someone told me service ceph start still tries to call ceph-disk which will create a filestore type OSD, and create a journal partition, is it true? ls -l /dev/disk/by-parttypeuuid/ lrwxrwxrwx. 1 root root 11 9? 23 16:56 45b0969e-9b03-4f30-b4c6-b4b80ceff106.00dbee5e-fb68-47c4-aa58-924c904c4383 - ../../hioa2 lrwxrwxrwx. 1 root root 10 9? 23 17:02 45b0969e-9b03-4f30-b4c6-b4b80ceff106.c30e5b97-b914-4eb8-8306-a9649e1c20ba - ../../sdb2 lrwxrwxrwx. 1 root root 11 9? 23 16:56 4fbd7e29-9d25-41b8-afd0-062c0ceff05d.6d726c93-41f9-453d-858e-ab4132b5c8fd - ../../hioa1 lrwxrwxrwx. 1 root root 10 9? 23 17:02 4fbd7e29-9d25-41b8-afd0-062c0ceff05d.b56ec699-e134-4b90-8f55-4952453e1b7e - ../../sdb1 lrwxrwxrwx. 1 root root 11 9? 23 16:52 89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be.6d726c93-41f9-453d-858e-ab4132b5c8fd - ../../hioa1 There seems to be two hioa1 partitions there, maybe remained from last time I create the OSD using ceph-deploy osd prepare? Crap - it is fighting you, yes - looks like the startup script has tried to build an osd for you using ceph-disk (which will make two partitions by default). So that's toasted the setup that your script did. Growl - that's made it more complicated for sure. Hrm, yeah. I think ceph-disk needs to have an option (via ceph.conf) that will avoid creating a journal [partition], and we need to make sure that the journal behavior is all conditional on the journal symlink being present. Do you mind opening a bug for this? It could condition itself off of the osd objectstore option (we'd need to teach ceph-diska bout the varoius backends), or we could add a secondary option (awkware to configure), or we could call into ceph-osd with something like 'ceph-osd -i 0 --does-backend-need-journal' so that a call into the backend code itself can control things. The latter is probably ideal. Opened http://tracker.ceph.com/issues/9580 and copying ceph-devel sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Impact of page cache on OSD read performance for SSD
On 09/24/2014 07:38 AM, Sage Weil wrote: On Wed, 24 Sep 2014, Haomai Wang wrote: I agree with that direct read will help for disk read. But if read data is hot and small enough to fit in memory, page cache is a good place to hold data cache. If discard page cache, we need to implement a cache to provide with effective lookup impl. This is true for some workloads, but not necessarily true for all. Many clients (notably RBD) will be caching at the client side (in VM's fs, and possibly in librbd itself) such that caching at the OSD is largely wasted effort. For RGW the often is likely true, unless there is a varnish cache or something in front. We should probably have a direct_io config option for filestore. But even better would be some hint from the client about whether it is caching or not so that FileStore could conditionally cache... I like the hinting idea. Having said that, if the effect being seen is due to page cache, it seems like something is off. We've seen performance issues in the kernel before so it's not unprecedented. Working around it with direct IO could be the right way to go, but it might be that this is something that could be fixed higher up and improve performance in other scenarios too. I'd hate to let it go by the wayside of we could find something actionable. sage BTW, whether to use direct io we can refer to MySQL Innodb engine with direct io and PostgreSQL with page cache. On Wed, Sep 24, 2014 at 10:29 AM, Somnath Roy somnath@sandisk.com wrote: Haomai, I am considering only about random reads and the changes I made only affecting reads. For write, I have not measured yet. But, yes, page cache may be helpful for write coalescing. Still need to evaluate how it is behaving comparing direct_io on SSD though. I think Ceph code path will be much shorter if we use direct_io in the write path where it is actually executing the transactions. Probably, the sync thread and all will not be needed. I am trying to analyze where is the extra reads coming from in case of buffered io by using blktrace etc. This should give us a clear understanding what exactly is going on there and it may turn out that tuning kernel parameters only we can achieve similar performance as direct_io. Thanks Regards Somnath -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Tuesday, September 23, 2014 7:07 PM To: Sage Weil Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org Subject: Re: Impact of page cache on OSD read performance for SSD Good point, but do you have considered that the impaction for write ops? And if skip page cache, FileStore is responsible for data cache? On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil sw...@redhat.com wrote: On Tue, 23 Sep 2014, Somnath Roy wrote: Milosz, Thanks for the response. I will see if I can get any information out of perf. Here is my OS information. root@emsclient:~# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description:Ubuntu 13.10 Release:13.10 Codename: saucy root@emsclient:~# uname -a Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io. It's not only page cache (memory) lookup, in case of buffered_io the following could be problem. 1. Double copy (disk - file buffer cache, file buffer cache - user buffer) 2. As the iostat output shows, it is not reading 4K only, it is reading more data from disk as required and in the end it will be wasted in case of random workload.. It might be worth using blktrace to see what the IOs it is issueing are. Which ones are 4K and what they point to... sage Thanks Regards Somnath -Original Message- From: Milosz Tanski [mailto:mil...@adfin.com] Sent: Tuesday, September 23, 2014 12:09 PM To: Somnath Roy Cc: ceph-devel@vger.kernel.org Subject: Re: Impact of page cache on OSD read performance for SSD Somnath, I wonder if there's a bottleneck or a point of contention for the kernel. For a entirely uncached workload I expect the page cache lookup to cause a slow down (since the lookup should be wasted). What I wouldn't expect is a 45% performance drop. Memory speed should be one magnitude faster then a modern SATA SSD drive (so it should be more negligible overhead). Is there anyway you could perform the same test but monitor what's going on with the OSD process using the perf tool? Whatever is the default cpu time spent hardware counter is fine. Make sure you have the kernel debug info package installed so can get symbol information for kernel and module calls. With any luck the diff in perf output in two runs will show us the culprit. Also, can you tell us what OS/kernel version you're using on the OSD machines? - Milosz On Tue, Sep 23, 2014 at 2:05 PM, Somnath Roy
Re: Fwd: S3 API Compatibility support
There is the main #4099 issue for object expiration, but there is no real detail there. The plan is (as always) to have equivalent functionality to S3. Do you mind creating a new feature ticket that specifically references the ability to move objects to a second storage tier based on policy? Any references to AWS docs about the API or functionality would be helpful in the ticket. Sure, I will create a new feature ticket and add the needful information there. Created a new ticket: http://tracker.ceph.com/issues/9581 Thanks Swami On Fri, Sep 19, 2014 at 9:23 PM, M Ranga Swami Reddy swamire...@gmail.com wrote: What do you mean by RRS storage-low cost storage? My read of the RRS numbers is that they simply have a different tier of S3 that runs fewer replicas and (probably) cheaper disks. In radosgw-land, this would just be a different rados pool with 2x replicas and (probably) a CRUSH rule mapping it to different hardware (with bigger and/or cheaper disks). Thats correct. If we could do the with a different rados pool using 2x replicas along with CURSH mapping it to different h/w (with bigger and cheaper disks) , then its same as RRS support in AWS. What isn't currently supported is the ability to reduce the redundancy of individual objects in a bucket. I don't think there is anything architecturally preventing that, but it is not implemented or supported. OK. Do we have the issue id for the above? Else, we can file one. Please advise. There is the main #4099 issue for object expiration, but there is no real detail there. The plan is (as always) to have equivalent functionality to S3. Do you mind creating a new feature ticket that specifically references the ability to move objects to a second storage tier based on policy? Any references to AWS docs about the API or functionality would be helpful in the ticket. Sure, I will create a new feature ticket and add the needful information there. Thanks Swami On Fri, Sep 19, 2014 at 9:08 PM, Sage Weil sw...@redhat.com wrote: On Fri, 19 Sep 2014, M Ranga Swami Reddy wrote: Hi Sage, Thanks for quick reply. what you mean. For RRS, though, I assume you mean the ability to create buckets with reduced redundancy with radosgw? That is supported, although not quite the way AWS does it. You can create different pools that back RGW buckets, and each bucket is stored in one of those pools. So you could make one of them 2x instead of 3x, or use an erasure code of your choice. Yes, we can confiure ceph to use 2x replicas, which will look like reduced redundancy, but AWS uses a separate RRS storage-low cost (instead of standard) storage for this purpose. I am checking, if we could similarly in ceph too. What do you mean by RRS storage-low cost storage? My read of the RRS numbers is that they simply have a different tier of S3 that runs fewer replicas and (probably) cheaper disks. In radosgw-land, this would just be a different rados pool with 2x replicas and (probably) a CRUSH rule mapping it to different hardware (with bigger and/or cheaper disks). What isn't currently supported is the ability to reduce the redundancy of individual objects in a bucket. I don't think there is anything architecturally preventing that, but it is not implemented or supported. OK. Do we have the issue id for the above? Else, we can file one. Please advise. There is the main #4099 issue for object expiration, but there is no real detail there. The plan is (as always) to have equivalent functionality to S3. Do you mind creating a new feature ticket that specifically references the ability to move objects to a second storage tier based on policy? Any references to AWS docs about the API or functionality would be helpful in the ticket. When we look at the S3 archival features in more detail (soon!) I'm sure this will come up! The current plan is to address object versioning first. That is, unless a developer surfaces who wants to start hacking on this right away... Great to know this. Even we are keen with S3 support in Ceph and we are happy support you here. Great to hear! Thanks- sage Thanks Swami On Fri, Sep 19, 2014 at 11:08 AM, Sage Weil sw...@redhat.com wrote: On Fri, 19 Sep 2014, M Ranga Swami Reddy wrote: Hi Sage, Could you please advise, if Ceph support the low cost object storages(like Amazon Glacier or RRS) for archiving objects like log file etc.? Ceph doesn't interact at all with AWS services like Glacier, if that's what you mean. For RRS, though, I assume you mean the ability to create buckets with reduced redundancy with radosgw? That is supported, although not quite the way AWS does it. You can create different pools that back RGW buckets, and each bucket is stored in one of those pools. So you could make one of them 2x instead of 3x, or use an erasure code of your choice. What isn't currently supported is the ability to reduce the redundancy
Re: Impact of page cache on OSD read performance for SSD
On Wed, Sep 24, 2014 at 9:27 AM, Mark Nelson mark.nel...@inktank.com wrote: On 09/24/2014 07:38 AM, Sage Weil wrote: On Wed, 24 Sep 2014, Haomai Wang wrote: I agree with that direct read will help for disk read. But if read data is hot and small enough to fit in memory, page cache is a good place to hold data cache. If discard page cache, we need to implement a cache to provide with effective lookup impl. This is true for some workloads, but not necessarily true for all. Many clients (notably RBD) will be caching at the client side (in VM's fs, and possibly in librbd itself) such that caching at the OSD is largely wasted effort. For RGW the often is likely true, unless there is a varnish cache or something in front. We should probably have a direct_io config option for filestore. But even better would be some hint from the client about whether it is caching or not so that FileStore could conditionally cache... I like the hinting idea. Having said that, if the effect being seen is due to page cache, it seems like something is off. We've seen performance issues in the kernel before so it's not unprecedented. Working around it with direct IO could be the right way to go, but it might be that this is something that could be fixed higher up and improve performance in other scenarios too. I'd hate to let it go by the wayside of we could find something actionable. sage BTW, whether to use direct io we can refer to MySQL Innodb engine with direct io and PostgreSQL with page cache. On Wed, Sep 24, 2014 at 10:29 AM, Somnath Roy somnath@sandisk.com wrote: Haomai, I am considering only about random reads and the changes I made only affecting reads. For write, I have not measured yet. But, yes, page cache may be helpful for write coalescing. Still need to evaluate how it is behaving comparing direct_io on SSD though. I think Ceph code path will be much shorter if we use direct_io in the write path where it is actually executing the transactions. Probably, the sync thread and all will not be needed. I am trying to analyze where is the extra reads coming from in case of buffered io by using blktrace etc. This should give us a clear understanding what exactly is going on there and it may turn out that tuning kernel parameters only we can achieve similar performance as direct_io. Thanks Regards Somnath -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Tuesday, September 23, 2014 7:07 PM To: Sage Weil Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org Subject: Re: Impact of page cache on OSD read performance for SSD Good point, but do you have considered that the impaction for write ops? And if skip page cache, FileStore is responsible for data cache? On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil sw...@redhat.com wrote: On Tue, 23 Sep 2014, Somnath Roy wrote: Milosz, Thanks for the response. I will see if I can get any information out of perf. Here is my OS information. root@emsclient:~# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description:Ubuntu 13.10 Release:13.10 Codename: saucy root@emsclient:~# uname -a Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io. It's not only page cache (memory) lookup, in case of buffered_io the following could be problem. 1. Double copy (disk - file buffer cache, file buffer cache - user buffer) 2. As the iostat output shows, it is not reading 4K only, it is reading more data from disk as required and in the end it will be wasted in case of random workload.. It might be worth using blktrace to see what the IOs it is issueing are. Which ones are 4K and what they point to... sage Thanks Regards Somnath -Original Message- From: Milosz Tanski [mailto:mil...@adfin.com] Sent: Tuesday, September 23, 2014 12:09 PM To: Somnath Roy Cc: ceph-devel@vger.kernel.org Subject: Re: Impact of page cache on OSD read performance for SSD Somnath, I wonder if there's a bottleneck or a point of contention for the kernel. For a entirely uncached workload I expect the page cache lookup to cause a slow down (since the lookup should be wasted). What I wouldn't expect is a 45% performance drop. Memory speed should be one magnitude faster then a modern SATA SSD drive (so it should be more negligible overhead). Is there anyway you could perform the same test but monitor what's going on with the OSD process using the perf tool? Whatever is the default cpu time spent hardware counter is fine. Make sure you have the kernel debug info package installed so can get symbol information for kernel and module calls. With any luck the diff in perf output in two runs will show us the culprit.
Re: Impact of page cache on OSD read performance for SSD
On Wed, Sep 24, 2014 at 8:38 PM, Sage Weil sw...@redhat.com wrote: On Wed, 24 Sep 2014, Haomai Wang wrote: I agree with that direct read will help for disk read. But if read data is hot and small enough to fit in memory, page cache is a good place to hold data cache. If discard page cache, we need to implement a cache to provide with effective lookup impl. This is true for some workloads, but not necessarily true for all. Many clients (notably RBD) will be caching at the client side (in VM's fs, and possibly in librbd itself) such that caching at the OSD is largely wasted effort. For RGW the often is likely true, unless there is a varnish cache or something in front. Still now, I don't think librbd cache can meet all the cache demand for rbd usage. Even though we have a effective librbd cache impl, we still need a buffer cache in ObjectStore level just like what database did. Client cache and host cache are both needed. We should probably have a direct_io config option for filestore. But even better would be some hint from the client about whether it is caching or not so that FileStore could conditionally cache... Yes, I remember we already did some early works like it. sage BTW, whether to use direct io we can refer to MySQL Innodb engine with direct io and PostgreSQL with page cache. On Wed, Sep 24, 2014 at 10:29 AM, Somnath Roy somnath@sandisk.com wrote: Haomai, I am considering only about random reads and the changes I made only affecting reads. For write, I have not measured yet. But, yes, page cache may be helpful for write coalescing. Still need to evaluate how it is behaving comparing direct_io on SSD though. I think Ceph code path will be much shorter if we use direct_io in the write path where it is actually executing the transactions. Probably, the sync thread and all will not be needed. I am trying to analyze where is the extra reads coming from in case of buffered io by using blktrace etc. This should give us a clear understanding what exactly is going on there and it may turn out that tuning kernel parameters only we can achieve similar performance as direct_io. Thanks Regards Somnath -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Tuesday, September 23, 2014 7:07 PM To: Sage Weil Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org Subject: Re: Impact of page cache on OSD read performance for SSD Good point, but do you have considered that the impaction for write ops? And if skip page cache, FileStore is responsible for data cache? On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil sw...@redhat.com wrote: On Tue, 23 Sep 2014, Somnath Roy wrote: Milosz, Thanks for the response. I will see if I can get any information out of perf. Here is my OS information. root@emsclient:~# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description:Ubuntu 13.10 Release:13.10 Codename: saucy root@emsclient:~# uname -a Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io. It's not only page cache (memory) lookup, in case of buffered_io the following could be problem. 1. Double copy (disk - file buffer cache, file buffer cache - user buffer) 2. As the iostat output shows, it is not reading 4K only, it is reading more data from disk as required and in the end it will be wasted in case of random workload.. It might be worth using blktrace to see what the IOs it is issueing are. Which ones are 4K and what they point to... sage Thanks Regards Somnath -Original Message- From: Milosz Tanski [mailto:mil...@adfin.com] Sent: Tuesday, September 23, 2014 12:09 PM To: Somnath Roy Cc: ceph-devel@vger.kernel.org Subject: Re: Impact of page cache on OSD read performance for SSD Somnath, I wonder if there's a bottleneck or a point of contention for the kernel. For a entirely uncached workload I expect the page cache lookup to cause a slow down (since the lookup should be wasted). What I wouldn't expect is a 45% performance drop. Memory speed should be one magnitude faster then a modern SATA SSD drive (so it should be more negligible overhead). Is there anyway you could perform the same test but monitor what's going on with the OSD process using the perf tool? Whatever is the default cpu time spent hardware counter is fine. Make sure you have the kernel debug info package installed so can get symbol information for kernel and module calls. With any luck the diff in perf output in two runs will show us the culprit. Also, can you tell us what OS/kernel version you're using on the OSD machines? - Milosz On
BlaumRoth with w=7 : what are the consequences ?
Hi Kevin, When implementing the plugin for Ceph, the check for isprime(w+1) was not added and w=7 was made the default for the BlaumRoth technique. As a result all content encoded with the BlaumRoth technique have used this parameter. I do not know what the consequences are. It does not crash and from what I've tried content can be encoded/decoded/repaired fine. But maybe I was lucky ? Your expert opinion on the consequence of chosing w=7 would be greatly appreciated :-) Cheers -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re: [ceph-users] Status of snapshots in CephFS
On Fri, Sep 19, 2014 at 5:25 PM, Sage Weil sw...@redhat.com wrote: On Fri, 19 Sep 2014, Florian Haas wrote: Hello everyone, Just thought I'd circle back on some discussions I've had with people earlier in the year: Shortly before firefly, snapshot support for CephFS clients was effectively disabled by default at the MDS level, and can only be enabled after accepting a scary warning that your filesystem is highly likely to break if snapshot support is enabled. Has any progress been made on this in the interim? With libcephfs support slowly maturing in Ganesha, the option of deploying a Ceph-backed userspace NFS server is becoming more attractive -- and it's probably a better use of resources than mapping a boatload of RBDs on an NFS head node and then exporting all the data from there. Recent snapshot trimming issues notwithstanding, RBD snapshot support is reasonably stable, but even so, making snapshot data available via NFS, that way, is rather ugly. In addition, the libcephfs/Ganesha approach would obviously include much better horizontal scalability. We haven't done any work on snapshot stability. It is probably moderately stable if snapshots are only done at the root or at a consistent point in the hierarcy (as opposed to random directories), but there are still some basic problems that need to be resolved. I would not suggest deploying this in production! But some stress testing woudl as always be very welcome. :) OK, on a semi-related note: is there any reasonably current authoritative list of features that are supported and unsupported in either ceph-fuse or kernel cephfs, and if so, at what minimal version? The most comprehensive overview that seems to be available is one from Greg, which however is a year and a half old: http://ceph.com/dev-notes/cephfs-mds-status-discussion/ In addition, https://github.com/nfs-ganesha/nfs-ganesha/wiki/ReleaseNotes_2.0#CEPH states: The current requirement to build and use the Ceph FSAL is a Ceph build environment which includes Ceph client enhancements staged on the libwipcephfs development branch. These changes are expected to be part of the Ceph Firefly release. ... though it's not clear whether they ever did make it into firefly. Could someone in the know comment on that? I think this is referring to the libcephfs API changes that the cohortfs folks did. That all merged shortly before firefly. Great, thanks for the clarification. By the way, we have some basic samba integration tests in our regular regression tests, but nothing based on ganesha. If you really want this to the work, the most valuable thing you could do would be to help get the tests written and integrated into ceph-qa-suite.git. Probably the biggest piece of work there is creating a task/ganesha.py that installs and configures ganesha with the ceph FSAL. Hmmm, given the excellent writeup that Niels de Vos of Gluster fame wrote about this topic, I might actually be able to cargo-cult some of what's in the Samba task and adapt it for ganesha. Sorry while I'm being ignorant about Teuthology: what platform does it normally run on? I ask because I understand most of your testing is done on Ubuntu, and Ubuntu currently doesn't ship a Ganesha package, which would make the install task a bit more complex. Cheers, Florian signature.asc Description: OpenPGP digital signature
Re: [ceph-users] Status of snapshots in CephFS
On Fri, Sep 19, 2014 at 5:25 PM, Sage Weil sw...@redhat.com wrote: On Fri, 19 Sep 2014, Florian Haas wrote: Hello everyone, Just thought I'd circle back on some discussions I've had with people earlier in the year: Shortly before firefly, snapshot support for CephFS clients was effectively disabled by default at the MDS level, and can only be enabled after accepting a scary warning that your filesystem is highly likely to break if snapshot support is enabled. Has any progress been made on this in the interim? With libcephfs support slowly maturing in Ganesha, the option of deploying a Ceph-backed userspace NFS server is becoming more attractive -- and it's probably a better use of resources than mapping a boatload of RBDs on an NFS head node and then exporting all the data from there. Recent snapshot trimming issues notwithstanding, RBD snapshot support is reasonably stable, but even so, making snapshot data available via NFS, that way, is rather ugly. In addition, the libcephfs/Ganesha approach would obviously include much better horizontal scalability. We haven't done any work on snapshot stability. It is probably moderately stable if snapshots are only done at the root or at a consistent point in the hierarcy (as opposed to random directories), but there are still some basic problems that need to be resolved. I would not suggest deploying this in production! But some stress testing woudl as always be very welcome. :) OK, on a semi-related note: is there any reasonably current authoritative list of features that are supported and unsupported in either ceph-fuse or kernel cephfs, and if so, at what minimal version? The most comprehensive overview that seems to be available is one from Greg, which however is a year and a half old: http://ceph.com/dev-notes/cephfs-mds-status-discussion/ In addition, https://github.com/nfs-ganesha/nfs-ganesha/wiki/ReleaseNotes_2.0#CEPH states: The current requirement to build and use the Ceph FSAL is a Ceph build environment which includes Ceph client enhancements staged on the libwipcephfs development branch. These changes are expected to be part of the Ceph Firefly release. ... though it's not clear whether they ever did make it into firefly. Could someone in the know comment on that? I think this is referring to the libcephfs API changes that the cohortfs folks did. That all merged shortly before firefly. Great, thanks for the clarification. By the way, we have some basic samba integration tests in our regular regression tests, but nothing based on ganesha. If you really want this to the work, the most valuable thing you could do would be to help get the tests written and integrated into ceph-qa-suite.git. Probably the biggest piece of work there is creating a task/ganesha.py that installs and configures ganesha with the ceph FSAL. Hmmm, given the excellent writeup that Niels de Vos of Gluster fame wrote about this topic, I might actually be able to cargo-cult some of what's in the Samba task and adapt it for ganesha. Sorry while I'm being ignorant about Teuthology: what platform does it normally run on? I ask because I understand most of your testing is done on Ubuntu, and Ubuntu currently doesn't ship a Ganesha package, which would make the install task a bit more complex. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Wed, Sep 24, 2014 at 1:05 AM, Sage Weil sw...@redhat.com wrote: Sam and I discussed this on IRC and have we think two simpler patches that solve the problem more directly. See wip-9487. So I understand this makes Dan's patch (and the config parameter that it introduces) unnecessary, but is it correct to assume that just like Dan's patch yours too will not be effective unless osd snap trim sleep 0? Queued for testing now. Once that passes we can backport and test for firefly and dumpling too. Note that this won't make the next dumpling or firefly point releases (which are imminent). Should be in the next ones, though. OK, just in case anyone else runs into problems after removing tons of snapshots with =0.67.11, what's the plan to get them going again until 0.67.12 comes out? Install the autobuild package from the wip branch? Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BlaumRoth with w=7 : what are the consequences ?
Hi Kevin, On 24/09/2014 20:40, Kevin Greenan wrote: The constraint guarantees the MDS property. I believe there are conditions where w+1 is composite and you still have an MDS code, but there are restrictions on 'n' (codeword length). So, you may have chosen the right parameters. Did you verify all possible combinations of erasures is tolerated? I tried all combinations that are likely to have been used and they all work out. Here is the script I used: for w in 7 11 13 17 19 ; do for k in $(seq 2 $w) ; do for m in $(seq 1 $k) ; do for erasures in $(seq 1 $m) ; do ./ceph_erasure_code_benchmark --plugin jerasure --workload decoded --iterations 1 --size 4096 --erasures $erasures --parameter w=$w --parameter k=$k --parameter m=2 --parameter technique=blaum_roth ; done ; done ; done ; done does that mean we're safe despite the fact that w+1 is not prime in all settings ? For sake of safety, you probably do not want to get experimental with 'production' code. IMHO, you should put the check in, especially if 'n' is tunable. n is tunable and the check is now enforced. I'm worried about the content that was previously encoded with codeword length that do not match prime(codeword + 1) Cheers -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
RE: Impact of page cache on OSD read performance for SSD
Hi, After going through the blktrace, I think I have figured out what is going on there. I think kernel read_ahead is causing the extra reads in case of buffered read. If I set read_ahead = 0 , the performance I am getting similar (or more when cache hit actually happens) to direct_io :-) IMHO, if any user doesn't want these nasty kernel effects and be sure of the random work pattern, we should provide a configurable direct_io read option (Need to quantify direct_io write also) as Sage suggested. Thanks Regards Somnath -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Wednesday, September 24, 2014 9:06 AM To: Sage Weil Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org Subject: Re: Impact of page cache on OSD read performance for SSD On Wed, Sep 24, 2014 at 8:38 PM, Sage Weil sw...@redhat.com wrote: On Wed, 24 Sep 2014, Haomai Wang wrote: I agree with that direct read will help for disk read. But if read data is hot and small enough to fit in memory, page cache is a good place to hold data cache. If discard page cache, we need to implement a cache to provide with effective lookup impl. This is true for some workloads, but not necessarily true for all. Many clients (notably RBD) will be caching at the client side (in VM's fs, and possibly in librbd itself) such that caching at the OSD is largely wasted effort. For RGW the often is likely true, unless there is a varnish cache or something in front. Still now, I don't think librbd cache can meet all the cache demand for rbd usage. Even though we have a effective librbd cache impl, we still need a buffer cache in ObjectStore level just like what database did. Client cache and host cache are both needed. We should probably have a direct_io config option for filestore. But even better would be some hint from the client about whether it is caching or not so that FileStore could conditionally cache... Yes, I remember we already did some early works like it. sage BTW, whether to use direct io we can refer to MySQL Innodb engine with direct io and PostgreSQL with page cache. On Wed, Sep 24, 2014 at 10:29 AM, Somnath Roy somnath@sandisk.com wrote: Haomai, I am considering only about random reads and the changes I made only affecting reads. For write, I have not measured yet. But, yes, page cache may be helpful for write coalescing. Still need to evaluate how it is behaving comparing direct_io on SSD though. I think Ceph code path will be much shorter if we use direct_io in the write path where it is actually executing the transactions. Probably, the sync thread and all will not be needed. I am trying to analyze where is the extra reads coming from in case of buffered io by using blktrace etc. This should give us a clear understanding what exactly is going on there and it may turn out that tuning kernel parameters only we can achieve similar performance as direct_io. Thanks Regards Somnath -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Tuesday, September 23, 2014 7:07 PM To: Sage Weil Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org Subject: Re: Impact of page cache on OSD read performance for SSD Good point, but do you have considered that the impaction for write ops? And if skip page cache, FileStore is responsible for data cache? On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil sw...@redhat.com wrote: On Tue, 23 Sep 2014, Somnath Roy wrote: Milosz, Thanks for the response. I will see if I can get any information out of perf. Here is my OS information. root@emsclient:~# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description:Ubuntu 13.10 Release:13.10 Codename: saucy root@emsclient:~# uname -a Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io. It's not only page cache (memory) lookup, in case of buffered_io the following could be problem. 1. Double copy (disk - file buffer cache, file buffer cache - user buffer) 2. As the iostat output shows, it is not reading 4K only, it is reading more data from disk as required and in the end it will be wasted in case of random workload.. It might be worth using blktrace to see what the IOs it is issueing are. Which ones are 4K and what they point to... sage Thanks Regards Somnath -Original Message- From: Milosz Tanski [mailto:mil...@adfin.com] Sent: Tuesday, September 23, 2014 12:09 PM To: Somnath Roy Cc: ceph-devel@vger.kernel.org Subject: Re: Impact of page cache on OSD read performance for SSD Somnath, I wonder if there's a bottleneck or a point of contention for the
Fwd: question about client's cluster aware
-- Forwarded message -- From: yue longguang yuelonggu...@gmail.com Date: Tue, Sep 23, 2014 at 5:53 PM Subject: question about client's cluster aware To: ceph-devel@vger.kernel.org hi,all my question is from my test. let's take a example. object1(4MB)-- pg 0.1 -- osd 1,2,3,p1 when client is writing object1, during the write , osd1 is down. let suppose 2MB is writed. 1. when the connection to osd1 is down, what does client do? ask monitor for new osdmap? or only the pg map? 2. now client gets a newer map , continues the write , the primary osd should be osd2. the rest 2MB is writed out. now what does ceph do to integrate the two part data? and to promise that replicas is enough? 3. where is the code. Be sure to tell me where the code is。 it is a very difficult question. Thanks so much -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fwd: question about object replication theory
-- Forwarded message -- From: yue longguang yuelonggu...@gmail.com Date: Tue, Sep 23, 2014 at 5:37 PM Subject: question about object replication theory To: ceph-devel@vger.kernel.org hi,all take a look at the link , http://www.ceph.com/docs/master/architecture/#smart-daemons-enable-hyperscale could you explain point 2, 3 in that picture. 1. at point 2,3, before primary writes data to next osd, where is the data? it is in momory or on disk already? 2. where is the code of point 2 or 3, at there primary distributes data to others? thanks -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Impact of page cache on OSD read performance for SSD
On Thu, Sep 25, 2014 at 7:49 AM, Somnath Roy somnath@sandisk.com wrote: Hi, After going through the blktrace, I think I have figured out what is going on there. I think kernel read_ahead is causing the extra reads in case of buffered read. If I set read_ahead = 0 , the performance I am getting similar (or more when cache hit actually happens) to direct_io :-) Hmm, BTW if set read_ahead=0, what about seq read performance compared to before? IMHO, if any user doesn't want these nasty kernel effects and be sure of the random work pattern, we should provide a configurable direct_io read option (Need to quantify direct_io write also) as Sage suggested. Thanks Regards Somnath -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Wednesday, September 24, 2014 9:06 AM To: Sage Weil Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org Subject: Re: Impact of page cache on OSD read performance for SSD On Wed, Sep 24, 2014 at 8:38 PM, Sage Weil sw...@redhat.com wrote: On Wed, 24 Sep 2014, Haomai Wang wrote: I agree with that direct read will help for disk read. But if read data is hot and small enough to fit in memory, page cache is a good place to hold data cache. If discard page cache, we need to implement a cache to provide with effective lookup impl. This is true for some workloads, but not necessarily true for all. Many clients (notably RBD) will be caching at the client side (in VM's fs, and possibly in librbd itself) such that caching at the OSD is largely wasted effort. For RGW the often is likely true, unless there is a varnish cache or something in front. Still now, I don't think librbd cache can meet all the cache demand for rbd usage. Even though we have a effective librbd cache impl, we still need a buffer cache in ObjectStore level just like what database did. Client cache and host cache are both needed. We should probably have a direct_io config option for filestore. But even better would be some hint from the client about whether it is caching or not so that FileStore could conditionally cache... Yes, I remember we already did some early works like it. sage BTW, whether to use direct io we can refer to MySQL Innodb engine with direct io and PostgreSQL with page cache. On Wed, Sep 24, 2014 at 10:29 AM, Somnath Roy somnath@sandisk.com wrote: Haomai, I am considering only about random reads and the changes I made only affecting reads. For write, I have not measured yet. But, yes, page cache may be helpful for write coalescing. Still need to evaluate how it is behaving comparing direct_io on SSD though. I think Ceph code path will be much shorter if we use direct_io in the write path where it is actually executing the transactions. Probably, the sync thread and all will not be needed. I am trying to analyze where is the extra reads coming from in case of buffered io by using blktrace etc. This should give us a clear understanding what exactly is going on there and it may turn out that tuning kernel parameters only we can achieve similar performance as direct_io. Thanks Regards Somnath -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Tuesday, September 23, 2014 7:07 PM To: Sage Weil Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org Subject: Re: Impact of page cache on OSD read performance for SSD Good point, but do you have considered that the impaction for write ops? And if skip page cache, FileStore is responsible for data cache? On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil sw...@redhat.com wrote: On Tue, 23 Sep 2014, Somnath Roy wrote: Milosz, Thanks for the response. I will see if I can get any information out of perf. Here is my OS information. root@emsclient:~# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description:Ubuntu 13.10 Release:13.10 Codename: saucy root@emsclient:~# uname -a Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io. It's not only page cache (memory) lookup, in case of buffered_io the following could be problem. 1. Double copy (disk - file buffer cache, file buffer cache - user buffer) 2. As the iostat output shows, it is not reading 4K only, it is reading more data from disk as required and in the end it will be wasted in case of random workload.. It might be worth using blktrace to see what the IOs it is issueing are. Which ones are 4K and what they point to... sage Thanks Regards Somnath -Original Message- From: Milosz Tanski [mailto:mil...@adfin.com] Sent: Tuesday, September 23, 2014 12:09 PM To: Somnath Roy Cc:
RE: Impact of page cache on OSD read performance for SSD
It will be definitely hampered. There will not be a single solution fits all. These parameters needs to be tuned based on the workload. Thanks Regards Somnath -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Wednesday, September 24, 2014 7:56 PM To: Somnath Roy Cc: Sage Weil; Milosz Tanski; ceph-devel@vger.kernel.org Subject: Re: Impact of page cache on OSD read performance for SSD On Thu, Sep 25, 2014 at 7:49 AM, Somnath Roy somnath@sandisk.com wrote: Hi, After going through the blktrace, I think I have figured out what is going on there. I think kernel read_ahead is causing the extra reads in case of buffered read. If I set read_ahead = 0 , the performance I am getting similar (or more when cache hit actually happens) to direct_io :-) Hmm, BTW if set read_ahead=0, what about seq read performance compared to before? IMHO, if any user doesn't want these nasty kernel effects and be sure of the random work pattern, we should provide a configurable direct_io read option (Need to quantify direct_io write also) as Sage suggested. Thanks Regards Somnath -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Wednesday, September 24, 2014 9:06 AM To: Sage Weil Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org Subject: Re: Impact of page cache on OSD read performance for SSD On Wed, Sep 24, 2014 at 8:38 PM, Sage Weil sw...@redhat.com wrote: On Wed, 24 Sep 2014, Haomai Wang wrote: I agree with that direct read will help for disk read. But if read data is hot and small enough to fit in memory, page cache is a good place to hold data cache. If discard page cache, we need to implement a cache to provide with effective lookup impl. This is true for some workloads, but not necessarily true for all. Many clients (notably RBD) will be caching at the client side (in VM's fs, and possibly in librbd itself) such that caching at the OSD is largely wasted effort. For RGW the often is likely true, unless there is a varnish cache or something in front. Still now, I don't think librbd cache can meet all the cache demand for rbd usage. Even though we have a effective librbd cache impl, we still need a buffer cache in ObjectStore level just like what database did. Client cache and host cache are both needed. We should probably have a direct_io config option for filestore. But even better would be some hint from the client about whether it is caching or not so that FileStore could conditionally cache... Yes, I remember we already did some early works like it. sage BTW, whether to use direct io we can refer to MySQL Innodb engine with direct io and PostgreSQL with page cache. On Wed, Sep 24, 2014 at 10:29 AM, Somnath Roy somnath@sandisk.com wrote: Haomai, I am considering only about random reads and the changes I made only affecting reads. For write, I have not measured yet. But, yes, page cache may be helpful for write coalescing. Still need to evaluate how it is behaving comparing direct_io on SSD though. I think Ceph code path will be much shorter if we use direct_io in the write path where it is actually executing the transactions. Probably, the sync thread and all will not be needed. I am trying to analyze where is the extra reads coming from in case of buffered io by using blktrace etc. This should give us a clear understanding what exactly is going on there and it may turn out that tuning kernel parameters only we can achieve similar performance as direct_io. Thanks Regards Somnath -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Tuesday, September 23, 2014 7:07 PM To: Sage Weil Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org Subject: Re: Impact of page cache on OSD read performance for SSD Good point, but do you have considered that the impaction for write ops? And if skip page cache, FileStore is responsible for data cache? On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil sw...@redhat.com wrote: On Tue, 23 Sep 2014, Somnath Roy wrote: Milosz, Thanks for the response. I will see if I can get any information out of perf. Here is my OS information. root@emsclient:~# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description:Ubuntu 13.10 Release:13.10 Codename: saucy root@emsclient:~# uname -a Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was able to get almost *2X* performance improvement with direct_io. It's not only page cache (memory) lookup, in case of buffered_io the following could be problem. 1. Double copy (disk - file buffer cache, file buffer cache - user buffer) 2. As the iostat output shows, it is not reading 4K only, it is
RE: Impact of page cache on OSD read performance for SSD
Have you ever seen large readahead_kb would hear random performance? We usually set it to very large (2M) , the random read performance keep steady, even in all SSD setup. Maybe with your optimization code for OP_QUEUE, the things may different? -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Thursday, September 25, 2014 11:15 AM To: Haomai Wang Cc: Sage Weil; Milosz Tanski; ceph-devel@vger.kernel.org Subject: RE: Impact of page cache on OSD read performance for SSD It will be definitely hampered. There will not be a single solution fits all. These parameters needs to be tuned based on the workload. Thanks Regards Somnath -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Wednesday, September 24, 2014 7:56 PM To: Somnath Roy Cc: Sage Weil; Milosz Tanski; ceph-devel@vger.kernel.org Subject: Re: Impact of page cache on OSD read performance for SSD On Thu, Sep 25, 2014 at 7:49 AM, Somnath Roy somnath@sandisk.com wrote: Hi, After going through the blktrace, I think I have figured out what is going on there. I think kernel read_ahead is causing the extra reads in case of buffered read. If I set read_ahead = 0 , the performance I am getting similar (or more when cache hit actually happens) to direct_io :-) Hmm, BTW if set read_ahead=0, what about seq read performance compared to before? IMHO, if any user doesn't want these nasty kernel effects and be sure of the random work pattern, we should provide a configurable direct_io read option (Need to quantify direct_io write also) as Sage suggested. Thanks Regards Somnath -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Wednesday, September 24, 2014 9:06 AM To: Sage Weil Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org Subject: Re: Impact of page cache on OSD read performance for SSD On Wed, Sep 24, 2014 at 8:38 PM, Sage Weil sw...@redhat.com wrote: On Wed, 24 Sep 2014, Haomai Wang wrote: I agree with that direct read will help for disk read. But if read data is hot and small enough to fit in memory, page cache is a good place to hold data cache. If discard page cache, we need to implement a cache to provide with effective lookup impl. This is true for some workloads, but not necessarily true for all. Many clients (notably RBD) will be caching at the client side (in VM's fs, and possibly in librbd itself) such that caching at the OSD is largely wasted effort. For RGW the often is likely true, unless there is a varnish cache or something in front. Still now, I don't think librbd cache can meet all the cache demand for rbd usage. Even though we have a effective librbd cache impl, we still need a buffer cache in ObjectStore level just like what database did. Client cache and host cache are both needed. We should probably have a direct_io config option for filestore. But even better would be some hint from the client about whether it is caching or not so that FileStore could conditionally cache... Yes, I remember we already did some early works like it. sage BTW, whether to use direct io we can refer to MySQL Innodb engine with direct io and PostgreSQL with page cache. On Wed, Sep 24, 2014 at 10:29 AM, Somnath Roy somnath@sandisk.com wrote: Haomai, I am considering only about random reads and the changes I made only affecting reads. For write, I have not measured yet. But, yes, page cache may be helpful for write coalescing. Still need to evaluate how it is behaving comparing direct_io on SSD though. I think Ceph code path will be much shorter if we use direct_io in the write path where it is actually executing the transactions. Probably, the sync thread and all will not be needed. I am trying to analyze where is the extra reads coming from in case of buffered io by using blktrace etc. This should give us a clear understanding what exactly is going on there and it may turn out that tuning kernel parameters only we can achieve similar performance as direct_io. Thanks Regards Somnath -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Tuesday, September 23, 2014 7:07 PM To: Sage Weil Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org Subject: Re: Impact of page cache on OSD read performance for SSD Good point, but do you have considered that the impaction for write ops? And if skip page cache, FileStore is responsible for data cache? On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil sw...@redhat.com wrote: On Tue, 23 Sep 2014, Somnath Roy wrote: Milosz, Thanks for the response. I will see if I can get any information out of perf. Here is my OS information. root@emsclient:~# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description:Ubuntu 13.10 Release:13.10
Re: [ceph-users] Can ceph-deploy be used with 'osd objectstore = keyvaluestore-dev' in config file ?
On 25/09/14 01:03, Sage Weil wrote: On Wed, 24 Sep 2014, Mark Kirkwood wrote: On 24/09/14 14:29, Aegeaner wrote: I run ceph on Red Hat Enterprise Linux Server 6.4 Santiago, and when I run service ceph start i got: # service ceph start ERROR:ceph-disk:Failed to activate ceph-disk: Does not look like a Ceph OSD, or incompatible version: /var/lib/ceph/tmp/mnt.I71N5T mount: /dev/hioa1 already mounted or /var/lib/ceph/tmp/mnt.02sVHj busy ceph-disk: Mounting filesystem failed: Command '['/bin/mount', '-t', 'xfs', '-o', 'noatime', '--', '/dev/disk/by-parttypeuuid/4fbd7e29-9d25-41b8-afd0-062c0ceff05d.6d726c93-41f9-453d-858e-ab4132b5c8fd', '/var/lib/ceph/tmp/mnt.02sVHj']' returned non-zero exit status 32 ceph-disk: Error: One or more partitions failed to activate Someone told me service ceph start still tries to call ceph-disk which will create a filestore type OSD, and create a journal partition, is it true? ls -l /dev/disk/by-parttypeuuid/ lrwxrwxrwx. 1 root root 11 9? 23 16:56 45b0969e-9b03-4f30-b4c6-b4b80ceff106.00dbee5e-fb68-47c4-aa58-924c904c4383 - ../../hioa2 lrwxrwxrwx. 1 root root 10 9? 23 17:02 45b0969e-9b03-4f30-b4c6-b4b80ceff106.c30e5b97-b914-4eb8-8306-a9649e1c20ba - ../../sdb2 lrwxrwxrwx. 1 root root 11 9? 23 16:56 4fbd7e29-9d25-41b8-afd0-062c0ceff05d.6d726c93-41f9-453d-858e-ab4132b5c8fd - ../../hioa1 lrwxrwxrwx. 1 root root 10 9? 23 17:02 4fbd7e29-9d25-41b8-afd0-062c0ceff05d.b56ec699-e134-4b90-8f55-4952453e1b7e - ../../sdb1 lrwxrwxrwx. 1 root root 11 9? 23 16:52 89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be.6d726c93-41f9-453d-858e-ab4132b5c8fd - ../../hioa1 There seems to be two hioa1 partitions there, maybe remained from last time I create the OSD using ceph-deploy osd prepare? Crap - it is fighting you, yes - looks like the startup script has tried to build an osd for you using ceph-disk (which will make two partitions by default). So that's toasted the setup that your script did. Growl - that's made it more complicated for sure. Hrm, yeah. I think ceph-disk needs to have an option (via ceph.conf) that will avoid creating a journal [partition], and we need to make sure that the journal behavior is all conditional on the journal symlink being present. Do you mind opening a bug for this? It could condition itself off of the osd objectstore option (we'd need to teach ceph-diska bout the varoius backends), or we could add a secondary option (awkware to configure), or we could call into ceph-osd with something like 'ceph-osd -i 0 --does-backend-need-journal' so that a call into the backend code itself can control things. The latter is probably ideal. Opened http://tracker.ceph.com/issues/9580 and copying ceph-devel Yeah, looks good - the approach to ask ceph-osd if it needs a journal seems sound. Regards Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html