Re: wip-user status
On 08/04/2015 10:53 PM, Sage Weil wrote: I rebased the wip-user patches from wip-selinux-policy onto wip-selinux-policy-no-user + merge to master so that it sits on top of the newly-merged systemd changes. Great, so if it is build-ready state, I can try it with our virtual cluster install. Notes/issues: - ceph-osd-prestart.sh verifies that the osd_data dir is owned by either 'root' or 'ceph' or else it exits with an error. (Presumably systemd will fail to start the unit in this case.) It prints a helpful message pointing the user at 'ceph-disk chown ...'. - 'ceph-disk chown ...' is not implemented yet. Should it take the base device, like activate and prepare? Or a mounted path? Or either? It should be easy to convert device/mountpoint by using findmnt so I would prefer what is more consistent with the user interface... IIRC, if the parameter is a base device, what should happen if device is not mounted? If mount path - then what about other data/journal partitions? It seems to me that parameter could be base OSD device and chown will simply handle all its partitions. (So for encrypted OSD it needs to get key to unlock it etc...) - Currently ceph-osd@.service unconditionally passes --setuser ceph to ceph-osd... even if the data directory is owned by root. I don't think systemd is smart enough to do this conditionally unless we make an ugly wrapper script that starts ceph-osd. Alternatively, we could make ceph-osd conditionally do the setuid based on the ownership of the directory, but... meh. The idea was to do the setuid *very* early in the startup process so that logging and so on are opened as the ceph user. Ideas? Well, systemd could do that if the service is generated (like e.g. cryptsetup activation jobs are generated according to crypttab). But this adds complexity that we do not need... Maybe another option is to use environment variable (CEPH_USER or so), set it in service Environment=/EnvironmentFile... and ceph-osd will use that... But I think some systemd gurus will find something better here:) Thanks, Milan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Erasure Code Plugins : PLUGINS_V3 feature
Hi Sam, How does this proposal sound ? It would be great if that was done before the feature freeze. Cheers On 29/07/2015 11:16, Loic Dachary wrote: Hi Sam, The SHEC plugin[0] has been running in the rados runs[1] in the past few months. It also has a matching corpus verification which runs on every make check[2] as well as its optimized variants. I believe the flag experimental can now be removed. In order to do so, we need to use a PLUGINS_V3 feature, in the same way we did back in Giant when the ISA and LRC plugins were introduced[3]. This won't be necessary in the future, when there is a generic plugin mechanism, but right now that's what we need. It would be a commit very similar to the one implementing PLUGINS_V2[4]. Is this agreeable to you ? Or would you rather see another way to resolve this ? Cheers [0] https://github.com/ceph/ceph/tree/master/src/erasure-code/shec [1] https://github.com/ceph/ceph-qa-suite/tree/master/suites/rados/thrash-erasure-code-shec [2] https://github.com/ceph/ceph-erasure-code-corpus/blob/master/v0.92-988/non-regression.sh#L52 [3] http://tracker.ceph.com/issues/9343 [4] https://github.com/ceph/ceph/commit/9687150ceac9cc7e506bc227f430d4207a6d7489 -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re: Transitioning Ceph from Autotools to CMake
On Tue, 2015-08-04 at 00:38 +0100, John Spray wrote: OK, here are vstart+ceph.in changes that work well enough in my out of tree build: https://github.com/ceph/ceph/pull/5457 Great! John On Mon, Aug 3, 2015 at 11:09 AM, John Spray jsp...@redhat.com wrote: On Sat, Aug 1, 2015 at 8:24 PM, Orit Wasserman owass...@redhat.com wrote: 3. no vstart.sh , starting working on this too but have less progress here. At the moment in order to use vstart I copy the exe and libs to src dir. I just started playing with CMake on Friday, adding some missing cephfs bits. I was going to fix (3) as well, but I don't want to duplicate work -- do you have an existing branch at all? Presumably this will mostly be a case of adding appropriate prefixes to commands. Cheers, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Is it safe to increase pg number in a production environment
Hi, comments inline. On 05 Aug 2015, at 05:45, Jevon Qiao qiaojianf...@unitedstack.com wrote: Hi Jan, Thank you for the detailed suggestion. Please see my reply in-line. On 5/8/15 01:23, Jan Schermer wrote: I think I wrote about my experience with this about 3 months ago, including what techniques I used to minimize impact on production. Basicaly we had to 1) increase pg_num in small increments only, bcreating the placement groups themselves caused slowed requests on OSDs 2) increse pgp_num in small increments and then go higher So you totally completed the step 1 before jumping into step 2. Have you ever tried mixing them together? Increase pg_number, increase pgp_number, increase pg_number… Actually we first increased both to 8192 and then decided to go higher, but that doesn’t matter. The only reason for this was that the first step took could run unattended at night without disturbing the workload.* The second step had to be attended. * in other words, we didn’t see “slow requests” because of our threshold settings, but while PGs were creating the cluster paused IO for non-trivial amounts of time. I suggest you do this in as small steps as possible, depending on your SLAs. We went from 4096 placement groups up to 16384 pg_num (the number of on-disk created placement groups) was increased like this: # for i in `seq 4096 64 16384` ; do ceph osd pool set $pool pg_num $i ; sleep 60 ; done this ran overnight (and was upped to 128 step during the night) Increasing pgp_num was trickier in our case, first because it was heavy production and we wanted to minimize the visible impact and second because of wildly differing free space on the OSDs. We did it again in steps and waited for the cluster to settle before continuing. Each step upped pgp_num by about 2% and as we got higher (8192) we increased this to much more - the last step was 15360-16384 with the same impact the initial 4096-4160 had. The strategy you adopted looks great. I'll do some experiments on a test cluster to evaluate the real impact in each step The end result is much better but still nowhere near optimal - bigger impact would be upgrading to a newer Ceph release and setting the new tunables because we’re running Dumpling. Be aware that PGs cost some space (rough estimate is 5GB per OSD in our case), and also quite a bit of memory - each OSD has 1.7-2.0GB RSS right now while it only had about 1GB before. That’s a lot of memory and space with higher OSD counts... This is a good point. So along with the increment of PGs, we also need to take the current status of the cluster(the available disk space and memory for each OSD) into account and evaluate whether it is needed to add more resources. Depends on how much free space you have. We had some OSDs at close to 85% capacity before we started (and other OSD’s at only 30%). When increasing the number of PGs the data shuffled greatly - but this depends on what CRUSH rules you have (and what version you are running). Newer versions with newer tunables will make this a lot easier I guess. And while I haven’t calculated the number of _objects_ per PG, but we have differing numbers of _placement_groups_ per OSD (one OSD hosts 500, another hosts 1300) and this seems to be the cause of poor data balancing. In our environment, we also encountered the imbalance mapping between PGs and OSD. What kind of bucket algorithm was used in your environment? Any idea on how to minimize it? We are using straw because of dumpling. Straw2 should make everything better :-) Jan Thanks, Jevon Jan On 04 Aug 2015, at 18:52, Marek Dohojda mdoho...@altitudedigital.com wrote: I have done this not that long ago. My original PG estimates were wrong and I had to increase them. After increasing the PG numbers the Ceph rebalanced, and that took a while. To be honest in my case the slowdown wasn’t really visible, but it took a while. My strong suggestion to you would be to do it in a long IO time, and be prepared that this willl take quite a long time to accomplish. Do it slowly and do not increase multiple pools at once. It isn’t recommended practice but doable. On Aug 4, 2015, at 10:46 AM, Samuel Just sj...@redhat.com wrote: It will cause a large amount of data movement. Each new pg after the split will relocate. It might be ok if you do it slowly. Experiment on a test cluster. -Sam On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 scaleq...@gmail.com wrote: Hi Cephers, This is a greeting from Jevon. Currently, I'm experiencing an issue which suffers me a lot, so I'm writing to ask for your comments/help/suggestions. More details are provided bellow. Issue: I set up a cluster having 24 OSDs and created one pool with 1024 placement groups on it for a small startup company. The number 1024 was calculated per the equation 'OSDs * 100'/pool size. The cluster have been running
Re: More ondisk_finisher thread?
On Wed, 5 Aug 2015, Ding Dinghua wrote: 2015-08-05 0:13 GMT+08:00 Somnath Roy somnath@sandisk.com: Yes, it has to re-acquire pg_lock today.. But, between journal write and initiating the ondisk ack, there is one context switche in the code path. So, I guess the pg_lock is not the only one that is causing this 1 ms delay... Not sure increasing the finisher threads will help in the pg_lock case as it will be more or less serialized by this pg_lock.. My concern is, if pg lock of pg A has been grabbed, not only ondisk callback of pg A is delayed, since ondisk_finisher has only one thread, ondisk callback of other pgs will be delayed too. I wonder if an optimistic approach might help here by making the completion synchronous and doing something like if (pg-lock.TryLock()) { pg-_finish_thing(completion-op); delete completion; } else { finisher.queue(completion); } or whatever. We'd need to ensure that we aren't holding any lock or throttle budget that the pg could deadlock against. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Transitioning Ceph from Autotools to CMake
Dear Ali, my point is no longer relevant, but your reassurances is still very relevant. Thanks Owen On 08/04/2015 08:26 PM, Ali Maredia wrote: Owen, I understand your concern, and don't think any transition will be made to CMake untill all the functionality is in it and until it has been thoroughly vetted by the entire community to ensure a smooth transition. I pushed a branch earlier today called wip-cmake (https://github.com/ceph/ceph/tree/wip-cmake) and plan to continue Orit's make check work, and coordinate with John on the vstart work he's done already as my very first action items. -Ali - Original Message - From: Owen Synge osy...@suse.com To: Ali Maredia amare...@redhat.com, ceph-devel@vger.kernel.org Sent: Tuesday, August 4, 2015 6:42:31 AM Subject: Re: Transitioning Ceph from Autotools to CMake Dear Ali, I am glad you are making progress. Sadly I don't yet know cmake. Please consider the systemd wip branch. It might be wise to leave autotools around a little longer, until all functionality is in the cmake. Best regards Owen On 07/30/2015 09:01 PM, Ali Maredia wrote: After discussing with several other Ceph developers and Sage, I wanted to start a discussion about making CMake the primary build system for Ceph. CMake works just fine as it is (make -j4 on master with CMake builds 350% faster than autotools!), but there's more work needed to make it into a first-class build system. Short term (1-2 weeks): - Making sure CMake works on all supported platforms: Centos7, RHEL7, Ubuntu 14.04 12.04, Fedora 22, Debian Jessie, Debian Wheezy are the target platforms I have noted to test on already. - Adding a target similar to make check - Creating CMake targets that build packages (such as for rpm or debian) - Writing documentation for those who haven't used CMake before to smooth the transition over - Making sure no targets or dependencies are missing from the current CMake build, and that CMake supports all current targets, configurations and options - Replacing the integration autotools has with any automated build/test systems such as the gitbuilder Longer term (2-4 weeks): - Removing the current autotools files, to avoid doubling build system workload - Adding more but shorter CMakeLists.txt files to a tree like structure where a CMakeLists.txt would be in every subdirectory I'm already working on a target similar to the make check target, and plan on working on the other short term goals over the next weeks and beyond. I wanted to get feedback from the community any reasons why someone started using CMake but stopped (ex: lack of functionality), and more broadly, on what other obstacles there might be for the transition. -Ali -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- SUSE LINUX GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Maxfeldstraße 5 90409 Nürnberg Germany -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Should pgls filters be part of object classes?
So, I've got this very-cephfs-specific piece of pgls filtering code in ReplicatedPG: https://github.com/ceph/ceph/commit/907a3c5a2ba8e3edda18d7edf89ccae7b9d91dc5 I'm not sure I'm sufficiently motivated to create some whole new plugin framework for these, but what about piggy-backing on object classes? I guess it would be an additional cls_register_filter(myfilter, my_callback_constructor) fn. tl;dr; How do people feel about extending object classes to include providing PGLS filters as well? John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Should pgls filters be part of object classes?
On Wed, 5 Aug 2015, John Spray wrote: So, I've got this very-cephfs-specific piece of pgls filtering code in ReplicatedPG: https://github.com/ceph/ceph/commit/907a3c5a2ba8e3edda18d7edf89ccae7b9d91dc5 I'm not sure I'm sufficiently motivated to create some whole new plugin framework for these, but what about piggy-backing on object classes? I guess it would be an additional cls_register_filter(myfilter, my_callback_constructor) fn. tl;dr; How do people feel about extending object classes to include providing PGLS filters as well? This seems like the way to do it... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: radosgw - stuck ops
On Tue, Aug 4, 2015 at 3:23 PM, GuangYang yguan...@outlook.com wrote: Thanks for Sage, Yehuda and Sam's quick reply. Given the discussion so far, could I summarize into the following bullet points: 1 The first step we would like to pursue is to implement the following mechanism to avoid infinite waiting at radosgw side: 1.1. radosgw - send OP with a *fast_fail* flag 1.2. OSD - reply with -EAGAIN if the PG is *inactive* and the *fast_fail* flag is set 1.3. radosgw - upon receiving -EAGAIN, retry till a timeout interval is reached (properly with some back-off?), and if it eventually fails, convert -EAGAIN to some other error code and passes to upper layer. I'm not crazy about the 'fast_fail' name, maybe we can come up with a better describing term. Also, not 100% sure the EAGAIN is the error we want to see. Maybe the flag on the request could specify what would be the error code to return in this case? I think it's a good plan to start with, we can adjust things later. 2 In terms of management of radosgw's worker threads, I think we either pursue Sage's proposal (which could linearly increase the time it takes to stuck all worker threads depending how many threads we expand), or simply try sharding work queue (which we already has some basic building block)? The problem that I see with that proposal (missed it earlier, only seeing it now), is that when the threads actually wake up the system could become unusable. In any case, it's probably a lower priority at this point, we could rethink this area again later. Yehuda Can I start working on patch for 1 and then 2 as a lower priority? Thanks, Guang Date: Tue, 4 Aug 2015 10:14:06 -0700 Subject: Re: radosgw - stuck ops From: ysade...@redhat.com To: sw...@redhat.com CC: yguan...@outlook.com; sj...@redhat.com; yeh...@redhat.com; ceph-devel@vger.kernel.org On Tue, Aug 4, 2015 at 10:03 AM, Sage Weil sw...@redhat.com wrote: On Tue, 4 Aug 2015, Yehuda Sadeh-Weinraub wrote: On Tue, Aug 4, 2015 at 9:55 AM, Sage Weil sw...@redhat.com wrote: One solution that I can think of is to determine before the read/write whether the pg we're about to access is healthy (or has been unhealthy for a short period of time), and if not to cancel the request before sending the operation. This could mitigate the problem you're seeing at the expense of availability in some cases. We'd need to have a way to query pg health through librados which we don't have right now afaik. Sage / Sam, does that make sense, and/or possible? This seems mostly impossible because we don't know ahead of time which PG(s) a request is going to touch (it'll generally be a lot of them)? Barring pgls() and such, each rados request that radosgw produces will only touch a single pg, right? Oh, yeah. I thought you meant before each RGW request. If it's at the rados level then yeah, you could avoid stuck pgs, although I think a better approach would be to make the OSD reply with -EAGAIN in that case so that you know the op didn't happen. There would still be cases (though more rare) where you weren't sure if the op happened or not (e.g., when you send to osd A, it goes down, you resend to osd B, and then you get EAGAIN/timeout). If done on the client side then we should only make it apply to the first request sent. Is it actually a problem if the osd triggered the error? What would you do when you get that failure/timeout, though? Is it practical to abort the rgw request handling completely? It should be like any error that happens through the transaction (e.g., client disconnection). Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
08/05/2015 Weekly Ceph Performance Meeting IS ON!
8AM PST as usual (that's in 13 minutes folks!) No specific topics for this week, please feel free to add your own! Here's the links: Etherpad URL: http://pad.ceph.com/p/performance_weekly To join the Meeting: https://bluejeans.com/268261044 To join via Browser: https://bluejeans.com/268261044/browser To join with Lync: https://bluejeans.com/268261044/lync To join via Room System: Video Conferencing System: bjn.vc -or- 199.48.152.152 Meeting ID: 268261044 To join via Phone: 1) Dial: +1 408 740 7256 +1 888 240 2560(US Toll Free) +1 408 317 9253(Alternate Number) (see all numbers - http://bluejeans.com/numbers) 2) Enter Conference ID: 268261044 Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
civetweb health check
Hi, We are planning to move our radosgw setup from apache to civetweb. We were successfully able to setup and run civetweb on a test cluster. The radosgw instances are fronted by a VIP with currently checks the health by getting /status.html file, after moving to civetweb the vip is unable to get the health of radosgw server using /status.html endpoint and assumes the server is down. I looked at ceph radosgw documentation and did not find any configuration to rewrite urls. What is the best approach for VIP to get the health of radosgw? Thanks Srikanth -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Erasure Code Plugins : PLUGINS_V3 feature
On Wed, 5 Aug 2015, Loic Dachary wrote: Hi Sam, How does this proposal sound ? It would be great if that was done before the feature freeze. I think it's a good time. Takeshi, note that what this really means is that the on-disk encoding needs to remain fixed. If we decide to change it down the line, we'll have to make a 'shec2' or similar so that the old format is still decodable (or ensure that existing data can still be read in some other way). Sound good? sage Cheers On 29/07/2015 11:16, Loic Dachary wrote: Hi Sam, The SHEC plugin[0] has been running in the rados runs[1] in the past few months. It also has a matching corpus verification which runs on every make check[2] as well as its optimized variants. I believe the flag experimental can now be removed. In order to do so, we need to use a PLUGINS_V3 feature, in the same way we did back in Giant when the ISA and LRC plugins were introduced[3]. This won't be necessary in the future, when there is a generic plugin mechanism, but right now that's what we need. It would be a commit very similar to the one implementing PLUGINS_V2[4]. Is this agreeable to you ? Or would you rather see another way to resolve this ? Cheers [0] https://github.com/ceph/ceph/tree/master/src/erasure-code/shec [1] https://github.com/ceph/ceph-qa-suite/tree/master/suites/rados/thrash-erasure-code-shec [2] https://github.com/ceph/ceph-erasure-code-corpus/blob/master/v0.92-988/non-regression.sh#L52 [3] http://tracker.ceph.com/issues/9343 [4] https://github.com/ceph/ceph/commit/9687150ceac9cc7e506bc227f430d4207a6d7489 -- Loïc Dachary, Artisan Logiciel Libre
Re: rgw and the next hammer release v0.94.3
On Tue, Aug 4, 2015 at 3:41 AM, Loic Dachary l...@dachary.org wrote: Hi Yehuda, The next hammer release as found at https://github.com/ceph/ceph/tree/hammer passed the rgw suite (http://tracker.ceph.com/issues/11990#rgw and http://tracker.ceph.com/issues/12502#note-6). Do you think the hammer branch is ready for QE to start their own round of testing ? Looks fine to me. Yehuda Cheers P.S. http://tracker.ceph.com/issues/11990#Release-information has direct links to the pull requests merged into hammer since v0.94.2 in case you need more context about one of them. -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: test/run-rbd-unit-tests.sh : pure virtual method called
Hi, Here is another make check fail. They don't seem to be related. To the best of my knowledge these are the only two rbd related failures in make check during the past week. http://jenkins.ceph.dachary.org/job/ceph/LABELS=ubuntu-14.04x86_64/6884/console [ RUN ] TestLibRBD.ObjectMapConsistentSnap using new format! test/librbd/test_librbd.cc:2790: Failure Value of: passed Actual: false Expected: true [ FAILED ] TestLibRBD.ObjectMapConsistentSnap (396 ms) [--] Global test environment tear-down [==] 98 tests from 6 test cases ran. (10554 ms total) [ PASSED ] 97 tests. [ FAILED ] 1 test, listed below: [ FAILED ] TestLibRBD.ObjectMapConsistentSnap On 03/08/2015 18:01, Loic Dachary wrote: Hi, test/run-rbd-unit-tests.sh failed today on master on Ubuntu 14.04, when run by the make check bot on an unrelated pull request (modifying do_autogen which is not used by the make check bot). http://jenkins.ceph.dachary.org/job/ceph/LABELS=ubuntu-14.04x86_64/6834/console [ RUN ] TestInternal.MultipleResize pure virtual method called terminate called without an active exception Cheers -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re: 08/05/2015 Weekly Ceph Performance Meeting IS ON!
Hi Mark, I Missed todays call :-( Could you please point me to the recording ? Etherpad link only shows recording until 07/08. __Neo On Wed, Aug 5, 2015 at 7:47 AM, Mark Nelson mnel...@redhat.com wrote: 8AM PST as usual (that's in 13 minutes folks!) No specific topics for this week, please feel free to add your own! Here's the links: Etherpad URL: http://pad.ceph.com/p/performance_weekly To join the Meeting: https://bluejeans.com/268261044 To join via Browser: https://bluejeans.com/268261044/browser To join with Lync: https://bluejeans.com/268261044/lync To join via Room System: Video Conferencing System: bjn.vc -or- 199.48.152.152 Meeting ID: 268261044 To join via Phone: 1) Dial: +1 408 740 7256 +1 888 240 2560(US Toll Free) +1 408 317 9253(Alternate Number) (see all numbers - http://bluejeans.com/numbers) 2) Enter Conference ID: 268261044 Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
FileStore should not use syncfs(2)
Today I learned that syncfs(2) does an O(n) search of the superblock's inode list searching for dirty items. I've always assumed that it was only traversing dirty inodes (e.g., a list of dirty inodes), but that appears not to be the case, even on the latest kernels. That means that the more RAM in the box, the larger (generally) the inode cache, the longer syncfs(2) will take, and the more CPU you'll waste doing it. The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 servicing a very light workload, and each syncfs(2) call was taking ~7 seconds (usually to write out a single inode). A possible workaround for such boxes is to turn /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching pages instead of inodes/dentries)... I think the take-away though is that we do need to bite the bullet and make FileStore f[data]sync all the right things so that the syncfs call can be avoided. This is the path you were originally headed down, Somnath, and I think it's the right one. The main thing to watch out for is that according to POSIX you really need to fsync directories. With XFS that isn't the case since all metadata operations are going into the journal and that's fully ordered, but we don't want to allow data loss on e.g. ext4 (we need to check what the metadata ordering behavior is there) or other file systems. :( sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: FileStore should not use syncfs(2)
Thanks Sage for digging down..I was suspecting something similar.. As I mentioned in today's call, in idle time also syncfs is taking ~60ms. I have 64 GB of RAM in the system. The workaround I was talking about today is working pretty good so far. In this implementation, I am not giving much work to syncfs as each worker thread is writing with o_dsync mode. I am issuing syncfs before trimming the journal and most of the time I saw it is taking 100 ms. I have to wake up the sync_thread now after each worker thread finished writing. I will benchmark both the approaches. As we discussed earlier, in case of only fsync approach, we still need to do a db sync to make sure the leveldb stuff persisted, right ? Thanks Regards Somnath -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, August 05, 2015 2:27 PM To: Somnath Roy Cc: ceph-devel@vger.kernel.org; sj...@redhat.com Subject: FileStore should not use syncfs(2) Today I learned that syncfs(2) does an O(n) search of the superblock's inode list searching for dirty items. I've always assumed that it was only traversing dirty inodes (e.g., a list of dirty inodes), but that appears not to be the case, even on the latest kernels. That means that the more RAM in the box, the larger (generally) the inode cache, the longer syncfs(2) will take, and the more CPU you'll waste doing it. The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 servicing a very light workload, and each syncfs(2) call was taking ~7 seconds (usually to write out a single inode). A possible workaround for such boxes is to turn /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching pages instead of inodes/dentries)... I think the take-away though is that we do need to bite the bullet and make FileStore f[data]sync all the right things so that the syncfs call can be avoided. This is the path you were originally headed down, Somnath, and I think it's the right one. The main thing to watch out for is that according to POSIX you really need to fsync directories. With XFS that isn't the case since all metadata operations are going into the journal and that's fully ordered, but we don't want to allow data loss on e.g. ext4 (we need to check what the metadata ordering behavior is there) or other file systems. :( sage PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FileStore should not use syncfs(2)
On 08/05/2015 04:26 PM, Sage Weil wrote: Today I learned that syncfs(2) does an O(n) search of the superblock's inode list searching for dirty items. I've always assumed that it was only traversing dirty inodes (e.g., a list of dirty inodes), but that appears not to be the case, even on the latest kernels. That means that the more RAM in the box, the larger (generally) the inode cache, the longer syncfs(2) will take, and the more CPU you'll waste doing it. The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 servicing a very light workload, and each syncfs(2) call was taking ~7 seconds (usually to write out a single inode). A possible workaround for such boxes is to turn /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching pages instead of inodes/dentries)... FWIW, I often see performance increase when favoring inode/dentry cache, but probably with far fewer inodes that the setup you just saw. It sounds like there needs to be some maximum limit on the inode/dentry cache to prevent this kind of behavior but still favor it up until that point. Having said that, maybe avoiding syncfs is best as you say below. I think the take-away though is that we do need to bite the bullet and make FileStore f[data]sync all the right things so that the syncfs call can be avoided. This is the path you were originally headed down, Somnath, and I think it's the right one. The main thing to watch out for is that according to POSIX you really need to fsync directories. With XFS that isn't the case since all metadata operations are going into the journal and that's fully ordered, but we don't want to allow data loss on e.g. ext4 (we need to check what the metadata ordering behavior is there) or other file systems. :( sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Consult some problems of Ceph when reading source code
Dear developers, My name is Cai Yi, and I am a graduate student majored in CS of Xi’an Jiaotong University in China. From Ceph’s homepage, I know Sage is the author of Ceph and I get the email address from your GitHub and Ceph’s official website. Because Ceph is an excellent distributed file system, so recently, I am reading the source code of the Ceph (the edition is Hammer) to understand the IO good path and the performance of Ceph. However, I face some problems which I could not find the solution from Internet or solve by myself and my partners. So I was wondering if you could help us solve some problems. The problems are as follows: 1) In the Ceph, there is a concept that is the transaction. When the OSD receives a write request, and then it is encapsulated by a transaction. But When the OSD receive many requests, is there a transaction queue to receive the messages? If there is a queue, is it a process of serial or parallel to submit these transaction to do next operation? If it is serial, could the transaction operations influence the performance? 2) From some documents about Ceph, if the OSD receives a read request, the OSD can only read data from primary and then back to client. Is the description right? Is there any way to read the data from replicated OSD? Do we have to require the data from the primary OSD when deal with the read request? If not and we can read from replicated OSD, could we promise the consistency? 3) When the OSD receives the message, the message’s attribute may be the normal dispatch or the fast dispatch. What is the difference between the normal dispatch and the fast dispatch? If the attribute is the normal dispatch, it enters the dispatch queue. Is there a single dispatch queue or multi dispatch queue to deal with all the messages? These are the problem I am facing. Thank you for your patience and cooperation, and I look forward to hearing from you. Yours sincerely Cai
Re: [ANN] ceph-deploy 1.5.27 released
Hi Nigel, On Wed, Aug 5, 2015 at 9:00 PM, Nigel Williams nigel.willi...@utas.edu.au wrote: On 6/08/2015 9:45 AM, Travis Rhoden wrote: A new version of ceph-deploy has been released. Version 1.5.27 includes the following: Has the syntax for use of --zap-disk changed? I moved it around but it is no longer recognised; worked around by doing a ceph-disk zap before running ceph-deploy. A few things in this area changed with 1.5.26. ceph-deploys options are much more strictly attached only to the commands where they make sense. This worked previously: ceph-deploy --overwrite-conf osd --zap-disk prepare ceph05:/dev/sdb:/dev/sdd --zap-disk is an option to 'prepare', not to 'osd'. ceph-deploy osd --zap-disk list doesn't make any sense, for example. The help menus should make this clear: # ceph-deploy osd --help usage: ceph-deploy osd [-h] {list,create,prepare,activate} ... # ceph-deploy osd prepare --help usage: ceph-deploy osd prepare [-h] [--zap-disk] [--fs-type FS_TYPE] [--dmcrypt] [--dmcrypt-key-dir KEYDIR] -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FileStore should not use syncfs(2)
Agree On Thu, Aug 6, 2015 at 5:38 AM, Somnath Roy somnath@sandisk.com wrote: Thanks Sage for digging down..I was suspecting something similar.. As I mentioned in today's call, in idle time also syncfs is taking ~60ms. I have 64 GB of RAM in the system. The workaround I was talking about today is working pretty good so far. In this implementation, I am not giving much work to syncfs as each worker thread is writing with o_dsync mode. I am issuing syncfs before trimming the journal and most of the time I saw it is taking 100 ms. Actually I prefer we don't use syncfs anymore. I more like to use aio+dio+Filestore custom cache to deal with all syncfs+pagecache things. So we even can make cache more smart to aware of upper levels instead of fadvise* calls. Second we can use checkpoint method like mysql innodb, we can know the bw of frontend(filejournal) and decide how much and how often we want to flush(using aio+dio). Anyway, because it's a big project, we may prefer to work at newstore instead of filestore. I have to wake up the sync_thread now after each worker thread finished writing. I will benchmark both the approaches. As we discussed earlier, in case of only fsync approach, we still need to do a db sync to make sure the leveldb stuff persisted, right ? Thanks Regards Somnath -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, August 05, 2015 2:27 PM To: Somnath Roy Cc: ceph-devel@vger.kernel.org; sj...@redhat.com Subject: FileStore should not use syncfs(2) Today I learned that syncfs(2) does an O(n) search of the superblock's inode list searching for dirty items. I've always assumed that it was only traversing dirty inodes (e.g., a list of dirty inodes), but that appears not to be the case, even on the latest kernels. That means that the more RAM in the box, the larger (generally) the inode cache, the longer syncfs(2) will take, and the more CPU you'll waste doing it. The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 servicing a very light workload, and each syncfs(2) call was taking ~7 seconds (usually to write out a single inode). A possible workaround for such boxes is to turn /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching pages instead of inodes/dentries)... I think the take-away though is that we do need to bite the bullet and make FileStore f[data]sync all the right things so that the syncfs call can be avoided. This is the path you were originally headed down, Somnath, and I think it's the right one. The main thing to watch out for is that according to POSIX you really need to fsync directories. With XFS that isn't the case since all metadata operations are going into the journal and that's fully ordered, but we don't want to allow data loss on e.g. ext4 (we need to check what the metadata ordering behavior is there) or other file systems. I guess there only a little directory modify operations, is it true? Maybe we only need to do syncfs when modifying directories? :( sage PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best Regards, Wheat -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANN] ceph-deploy 1.5.27 released
On 6/08/2015 2:22 PM, Travis Rhoden wrote: A few things in this area changed with 1.5.26. ceph-deploys options are much more strictly attached only to the commands where they make sense. Oh much better, thanks. I did wonder about that, but as it worked I didn't revisit. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANN] ceph-deploy 1.5.27 released
On 6/08/2015 9:45 AM, Travis Rhoden wrote: A new version of ceph-deploy has been released. Version 1.5.27 includes the following: Has the syntax for use of --zap-disk changed? I moved it around but it is no longer recognised; worked around by doing a ceph-disk zap before running ceph-deploy. This worked previously: ceph-deploy --overwrite-conf osd --zap-disk prepare ceph05:/dev/sdb:/dev/sdd -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hammer branch for v0.94.3 ready for QE
Hi Yuri The hammer branch for v0.94.3, as found at https://github.com/ceph/ceph/commits/hammer,is approved by the leads (Sam,Greg,Josh Yehuda) and is ready for QE. The backport itself is tracked at http://tracker.ceph.com/issues/11990, which has a record of all the test runs so far the tip of the branch is at 88e7ee716fdd7bcf81845087021a677de5a50da8. Regards Abhishek signature.asc Description: PGP signature
[ANN] ceph-deploy 1.5.27 released
Hi everyone, A new version of ceph-deploy has been released. Version 1.5.27 includes the following: - a new ceph-deploy repo command that allows for adding and removing custom repo definitions - Makes commands like ceph-deploy install --rgw only install the RGW component of Ceph. This works for daemons/components such as --rgw, --mds, and --cli, depending on how packages are split on your distro. For example, Debian packages the Ceph MDS into a separate 'ceph-mon' package, and therefore if you use install --mds only the ceph-mds package will be installed. RPM packages do not do this, so it has to install ceph, which includes MDS, MON, and OSD daemons. Further package splits are coming, but right now we do what we can. - Some fixes around using DNF (Fedora = 22) - Early support for systemd (Fedora 22 and development Ceph builds only) - Loads of internal changes. Full changelog is at [1]. Updated packages have been uploaded to {rpm,debian}-{firefly,hammer,testing} repos on ceph.com, and to PyPI. Cheers, - Travis [1] http://ceph.com/ceph-deploy/docs/changelog.html#id2 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
radosgw + civetweb latency issue on Hammer
Hi, After upgrading to Hammer and moving from apache to civetweb. We started seeing high PUT latency in the order of 2 sec for every PUT request. The GET request lo Attaching the radosgw logs for a single request. The ceph.conf has the following configuration for civetweb. [client.radosgw.gateway] rgw frontends = civetweb port=5632 Further investion reveled the call to get_data() at https://github.com/ceph/ceph/blob/hammer/src/rgw/rgw_op.cc#L1786 is taking 2 sec to respond. The cluster is running Hammer 94.2 release Did any one face this issue before? Is there some configuration I am missing? Regards Srikanth radosgw.log Description: Binary data
RE: Erasure Code Plugins : PLUGINS_V3 feature
Dear Sage, note that what this really means is that the on-disk encoding needs to remain fixed. Thank you for letting us know the important notice. We have no plan to change shec's format at this moment, but we will remember the comment for any future events. Best Regards, Takeshi Miyamae -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Thursday, August 6, 2015 3:45 AM To: Loic Dachary; Miyamae, Takeshi/宮前 剛 Cc: Samuel Just; Ceph Development Subject: Re: Erasure Code Plugins : PLUGINS_V3 feature On Wed, 5 Aug 2015, Loic Dachary wrote: Hi Sam, How does this proposal sound ? It would be great if that was done before the feature freeze. I think it's a good time. Takeshi, note that what this really means is that the on-disk encoding needs to remain fixed. If we decide to change it down the line, we'll have to make a 'shec2' or similar so that the old format is still decodable (or ensure that existing data can still be read in some other way). Sound good? sage Cheers On 29/07/2015 11:16, Loic Dachary wrote: Hi Sam, The SHEC plugin[0] has been running in the rados runs[1] in the past few months. It also has a matching corpus verification which runs on every make check[2] as well as its optimized variants. I believe the flag experimental can now be removed. In order to do so, we need to use a PLUGINS_V3 feature, in the same way we did back in Giant when the ISA and LRC plugins were introduced[3]. This won't be necessary in the future, when there is a generic plugin mechanism, but right now that's what we need. It would be a commit very similar to the one implementing PLUGINS_V2[4]. Is this agreeable to you ? Or would you rather see another way to resolve this ? Cheers [0] https://github.com/ceph/ceph/tree/master/src/erasure-code/shec [1] https://github.com/ceph/ceph-qa-suite/tree/master/suites/rados/thras h-erasure-code-shec [2] https://github.com/ceph/ceph-erasure-code-corpus/blob/master/v0.92-9 88/non-regression.sh#L52 [3] http://tracker.ceph.com/issues/9343 [4] https://github.com/ceph/ceph/commit/9687150ceac9cc7e506bc227f430d420 7a6d7489 -- Loïc Dachary, Artisan Logiciel Libre