Re: timeout 120 teuthology-killl is highly recommended
I was thinking of teuthology-nuke thou ! Thx YuriW - Original Message - From: Yuri Weinstein ywein...@redhat.com To: Loic Dachary l...@dachary.org Cc: Ceph Development ceph-devel@vger.kernel.org Sent: Tuesday, July 21, 2015 9:33:26 AM Subject: Re: timeout 120 teuthology-killl is highly recommended Loic I don't use teuthology-kill simultaneously only sequentially. As far as run time, just as a note, when we use 'stale' arg and it invokes ipmitool interface it does take awhile to finish. Thx YuriW - Original Message - From: Loic Dachary l...@dachary.org To: Ceph Development ceph-devel@vger.kernel.org Sent: Tuesday, July 21, 2015 9:13:04 AM Subject: timeout 120 teuthology-killl is highly recommended Hi Ceph, Today I did something wrong and that blocked the lab for a good half hour. a) I ran two teuthology-kill simultaneously and that makes them deadlock each other b) I let them run unattended only to come back to the terminal 30 minutes later and see them stuck. Sure, two teuthology-kill simultaneously should not deadlock and that needs to be fixed. But the easy workaround to avoid that trouble is to just not let it run forever. Even for ~200 jobs it takes at most a minute or two. And if it takes longer it probably means another teuthology-kill competes and it should be interrupted and restarted later. From now on I'll do timeout 120 teuthology-kill || echo FAIL! as a generic safeguard. Apologies for the troubles. -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: teuthology rados runs for next
Ok ! teuthology-kill -m multi -r teuthology-2015-07-18_21:00:09-rados-next-distro-basic-multi teuthology-kill -m multi -r teuthology-2015-07-20_21:00:09-rados-next-distro-basic-multi I observed that the older http://pulpito.ceph.com/teuthology-2015-07-17_21:00:10-rados-next-distro-basic-multi is half way through but does not seem to make progress while the one scheduled today http://pulpito.ceph.com/teuthology-2015-07-19_21:00:10-rados-next-distro-basic-multi has one job running. Do you think best to kill the newer so it does not compete for resources that would prevent the older from finishing ? I'd be tempted to kill the newer because it's so difficult to get jobs running right now that it makes sense to preserve a run that already managed to pass 100 jobs :-) Cheers On 21/07/2015 15:43, Sage Weil wrote: On Tue, 21 Jul 2015, Loic Dachary wrote: Hi Sam, I noticed today that http://pulpito.ceph.com/?suite=radosbranch=next is lagging three days behind. Do we want to keep all the runs or should we kill the older ones ? I suppose there would be value in having the results for all of them but given the current load in the sepia lab it also significantly delays them. What do you think ? I think it's better to kill old scheduled runs. -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re: timeout 120 teuthology-killl is highly recommended
Loic I don't use teuthology-kill simultaneously only sequentially. As far as run time, just as a note, when we use 'stale' arg and it invokes ipmitool interface it does take awhile to finish. Thx YuriW - Original Message - From: Loic Dachary l...@dachary.org To: Ceph Development ceph-devel@vger.kernel.org Sent: Tuesday, July 21, 2015 9:13:04 AM Subject: timeout 120 teuthology-killl is highly recommended Hi Ceph, Today I did something wrong and that blocked the lab for a good half hour. a) I ran two teuthology-kill simultaneously and that makes them deadlock each other b) I let them run unattended only to come back to the terminal 30 minutes later and see them stuck. Sure, two teuthology-kill simultaneously should not deadlock and that needs to be fixed. But the easy workaround to avoid that trouble is to just not let it run forever. Even for ~200 jobs it takes at most a minute or two. And if it takes longer it probably means another teuthology-kill competes and it should be interrupted and restarted later. From now on I'll do timeout 120 teuthology-kill || echo FAIL! as a generic safeguard. Apologies for the troubles. -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
9.0.2 test/perf_local.cc on non-x86 architectures
I was trying to do an rpmbuild of v9.0.2 for aarch64 and got the following error: test/perf_local.cc: In function 'double div32()': test/perf_local.cc:396:31: error: impossible constraint in 'asm' cc); Probably should have an if defined (__i386__) around it. -- Tom -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Documentation] Hardware recommandation : RAM and PGLog
OK, I just understand the need of transactions for the trim takes place after changing settings. What is the risk to have too low a value for the parameter osd_min_pg_log_entries (not osd_max_pg_log_entries in degraded environment) ? David. On 07/20/2015 03:13 PM, Sage Weil wrote: On Sun, 19 Jul 2015, David Casier AEVOO wrote: Hi, I have a question about PGLog and RAM consumption. In the documentation, we read OSDs do not require as much RAM for regular operations (e.g., 500MB of RAM per daemon instance); however, during recovery they need significantly more RAM (e.g., ~1GB per 1TB of storage per daemon) But in fact, all pg log are read in the start of ceph-osd daemon and put in RAM ( pg-read_state(store, bl); ) Is this normal behavior or I have a defect in my environment? There are two tunables that control how many pg log entries we keep around. When teh PG is healthy, we keep ~1000, and when the PG is degraded, we keep more, to expand the time window over which a recovering OSD will be able to do regular log-based recovery instead of a more expensive backfill. This is one source of additional memory. Others are the missing sets (lists of missing/degraded objects) and messages/data/state associated with objcts that are being recovered/copied. Note that the numbers in teh documentation are pretty rough rules of thumb. At some point it would be great to build a model for how much RAM the osd consumes as a function of the various configurables (pg log size, pg count, avg object size, etc.). sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: local teuthology testing
Hi, Since July 18th teuthology no longer uses chef, this issue has been resolved ! Using ansible requires configuration ( http://dachary.org/?p=3752 explains it shortly, maybe there is something in the documentation but I did not pay enough attention to be sure ). At the end of http://dachary.org/?p=3752 you will see a list of configurable values and I suspect Andrew Zack would be more than happy to explain how any hardcoded leftover can be stripped :-) Cheers On 21/07/2015 14:58, Shinobu Kinjo wrote: Hi, I think that you have to show us such a URLs for anyone who would have same biggest issue. Sincerely, Kinjo On Tue, Jul 21, 2015 at 9:52 PM, Zhou, Yuan yuan.z...@intel.com mailto:yuan.z...@intel.com wrote: Hi David/Loic, I was also trying to make some local Teuthology clusters here. The biggest issue I met is in the ceph-qa-chef - there're lots of hardcoded URL related with the sepia lab. I have to trace the code and change them line by line. Can you please kindly share me how did you get this work? Is there an easy way to fix this? Thanks, -yuan -- Life w/ Linux http://i-shinobu.hatenablog.com/ -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
timeout 120 teuthology-killl is highly recommended
Hi Ceph, Today I did something wrong and that blocked the lab for a good half hour. a) I ran two teuthology-kill simultaneously and that makes them deadlock each other b) I let them run unattended only to come back to the terminal 30 minutes later and see them stuck. Sure, two teuthology-kill simultaneously should not deadlock and that needs to be fixed. But the easy workaround to avoid that trouble is to just not let it run forever. Even for ~200 jobs it takes at most a minute or two. And if it takes longer it probably means another teuthology-kill competes and it should be interrupted and restarted later. From now on I'll do timeout 120 teuthology-kill || echo FAIL! as a generic safeguard. Apologies for the troubles. -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re: timeout 120 teuthology-killl is highly recommended
On Tue, Jul 21, 2015 at 5:13 PM, Loic Dachary l...@dachary.org wrote: Hi Ceph, Today I did something wrong and that blocked the lab for a good half hour. a) I ran two teuthology-kill simultaneously and that makes them deadlock each other b) I let them run unattended only to come back to the terminal 30 minutes later and see them stuck. Sure, two teuthology-kill simultaneously should not deadlock and that needs to be fixed. But the easy workaround to avoid that trouble is to just not let it run forever. Even for ~200 jobs it takes at most a minute or two. Mmm, I'm not sure that's correct if you're killing jobs which are actually running — teuthology-nuke (which it will invoke) can take a while and you definitely don't want to time that out! So beware for in-process runs. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: timeout 120 teuthology-killl is highly recommended
Greg Yuri : I stand corrected, I should have been less affirmative on a topic I know little about. Thanks ! On 21/07/2015 18:33, Yuri Weinstein wrote: Loic I don't use teuthology-kill simultaneously only sequentially. As far as run time, just as a note, when we use 'stale' arg and it invokes ipmitool interface it does take awhile to finish. Thx YuriW - Original Message - From: Loic Dachary l...@dachary.org To: Ceph Development ceph-devel@vger.kernel.org Sent: Tuesday, July 21, 2015 9:13:04 AM Subject: timeout 120 teuthology-killl is highly recommended Hi Ceph, Today I did something wrong and that blocked the lab for a good half hour. a) I ran two teuthology-kill simultaneously and that makes them deadlock each other b) I let them run unattended only to come back to the terminal 30 minutes later and see them stuck. Sure, two teuthology-kill simultaneously should not deadlock and that needs to be fixed. But the easy workaround to avoid that trouble is to just not let it run forever. Even for ~200 jobs it takes at most a minute or two. And if it takes longer it probably means another teuthology-kill competes and it should be interrupted and restarted later. From now on I'll do timeout 120 teuthology-kill || echo FAIL! as a generic safeguard. Apologies for the troubles. -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re: The design of the eviction improvement
Thanks for the explanations, Greg. - Gregory Farnum g...@gregs42.com wrote: On Tue, Jul 21, 2015 at 3:15 PM, Matt W. Benjamin m...@cohortfs.com wrote: Hi, Couple of points. 1) a successor to 2Q is MQ (Li et al). We have an intrusive MQ LRU implementation with 2 levels currently, plus a pinned queue, that addresses stuff like partitioning (sharding), scan resistance, and coordination w/lookup tables. We might extend/re-use it. 2) I'm a bit confused by active/inactive vocabulary, dimensioning of cache segments (are you proposing to/do we now always cache whole objects?), and cost of looking for dirty objects; I suspect that it makes sense to amortize the cost of locating segments eligible to be flushed, rather than minimize bookkeeping. We make caching decisions in terms of whole objects right now, yeah. There's really nothing in the system that's capable of doing segments within an object, and it's not just about tracking a little more metadata about dirty objects — the way we handle snapshots, etc would have to be reworked if we were allowing partial-object caching. Plus keep in mind the IO cost of the bookkeeping — it needs to be either consistently persisted to disk or reconstructable from whatever happens to be in the object. That can get expensive really fast. -Greg For current semantics/structure of PGs + specific tier held fixed, makes sense. For our object addressing currently, we have a greater requirement for partial object caching. (Partly, we did this to achieve periodicity w/sequential I/O.) I think broadly, there are large performance tradeoffs here. In AFS and DCE, there is full consistency in materialized caches. Also, caches are dimensioned by chunks. If the cache is materialized in memory, the semantics aren't those of disk. Basically, consistency guarantees are policy. Different snapshot mechanisms, or omtting them, e.g., should logically enable relaxed consistency, modulo policy. Matt Matt - Zhiqiang Wang zhiqiang.w...@intel.com wrote: -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Tuesday, July 21, 2015 6:38 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: Re: The design of the eviction improvement On Mon, 20 Jul 2015, Wang, Zhiqiang wrote: Hi all, This is a follow-up of one of the CDS session at http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri ng_eviction. We discussed the drawbacks of the current eviction algorithm and several ways to improve it. Seems like the LRU variants is the right way to go. I come up with some design points after the CDS, and want to discuss it with you. It is an approximate 2Q algorithm, combining some benefits of the clock algorithm, similar to what the linux kernel does for the page cache. Unfortunately I missed this last CDS so I'm behind on the discussion. I have a few questions though... # Design points: ## LRU lists - Maintain LRU lists at the PG level. The SharedLRU and SimpleLRU implementation in the current code have a max_size, which limits the max number of elements in the list. This mostly looks like a MRU, though its name implies they are LRUs. Since the object size may vary in a PG, it's not possible to caculate the total number of objects which the cache tier can hold ahead of time. We need a new LRU implementation with no limit on the size. This last sentence seems to me to be the crux of it. Assuming we have an OSD based by flash storing O(n) objects, we need a way to maintain an LRU of O(n) objects in memory. The current hitset-based approach was taken based on the assumption that this wasn't feasible--or at least we didn't know how to implmement such a thing. If it is, or we simply want to stipulate that cache tier OSDs get gobs of RAM to make it possible, then lots of better options become possible... Let's say you have a 1TB SSD, with an average object size of 1MB -- that's 1 million objects. At maybe ~100bytes per object of RAM for an LRU entry that's 100MB... so not so unreasonable, perhaps! I was having the same question before proposing this. I did the similar calculation and thought it would be ok to use this many memory :-) - Two lists for each PG: active and inactive Objects are first put into the inactive list when they are accessed, and moved between these two lists based on some criteria. Object flag: active, referenced, unevictable, dirty. - When an object is accessed: 1) If it's not in both of the lists, it's put on the top of the inactive list 2) If it's in the inactive list, and the referenced flag is not set, the referenced flag is set, and it's
Re: MVC in ceph-deploy.
Hi Owen, I think the primary concern I have that I want to discuss more about is cluster state discovery. I'm worried about how this scales. Normally when I think about MVC, I think of a long-running application or something with a persistent data store for the model (or both). ceph-deploy is neither of these, so the act of querying the cluster first and loading all the data into the (memory-backed) model upon every ceph-deploy invocation concerns me. For querying pools and monitors, it seems fine. What happens when we have 1000s of OSDs? I don't think we'd want to add any kind of data persistence to ceph-deploy (not that you've suggested it), so that means we would have to load the model every time. Right now we take more of a Pythonic approach of it's better to ask for forgiveness than permission, meaning that we just try to do the action in question and deal with any exceptions that arise. I'm a little hesitant to try to infer or pull much knowledge of a Ceph cluster's state into ceph-deploy, as the cluster's state is quite complicated and dynamic just because it's a large distributed system. Of course the monitors deal with that for us, and I think we would just be querying the monitor(s) for the latest state. In general that's my primary feedback. Are there issues with scaling to large clusters? Does it become a large overhead to load the model if someone runs ceph-deploy repeatedly (say, adding OSDs to one node, then the next, then the next, and they've done it with separate calls to ceph-deploy each time). How do we deal with updating the model in failure scenarios? We have to re-query the monitor and update the model to make sure our local representation is accurate. I suppose that applies even for non-failure. For example when we are proposing a change to the model, say we want to add an OSD, we see go off to a node to create/deploy the OSD and everything comes back okay. But to really be sure, we probably need to query the monitor and see what the status of the OSD is (is it defined? is it up?) and to re-sync the model. It seems like a lot of back and forth interaction to keep the model up to date, and ultimately we lose all that information when the application exits. That's my initial feedback. - Travis On Fri, Jul 17, 2015 at 2:31 AM, Owen Synge osy...@suse.com wrote: On 07/17/2015 08:16 AM, Travis Rhoden wrote: Hi Owen, I'm still giving this one some thought. I've gone back and reviewed https://github.com/ceph/ceph-deploy/pull/320 a few more times. I do understand how it works (it took a couple times through it), and cosmetic things notwithstanding I can appreciate what it is doing. I also fully get that the choice of sqlalchemy vs choice of data store makes no difference to the merit of the idea. I'm still formulating my opinion, however, but wanted to let you know I was thinking about it. Thanks for this reply, but please don't get too focused on the patch. At the time of writing this patch I thought MVC would be completely uncontentious. It was never intended to illustrate the benefits of MVC. That is more work than this patch was intended to achieve. It was written to show a model can be practically done with a sqlalchemy implementation of an MVC's model with no important deployment overhead, and illustrate how SQL queries can be mapped easily as an advantage of using a RDBMS model, rather than illustrating MVC best practice. The rest of the email tries to show horn the patch into the discussion of the validity of MVC with respect to the patch: https://github.com/ceph/ceph-deploy/pull/320 Its a nights work and only a discussion point, its half existing code I already have (for rgw setup), and half MVC in its self. Please think of it more as an aid to a conversation (like slides) rather than as final or a good example of MVC best practice. It would need more work to be this, and its probably a days work to be close to a good example. The model is clear, but the views could be clearer in the way they are abstracted, and can be extended in consistent ways, so I made a few comment on the patch showing where I think the code is not very MVC. A good example in how its not MVC enough to be an MVC example is the set operations that should be in a controller method, and comparing data in the model, the data in the model should be loaded via views. In addition if the model is based on a RDBMS, the set operations would be performed by the RDBMS and not in python. The first post in this thread is more stand alone in where I would see us going if we went MVC. The development of the patch helped me see many places where we could add consistency by being more MVC, rather than actually following the design pattern enough to show best practice. I would be happy to chat about the code face to face but reviewing this code directly without comments does not show the benefits of MVC. In summary about this patch: (1) It is
Re: upstream/firefly exporting the same snap 2 times results in different exports
Am 21.07.2015 um 16:32 schrieb Jason Dillaman: Any chance that the snapshot was just created prior to the first export and you have a process actively writing to the image? Sadly not. I executed those commands exactly as i've posted manually at bash. I can reproduce this at 5 different ceph cluster and 500 vms each. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ceph Tech Talk next week
Hey cephers, Just a reminder that the Ceph Tech Talk on CephFS that was scheduled for last month (and cancelled due to technical difficulties) has been rescheduled for this month's talk. It will be happening next Thurs at 17:00 UTC (1p EST) on our Blue Jeans conferencing system. If you have any questions feel free to let me know. Thanks. http://ceph.com/ceph-tech-talks/ -- Best Regards, Patrick McGarry Director Ceph Community || Red Hat http://ceph.com || http://community.redhat.com @scuttlemonkey || @ceph -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: upstream/firefly exporting the same snap 2 times results in different exports
Does this still occur if you export the images to the console (i.e. rbd export cephstor/disk-116@snap - dump_file)? Would it be possible for you to provide logs from the two rbd export runs on your smallest VM image? If so, please add the following to the [client] section of your ceph.conf: log file = /valid/path/to/logs/$name.$pid.log debug rbd = 20 I opened a ticket [1] where you can attach the logs (if they aren't too large). [1] http://tracker.ceph.com/issues/12422 -- Jason Dillaman Red Hat dilla...@redhat.com http://www.redhat.com - Original Message - From: Stefan Priebe s.pri...@profihost.ag To: Jason Dillaman dilla...@redhat.com Cc: ceph-devel@vger.kernel.org Sent: Tuesday, July 21, 2015 12:55:43 PM Subject: Re: upstream/firefly exporting the same snap 2 times results in different exports Am 21.07.2015 um 16:32 schrieb Jason Dillaman: Any chance that the snapshot was just created prior to the first export and you have a process actively writing to the image? Sadly not. I executed those commands exactly as i've posted manually at bash. I can reproduce this at 5 different ceph cluster and 500 vms each. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hdparm -W redux, bug in _check_disk_write_cache for RHEL6?
On Tue, Jul 21, 2015 at 4:20 PM, Ilya Dryomov idryo...@gmail.com wrote: This one, I think: commit ab0a9735e06914ce4d2a94ffa41497dbc142fe7f Author: Christoph Hellwig h...@lst.de Date: Thu Oct 29 14:14:04 2009 +0100 blkdev: flush disk cache on -fsync Thanks, that looks relevant! Looks to me that all RHEL 6 kernels have (a version of) that patch. Cheers, Dan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: The design of the eviction improvement
-Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Tuesday, July 21, 2015 9:29 PM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Tue, 21 Jul 2015, Wang, Zhiqiang wrote: -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Tuesday, July 21, 2015 6:38 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: Re: The design of the eviction improvement On Mon, 20 Jul 2015, Wang, Zhiqiang wrote: Hi all, This is a follow-up of one of the CDS session at http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_ tieri ng_eviction. We discussed the drawbacks of the current eviction algorithm and several ways to improve it. Seems like the LRU variants is the right way to go. I come up with some design points after the CDS, and want to discuss it with you. It is an approximate 2Q algorithm, combining some benefits of the clock algorithm, similar to what the linux kernel does for the page cache. Unfortunately I missed this last CDS so I'm behind on the discussion. I have a few questions though... # Design points: ## LRU lists - Maintain LRU lists at the PG level. The SharedLRU and SimpleLRU implementation in the current code have a max_size, which limits the max number of elements in the list. This mostly looks like a MRU, though its name implies they are LRUs. Since the object size may vary in a PG, it's not possible to caculate the total number of objects which the cache tier can hold ahead of time. We need a new LRU implementation with no limit on the size. This last sentence seems to me to be the crux of it. Assuming we have an OSD based by flash storing O(n) objects, we need a way to maintain an LRU of O(n) objects in memory. The current hitset-based approach was taken based on the assumption that this wasn't feasible--or at least we didn't know how to implmement such a thing. If it is, or we simply want to stipulate that cache tier OSDs get gobs of RAM to make it possible, then lots of better options become possible... Let's say you have a 1TB SSD, with an average object size of 1MB -- that's 1 million objects. At maybe ~100bytes per object of RAM for an LRU entry that's 100MB... so not so unreasonable, perhaps! I was having the same question before proposing this. I did the similar calculation and thought it would be ok to use this many memory :-) The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers (assuming a conventional linked list structure). Maybe tens of seconds... I'm thinking of maintaining the lists at the PG level. That's to say, we have an active/inactive list for every PG. We can load the lists in parallel during rebooting. Also, the ~100 MB lists are split among different OSD nodes. Perhaps it does not need such long time to load them? I wonder if instead we should construct some sort of flat model where we load slabs of contiguous memory, 10's of MB each, and have the next/previous pointers be a (slab,position) pair. That way we can load it into memory in big chunks, quickly, and be able to operate on it (adjust links) immediately. Another thought: currently we use the hobject_t hash only instead of the full object name. We could continue to do the same, or we could do a hash pair (hobject_t hash + a different hash of the rest of the object) to keep the representation compact. With a model lke the above, that could get the object representation down to 2 u32's. A link could be a slab + position (2 more u32's), and if we have prev + next that'd be just 6x4=24 bytes per object. Looks like for an object, the head and the snapshot version have the same hobject hash. Thus we have to use the hash pair instead of just the hobject hash. But I still have two questions if we use the hash pair to represent an object. 1) Does the hash pair uniquely identify an object? That's to say, is it possible for two objects to have the same hash pair? 2) We need a way to get the full object name from the hash pair, so that we know what objects to evict. But seems like we don't have a good way to do this? With fixed-sized slots on the slabs, the slab allocator could be very simple.. maybe just a bitmap, a free counter, and any other trivial optimizations to make finding a slab's next free a slot nice and quick. - Two lists for each PG: active and inactive Objects are first put into the inactive list when they are accessed, and moved between these two lists based on some criteria. Object flag: active, referenced,
Re: The design of the eviction improvement
Hi, - Zhiqiang Wang zhiqiang.w...@intel.com wrote: Hi Matt, -Original Message- From: Matt W. Benjamin [mailto:m...@cohortfs.com] Sent: Tuesday, July 21, 2015 10:16 PM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org; Sage Weil Subject: Re: The design of the eviction improvement Hi, Couple of points. 1) a successor to 2Q is MQ (Li et al). We have an intrusive MQ LRU implementation with 2 levels currently, plus a pinned queue, that addresses stuff like partitioning (sharding), scan resistance, and coordination w/lookup tables. We might extend/re-use it. The MQ algorithm is more complex, and seems like it has more overhead than 2Q. The approximate 2Q algorithm here combines some benefits of the clock algorithm, and works well on the linux kernel. MQ could be another choice. There are some other candidates like LIRS, ARC, etc., which have been deployed in some practical systems. MQ has been deployed in practical systems, and is more general. 2) I'm a bit confused by active/inactive vocabulary, dimensioning of cache segments (are you proposing to/do we now always cache whole objects?), and cost of looking for dirty objects; I suspect that it makes sense to amortize the cost of locating segments eligible to be flushed, rather than minimize bookkeeping. Though currently the caching decisions are made in the unit of object as Greg pointed out in another mail, I think we still have something to improve here. I'll come back to this some time later. Matt - Zhiqiang Wang zhiqiang.w...@intel.com wrote: -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Tuesday, July 21, 2015 6:38 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: Re: The design of the eviction improvement On Mon, 20 Jul 2015, Wang, Zhiqiang wrote: Hi all, This is a follow-up of one of the CDS session at http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_ti eri ng_eviction. We discussed the drawbacks of the current eviction algorithm and several ways to improve it. Seems like the LRU variants is the right way to go. I come up with some design points after the CDS, and want to discuss it with you. It is an approximate 2Q algorithm, combining some benefits of the clock algorithm, similar to what the linux kernel does for the page cache. Unfortunately I missed this last CDS so I'm behind on the discussion. I have a few questions though... # Design points: ## LRU lists - Maintain LRU lists at the PG level. The SharedLRU and SimpleLRU implementation in the current code have a max_size, which limits the max number of elements in the list. This mostly looks like a MRU, though its name implies they are LRUs. Since the object size may vary in a PG, it's not possible to caculate the total number of objects which the cache tier can hold ahead of time. We need a new LRU implementation with no limit on the size. This last sentence seems to me to be the crux of it. Assuming we have an OSD based by flash storing O(n) objects, we need a way to maintain an LRU of O(n) objects in memory. The current hitset-based approach was taken based on the assumption that this wasn't feasible--or at least we didn't know how to implmement such a thing. If it is, or we simply want to stipulate that cache tier OSDs get gobs of RAM to make it possible, then lots of better options become possible... Let's say you have a 1TB SSD, with an average object size of 1MB -- that's 1 million objects. At maybe ~100bytes per object of RAM for an LRU entry that's 100MB... so not so unreasonable, perhaps! I was having the same question before proposing this. I did the similar calculation and thought it would be ok to use this many memory :-) - Two lists for each PG: active and inactive Objects are first put into the inactive list when they are accessed, and moved between these two lists based on some criteria. Object flag: active, referenced, unevictable, dirty. - When an object is accessed: 1) If it's not in both of the lists, it's put on the top of the inactive list 2) If it's in the inactive list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the inactive list. 3) If it's in the inactive list, and the referenced flag is set, the referenced flag is cleared, and it's removed from the inactive list, and put on top of the active list. 4) If it's in the active list, and the referenced flag is not set, the referenced flag is
Subscription to the ceph-devel mailing list
-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: The design of the eviction improvement
Hi Matt, -Original Message- From: Matt W. Benjamin [mailto:m...@cohortfs.com] Sent: Tuesday, July 21, 2015 10:16 PM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org; Sage Weil Subject: Re: The design of the eviction improvement Hi, Couple of points. 1) a successor to 2Q is MQ (Li et al). We have an intrusive MQ LRU implementation with 2 levels currently, plus a pinned queue, that addresses stuff like partitioning (sharding), scan resistance, and coordination w/lookup tables. We might extend/re-use it. The MQ algorithm is more complex, and seems like it has more overhead than 2Q. The approximate 2Q algorithm here combines some benefits of the clock algorithm, and works well on the linux kernel. MQ could be another choice. There are some other candidates like LIRS, ARC, etc., which have been deployed in some practical systems. 2) I'm a bit confused by active/inactive vocabulary, dimensioning of cache segments (are you proposing to/do we now always cache whole objects?), and cost of looking for dirty objects; I suspect that it makes sense to amortize the cost of locating segments eligible to be flushed, rather than minimize bookkeeping. Though currently the caching decisions are made in the unit of object as Greg pointed out in another mail, I think we still have something to improve here. I'll come back to this some time later. Matt - Zhiqiang Wang zhiqiang.w...@intel.com wrote: -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Tuesday, July 21, 2015 6:38 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: Re: The design of the eviction improvement On Mon, 20 Jul 2015, Wang, Zhiqiang wrote: Hi all, This is a follow-up of one of the CDS session at http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_ti eri ng_eviction. We discussed the drawbacks of the current eviction algorithm and several ways to improve it. Seems like the LRU variants is the right way to go. I come up with some design points after the CDS, and want to discuss it with you. It is an approximate 2Q algorithm, combining some benefits of the clock algorithm, similar to what the linux kernel does for the page cache. Unfortunately I missed this last CDS so I'm behind on the discussion. I have a few questions though... # Design points: ## LRU lists - Maintain LRU lists at the PG level. The SharedLRU and SimpleLRU implementation in the current code have a max_size, which limits the max number of elements in the list. This mostly looks like a MRU, though its name implies they are LRUs. Since the object size may vary in a PG, it's not possible to caculate the total number of objects which the cache tier can hold ahead of time. We need a new LRU implementation with no limit on the size. This last sentence seems to me to be the crux of it. Assuming we have an OSD based by flash storing O(n) objects, we need a way to maintain an LRU of O(n) objects in memory. The current hitset-based approach was taken based on the assumption that this wasn't feasible--or at least we didn't know how to implmement such a thing. If it is, or we simply want to stipulate that cache tier OSDs get gobs of RAM to make it possible, then lots of better options become possible... Let's say you have a 1TB SSD, with an average object size of 1MB -- that's 1 million objects. At maybe ~100bytes per object of RAM for an LRU entry that's 100MB... so not so unreasonable, perhaps! I was having the same question before proposing this. I did the similar calculation and thought it would be ok to use this many memory :-) - Two lists for each PG: active and inactive Objects are first put into the inactive list when they are accessed, and moved between these two lists based on some criteria. Object flag: active, referenced, unevictable, dirty. - When an object is accessed: 1) If it's not in both of the lists, it's put on the top of the inactive list 2) If it's in the inactive list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the inactive list. 3) If it's in the inactive list, and the referenced flag is set, the referenced flag is cleared, and it's removed from the inactive list, and put on top of the active list. 4) If it's in the active list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the active list. 5) If it's in the active list, and the referenced flag is set, it's moved to the top of the active list. - When selecting objects to evict: 1) Objects at the bottom of the inactive list are selected to
Re: upstream/firefly exporting the same snap 2 times results in different exports
Am 21.07.2015 um 21:46 schrieb Josh Durgin: On 07/21/2015 12:22 PM, Stefan Priebe wrote: Am 21.07.2015 um 19:19 schrieb Jason Dillaman: Does this still occur if you export the images to the console (i.e. rbd export cephstor/disk-116@snap - dump_file)? Would it be possible for you to provide logs from the two rbd export runs on your smallest VM image? If so, please add the following to the [client] section of your ceph.conf: log file = /valid/path/to/logs/$name.$pid.log debug rbd = 20 I opened a ticket [1] where you can attach the logs (if they aren't too large). [1] http://tracker.ceph.com/issues/12422 Will post some more details to the tracker in a few hours. It seems it is related to using discard inside guest but not on the FS the osd is on. That sounds very odd. Could you verify via 'rados listwatchers' on an in-use rbd image's header object that there's still a watch established? How can i do this exactly? Have you increased pgs in all those clusters recently? Yes i bumped from 2048 to 4096 as i doubled the osds. Stefan Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: upstream/firefly exporting the same snap 2 times results in different exports
So this is really this old bug? http://tracker.ceph.com/issues/9806 Stefan Am 21.07.2015 um 21:46 schrieb Josh Durgin: On 07/21/2015 12:22 PM, Stefan Priebe wrote: Am 21.07.2015 um 19:19 schrieb Jason Dillaman: Does this still occur if you export the images to the console (i.e. rbd export cephstor/disk-116@snap - dump_file)? Would it be possible for you to provide logs from the two rbd export runs on your smallest VM image? If so, please add the following to the [client] section of your ceph.conf: log file = /valid/path/to/logs/$name.$pid.log debug rbd = 20 I opened a ticket [1] where you can attach the logs (if they aren't too large). [1] http://tracker.ceph.com/issues/12422 Will post some more details to the tracker in a few hours. It seems it is related to using discard inside guest but not on the FS the osd is on. That sounds very odd. Could you verify via 'rados listwatchers' on an in-use rbd image's header object that there's still a watch established? Have you increased pgs in all those clusters recently? Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: upstream/firefly exporting the same snap 2 times results in different exports
Am 21.07.2015 um 19:19 schrieb Jason Dillaman: Does this still occur if you export the images to the console (i.e. rbd export cephstor/disk-116@snap - dump_file)? Would it be possible for you to provide logs from the two rbd export runs on your smallest VM image? If so, please add the following to the [client] section of your ceph.conf: log file = /valid/path/to/logs/$name.$pid.log debug rbd = 20 I opened a ticket [1] where you can attach the logs (if they aren't too large). [1] http://tracker.ceph.com/issues/12422 Will post some more details to the tracker in a few hours. It seems it is related to using discard inside guest but not on the FS the osd is on. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: upstream/firefly exporting the same snap 2 times results in different exports
On 07/21/2015 12:22 PM, Stefan Priebe wrote: Am 21.07.2015 um 19:19 schrieb Jason Dillaman: Does this still occur if you export the images to the console (i.e. rbd export cephstor/disk-116@snap - dump_file)? Would it be possible for you to provide logs from the two rbd export runs on your smallest VM image? If so, please add the following to the [client] section of your ceph.conf: log file = /valid/path/to/logs/$name.$pid.log debug rbd = 20 I opened a ticket [1] where you can attach the logs (if they aren't too large). [1] http://tracker.ceph.com/issues/12422 Will post some more details to the tracker in a few hours. It seems it is related to using discard inside guest but not on the FS the osd is on. That sounds very odd. Could you verify via 'rados listwatchers' on an in-use rbd image's header object that there's still a watch established? Have you increased pgs in all those clusters recently? Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph Tech Talk next week
On Tue, Jul 21, 2015 at 6:09 PM, Patrick McGarry pmcga...@redhat.com wrote: Hey cephers, Just a reminder that the Ceph Tech Talk on CephFS that was scheduled for last month (and cancelled due to technical difficulties) has been rescheduled for this month's talk. It will be happening next Thurs at 17:00 UTC (1p EST) So that's July 30, according to the website, right? :) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: upstream/firefly exporting the same snap 2 times results in different exports
Yes, I'm afraid it sounds like it is. You can double check whether the watch exists on an image by getting the id of the image from 'rbd info $pool/$image | grep block_name_prefix': block_name_prefix: rbd_data.105674b0dc51 The id is the hex number there. Append that to 'rbd_header.' and you have the header object name. Check whether it has watchers with: rados listwatchers -p $pool rbd_header.105674b0dc51 If that doesn't show any watchers while the image is in use by a vm, it's #9806. I just merged the backport for firefly, so it'll be in 0.80.11. Sorry it took so long to get to firefly :(. We'll need to be more vigilant about checking non-trivial backports when we're going through all the bugs periodically. Josh On 07/21/2015 12:52 PM, Stefan Priebe wrote: So this is really this old bug? http://tracker.ceph.com/issues/9806 Stefan Am 21.07.2015 um 21:46 schrieb Josh Durgin: On 07/21/2015 12:22 PM, Stefan Priebe wrote: Am 21.07.2015 um 19:19 schrieb Jason Dillaman: Does this still occur if you export the images to the console (i.e. rbd export cephstor/disk-116@snap - dump_file)? Would it be possible for you to provide logs from the two rbd export runs on your smallest VM image? If so, please add the following to the [client] section of your ceph.conf: log file = /valid/path/to/logs/$name.$pid.log debug rbd = 20 I opened a ticket [1] where you can attach the logs (if they aren't too large). [1] http://tracker.ceph.com/issues/12422 Will post some more details to the tracker in a few hours. It seems it is related to using discard inside guest but not on the FS the osd is on. That sounds very odd. Could you verify via 'rados listwatchers' on an in-use rbd image's header object that there's still a watch established? Have you increased pgs in all those clusters recently? Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
quick way to rebuild deb packages
Hi all, I'm currently working on a test environment for ceph where we're using deb files to deploy new version on test cluster. To make this work efficiently I'd have to quckly build deb packages. I tried dpkg-buildpackages -nc which should keep the results of previous build but it ends up in a linking error: ... CXXLDceph_rgw_jsonparser ./.libs/libglobal.a(json_spirit_reader.o): In function `~thread_specific_ptr': /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' ./.libs/libglobal.a(json_spirit_reader.o):/usr/include/boost/thread/tss.hpp:79: more undefined references to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' follow ./.libs/libglobal.a(json_spirit_reader.o): In function `call_oncevoid (*)()': ... Any ideas on what could go wrong here ? Version I'm compiling is v0.94.1 but I've observed same results with 9.0.1. -- Bartlomiej Swiecki bartlomiej.swie...@corp.ovh.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: upstream/firefly exporting the same snap 2 times results in different exports
That sounds very odd. Could you verify via 'rados listwatchers' on an in-use rbd image's header object that there's still a watch established? How can i do this exactly? You need to determine the RBD header object name. For format 1 images (default for Firefly), the image header object is named image name.rbd. For format 2 images, you can determine the header object name via rbd info image spec | grep 'block_name_prefix' | sed 's/.*rbd_data\.\(.*\)/rbd_header.\1/g'. Once you have the RBD image header object name, you can run: rados listwatchers -p pool name RBD image header name. -- Jason Dillaman Red Hat dilla...@redhat.com http://www.redhat.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: local teuthology testing
Loic, thanks for the notes! Will try the new code and report out the issue I met. Thanks, -yuan -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Loic Dachary Sent: Tuesday, July 21, 2015 11:48 PM To: shin...@linux.com; Zhou, Yuan Cc: David Casier AEVOO; Ceph Devel; se...@lists.ceph.com Subject: Re: local teuthology testing Hi, Since July 18th teuthology no longer uses chef, this issue has been resolved ! Using ansible requires configuration ( http://dachary.org/?p=3752 explains it shortly, maybe there is something in the documentation but I did not pay enough attention to be sure ). At the end of http://dachary.org/?p=3752 you will see a list of configurable values and I suspect Andrew Zack would be more than happy to explain how any hardcoded leftover can be stripped :-) Cheers On 21/07/2015 14:58, Shinobu Kinjo wrote: Hi, I think that you have to show us such a URLs for anyone who would have same biggest issue. Sincerely, Kinjo On Tue, Jul 21, 2015 at 9:52 PM, Zhou, Yuan yuan.z...@intel.com mailto:yuan.z...@intel.com wrote: Hi David/Loic, I was also trying to make some local Teuthology clusters here. The biggest issue I met is in the ceph-qa-chef - there're lots of hardcoded URL related with the sepia lab. I have to trace the code and change them line by line. Can you please kindly share me how did you get this work? Is there an easy way to fix this? Thanks, -yuan -- Life w/ Linux http://i-shinobu.hatenablog.com/ -- Loïc Dachary, Artisan Logiciel Libre
Re: rados/thrash on OpenStack
Hi Kefu, The following runs on OpenStack and the next branch http://integration.ceph.dachary.org:8081/ubuntu-2015-07-21_00:04:04-rados-next---basic-openstack/ and 15 out of the 16 dead jobs (timed out after 3 hours) are from rados/thrash. A rados suite run on next dated a few days ago in the sepia lab ( http://pulpito.ceph.com/teuthology-2015-07-15_21:00:10-rados-next-distro-basic-multi/ ) also has a few dead jobs but only two of them are from rados/thrash. Cheers On 20/07/2015 16:23, Loic Dachary wrote: More information about this run. I'll run a rados suite on master on OpenStack to get a baseline of what we should expect. http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/12/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/14/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/15/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/17/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/20/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/21/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/22/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/23/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/26/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/28/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/2/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/5/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/6/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/7/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/9/ I see 2015-07-20T10:02:10.567 INFO:tasks.ceph.osd.5.ovh165019.stderr:osd/ReplicatedPG.cc: In function 'bool ReplicatedPG::is_degraded_or_backfilling_object(const hobject_t)' thread 7f2af94df700 time 2015-07-20 10:02:10.481916 2015-07-20T10:02:10.567 INFO:tasks.ceph.osd.5.ovh165019.stderr:osd/ReplicatedPG.cc: 412: FAILED assert(!actingbackfill.empty()) 2015-07-20T10:02:10.567 INFO:tasks.ceph.osd.5.ovh165019.stderr: ceph version 9.0.2-799-gba9c2ae (ba9c2ae4bffd3fd7b26a2e0ce843913b77940b8a) 2015-07-20T10:02:10.568 INFO:tasks.ceph.osd.5.ovh165019.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xc45d1b] 2015-07-20T10:02:10.568 INFO:tasks.ceph.osd.5.ovh165019.stderr: 2: ceph-osd() [0x88535d] 2015-07-20T10:02:10.568 INFO:tasks.ceph.osd.5.ovh165019.stderr: 3: (ReplicatedPG::hit_set_remove_all()+0x7c) [0x8b039c] 2015-07-20T10:02:10.568 INFO:tasks.ceph.osd.5.ovh165019.stderr: 4: (ReplicatedPG::on_pool_change()+0x161) [0x8b1a21] 2015-07-20T10:02:10.569 INFO:tasks.ceph.osd.5.ovh165019.stderr: 5: (PG::handle_advance_map(std::tr1::shared_ptrOSDMap const, std::tr1::shared_ptrOSDMap const, std::vectorint, std::allocatorint , int, std::vectorint, std::allocatorint , int, PG::RecoveryCtx*)+0x60c) [0x8348fc] 2015-07-20T10:02:10.569 INFO:tasks.ceph.osd.5.ovh165019.stderr: 6: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle, PG::RecoveryCtx*, std::setboost::intrusive_ptrPG, std::lessboost::intrusive_ptrPG , std::allocatorboost::intrusive_ptrPG *)+0x2c3) [0x6dcc73] 2015-07-20T10:02:10.569 INFO:tasks.ceph.osd.5.ovh165019.stderr: 7: (OSD::process_peering_events(std::listPG*, std::allocatorPG* const, ThreadPool::TPHandle)+0x1f1) [0x6dd721] 2015-07-20T10:02:10.572 INFO:tasks.ceph.osd.5.ovh165019.stderr: 8: (OSD::PeeringWQ::_process(std::listPG*, std::allocatorPG* const, ThreadPool::TPHandle)+0x18) [0x7328d8] 2015-07-20T10:02:10.573 INFO:tasks.ceph.osd.5.ovh165019.stderr: 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xc3677e] 2015-07-20T10:02:10.573 INFO:tasks.ceph.osd.5.ovh165019.stderr: 10: (ThreadPool::WorkThread::entry()+0x10) [0xc37820] 2015-07-20T10:02:10.573 INFO:tasks.ceph.osd.5.ovh165019.stderr: 11: (()+0x8182) [0x7f2b149e3182] 2015-07-20T10:02:10.573 INFO:tasks.ceph.osd.5.ovh165019.stderr: 12: (clone()+0x6d) [0x7f2b12d2847d] In http://149.202.164.239/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/24/ I see the same error as below. In http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/8/ it looks like the run was about to finish, just took a long time, and should be ignored as a false negative. On 20/07/2015 14:52, Loic Dachary wrote: Hi, I checked one of the timeout (dead) at
Re: rados/thrash on OpenStack
Note however that only one of the dead (timed out) job has an assert (looks like it's because the file system is not as it should, which is expected since there are no attached disks to the instances, therefore no way for the job to mkfs the file system of choice). All others timed out just because they either need more disk or just more time. On 21/07/2015 09:30, Loic Dachary wrote: Hi Kefu, The following runs on OpenStack and the next branch http://integration.ceph.dachary.org:8081/ubuntu-2015-07-21_00:04:04-rados-next---basic-openstack/ and 15 out of the 16 dead jobs (timed out after 3 hours) are from rados/thrash. A rados suite run on next dated a few days ago in the sepia lab ( http://pulpito.ceph.com/teuthology-2015-07-15_21:00:10-rados-next-distro-basic-multi/ ) also has a few dead jobs but only two of them are from rados/thrash. Cheers On 20/07/2015 16:23, Loic Dachary wrote: More information about this run. I'll run a rados suite on master on OpenStack to get a baseline of what we should expect. http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/12/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/14/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/15/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/17/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/20/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/21/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/22/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/23/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/26/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/28/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/2/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/5/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/6/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/7/ http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/9/ I see 2015-07-20T10:02:10.567 INFO:tasks.ceph.osd.5.ovh165019.stderr:osd/ReplicatedPG.cc: In function 'bool ReplicatedPG::is_degraded_or_backfilling_object(const hobject_t)' thread 7f2af94df700 time 2015-07-20 10:02:10.481916 2015-07-20T10:02:10.567 INFO:tasks.ceph.osd.5.ovh165019.stderr:osd/ReplicatedPG.cc: 412: FAILED assert(!actingbackfill.empty()) 2015-07-20T10:02:10.567 INFO:tasks.ceph.osd.5.ovh165019.stderr: ceph version 9.0.2-799-gba9c2ae (ba9c2ae4bffd3fd7b26a2e0ce843913b77940b8a) 2015-07-20T10:02:10.568 INFO:tasks.ceph.osd.5.ovh165019.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xc45d1b] 2015-07-20T10:02:10.568 INFO:tasks.ceph.osd.5.ovh165019.stderr: 2: ceph-osd() [0x88535d] 2015-07-20T10:02:10.568 INFO:tasks.ceph.osd.5.ovh165019.stderr: 3: (ReplicatedPG::hit_set_remove_all()+0x7c) [0x8b039c] 2015-07-20T10:02:10.568 INFO:tasks.ceph.osd.5.ovh165019.stderr: 4: (ReplicatedPG::on_pool_change()+0x161) [0x8b1a21] 2015-07-20T10:02:10.569 INFO:tasks.ceph.osd.5.ovh165019.stderr: 5: (PG::handle_advance_map(std::tr1::shared_ptrOSDMap const, std::tr1::shared_ptrOSDMap const, std::vectorint, std::allocatorint , int, std::vectorint, std::allocatorint , int, PG::RecoveryCtx*)+0x60c) [0x8348fc] 2015-07-20T10:02:10.569 INFO:tasks.ceph.osd.5.ovh165019.stderr: 6: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle, PG::RecoveryCtx*, std::setboost::intrusive_ptrPG, std::lessboost::intrusive_ptrPG , std::allocatorboost::intrusive_ptrPG *)+0x2c3) [0x6dcc73] 2015-07-20T10:02:10.569 INFO:tasks.ceph.osd.5.ovh165019.stderr: 7: (OSD::process_peering_events(std::listPG*, std::allocatorPG* const, ThreadPool::TPHandle)+0x1f1) [0x6dd721] 2015-07-20T10:02:10.572 INFO:tasks.ceph.osd.5.ovh165019.stderr: 8: (OSD::PeeringWQ::_process(std::listPG*, std::allocatorPG* const, ThreadPool::TPHandle)+0x18) [0x7328d8] 2015-07-20T10:02:10.573 INFO:tasks.ceph.osd.5.ovh165019.stderr: 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xc3677e] 2015-07-20T10:02:10.573 INFO:tasks.ceph.osd.5.ovh165019.stderr: 10: (ThreadPool::WorkThread::entry()+0x10) [0xc37820] 2015-07-20T10:02:10.573 INFO:tasks.ceph.osd.5.ovh165019.stderr: 11: (()+0x8182) [0x7f2b149e3182] 2015-07-20T10:02:10.573 INFO:tasks.ceph.osd.5.ovh165019.stderr: 12: (clone()+0x6d) [0x7f2b12d2847d] In
Re: dmcrypt with luks keys in hammer
Hi, On Mon, 20 Jul 2015 15:21:50 -0700 (PDT), Sage Weil wrote: On Mon, 20 Jul 2015, Wyllys Ingersoll wrote: No luck with ceph-disk-activate (all or just one device). $ sudo ceph-disk-activate /dev/sdv1 mount: unknown filesystem type 'crypto_LUKS' ceph-disk: Mounting filesystem failed: Command '['/bin/mount', '-t', 'crypto_LUKS', '-o', '', '--', '/dev/sdv1', '/var/lib/ceph/tmp/mnt.QHe3zK']' returned non-zero exit status 32 Its odd that it should complain about the crypto_LUKS filesystem not being recognized, because it did mount some of the LUKS systems successfully, though not sometimes just the data and not the journal (or vice versa). $ lsblk /dev/sdb NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdb 8:16 0 3.7T 0 disk ??sdb18:17 0 3.6T 0 part ? ??e8bc1531-a187-4fd2-9e3f-cf90255f89d0 (dm-0) 252:00 3.6T 0 crypt /var/lib/ceph/osd/ceph-54 ??sdb28:18 010G 0 part ??temporary-cryptsetup-1235 (dm-6)252:60 125K 1 crypt $ blkid /dev/sdb1 /dev/sdb1: UUID=d6194096-a219-4732-8d61-d0c125c49393 TYPE=crypto_LUKS A race condition (or other issue) with udev seems likely given that its rather random which ones come up and which ones don't. A race condition during creation or activation? If it's activation I would expect ceph-disk activate ... to work reasonably reliably when called manually (on a single device at a time). We encountered similar issues on a non-dmcrypt firefly deployment with 10 OSDs per node. I've been working on a patch set to defer device activation to systemd services. ceph-disk activate is extended to support mapping of dmcrypt devices prior to OSD startup. The master-based changes aren't ready for upstream yet, but can be found in my WIP branch at: https://github.com/ddiss/ceph/tree/wip_bnc926756_split_udev_systemd_master There are a few things that I'd still like to address before submitting upstream, mostly covering activate-journal: - The test/ceph-disk.sh unit tests need to be extended and fixed. - The activate-journal --dmcrypt changes are less than optimal, and leave me with a few unanswered questions: + Does get_journal_osd_uuid(dev) return the plaintext or cyphertext uuid? + If a journal is encrypted, is the data partition also always encrypted? - dmcrypt journal device mapping should probably also be split out into a separate systemd service, as that'll be needed for the future network based key retrieval feature. Feedback on the approach taken would be appreciated. Cheers, David -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About Fio backend with ObjectStore API
Hi Casey, I check your commits and know what you fixed. I cherry-picked your new commits but I still met the same problem. It's strange that it alwasys hit segment fault when entering _fio_setup_ceph_filestore_data, gdb tells td-io_ops is NULL but when I up the stack, the td-io_ops is not null. Maybe it's related to dlopen? Do you have any hint about this? On Thu, Jul 16, 2015 at 5:23 AM, Casey Bodley cbod...@gmail.com wrote: Hi Haomai, I was able to run this after a couple changes to the filestore.fio job file. Two of the config options were using the wrong names. I pushed a fix for the job file, as well as a patch that renames everything from filestore to objectstore (thanks James), to https://github.com/linuxbox2/linuxbox-ceph/commits/fio-objectstore. I found that the read support doesn't appear to work anymore, so give rw=write a try. And because it does a mkfs(), make sure you're pointing it to an empty xfs directory with the directory= option. Casey On Tue, Jul 14, 2015 at 2:45 AM, Haomai Wang haomaiw...@gmail.com wrote: Anyone who have successfully ran the fio with this external io engine ceph_objectstore? It's strange that it alwasys hit segment fault when entering _fio_setup_ceph_filestore_data, gdb tells td-io_ops is NULL but when I up the stack, the td-io_ops is not null. Maybe it's related to dlopen? On Fri, Jul 10, 2015 at 3:51 PM, Haomai Wang haomaiw...@gmail.com wrote: I have rebased the branch with master, and push it to ceph upstream repo. https://github.com/ceph/ceph/compare/fio-objectstore?expand=1 Plz let me know if who is working on this. Otherwise, I would like to improve this to be merge ready. On Fri, Jul 10, 2015 at 4:26 AM, Matt W. Benjamin m...@cohortfs.com wrote: That makes sense. Matt - James (Fei) Liu-SSI james@ssi.samsung.com wrote: Hi Casey, Got it. I was directed to the old code base. By the way, Since the testing case was used to exercise all of object stores. Strongly recommend to change the name from fio_ceph_filestore.cc to fio_ceph_objectstore.cc . And the code in fio_ceph_filestore.cc should be refactored to reflect that the whole objectstore will be supported by fio_ceph_objectstore.cc. what you think? Let me know if you need any help from my side. Regards, James -Original Message- From: Casey Bodley [mailto:cbod...@gmail.com] Sent: Thursday, July 09, 2015 12:32 PM To: James (Fei) Liu-SSI Cc: Haomai Wang; ceph-devel@vger.kernel.org Subject: Re: About Fio backend with ObjectStore API Hi James, Are you looking at the code from https://github.com/linuxbox2/linuxbox-ceph/tree/fio-objectstore? It uses ObjectStore::create() instead of new FileStore(). This allows us to exercise all of the object stores with the same code. Casey On Thu, Jul 9, 2015 at 2:01 PM, James (Fei) Liu-SSI james@ssi.samsung.com wrote: Hi Casey, Here is the code in the fio_ceph_filestore.cc. Basically, it creates a filestore as backend engine for IO exercises. If we got to send IO commands to KeyValue Store or Newstore, we got to change the code accordingly, right? I did not see any other files like fio_ceph_keyvaluestore.cc or fio_ceph_newstore.cc. In my humble opinion, we might need to create other two fio engines for keyvaluestore and newstore if we want to exercise these two, right? Regards, James static int fio_ceph_filestore_init(struct thread_data *td) 209 { 210 vectorconst char* args; 211 struct ceph_filestore_data *ceph_filestore_data = (struct ceph_filestore_data *) td-io_ops-data; 212 ObjectStore::Transaction ft; 213 214 global_init(NULL, args, CEPH_ENTITY_TYPE_OSD, CODE_ENVIRONMENT_UTILITY, 0); 215 //g_conf-journal_dio = false; 216 common_init_finish(g_ceph_context); 217 //g_ceph_context-_conf-set_val(debug_filestore, 20); 218 //g_ceph_context-_conf-set_val(debug_throttle, 20); 219 g_ceph_context-_conf-apply_changes(NULL); 220 221 ceph_filestore_data-osd_path = strdup(/mnt/fio_ceph_filestore.XXX); 222 ceph_filestore_data-journal_path = strdup(/var/lib/ceph/osd/journal-ram/fio_ceph_filestore.XXX); 223 224 if (!mkdtemp(ceph_filestore_data-osd_path)) { 225 cout mkdtemp failed: strerror(errno) std::endl; 226 return 1; 227 } 228 //mktemp(ceph_filestore_data-journal_path); // NOSPC issue 229 230 ObjectStore *fs = new FileStore(ceph_filestore_data-osd_path, ceph_filestore_data-journal_path); 231 ceph_filestore_data-fs = fs; 232 233 if (fs-mkfs() 0) { 234 cout mkfs failed std::endl; 235 goto failed; 236 } 237 238 if (fs-mount() 0) { 239 cout mount failed std::endl; 240 goto failed; 241 } 242 243 ft.create_collection(coll_t()); 244 fs-apply_transaction(ft); 245 246 247 return 0; 248 249
hdparm -W redux, bug in _check_disk_write_cache for RHEL6?
Hi, Following the sf.net corruption report I've been checking our config w.r.t data consistency. AFAIK the two main recommendations are: 1) don't mount FileStores with nobarrier 2) disable write-caching (hdparm -W 0 /dev/sdX) when using block dev journals and your kernel is 2.6.33 Obviously we don't do (1) because that would be crazy, but for (2) we didn't disable yet write-caching, probably because we didn't notice the doc. But my lame excuse is that apparently _check_disk_write_cache in FileJournal.cc doesn't print a warning when it should, because hdparm -W doesn't always work on partitions rather than whole block devices. See: GOOD: ceph 0.94.2, kernel 3.10.0-229.7.2.el7.x86_64, hdparm v9.43: 10 journal _open_block_device: ignoring osd journal size. We'll use the entire block device (size: 21474836480) 20 journal _check_disk_write_cache: disk write cache is on, but your kernel is new enough to handle it correctly. (fn:/var/lib/ceph/osd/ceph-96/journal) 1 journal _open /var/lib/ceph/osd/ceph-96/journal fd 20: 21474836480 bytes, block size 4096 bytes, directio = 1, aio = 1 BAD: ceph 0.94.2, kernel 2.6.32-431.29.2.el6.x86_64, hdparm v9.43: 10 journal _open_block_device: ignoring osd journal size. We'll use the entire block device (size: 21474836480) 1 journal _open /var/lib/ceph/osd/ceph-56/journal fd 19: 21474836480 bytes, block size 4096 bytes, directio = 1, aio = 1 In other words, running hammer on EL6, _check_disk_write_cache exits without printing anything, but actually it should log the scary WARNING: disk write cache is ON. I guess it's because of this: GOOD # uname -r hdparm -W /dev/sda hdparm -W /dev/sda1 3.10.0-229.7.2.el7.x86_64 /dev/sda1: write-caching = 1 (on) /dev/sda: write-caching = 1 (on) BAD # uname -r hdparm -W /dev/sda hdparm -W /dev/sda1 2.6.32-431.23.3.el6.x86_64 /dev/sda: write-caching = 1 (on) /dev/sda1: HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device (in both cases /dev/sda is an INTEL SSDSC2BA20). So a few questions to end this: 1) What was the magic patch in 2.6.33 which made write-caching safe? 2) What's the recommended recourse here: hopefully Red Hat backported the necessary to their 2.6.32 kernel, but if not should we fix _check_disk_write_cache and make some publicity for people to check their configs? Best Regards, Dan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
teuthology rados runs for next
Hi Sam, I noticed today that http://pulpito.ceph.com/?suite=radosbranch=next is lagging three days behind. Do we want to keep all the runs or should we kill the older ones ? I suppose there would be value in having the results for all of them but given the current load in the sepia lab it also significantly delays them. What do you think ? Cheers -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re: teuthology rados runs for next
On Tue, 21 Jul 2015, Loic Dachary wrote: Hi Sam, I noticed today that http://pulpito.ceph.com/?suite=radosbranch=next is lagging three days behind. Do we want to keep all the runs or should we kill the older ones ? I suppose there would be value in having the results for all of them but given the current load in the sepia lab it also significantly delays them. What do you think ? I think it's better to kill old scheduled runs. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hdparm -W redux, bug in _check_disk_write_cache for RHEL6?
On Tue, 21 Jul 2015, Dan van der Ster wrote: Hi, Following the sf.net corruption report I've been checking our config w.r.t data consistency. AFAIK the two main recommendations are: 1) don't mount FileStores with nobarrier 2) disable write-caching (hdparm -W 0 /dev/sdX) when using block dev journals and your kernel is 2.6.33 Obviously we don't do (1) because that would be crazy, but for (2) we didn't disable yet write-caching, probably because we didn't notice the doc. But my lame excuse is that apparently _check_disk_write_cache in FileJournal.cc doesn't print a warning when it should, because hdparm -W doesn't always work on partitions rather than whole block devices. See: GOOD: ceph 0.94.2, kernel 3.10.0-229.7.2.el7.x86_64, hdparm v9.43: 10 journal _open_block_device: ignoring osd journal size. We'll use the entire block device (size: 21474836480) 20 journal _check_disk_write_cache: disk write cache is on, but your kernel is new enough to handle it correctly. (fn:/var/lib/ceph/osd/ceph-96/journal) 1 journal _open /var/lib/ceph/osd/ceph-96/journal fd 20: 21474836480 bytes, block size 4096 bytes, directio = 1, aio = 1 BAD: ceph 0.94.2, kernel 2.6.32-431.29.2.el6.x86_64, hdparm v9.43: 10 journal _open_block_device: ignoring osd journal size. We'll use the entire block device (size: 21474836480) 1 journal _open /var/lib/ceph/osd/ceph-56/journal fd 19: 21474836480 bytes, block size 4096 bytes, directio = 1, aio = 1 In other words, running hammer on EL6, _check_disk_write_cache exits without printing anything, but actually it should log the scary WARNING: disk write cache is ON. I guess it's because of this: GOOD # uname -r hdparm -W /dev/sda hdparm -W /dev/sda1 3.10.0-229.7.2.el7.x86_64 /dev/sda1: write-caching = 1 (on) /dev/sda: write-caching = 1 (on) BAD # uname -r hdparm -W /dev/sda hdparm -W /dev/sda1 2.6.32-431.23.3.el6.x86_64 /dev/sda: write-caching = 1 (on) /dev/sda1: HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device (in both cases /dev/sda is an INTEL SSDSC2BA20). So a few questions to end this: 1) What was the magic patch in 2.6.33 which made write-caching safe? The specific behavior is that we want fsync or fdatasync to flush the write cache on the underlying device. Unfortunately I've lost track of which commit led me to the magic 2.6.33 number. However, this reference seems to confirm that 2.6.33 is a safe upper bound: http://monolight.cc/2011/06/barriers-caches-filesystems/ 2) What's the recommended recourse here: hopefully Red Hat backported the necessary to their 2.6.32 kernel, but if not should we fix _check_disk_write_cache and make some publicity for people to check their configs? I have no doubt that any and all patches related to flushing caches on fsync are part of the el6 kernel. What's embarassing is that hdparm fails on kernels old enough to fail the test :). The fix is probably to strip off the partition number (ideally using the helpers in blkdev.cc so that it works even for weirdly-named devices) and run hdparm on that. sage Best Regards, Dan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dmcrypt with luks keys in hammer
On Tue, 21 Jul 2015, David Disseldorp wrote: Hi, On Mon, 20 Jul 2015 15:21:50 -0700 (PDT), Sage Weil wrote: On Mon, 20 Jul 2015, Wyllys Ingersoll wrote: No luck with ceph-disk-activate (all or just one device). $ sudo ceph-disk-activate /dev/sdv1 mount: unknown filesystem type 'crypto_LUKS' ceph-disk: Mounting filesystem failed: Command '['/bin/mount', '-t', 'crypto_LUKS', '-o', '', '--', '/dev/sdv1', '/var/lib/ceph/tmp/mnt.QHe3zK']' returned non-zero exit status 32 Its odd that it should complain about the crypto_LUKS filesystem not being recognized, because it did mount some of the LUKS systems successfully, though not sometimes just the data and not the journal (or vice versa). $ lsblk /dev/sdb NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdb 8:16 0 3.7T 0 disk ??sdb18:17 0 3.6T 0 part ? ??e8bc1531-a187-4fd2-9e3f-cf90255f89d0 (dm-0) 252:00 3.6T 0 crypt /var/lib/ceph/osd/ceph-54 ??sdb28:18 010G 0 part ??temporary-cryptsetup-1235 (dm-6)252:60 125K 1 crypt $ blkid /dev/sdb1 /dev/sdb1: UUID=d6194096-a219-4732-8d61-d0c125c49393 TYPE=crypto_LUKS A race condition (or other issue) with udev seems likely given that its rather random which ones come up and which ones don't. A race condition during creation or activation? If it's activation I would expect ceph-disk activate ... to work reasonably reliably when called manually (on a single device at a time). We encountered similar issues on a non-dmcrypt firefly deployment with 10 OSDs per node. I've been working on a patch set to defer device activation to systemd services. ceph-disk activate is extended to support mapping of dmcrypt devices prior to OSD startup. The master-based changes aren't ready for upstream yet, but can be found in my WIP branch at: https://github.com/ddiss/ceph/tree/wip_bnc926756_split_udev_systemd_master This approach looks to be MUCH MUCH better than what we're doing right now! There are a few things that I'd still like to address before submitting upstream, mostly covering activate-journal: - The test/ceph-disk.sh unit tests need to be extended and fixed. - The activate-journal --dmcrypt changes are less than optimal, and leave me with a few unanswered questions: + Does get_journal_osd_uuid(dev) return the plaintext or cyphertext uuid? The uuid is never encrypted. + If a journal is encrypted, is the data partition also always encrypted? Yes (I don't think it's useful to support a mixed encrypted/unencrypted OSD). - dmcrypt journal device mapping should probably also be split out into a separate systemd service, as that'll be needed for the future network based key retrieval feature. Feedback on the approach taken would be appreciated. My only regret is that it won't help non-systemd cases, but I'm okay with leaving those as is (users can use the existing workarounds, like 'ceph-disk activate-all' in rc.local to mop up stragglers) and focus instead on the new systemd world. Let us know if there's anything else we can do to help! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: The design of the eviction improvement
On Tue, 21 Jul 2015, Wang, Zhiqiang wrote: -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Tuesday, July 21, 2015 6:38 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: Re: The design of the eviction improvement On Mon, 20 Jul 2015, Wang, Zhiqiang wrote: Hi all, This is a follow-up of one of the CDS session at http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri ng_eviction. We discussed the drawbacks of the current eviction algorithm and several ways to improve it. Seems like the LRU variants is the right way to go. I come up with some design points after the CDS, and want to discuss it with you. It is an approximate 2Q algorithm, combining some benefits of the clock algorithm, similar to what the linux kernel does for the page cache. Unfortunately I missed this last CDS so I'm behind on the discussion. I have a few questions though... # Design points: ## LRU lists - Maintain LRU lists at the PG level. The SharedLRU and SimpleLRU implementation in the current code have a max_size, which limits the max number of elements in the list. This mostly looks like a MRU, though its name implies they are LRUs. Since the object size may vary in a PG, it's not possible to caculate the total number of objects which the cache tier can hold ahead of time. We need a new LRU implementation with no limit on the size. This last sentence seems to me to be the crux of it. Assuming we have an OSD based by flash storing O(n) objects, we need a way to maintain an LRU of O(n) objects in memory. The current hitset-based approach was taken based on the assumption that this wasn't feasible--or at least we didn't know how to implmement such a thing. If it is, or we simply want to stipulate that cache tier OSDs get gobs of RAM to make it possible, then lots of better options become possible... Let's say you have a 1TB SSD, with an average object size of 1MB -- that's 1 million objects. At maybe ~100bytes per object of RAM for an LRU entry that's 100MB... so not so unreasonable, perhaps! I was having the same question before proposing this. I did the similar calculation and thought it would be ok to use this many memory :-) The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers (assuming a conventional linked list structure). Maybe tens of seconds... I wonder if instead we should construct some sort of flat model where we load slabs of contiguous memory, 10's of MB each, and have the next/previous pointers be a (slab,position) pair. That way we can load it into memory in big chunks, quickly, and be able to operate on it (adjust links) immediately. Another thought: currently we use the hobject_t hash only instead of the full object name. We could continue to do the same, or we could do a hash pair (hobject_t hash + a different hash of the rest of the object) to keep the representation compact. With a model lke the above, that could get the object representation down to 2 u32's. A link could be a slab + position (2 more u32's), and if we have prev + next that'd be just 6x4=24 bytes per object. With fixed-sized slots on the slabs, the slab allocator could be very simple.. maybe just a bitmap, a free counter, and any other trivial optimizations to make finding a slab's next free a slot nice and quick. - Two lists for each PG: active and inactive Objects are first put into the inactive list when they are accessed, and moved between these two lists based on some criteria. Object flag: active, referenced, unevictable, dirty. - When an object is accessed: 1) If it's not in both of the lists, it's put on the top of the inactive list 2) If it's in the inactive list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the inactive list. 3) If it's in the inactive list, and the referenced flag is set, the referenced flag is cleared, and it's removed from the inactive list, and put on top of the active list. 4) If it's in the active list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the active list. 5) If it's in the active list, and the referenced flag is set, it's moved to the top of the active list. - When selecting objects to evict: 1) Objects at the bottom of the inactive list are selected to evict. They are removed from the inactive list. 2) If the number of the objects in the inactive list becomes low, some of the objects at the bottom of the active list are moved to the inactive list. For those objects which have the referenced flag
upstream/firefly exporting the same snap 2 times results in different exports
Hi, i remember there was a bug before in ceph not sure in which release where exporting the same rbd snap multiple times results in different raw images. Currently running upstream/firefly and i'm seeing the same again. # rbd export cephstor/disk-116@snap dump1 # sleep 10 # rbd export cephstor/disk-116@snap dump2 # md5sum -b dump* b89198f118de59b3aa832db1bfddaf8f *dump1 f63ed9345ac2d5898483531e473772b1 *dump2 Can anybody help? Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
local teuthology testing
Hi David/Loic, I was also trying to make some local Teuthology clusters here. The biggest issue I met is in the ceph-qa-chef - there're lots of hardcoded URL related with the sepia lab. I have to trace the code and change them line by line. Can you please kindly share me how did you get this work? Is there an easy way to fix this? Thanks, -yuan N�r��yb�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!�i
Re: The design of the eviction improvement
Hi, Couple of points. 1) a successor to 2Q is MQ (Li et al). We have an intrusive MQ LRU implementation with 2 levels currently, plus a pinned queue, that addresses stuff like partitioning (sharding), scan resistance, and coordination w/lookup tables. We might extend/re-use it. 2) I'm a bit confused by active/inactive vocabulary, dimensioning of cache segments (are you proposing to/do we now always cache whole objects?), and cost of looking for dirty objects; I suspect that it makes sense to amortize the cost of locating segments eligible to be flushed, rather than minimize bookkeeping. Matt - Zhiqiang Wang zhiqiang.w...@intel.com wrote: -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Tuesday, July 21, 2015 6:38 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: Re: The design of the eviction improvement On Mon, 20 Jul 2015, Wang, Zhiqiang wrote: Hi all, This is a follow-up of one of the CDS session at http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri ng_eviction. We discussed the drawbacks of the current eviction algorithm and several ways to improve it. Seems like the LRU variants is the right way to go. I come up with some design points after the CDS, and want to discuss it with you. It is an approximate 2Q algorithm, combining some benefits of the clock algorithm, similar to what the linux kernel does for the page cache. Unfortunately I missed this last CDS so I'm behind on the discussion. I have a few questions though... # Design points: ## LRU lists - Maintain LRU lists at the PG level. The SharedLRU and SimpleLRU implementation in the current code have a max_size, which limits the max number of elements in the list. This mostly looks like a MRU, though its name implies they are LRUs. Since the object size may vary in a PG, it's not possible to caculate the total number of objects which the cache tier can hold ahead of time. We need a new LRU implementation with no limit on the size. This last sentence seems to me to be the crux of it. Assuming we have an OSD based by flash storing O(n) objects, we need a way to maintain an LRU of O(n) objects in memory. The current hitset-based approach was taken based on the assumption that this wasn't feasible--or at least we didn't know how to implmement such a thing. If it is, or we simply want to stipulate that cache tier OSDs get gobs of RAM to make it possible, then lots of better options become possible... Let's say you have a 1TB SSD, with an average object size of 1MB -- that's 1 million objects. At maybe ~100bytes per object of RAM for an LRU entry that's 100MB... so not so unreasonable, perhaps! I was having the same question before proposing this. I did the similar calculation and thought it would be ok to use this many memory :-) - Two lists for each PG: active and inactive Objects are first put into the inactive list when they are accessed, and moved between these two lists based on some criteria. Object flag: active, referenced, unevictable, dirty. - When an object is accessed: 1) If it's not in both of the lists, it's put on the top of the inactive list 2) If it's in the inactive list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the inactive list. 3) If it's in the inactive list, and the referenced flag is set, the referenced flag is cleared, and it's removed from the inactive list, and put on top of the active list. 4) If it's in the active list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the active list. 5) If it's in the active list, and the referenced flag is set, it's moved to the top of the active list. - When selecting objects to evict: 1) Objects at the bottom of the inactive list are selected to evict. They are removed from the inactive list. 2) If the number of the objects in the inactive list becomes low, some of the objects at the bottom of the active list are moved to the inactive list. For those objects which have the referenced flag set, they are given one more chance in the active list. They are moved to the top of the active list with the referenced flag cleared. For those objects which don't have the referenced flag set, they are moved to the inactive list, with the referenced flag set. So that they can be quickly promoted to the active list when necessary. ## Combine flush with eviction - When evicting an object, if it's dirty, it's flushed first. After flushing, it's evicted. If not dirty, it's evicted directly. - This means that we won't have separate activities and won't set different ratios for flush and evict. Is there a need to do so?
Re: hdparm -W redux, bug in _check_disk_write_cache for RHEL6?
On Tue, Jul 21, 2015 at 4:54 PM, Sage Weil s...@newdream.net wrote: On Tue, 21 Jul 2015, Dan van der Ster wrote: Hi, Following the sf.net corruption report I've been checking our config w.r.t data consistency. AFAIK the two main recommendations are: 1) don't mount FileStores with nobarrier 2) disable write-caching (hdparm -W 0 /dev/sdX) when using block dev journals and your kernel is 2.6.33 Obviously we don't do (1) because that would be crazy, but for (2) we didn't disable yet write-caching, probably because we didn't notice the doc. But my lame excuse is that apparently _check_disk_write_cache in FileJournal.cc doesn't print a warning when it should, because hdparm -W doesn't always work on partitions rather than whole block devices. See: GOOD: ceph 0.94.2, kernel 3.10.0-229.7.2.el7.x86_64, hdparm v9.43: 10 journal _open_block_device: ignoring osd journal size. We'll use the entire block device (size: 21474836480) 20 journal _check_disk_write_cache: disk write cache is on, but your kernel is new enough to handle it correctly. (fn:/var/lib/ceph/osd/ceph-96/journal) 1 journal _open /var/lib/ceph/osd/ceph-96/journal fd 20: 21474836480 bytes, block size 4096 bytes, directio = 1, aio = 1 BAD: ceph 0.94.2, kernel 2.6.32-431.29.2.el6.x86_64, hdparm v9.43: 10 journal _open_block_device: ignoring osd journal size. We'll use the entire block device (size: 21474836480) 1 journal _open /var/lib/ceph/osd/ceph-56/journal fd 19: 21474836480 bytes, block size 4096 bytes, directio = 1, aio = 1 In other words, running hammer on EL6, _check_disk_write_cache exits without printing anything, but actually it should log the scary WARNING: disk write cache is ON. I guess it's because of this: GOOD # uname -r hdparm -W /dev/sda hdparm -W /dev/sda1 3.10.0-229.7.2.el7.x86_64 /dev/sda1: write-caching = 1 (on) /dev/sda: write-caching = 1 (on) BAD # uname -r hdparm -W /dev/sda hdparm -W /dev/sda1 2.6.32-431.23.3.el6.x86_64 /dev/sda: write-caching = 1 (on) /dev/sda1: HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device (in both cases /dev/sda is an INTEL SSDSC2BA20). So a few questions to end this: 1) What was the magic patch in 2.6.33 which made write-caching safe? The specific behavior is that we want fsync or fdatasync to flush the write cache on the underlying device. Unfortunately I've lost track of which commit led me to the magic 2.6.33 number. However, this reference seems to confirm that 2.6.33 is a safe upper bound: http://monolight.cc/2011/06/barriers-caches-filesystems/ This one, I think: commit ab0a9735e06914ce4d2a94ffa41497dbc142fe7f Author: Christoph Hellwig h...@lst.de Date: Thu Oct 29 14:14:04 2009 +0100 blkdev: flush disk cache on -fsync Currently there is no barrier support in the block device code. That means we cannot guarantee any sort of data integerity when using the block device node with dis kwrite caches enabled. Using the raw block device node is a typical use case for virtualization (and I assume databases, too). This patch changes block_fsync to issue a cache flush and thus make fsync on block device nodes actually useful. Note that in mainline we would also need to add such code to the -aio_write method for O_SYNC handling, but assuming that Jan's patch series for the O_SYNC rewrite goes in it will also call into -fsync for 2.6.32. Signed-off-by: Christoph Hellwig h...@lst.de Signed-off-by: Jens Axboe jens.ax...@oracle.com 2) What's the recommended recourse here: hopefully Red Hat backported the necessary to their 2.6.32 kernel, but if not should we fix _check_disk_write_cache and make some publicity for people to check their configs? I have no doubt that any and all patches related to flushing caches on fsync are part of the el6 kernel. What's embarassing is that hdparm fails on kernels old enough to fail the test :). The fix is probably to strip off the partition number (ideally using the helpers in blkdev.cc so that it works even for weirdly-named devices) and run hdparm on that. We should look into using libblkid for this and nuking blkdev.cc. rbd unmap supports unmap by partition and already relies on libblkid to do the partition - whole disk thing. I can't remember if that function is old enough to be in el6 base, I can take a stab at this if it is... Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: The design of the eviction improvement
On Tue, Jul 21, 2015 at 3:15 PM, Matt W. Benjamin m...@cohortfs.com wrote: Hi, Couple of points. 1) a successor to 2Q is MQ (Li et al). We have an intrusive MQ LRU implementation with 2 levels currently, plus a pinned queue, that addresses stuff like partitioning (sharding), scan resistance, and coordination w/lookup tables. We might extend/re-use it. 2) I'm a bit confused by active/inactive vocabulary, dimensioning of cache segments (are you proposing to/do we now always cache whole objects?), and cost of looking for dirty objects; I suspect that it makes sense to amortize the cost of locating segments eligible to be flushed, rather than minimize bookkeeping. We make caching decisions in terms of whole objects right now, yeah. There's really nothing in the system that's capable of doing segments within an object, and it's not just about tracking a little more metadata about dirty objects — the way we handle snapshots, etc would have to be reworked if we were allowing partial-object caching. Plus keep in mind the IO cost of the bookkeeping — it needs to be either consistently persisted to disk or reconstructable from whatever happens to be in the object. That can get expensive really fast. -Greg Matt - Zhiqiang Wang zhiqiang.w...@intel.com wrote: -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Tuesday, July 21, 2015 6:38 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: Re: The design of the eviction improvement On Mon, 20 Jul 2015, Wang, Zhiqiang wrote: Hi all, This is a follow-up of one of the CDS session at http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri ng_eviction. We discussed the drawbacks of the current eviction algorithm and several ways to improve it. Seems like the LRU variants is the right way to go. I come up with some design points after the CDS, and want to discuss it with you. It is an approximate 2Q algorithm, combining some benefits of the clock algorithm, similar to what the linux kernel does for the page cache. Unfortunately I missed this last CDS so I'm behind on the discussion. I have a few questions though... # Design points: ## LRU lists - Maintain LRU lists at the PG level. The SharedLRU and SimpleLRU implementation in the current code have a max_size, which limits the max number of elements in the list. This mostly looks like a MRU, though its name implies they are LRUs. Since the object size may vary in a PG, it's not possible to caculate the total number of objects which the cache tier can hold ahead of time. We need a new LRU implementation with no limit on the size. This last sentence seems to me to be the crux of it. Assuming we have an OSD based by flash storing O(n) objects, we need a way to maintain an LRU of O(n) objects in memory. The current hitset-based approach was taken based on the assumption that this wasn't feasible--or at least we didn't know how to implmement such a thing. If it is, or we simply want to stipulate that cache tier OSDs get gobs of RAM to make it possible, then lots of better options become possible... Let's say you have a 1TB SSD, with an average object size of 1MB -- that's 1 million objects. At maybe ~100bytes per object of RAM for an LRU entry that's 100MB... so not so unreasonable, perhaps! I was having the same question before proposing this. I did the similar calculation and thought it would be ok to use this many memory :-) - Two lists for each PG: active and inactive Objects are first put into the inactive list when they are accessed, and moved between these two lists based on some criteria. Object flag: active, referenced, unevictable, dirty. - When an object is accessed: 1) If it's not in both of the lists, it's put on the top of the inactive list 2) If it's in the inactive list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the inactive list. 3) If it's in the inactive list, and the referenced flag is set, the referenced flag is cleared, and it's removed from the inactive list, and put on top of the active list. 4) If it's in the active list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the active list. 5) If it's in the active list, and the referenced flag is set, it's moved to the top of the active list. - When selecting objects to evict: 1) Objects at the bottom of the inactive list are selected to evict. They are removed from the inactive list. 2) If the number of the objects in the inactive list becomes low, some of the objects at the bottom of the active list are moved to the inactive list. For those objects which have the referenced flag set, they are given one
Re: dmcrypt with luks keys in hammer
On 07/21/2015 01:14 PM, David Disseldorp wrote: A race condition (or other issue) with udev seems likely given that its rather random which ones come up and which ones don't A race condition during creation or activation? If it's activation I would expect ceph-disk activate ... to work reasonably reliably when called manually (on a single device at a time). I still do not understand completely how the dmcrypt activation in Ceph is designed, but there are clear problems in the current design. Activation of another device-mapper inside udev rules (here LUKS or plain dmcrypt device) is broken by design, it can work with only with ugly workarounds. The first reason is correctly mentioned in your mentioned wip branch (udev RUN is intended for short-running commands. For example, I think if you increase iteration count in LUKS device, the whole Ceph udev rules fails completely because udev thread processing will kill it on timeout...) (Unlocking can take even minutes when you move encrypted disk to a very slow machine) The second reason is even more serious - cryptsetup itself uses udev (through libdevmapper) to create nodes and must synchronize with some other device-mapper udev rules. So here it is a race by design... udev waits for another udev process. Ditto for creating /dev/by* links (created by udev rule as well). (And add to mix +watch rules, which reacts on close-on-write on every node by running another udev rule blkid scan. If you see some leftover temporary-cryptsetup* devices, something is really wrong. These devices are internal to libcryptsetup and maps keyslots only, there are never keep open in correct operation.) So moving activation outside of the udev rules is the correct solution here, only processing of device nodes should be there and rest should be offloaded after udev rules run. We encountered similar issues on a non-dmcrypt firefly deployment with 10 OSDs per node. I've been working on a patch set to defer device activation to systemd services. ceph-disk activate is extended to support mapping of dmcrypt devices prior to OSD startup. Well, using systemd service is one option. But then it should handle all cryptsetup device activations. Milan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dmcrypt with luks keys in hammer
ceph-disk activate-all does not fix the problem for non-systemd users. Once they are into the temporary-cryptsetup-PID state, they have to be manually cleared and remounted as follows: 1. cryptsetup close all of the ones in the temporary-cryptsetup state 2. find the UUID for each block device (journal and data partitions) 3. cryptsetup luksOpen on those devices individually for i in `ls /dev/sd?[12] | grep -v sda` do UUID=`sudo blkid -p $i | sed 's/ /\n/g'|grep PART_ENTRY_UUID|cut -f2 -d=| tr -d \ cryptsetup luksOpen $i $UUID --key-file /etc/ceph/dmcrypt-keys/${UUID}.luks.key done $ sudo start ceph-osd-all On Tue, Jul 21, 2015 at 10:00 AM, Sage Weil s...@newdream.net wrote: On Tue, 21 Jul 2015, David Disseldorp wrote: Hi, On Mon, 20 Jul 2015 15:21:50 -0700 (PDT), Sage Weil wrote: On Mon, 20 Jul 2015, Wyllys Ingersoll wrote: No luck with ceph-disk-activate (all or just one device). $ sudo ceph-disk-activate /dev/sdv1 mount: unknown filesystem type 'crypto_LUKS' ceph-disk: Mounting filesystem failed: Command '['/bin/mount', '-t', 'crypto_LUKS', '-o', '', '--', '/dev/sdv1', '/var/lib/ceph/tmp/mnt.QHe3zK']' returned non-zero exit status 32 Its odd that it should complain about the crypto_LUKS filesystem not being recognized, because it did mount some of the LUKS systems successfully, though not sometimes just the data and not the journal (or vice versa). $ lsblk /dev/sdb NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdb 8:16 0 3.7T 0 disk ??sdb18:17 0 3.6T 0 part ? ??e8bc1531-a187-4fd2-9e3f-cf90255f89d0 (dm-0) 252:00 3.6T 0 crypt /var/lib/ceph/osd/ceph-54 ??sdb28:18 010G 0 part ??temporary-cryptsetup-1235 (dm-6)252:60 125K 1 crypt $ blkid /dev/sdb1 /dev/sdb1: UUID=d6194096-a219-4732-8d61-d0c125c49393 TYPE=crypto_LUKS A race condition (or other issue) with udev seems likely given that its rather random which ones come up and which ones don't. A race condition during creation or activation? If it's activation I would expect ceph-disk activate ... to work reasonably reliably when called manually (on a single device at a time). We encountered similar issues on a non-dmcrypt firefly deployment with 10 OSDs per node. I've been working on a patch set to defer device activation to systemd services. ceph-disk activate is extended to support mapping of dmcrypt devices prior to OSD startup. The master-based changes aren't ready for upstream yet, but can be found in my WIP branch at: https://github.com/ddiss/ceph/tree/wip_bnc926756_split_udev_systemd_master This approach looks to be MUCH MUCH better than what we're doing right now! There are a few things that I'd still like to address before submitting upstream, mostly covering activate-journal: - The test/ceph-disk.sh unit tests need to be extended and fixed. - The activate-journal --dmcrypt changes are less than optimal, and leave me with a few unanswered questions: + Does get_journal_osd_uuid(dev) return the plaintext or cyphertext uuid? The uuid is never encrypted. + If a journal is encrypted, is the data partition also always encrypted? Yes (I don't think it's useful to support a mixed encrypted/unencrypted OSD). - dmcrypt journal device mapping should probably also be split out into a separate systemd service, as that'll be needed for the future network based key retrieval feature. Feedback on the approach taken would be appreciated. My only regret is that it won't help non-systemd cases, but I'm okay with leaving those as is (users can use the existing workarounds, like 'ceph-disk activate-all' in rc.local to mop up stragglers) and focus instead on the new systemd world. Let us know if there's anything else we can do to help! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: upstream/firefly exporting the same snap 2 times results in different exports
Any chance that the snapshot was just created prior to the first export and you have a process actively writing to the image? -- Jason Dillaman Red Hat dilla...@redhat.com http://www.redhat.com - Original Message - From: Stefan Priebe - Profihost AG s.pri...@profihost.ag To: ceph-devel@vger.kernel.org Sent: Tuesday, July 21, 2015 8:29:46 AM Subject: upstream/firefly exporting the same snap 2 times results in different exports Hi, i remember there was a bug before in ceph not sure in which release where exporting the same rbd snap multiple times results in different raw images. Currently running upstream/firefly and i'm seeing the same again. # rbd export cephstor/disk-116@snap dump1 # sleep 10 # rbd export cephstor/disk-116@snap dump2 # md5sum -b dump* b89198f118de59b3aa832db1bfddaf8f *dump1 f63ed9345ac2d5898483531e473772b1 *dump2 Can anybody help? Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html