[ceph-users] May I know the exact date of Nautilus release? Thanks!
- Vivian SSG OTC NST Storage Tel: (8621)61167437 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] crush map has straw_calc_version=0 and legacy tunables on luminous
For future reference I found these 2 links which answer most of the questions: http://docs.ceph.com/docs/master/rados/operations/crush-map/ https://www.openstack.org/assets/presentation-media/Advanced-Tuning-and-Operation-guide-for-Block-Storage-using-Ceph-Boston-2017-final.pdf We have about 250TB (x3) in our cluster so I am leaning toward not changing things at this point because it sounds like there will be a significant amount of data movement involved for not a lot in return. If anyone knows of a strong reason I should change the tunables profile away from what I have…then please let me know so I don’t end up running the cluster in a sub-optimal state for no reason. Thanks, Shain -- Shain Miley | Manager of Systems and Infrastructure, Digital Media | smi...@npr.org | 202.513.3649 From: ceph-users on behalf of Shain Miley Date: Monday, February 4, 2019 at 3:03 PM To: "ceph-users@lists.ceph.com" Subject: [ceph-users] crush map has straw_calc_version=0 and legacy tunables on luminous Hello, I just upgraded our cluster to 12.2.11 and I have a few questions around straw_calc_version and tunables. Currently ceph status shows the following: crush map has straw_calc_version=0 crush map has legacy tunables (require argonaut, min is firefly) 1. Will setting tunables to optimal also change the staw_calc_version or do I need to set that separately? 2. Right now I have a set of rbd kernel clients connecting using kernel version 4.4. The ‘ceph daemon mon.id sessions’ command shows that this client is still connecting using the hammer feature set (and a few others on jewel as well): "MonSession(client.113933130 10.35.100.121:0/3425045489 is open allow *, features 0x7fddff8ee8cbffb (jewel))", “MonSession(client.112250505 10.35.100.99:0/4174610322 is open allow *, features 0x106b84a842a42 (hammer))", My question is what is the minimum kernel version I would need to upgrade the 4.4 kernel server to in order to get to jewel or luminous? 1. Will setting the tunables to optimal on luminous prevent jewel and hammer clients from connecting? I want to make sure I don’t do anything will prevent my existing clients from connecting to the cluster. Thanks in advance, Shain -- Shain Miley | Manager of Systems and Infrastructure, Digital Media | smi...@npr.org | 202.513.3649 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS MDS journal
On Mon, Feb 4, 2019 at 8:03 AM Mahmoud Ismail wrote: > On Mon, Feb 4, 2019 at 4:35 PM Gregory Farnum wrote: > >> >> >> On Mon, Feb 4, 2019 at 7:32 AM Mahmoud Ismail < >> mahmoudahmedism...@gmail.com> wrote: >> >>> On Mon, Feb 4, 2019 at 4:16 PM Gregory Farnum >>> wrote: >>> On Fri, Feb 1, 2019 at 2:29 AM Mahmoud Ismail < mahmoudahmedism...@gmail.com> wrote: > Hello, > > I'm a bit confused about how the journaling actually works in the MDS. > > I was reading about these two configuration parameters (journal write > head interval) and (mds early reply). Does the MDS flush the journal > synchronously after each operation? and by setting mds eary reply to true > it allows operations to return without flushing. If so, what the other > parameter (journal write head interval) do or isn't it for MDS?. Also, can > all operations return without flushing with the mds early reply or is it > specific to a subset of operations?. > In general, the MDS journal is flushed every five seconds (by default), and client requests get an early reply when the operation is done in memory but not yet committed to RADOS. Some operations will trigger an immediate flush, and there may be some operations that can't get an early reply or that need to wait for part of the operation to get committed (like renames that move a file's authority to a different MDS). IIRC the journal write head interval controls how often it flushes out the journal's header, which limits how out-of-date its hints on restart can be. (When the MDS restarts, it asks the journal head where the journal's unfinished start and end points are, but of course more of the journaled operations may have been fully completed since the head was written.) >>> >>> Thanks for the explanation. Which operations trigger an immediate flush? >>> Is the readdir one of these operations?. I noticed that the readdir >>> operation latency is going higher under load when the OSDs are hitting the >>> limit of the underlying hdd throughput. Can i assume that this is happening >>> due to the journal flushing then? >>> >> >> Not directly, but a readdir might ask to know the size of each file and >> that will force the other clients in the system to flush their dirty data >> in the directory (so that the readdir can return valid results). >> -Greg >> >> > > Could it be also due to the MDS lock (operations waiting for the lock > under load)? > Well that's not going to cause high OSD usage, and the MDS lock is not held while writes are happening. But if the MDS is using 100% CPU, yes, it could be contended. > Also, i assume that the journal is using a different thread for flushing, > Right? > Yes, that's correct. > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Optane still valid
I think one limitation would be the 375GB since bluestore needs a larger amount of space than filestore did. On Mon, Feb 4, 2019 at 10:20 AM Florian Engelmann < florian.engelm...@everyware.ch> wrote: > Hi, > > we have built a 6 Node NVMe only Ceph Cluster with 4x Intel DC P4510 8TB > each and one Intel DC P4800X 375GB Optane each. Up to 10x P4510 can be > installed in each node. > WAL and RocksDBs for all P4510 should be stored on the Optane (approx. > 30GB per RocksDB incl. WAL). > Internally, discussions arose whether the Optane would become a > bottleneck from a certain number of P4510 on. > For us, the lowest possible latency is very important. Therefore the > Optane NVMes were bought. In view of the good performance of the P4510, > the question arises whether the Optanes still have a noticeable effect > or whether they are actually just SPOFs? > > > All the best, > Florian > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] crush map has straw_calc_version=0 and legacy tunables on luminous
Hello, I just upgraded our cluster to 12.2.11 and I have a few questions around straw_calc_version and tunables. Currently ceph status shows the following: crush map has straw_calc_version=0 crush map has legacy tunables (require argonaut, min is firefly) 1. Will setting tunables to optimal also change the staw_calc_version or do I need to set that separately? 2. Right now I have a set of rbd kernel clients connecting using kernel version 4.4. The ‘ceph daemon mon.id sessions’ command shows that this client is still connecting using the hammer feature set (and a few others on jewel as well): "MonSession(client.113933130 10.35.100.121:0/3425045489 is open allow *, features 0x7fddff8ee8cbffb (jewel))", “MonSession(client.112250505 10.35.100.99:0/4174610322 is open allow *, features 0x106b84a842a42 (hammer))", My question is what is the minimum kernel version I would need to upgrade the 4.4 kernel server to in order to get to jewel or luminous? 1. Will setting the tunables to optimal on luminous prevent jewel and hammer clients from connecting? I want to make sure I don’t do anything will prevent my existing clients from connecting to the cluster. Thanks in advance, Shain -- Shain Miley | Manager of Systems and Infrastructure, Digital Media | smi...@npr.org | 202.513.3649 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS MDS journal
On Mon, Feb 4, 2019 at 4:35 PM Gregory Farnum wrote: > > > On Mon, Feb 4, 2019 at 7:32 AM Mahmoud Ismail < > mahmoudahmedism...@gmail.com> wrote: > >> On Mon, Feb 4, 2019 at 4:16 PM Gregory Farnum wrote: >> >>> On Fri, Feb 1, 2019 at 2:29 AM Mahmoud Ismail < >>> mahmoudahmedism...@gmail.com> wrote: >>> Hello, I'm a bit confused about how the journaling actually works in the MDS. I was reading about these two configuration parameters (journal write head interval) and (mds early reply). Does the MDS flush the journal synchronously after each operation? and by setting mds eary reply to true it allows operations to return without flushing. If so, what the other parameter (journal write head interval) do or isn't it for MDS?. Also, can all operations return without flushing with the mds early reply or is it specific to a subset of operations?. >>> >>> In general, the MDS journal is flushed every five seconds (by default), >>> and client requests get an early reply when the operation is done in memory >>> but not yet committed to RADOS. Some operations will trigger an immediate >>> flush, and there may be some operations that can't get an early reply or >>> that need to wait for part of the operation to get committed (like renames >>> that move a file's authority to a different MDS). >>> IIRC the journal write head interval controls how often it flushes out >>> the journal's header, which limits how out-of-date its hints on restart can >>> be. (When the MDS restarts, it asks the journal head where the journal's >>> unfinished start and end points are, but of course more of the journaled >>> operations may have been fully completed since the head was written.) >>> >> >> Thanks for the explanation. Which operations trigger an immediate flush? >> Is the readdir one of these operations?. I noticed that the readdir >> operation latency is going higher under load when the OSDs are hitting the >> limit of the underlying hdd throughput. Can i assume that this is happening >> due to the journal flushing then? >> > > Not directly, but a readdir might ask to know the size of each file and > that will force the other clients in the system to flush their dirty data > in the directory (so that the readdir can return valid results). > -Greg > > Could it be also due to the MDS lock (operations waiting for the lock under load)? Also, i assume that the journal is using a different thread for flushing, Right? > >> >>> >> Another question, are open operations also written to the journal? >>> >>> Not opens per se, but we do persist when clients have permission to >>> operate on files. >>> -Greg >>> >>> Regards, Mahmoud ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>but I don't see l_bluestore_fragmentation counter. >>(but I have bluestore_fragmentation_micros) ok, this is the same b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000"); Here a graph on last month, with bluestore_fragmentation_micros and latency, http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png - Mail original - De: "Alexandre Derumier" À: "Igor Fedotov" Cc: "Stefan Priebe, Profihost AG" , "Mark Nelson" , "Sage Weil" , "ceph-users" , "ceph-devel" Envoyé: Lundi 4 Février 2019 16:04:38 Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart Thanks Igor, >>Could you please collect BlueStore performance counters right after OSD >>startup and once you get high latency. >> >>Specifically 'l_bluestore_fragmentation' parameter is of interest. I'm already monitoring with "ceph daemon osd.x perf dump ", (I have 2months history will all counters) but I don't see l_bluestore_fragmentation counter. (but I have bluestore_fragmentation_micros) >>Also if you're able to rebuild the code I can probably make a simple >>patch to track latency and some other internal allocator's paramter to >>make sure it's degraded and learn more details. Sorry, It's a critical production cluster, I can't test on it :( But I have a test cluster, maybe I can try to put some load on it, and try to reproduce. >>More vigorous fix would be to backport bitmap allocator from Nautilus >>and try the difference... Any plan to backport it to mimic ? (But I can wait for Nautilus) perf results of new bitmap allocator seem very promising from what I've seen in PR. - Mail original - De: "Igor Fedotov" À: "Alexandre Derumier" , "Stefan Priebe, Profihost AG" , "Mark Nelson" Cc: "Sage Weil" , "ceph-users" , "ceph-devel" Envoyé: Lundi 4 Février 2019 15:51:30 Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart Hi Alexandre, looks like a bug in StupidAllocator. Could you please collect BlueStore performance counters right after OSD startup and once you get high latency. Specifically 'l_bluestore_fragmentation' parameter is of interest. Also if you're able to rebuild the code I can probably make a simple patch to track latency and some other internal allocator's paramter to make sure it's degraded and learn more details. More vigorous fix would be to backport bitmap allocator from Nautilus and try the difference... Thanks, Igor On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: > Hi again, > > I speak too fast, the problem has occured again, so it's not tcmalloc cache > size related. > > > I have notice something using a simple "perf top", > > each time I have this problem (I have seen exactly 4 times the same > behaviour), > > when latency is bad, perf top give me : > > StupidAllocator::_aligned_len > and > btree::btree_iterator long, unsigned long, std::less, mempoo > l::pool_allocator<(mempool::pool_index_t)1, std::pair unsigned long> >, 256> >, std::pair&, > std::pair const, unsigned long>*>::increment_slow() > > (around 10-20% time for both) > > > when latency is good, I don't see them at all. > > > I have used the Mark wallclock profiler, here the results: > > http://odisoweb1.odiso.net/gdbpmp-ok.txt > > http://odisoweb1.odiso.net/gdbpmp-bad.txt > > > here an extract of the thread with btree::btree_iterator && > StupidAllocator::_aligned_len > > > + 100.00% clone > + 100.00% start_thread > + 100.00% ShardedThreadPool::WorkThreadSharded::entry() > + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) > + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) > + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&, > ThreadPool::TPHandle&) > | + 70.00% OSD::dequeue_op(boost::intrusive_ptr, > boost::intrusive_ptr, ThreadPool::TPHandle&) > | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr&, > ThreadPool::TPHandle&) > | + 68.00% PGBackend::handle_message(boost::intrusive_ptr) > | | + 68.00% > ReplicatedBackend::_handle_message(boost::intrusive_ptr) > | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr) > | | + 67.00% non-virtual thunk to > PrimaryLogPG::queue_transactions(std::vector std::allocator >&, boost::intrusive_ptr) > | | | + 67.00% > BlueStore::queue_transactions(boost::intrusive_ptr&, > std::vector std::allocator >&, boost::intrusive_ptr, > ThreadPool::TPHandle*) > | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, > ObjectStore::Transaction*) > | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, > boost::intrusive_ptr&, > boost::intrusive_ptr&, unsigned long, unsigned long, > ceph::buffer::list&, unsigned int) > | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, > boost::intrusive_ptr&,
Re: [ceph-users] CephFS MDS journal
On Mon, Feb 4, 2019 at 7:32 AM Mahmoud Ismail wrote: > On Mon, Feb 4, 2019 at 4:16 PM Gregory Farnum wrote: > >> On Fri, Feb 1, 2019 at 2:29 AM Mahmoud Ismail < >> mahmoudahmedism...@gmail.com> wrote: >> >>> Hello, >>> >>> I'm a bit confused about how the journaling actually works in the MDS. >>> >>> I was reading about these two configuration parameters (journal write >>> head interval) and (mds early reply). Does the MDS flush the journal >>> synchronously after each operation? and by setting mds eary reply to true >>> it allows operations to return without flushing. If so, what the other >>> parameter (journal write head interval) do or isn't it for MDS?. Also, can >>> all operations return without flushing with the mds early reply or is it >>> specific to a subset of operations?. >>> >> >> In general, the MDS journal is flushed every five seconds (by default), >> and client requests get an early reply when the operation is done in memory >> but not yet committed to RADOS. Some operations will trigger an immediate >> flush, and there may be some operations that can't get an early reply or >> that need to wait for part of the operation to get committed (like renames >> that move a file's authority to a different MDS). >> IIRC the journal write head interval controls how often it flushes out >> the journal's header, which limits how out-of-date its hints on restart can >> be. (When the MDS restarts, it asks the journal head where the journal's >> unfinished start and end points are, but of course more of the journaled >> operations may have been fully completed since the head was written.) >> > > Thanks for the explanation. Which operations trigger an immediate flush? > Is the readdir one of these operations?. I noticed that the readdir > operation latency is going higher under load when the OSDs are hitting the > limit of the underlying hdd throughput. Can i assume that this is happening > due to the journal flushing then? > Not directly, but a readdir might ask to know the size of each file and that will force the other clients in the system to flush their dirty data in the directory (so that the readdir can return valid results). -Greg > > >> > >>> Another question, are open operations also written to the journal? >>> >> >> Not opens per se, but we do persist when clients have permission to >> operate on files. >> -Greg >> >> >>> >>> Regards, >>> Mahmoud >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS MDS journal
On Mon, Feb 4, 2019 at 4:16 PM Gregory Farnum wrote: > On Fri, Feb 1, 2019 at 2:29 AM Mahmoud Ismail < > mahmoudahmedism...@gmail.com> wrote: > >> Hello, >> >> I'm a bit confused about how the journaling actually works in the MDS. >> >> I was reading about these two configuration parameters (journal write >> head interval) and (mds early reply). Does the MDS flush the journal >> synchronously after each operation? and by setting mds eary reply to true >> it allows operations to return without flushing. If so, what the other >> parameter (journal write head interval) do or isn't it for MDS?. Also, can >> all operations return without flushing with the mds early reply or is it >> specific to a subset of operations?. >> > > In general, the MDS journal is flushed every five seconds (by default), > and client requests get an early reply when the operation is done in memory > but not yet committed to RADOS. Some operations will trigger an immediate > flush, and there may be some operations that can't get an early reply or > that need to wait for part of the operation to get committed (like renames > that move a file's authority to a different MDS). > IIRC the journal write head interval controls how often it flushes out the > journal's header, which limits how out-of-date its hints on restart can be. > (When the MDS restarts, it asks the journal head where the journal's > unfinished start and end points are, but of course more of the journaled > operations may have been fully completed since the head was written.) > Thanks for the explanation. Which operations trigger an immediate flush? Is the readdir one of these operations?. I noticed that the readdir operation latency is going higher under load when the OSDs are hitting the limit of the underlying hdd throughput. Can i assume that this is happening due to the journal flushing then? > >> Another question, are open operations also written to the journal? >> > > Not opens per se, but we do persist when clients have permission to > operate on files. > -Greg > > >> >> Regards, >> Mahmoud >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS MDS journal
On Fri, Feb 1, 2019 at 2:29 AM Mahmoud Ismail wrote: > Hello, > > I'm a bit confused about how the journaling actually works in the MDS. > > I was reading about these two configuration parameters (journal write head > interval) and (mds early reply). Does the MDS flush the journal > synchronously after each operation? and by setting mds eary reply to true > it allows operations to return without flushing. If so, what the other > parameter (journal write head interval) do or isn't it for MDS?. Also, can > all operations return without flushing with the mds early reply or is it > specific to a subset of operations?. > In general, the MDS journal is flushed every five seconds (by default), and client requests get an early reply when the operation is done in memory but not yet committed to RADOS. Some operations will trigger an immediate flush, and there may be some operations that can't get an early reply or that need to wait for part of the operation to get committed (like renames that move a file's authority to a different MDS). IIRC the journal write head interval controls how often it flushes out the journal's header, which limits how out-of-date its hints on restart can be. (When the MDS restarts, it asks the journal head where the journal's unfinished start and end points are, but of course more of the journaled operations may have been fully completed since the head was written.) > > Another question, are open operations also written to the journal? > Not opens per se, but we do persist when clients have permission to operate on files. -Greg > > Regards, > Mahmoud > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd commit latency increase over time, until restart
Thanks Igor, >>Could you please collect BlueStore performance counters right after OSD >>startup and once you get high latency. >> >>Specifically 'l_bluestore_fragmentation' parameter is of interest. I'm already monitoring with "ceph daemon osd.x perf dump ", (I have 2months history will all counters) but I don't see l_bluestore_fragmentation counter. (but I have bluestore_fragmentation_micros) >>Also if you're able to rebuild the code I can probably make a simple >>patch to track latency and some other internal allocator's paramter to >>make sure it's degraded and learn more details. Sorry, It's a critical production cluster, I can't test on it :( But I have a test cluster, maybe I can try to put some load on it, and try to reproduce. >>More vigorous fix would be to backport bitmap allocator from Nautilus >>and try the difference... Any plan to backport it to mimic ? (But I can wait for Nautilus) perf results of new bitmap allocator seem very promising from what I've seen in PR. - Mail original - De: "Igor Fedotov" À: "Alexandre Derumier" , "Stefan Priebe, Profihost AG" , "Mark Nelson" Cc: "Sage Weil" , "ceph-users" , "ceph-devel" Envoyé: Lundi 4 Février 2019 15:51:30 Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart Hi Alexandre, looks like a bug in StupidAllocator. Could you please collect BlueStore performance counters right after OSD startup and once you get high latency. Specifically 'l_bluestore_fragmentation' parameter is of interest. Also if you're able to rebuild the code I can probably make a simple patch to track latency and some other internal allocator's paramter to make sure it's degraded and learn more details. More vigorous fix would be to backport bitmap allocator from Nautilus and try the difference... Thanks, Igor On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: > Hi again, > > I speak too fast, the problem has occured again, so it's not tcmalloc cache > size related. > > > I have notice something using a simple "perf top", > > each time I have this problem (I have seen exactly 4 times the same > behaviour), > > when latency is bad, perf top give me : > > StupidAllocator::_aligned_len > and > btree::btree_iterator long, unsigned long, std::less, mempoo > l::pool_allocator<(mempool::pool_index_t)1, std::pair unsigned long> >, 256> >, std::pair&, > std::pair const, unsigned long>*>::increment_slow() > > (around 10-20% time for both) > > > when latency is good, I don't see them at all. > > > I have used the Mark wallclock profiler, here the results: > > http://odisoweb1.odiso.net/gdbpmp-ok.txt > > http://odisoweb1.odiso.net/gdbpmp-bad.txt > > > here an extract of the thread with btree::btree_iterator && > StupidAllocator::_aligned_len > > > + 100.00% clone > + 100.00% start_thread > + 100.00% ShardedThreadPool::WorkThreadSharded::entry() > + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) > + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) > + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&, > ThreadPool::TPHandle&) > | + 70.00% OSD::dequeue_op(boost::intrusive_ptr, > boost::intrusive_ptr, ThreadPool::TPHandle&) > | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr&, > ThreadPool::TPHandle&) > | + 68.00% PGBackend::handle_message(boost::intrusive_ptr) > | | + 68.00% > ReplicatedBackend::_handle_message(boost::intrusive_ptr) > | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr) > | | + 67.00% non-virtual thunk to > PrimaryLogPG::queue_transactions(std::vector std::allocator >&, boost::intrusive_ptr) > | | | + 67.00% > BlueStore::queue_transactions(boost::intrusive_ptr&, > std::vector std::allocator >&, boost::intrusive_ptr, > ThreadPool::TPHandle*) > | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, > ObjectStore::Transaction*) > | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, > boost::intrusive_ptr&, > boost::intrusive_ptr&, unsigned long, unsigned long, > ceph::buffer::list&, unsigned int) > | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, > boost::intrusive_ptr&, > boost::intrusive_ptr, unsigned long, unsigned long, > ceph::buffer::list&, unsigned int) > | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, > boost::intrusive_ptr, > boost::intrusive_ptr, BlueStore::WriteContext*) > | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, > unsigned long, long, std::vector mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) > | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned > long, long, unsigned long*, unsigned int*) > | | | | | | + 34.00% > btree::btree_iterator long, unsigned long, std::less, > mempool::pool_allocator<(mempool::pool_index_t)1, std::pair const, unsigned long> >, 256> >, std::pair long>&, std::pair*>
Re: [ceph-users] ceph osd commit latency increase over time, until restart
Hi Alexandre, looks like a bug in StupidAllocator. Could you please collect BlueStore performance counters right after OSD startup and once you get high latency. Specifically 'l_bluestore_fragmentation' parameter is of interest. Also if you're able to rebuild the code I can probably make a simple patch to track latency and some other internal allocator's paramter to make sure it's degraded and learn more details. More vigorous fix would be to backport bitmap allocator from Nautilus and try the difference... Thanks, Igor On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: Hi again, I speak too fast, the problem has occured again, so it's not tcmalloc cache size related. I have notice something using a simple "perf top", each time I have this problem (I have seen exactly 4 times the same behaviour), when latency is bad, perf top give me : StupidAllocator::_aligned_len and btree::btree_iterator, mempoo l::pool_allocator<(mempool::pool_index_t)1, std::pair >, 256> >, std::pair&, std::pair*>::increment_slow() (around 10-20% time for both) when latency is good, I don't see them at all. I have used the Mark wallclock profiler, here the results: http://odisoweb1.odiso.net/gdbpmp-ok.txt http://odisoweb1.odiso.net/gdbpmp-bad.txt here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len + 100.00% clone + 100.00% start_thread + 100.00% ShardedThreadPool::WorkThreadSharded::entry() + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&, ThreadPool::TPHandle&) | + 70.00% OSD::dequeue_op(boost::intrusive_ptr, boost::intrusive_ptr, ThreadPool::TPHandle&) | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr&, ThreadPool::TPHandle&) | + 68.00% PGBackend::handle_message(boost::intrusive_ptr) | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr) | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr) | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector >&, boost::intrusive_ptr) | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr&, std::vector >&, boost::intrusive_ptr, ThreadPool::TPHandle*) | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*) | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr&, boost::intrusive_ptr&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr&, boost::intrusive_ptr, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr, boost::intrusive_ptr, BlueStore::WriteContext*) | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector >*) | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*) | | | | | | + 34.00% btree::btree_iterator, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair >, 256> >, std::pair&, std::pair*>::increment_slow() | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair >, 256> >::iterator, unsigned long) - Mail original - De: "Alexandre Derumier" À: "Stefan Priebe, Profihost AG" Cc: "Sage Weil" , "ceph-users" , "ceph-devel" Envoyé: Lundi 4 Février 2019 09:38:11 Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart Hi, some news: I have tried with different transparent hugepage values (madvise, never) : no change I have tried to increase bluestore_cache_size_ssd to 8G: no change I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure) Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB), my others clusters user 1,6TB ssd. Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping. BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ? Regards, Alexandre - Mail original - De: "aderumier" À: "Stefan Priebe, Profihost AG" Cc: "Sage Weil" , "ceph-users" , "ceph-devel" Envoyé: Mercredi 30 Janvier 2019 19:58:15 Objet: Re: [ceph-us
[ceph-users] Optane still valid
Hi, we have built a 6 Node NVMe only Ceph Cluster with 4x Intel DC P4510 8TB each and one Intel DC P4800X 375GB Optane each. Up to 10x P4510 can be installed in each node. WAL and RocksDBs for all P4510 should be stored on the Optane (approx. 30GB per RocksDB incl. WAL). Internally, discussions arose whether the Optane would become a bottleneck from a certain number of P4510 on. For us, the lowest possible latency is very important. Therefore the Optane NVMes were bought. In view of the good performance of the P4510, the question arises whether the Optanes still have a noticeable effect or whether they are actually just SPOFs? All the best, Florian smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd commit latency increase over time, until restart
Hi again, I speak too fast, the problem has occured again, so it's not tcmalloc cache size related. I have notice something using a simple "perf top", each time I have this problem (I have seen exactly 4 times the same behaviour), when latency is bad, perf top give me : StupidAllocator::_aligned_len and btree::btree_iterator, mempoo l::pool_allocator<(mempool::pool_index_t)1, std::pair >, 256> >, std::pair&, std::pair*>::increment_slow() (around 10-20% time for both) when latency is good, I don't see them at all. I have used the Mark wallclock profiler, here the results: http://odisoweb1.odiso.net/gdbpmp-ok.txt http://odisoweb1.odiso.net/gdbpmp-bad.txt here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len + 100.00% clone + 100.00% start_thread + 100.00% ShardedThreadPool::WorkThreadSharded::entry() + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&, ThreadPool::TPHandle&) | + 70.00% OSD::dequeue_op(boost::intrusive_ptr, boost::intrusive_ptr, ThreadPool::TPHandle&) | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr&, ThreadPool::TPHandle&) | + 68.00% PGBackend::handle_message(boost::intrusive_ptr) | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr) | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr) | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector >&, boost::intrusive_ptr) | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr&, std::vector >&, boost::intrusive_ptr, ThreadPool::TPHandle*) | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*) | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr&, boost::intrusive_ptr&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr&, boost::intrusive_ptr, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr, boost::intrusive_ptr, BlueStore::WriteContext*) | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector >*) | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*) | | | | | | + 34.00% btree::btree_iterator, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair >, 256> >, std::pair&, std::pair*>::increment_slow() | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair >, 256> >::iterator, unsigned long) - Mail original - De: "Alexandre Derumier" À: "Stefan Priebe, Profihost AG" Cc: "Sage Weil" , "ceph-users" , "ceph-devel" Envoyé: Lundi 4 Février 2019 09:38:11 Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart Hi, some news: I have tried with different transparent hugepage values (madvise, never) : no change I have tried to increase bluestore_cache_size_ssd to 8G: no change I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure) Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB), my others clusters user 1,6TB ssd. Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping. BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ? Regards, Alexandre - Mail original - De: "aderumier" À: "Stefan Priebe, Profihost AG" Cc: "Sage Weil" , "ceph-users" , "ceph-devel" Envoyé: Mercredi 30 Janvier 2019 19:58:15 Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart >>Thanks. Is there any reason you monitor op_w_latency but not >>op_r_latency but instead op_latency? >> >>Also why do you monitor op_w_process_latency? but not op_r_process_latency? I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). I just don't see latency difference on reads. (or they are very very small vs the write latency increase) - Mail original - De: "Stefan Priebe, Profihost AG" À: "aderumier" Cc: "Sage Weil" , "ceph-users" , "ceph-devel" Envoyé: Mercredi
Re: [ceph-users] Kernel requirements for balancer in upmap mode
Thanks a lot ! On Mon, Feb 4, 2019 at 12:35 PM Konstantin Shalygin wrote: > So, if I am using ceph just to provide block storage to an OpenStack > cluster (so using libvirt), the kernel version on the client nodes > shouldn't matter, right ? > > Yep, just make sure your librbd on compute hosts is Luminous. > > > > k > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore deploys to tmpfs?
On Mon, Feb 4, 2019 at 4:43 AM Hector Martin wrote: > > On 02/02/2019 05:07, Stuart Longland wrote: > > On 1/2/19 10:43 pm, Alfredo Deza wrote: > >>> The tmpfs setup is expected. All persistent data for bluestore OSDs > >>> setup with LVM are stored in LVM metadata. The LVM/udev handler for > >>> bluestore volumes create these tmpfs filesystems on the fly and populate > >>> them with the information from the metadata. > >> That is mostly what happens. There isn't a dependency on UDEV anymore > >> (yay), but the reason why files are mounted on tmpfs > >> is because *bluestore* spits them out on activation, this makes the > >> path fully ephemeral (a great thing!) > >> > >> The step-by-step is documented in this summary section of 'activate' > >> http://docs.ceph.com/docs/master/ceph-volume/lvm/activate/#summary > >> > >> Filestore doesn't have any of these capabilities and it is why it does > >> have an actual existing path (vs. tmpfs), and the files come from the > >> data partition that > >> gets mounted. > >> > > > > Well, for whatever reason, ceph-osd isn't calling the activate script > > before it starts up. > > > > It is worth noting that the systems I'm using do not use systemd out of > > simplicity. I might need to write an init script to do that. It wasn't > > clear last weekend what commands I needed to run to activate a BlueStore > > OSD. > > The way you do this on Gentoo is by writing the OSD FSID into > /etc/conf.d/ceph-osd.. You need to make note of the ID when the OSD > is first deployed. > > # echo "bluestore_osd_fsid=$(cat /var/lib/ceph/osd/ceph-0/fsid)" > > /etc/conf.d/ceph-osd.0 > > And then of course do the usual initscript symlink enable on Gentoo: > > # ln -s ceph /etc/init.d/ceph-osd.0 > # rc-update add ceph-osd.0 default > > This will then call `ceph-volume lvm activate` for that OSD for you > before bringing it up, which will populate the tmpfs. It is the Gentoo > OpenRC equivalent of enabling the systemd unit for that osd-fsid on > systemd systems (but ceph-volume won't do it for you). This is spot on. The whole systemd script for ceph-volume, is merely passing the ID and FSID over to ceph-volume, which ends up doing something like: ceph-volume lvm activate ID FSID > > You may also want to add some dependencies for all OSDs depending on > your setup (e.g. I run single-host and the mon has the OSD dm-crypt > keys, so that has to come first): > > # echo 'rc_need="ceph-mon.0"' > /etc/conf.d/ceph-osd > > The Gentoo initscript setup for Ceph is unfortunately not very well > documented. I've been meaning to write a blogpost about this to try to > share what I've learned :-) > > -- > Hector Martin (hec...@marcansoft.com) > Public Key: https://marcan.st/marcan.asc > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Problem replacing osd with ceph-deploy
On Fri, Feb 1, 2019 at 6:07 PM Shain Miley wrote: > > Hi, > > I went to replace a disk today (which I had not had to do in a while) > and after I added it the results looked rather odd compared to times past: > > I was attempting to replace /dev/sdk on one of our osd nodes: > > #ceph-deploy disk zap hqosd7 /dev/sdk > #ceph-deploy osd create --data /dev/sdk hqosd7 > > [ceph_deploy.conf][DEBUG ] found configuration file at: > /root/.cephdeploy.conf > [ceph_deploy.cli][INFO ] Invoked (2.0.1): /usr/local/bin/ceph-deploy > osd create --data /dev/sdk hqosd7 > [ceph_deploy.cli][INFO ] ceph-deploy options: > [ceph_deploy.cli][INFO ] verbose : False > [ceph_deploy.cli][INFO ] bluestore : None > [ceph_deploy.cli][INFO ] cd_conf : > > [ceph_deploy.cli][INFO ] cluster : ceph > [ceph_deploy.cli][INFO ] fs_type : xfs > [ceph_deploy.cli][INFO ] block_wal : None > [ceph_deploy.cli][INFO ] default_release : False > [ceph_deploy.cli][INFO ] username : None > [ceph_deploy.cli][INFO ] journal : None > [ceph_deploy.cli][INFO ] subcommand: create > [ceph_deploy.cli][INFO ] host : hqosd7 > [ceph_deploy.cli][INFO ] filestore : None > [ceph_deploy.cli][INFO ] func : at 0x7fa3b14b3398> > [ceph_deploy.cli][INFO ] ceph_conf : None > [ceph_deploy.cli][INFO ] zap_disk : False > [ceph_deploy.cli][INFO ] data : /dev/sdk > [ceph_deploy.cli][INFO ] block_db : None > [ceph_deploy.cli][INFO ] dmcrypt : False > [ceph_deploy.cli][INFO ] overwrite_conf: False > [ceph_deploy.cli][INFO ] dmcrypt_key_dir : > /etc/ceph/dmcrypt-keys > [ceph_deploy.cli][INFO ] quiet : False > [ceph_deploy.cli][INFO ] debug : False > [ceph_deploy.osd][DEBUG ] Creating OSD on cluster ceph with data device > /dev/sdk > [hqosd7][DEBUG ] connected to host: hqosd7 > [hqosd7][DEBUG ] detect platform information from remote host > [hqosd7][DEBUG ] detect machine type > [hqosd7][DEBUG ] find the location of an executable > [ceph_deploy.osd][INFO ] Distro info: Ubuntu 16.04 xenial > [ceph_deploy.osd][DEBUG ] Deploying osd to hqosd7 > [hqosd7][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf > [hqosd7][DEBUG ] find the location of an executable > [hqosd7][INFO ] Running command: /usr/sbin/ceph-volume --cluster ceph > lvm create --bluestore --data /dev/sdk > [hqosd7][DEBUG ] Running command: /usr/bin/ceph-authtool --gen-print-key > [hqosd7][DEBUG ] Running command: /usr/bin/ceph --cluster ceph --name > client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring > -i - osd new c98a11d1-9b7f-487e-8c69-72fc662927d4 > [hqosd7][DEBUG ] Running command: vgcreate --force --yes > ceph-bbe0e44e-afc9-4cf1-9f1a-ed7d20f796c1 /dev/sdk > [hqosd7][DEBUG ] stdout: Physical volume "/dev/sdk" successfully created > [hqosd7][DEBUG ] stdout: Volume group > "ceph-bbe0e44e-afc9-4cf1-9f1a-ed7d20f796c1" successfully created > [hqosd7][DEBUG ] Running command: lvcreate --yes -l 100%FREE -n > osd-block-c98a11d1-9b7f-487e-8c69-72fc662927d4 > ceph-bbe0e44e-afc9-4cf1-9f1a-ed7d20f796c1 > [hqosd7][DEBUG ] stdout: Logical volume > "osd-block-c98a11d1-9b7f-487e-8c69-72fc662927d4" created. > [hqosd7][DEBUG ] Running command: /usr/bin/ceph-authtool --gen-print-key > [hqosd7][DEBUG ] Running command: mount -t tmpfs tmpfs > /var/lib/ceph/osd/ceph-81 > [hqosd7][DEBUG ] Running command: chown -R ceph:ceph /dev/dm-0 > [hqosd7][DEBUG ] Running command: ln -s > /dev/ceph-bbe0e44e-afc9-4cf1-9f1a-ed7d20f796c1/osd-block-c98a11d1-9b7f-487e-8c69-72fc662927d4 > /var/lib/ceph/osd/ceph-81/block > [hqosd7][DEBUG ] Running command: ceph --cluster ceph --name > client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring > mon getmap -o /var/lib/ceph/osd/ceph-81/activate.monmap > [hqosd7][DEBUG ] stderr: got monmap epoch 2 > [hqosd7][DEBUG ] Running command: ceph-authtool > /var/lib/ceph/osd/ceph-81/keyring --create-keyring --name osd.81 > --add-key AQCyyFRcSwWqGBAAKZR8rcWIEknj/o3rsehOdA== > [hqosd7][DEBUG ] stdout: creating /var/lib/ceph/osd/ceph-81/keyring > [hqosd7][DEBUG ] stdout: added entity osd.81 auth auth(auid = > 18446744073709551615 key=AQCyyFRcSwWqGBAAKZR8rcWIEknj/o3rsehOdA== with 0 > caps) > [hqosd7][DEBUG ] Running command: chown -R ceph:ceph > /var/lib/ceph/osd/ceph-81/keyring > [hqosd7][DEBUG ] Running command: chown -R ceph:ceph > /var/lib/ceph/osd/ceph-81/ > [hqosd7][DEBUG ] Running command: /usr/bin/ceph-osd --cluster ceph > --osd-objectstore bluestore --mkfs -i 81 --monmap > /var/lib/ceph/osd/ceph-81/activate.monmap --keyfile - --osd-data > /var/lib/ceph/osd/ceph-8
[ceph-users] ceph OSD cache ration usage
Hello - We are using the ceph osd nodes with cache controller cache of 1G size. Are there any recommendation for using the cache for read and write? Here we are using - HDDs with colocated journals. For SSD journal - 0% cache and 100% write. Thanks Swami ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Problem replacing osd with ceph-deploy
On Fri, Feb 1, 2019 at 6:35 PM Vladimir Prokofev wrote: > > Your output looks a bit weird, but still, this is normal for bluestore. It > creates small separate data partition that is presented as XFS mounted in > /var/lib/ceph/osd, while real data partition is hidden as raw(bluestore) > block device. That is not right for this output. It is using ceph-volume with LVM, there are no partitions being created. > It's no longer possible to check disk utilisation with df using bluestore. > To check your osd capacity use 'ceph osd df' > > сб, 2 февр. 2019 г. в 02:07, Shain Miley : >> >> Hi, >> >> I went to replace a disk today (which I had not had to do in a while) >> and after I added it the results looked rather odd compared to times past: >> >> I was attempting to replace /dev/sdk on one of our osd nodes: >> >> #ceph-deploy disk zap hqosd7 /dev/sdk >> #ceph-deploy osd create --data /dev/sdk hqosd7 >> >> [ceph_deploy.conf][DEBUG ] found configuration file at: >> /root/.cephdeploy.conf >> [ceph_deploy.cli][INFO ] Invoked (2.0.1): /usr/local/bin/ceph-deploy >> osd create --data /dev/sdk hqosd7 >> [ceph_deploy.cli][INFO ] ceph-deploy options: >> [ceph_deploy.cli][INFO ] verbose : False >> [ceph_deploy.cli][INFO ] bluestore : None >> [ceph_deploy.cli][INFO ] cd_conf : >> >> [ceph_deploy.cli][INFO ] cluster : ceph >> [ceph_deploy.cli][INFO ] fs_type : xfs >> [ceph_deploy.cli][INFO ] block_wal : None >> [ceph_deploy.cli][INFO ] default_release : False >> [ceph_deploy.cli][INFO ] username : None >> [ceph_deploy.cli][INFO ] journal : None >> [ceph_deploy.cli][INFO ] subcommand: create >> [ceph_deploy.cli][INFO ] host : hqosd7 >> [ceph_deploy.cli][INFO ] filestore : None >> [ceph_deploy.cli][INFO ] func : > at 0x7fa3b14b3398> >> [ceph_deploy.cli][INFO ] ceph_conf : None >> [ceph_deploy.cli][INFO ] zap_disk : False >> [ceph_deploy.cli][INFO ] data : /dev/sdk >> [ceph_deploy.cli][INFO ] block_db : None >> [ceph_deploy.cli][INFO ] dmcrypt : False >> [ceph_deploy.cli][INFO ] overwrite_conf: False >> [ceph_deploy.cli][INFO ] dmcrypt_key_dir : >> /etc/ceph/dmcrypt-keys >> [ceph_deploy.cli][INFO ] quiet : False >> [ceph_deploy.cli][INFO ] debug : False >> [ceph_deploy.osd][DEBUG ] Creating OSD on cluster ceph with data device >> /dev/sdk >> [hqosd7][DEBUG ] connected to host: hqosd7 >> [hqosd7][DEBUG ] detect platform information from remote host >> [hqosd7][DEBUG ] detect machine type >> [hqosd7][DEBUG ] find the location of an executable >> [ceph_deploy.osd][INFO ] Distro info: Ubuntu 16.04 xenial >> [ceph_deploy.osd][DEBUG ] Deploying osd to hqosd7 >> [hqosd7][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf >> [hqosd7][DEBUG ] find the location of an executable >> [hqosd7][INFO ] Running command: /usr/sbin/ceph-volume --cluster ceph >> lvm create --bluestore --data /dev/sdk >> [hqosd7][DEBUG ] Running command: /usr/bin/ceph-authtool --gen-print-key >> [hqosd7][DEBUG ] Running command: /usr/bin/ceph --cluster ceph --name >> client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring >> -i - osd new c98a11d1-9b7f-487e-8c69-72fc662927d4 >> [hqosd7][DEBUG ] Running command: vgcreate --force --yes >> ceph-bbe0e44e-afc9-4cf1-9f1a-ed7d20f796c1 /dev/sdk >> [hqosd7][DEBUG ] stdout: Physical volume "/dev/sdk" successfully created >> [hqosd7][DEBUG ] stdout: Volume group >> "ceph-bbe0e44e-afc9-4cf1-9f1a-ed7d20f796c1" successfully created >> [hqosd7][DEBUG ] Running command: lvcreate --yes -l 100%FREE -n >> osd-block-c98a11d1-9b7f-487e-8c69-72fc662927d4 >> ceph-bbe0e44e-afc9-4cf1-9f1a-ed7d20f796c1 >> [hqosd7][DEBUG ] stdout: Logical volume >> "osd-block-c98a11d1-9b7f-487e-8c69-72fc662927d4" created. >> [hqosd7][DEBUG ] Running command: /usr/bin/ceph-authtool --gen-print-key >> [hqosd7][DEBUG ] Running command: mount -t tmpfs tmpfs >> /var/lib/ceph/osd/ceph-81 >> [hqosd7][DEBUG ] Running command: chown -R ceph:ceph /dev/dm-0 >> [hqosd7][DEBUG ] Running command: ln -s >> /dev/ceph-bbe0e44e-afc9-4cf1-9f1a-ed7d20f796c1/osd-block-c98a11d1-9b7f-487e-8c69-72fc662927d4 >> /var/lib/ceph/osd/ceph-81/block >> [hqosd7][DEBUG ] Running command: ceph --cluster ceph --name >> client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring >> mon getmap -o /var/lib/ceph/osd/ceph-81/activate.monmap >> [hqosd7][DEBUG ] stderr: got monmap epoch 2 >> [hqosd7][DEBUG ] Running command: ceph-authtool >> /var/lib/ceph/osd/ceph-81/keyring --create-keyring --name osd.81 >> --add-key AQCyyFRcSwWqGBAAKZR8rcWIEknj/o3r
Re: [ceph-users] Kernel requirements for balancer in upmap mode
So, if I am using ceph just to provide block storage to an OpenStack cluster (so using libvirt), the kernel version on the client nodes shouldn't matter, right ? Yep, just make sure your librbd on compute hosts is Luminous. k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Kernel requirements for balancer in upmap mode
Thanks a lot So, if I am using ceph just to provide block storage to an OpenStack cluster (so using libvirt), the kernel version on the client nodes shouldn't matter, right ? Thanks again, Massimo On Mon, Feb 4, 2019 at 10:02 AM Ilya Dryomov wrote: > On Mon, Feb 4, 2019 at 9:25 AM Massimo Sgaravatto > wrote: > > > > The official documentation [*] says that the only requirement to use the > balancer in upmap mode is that all clients must run at least luminous. > > But I read somewhere (also in this mailing list) that there are also > requirements wrt the kernel. > > If so: > > > > 1) Could you please specify what is the minimum required kernel ? > > 4.13 or CentOS 7.5. See [1] for details. > > > 2) Does this kernel requirement apply only to the OSD nodes ? Or also to > the clients ? > > No, only to the kernel client nodes. If the kernel client isn't used, > there is no requirement at all. > > [1] > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/027002.html > > Thanks, > > Ilya > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore deploys to tmpfs?
On 02/02/2019 05:07, Stuart Longland wrote: On 1/2/19 10:43 pm, Alfredo Deza wrote: The tmpfs setup is expected. All persistent data for bluestore OSDs setup with LVM are stored in LVM metadata. The LVM/udev handler for bluestore volumes create these tmpfs filesystems on the fly and populate them with the information from the metadata. That is mostly what happens. There isn't a dependency on UDEV anymore (yay), but the reason why files are mounted on tmpfs is because *bluestore* spits them out on activation, this makes the path fully ephemeral (a great thing!) The step-by-step is documented in this summary section of 'activate' http://docs.ceph.com/docs/master/ceph-volume/lvm/activate/#summary Filestore doesn't have any of these capabilities and it is why it does have an actual existing path (vs. tmpfs), and the files come from the data partition that gets mounted. Well, for whatever reason, ceph-osd isn't calling the activate script before it starts up. It is worth noting that the systems I'm using do not use systemd out of simplicity. I might need to write an init script to do that. It wasn't clear last weekend what commands I needed to run to activate a BlueStore OSD. The way you do this on Gentoo is by writing the OSD FSID into /etc/conf.d/ceph-osd.. You need to make note of the ID when the OSD is first deployed. # echo "bluestore_osd_fsid=$(cat /var/lib/ceph/osd/ceph-0/fsid)" > /etc/conf.d/ceph-osd.0 And then of course do the usual initscript symlink enable on Gentoo: # ln -s ceph /etc/init.d/ceph-osd.0 # rc-update add ceph-osd.0 default This will then call `ceph-volume lvm activate` for that OSD for you before bringing it up, which will populate the tmpfs. It is the Gentoo OpenRC equivalent of enabling the systemd unit for that osd-fsid on systemd systems (but ceph-volume won't do it for you). You may also want to add some dependencies for all OSDs depending on your setup (e.g. I run single-host and the mon has the OSD dm-crypt keys, so that has to come first): # echo 'rc_need="ceph-mon.0"' > /etc/conf.d/ceph-osd The Gentoo initscript setup for Ceph is unfortunately not very well documented. I've been meaning to write a blogpost about this to try to share what I've learned :-) -- Hector Martin (hec...@marcansoft.com) Public Key: https://marcan.st/marcan.asc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Kernel requirements for balancer in upmap mode
On Mon, Feb 4, 2019 at 9:25 AM Massimo Sgaravatto wrote: > > The official documentation [*] says that the only requirement to use the > balancer in upmap mode is that all clients must run at least luminous. > But I read somewhere (also in this mailing list) that there are also > requirements wrt the kernel. > If so: > > 1) Could you please specify what is the minimum required kernel ? 4.13 or CentOS 7.5. See [1] for details. > 2) Does this kernel requirement apply only to the OSD nodes ? Or also to the > clients ? No, only to the kernel client nodes. If the kernel client isn't used, there is no requirement at all. [1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/027002.html Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd commit latency increase over time, until restart
Hi, some news: I have tried with different transparent hugepage values (madvise, never) : no change I have tried to increase bluestore_cache_size_ssd to 8G: no change I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure) Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB), my others clusters user 1,6TB ssd. Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping. BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ? Regards, Alexandre - Mail original - De: "aderumier" À: "Stefan Priebe, Profihost AG" Cc: "Sage Weil" , "ceph-users" , "ceph-devel" Envoyé: Mercredi 30 Janvier 2019 19:58:15 Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart >>Thanks. Is there any reason you monitor op_w_latency but not >>op_r_latency but instead op_latency? >> >>Also why do you monitor op_w_process_latency? but not op_r_process_latency? I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). I just don't see latency difference on reads. (or they are very very small vs the write latency increase) - Mail original - De: "Stefan Priebe, Profihost AG" À: "aderumier" Cc: "Sage Weil" , "ceph-users" , "ceph-devel" Envoyé: Mercredi 30 Janvier 2019 19:50:20 Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart Hi, Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: > Hi Stefan, > >>> currently i'm in the process of switching back from jemalloc to tcmalloc >>> like suggested. This report makes me a little nervous about my change. > Well,I'm really not sure that it's a tcmalloc bug. > maybe bluestore related (don't have filestore anymore to compare) > I need to compare with bigger latencies > > here an example, when all osd at 20-50ms before restart, then after restart > (at 21:15), 1ms > http://odisoweb1.odiso.net/latencybad.png > > I observe the latency in my guest vm too, on disks iowait. > > http://odisoweb1.odiso.net/latencybadvm.png > >>> Also i'm currently only monitoring latency for filestore osds. Which >>> exact values out of the daemon do you use for bluestore? > > here my influxdb queries: > > It take op_latency.sum/op_latency.avgcount on last second. > > > SELECT non_negative_derivative(first("op_latency.sum"), > 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" > WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter > GROUP BY time($interval), "host", "id" fill(previous) > > > SELECT non_negative_derivative(first("op_w_latency.sum"), > 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" > WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ > AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) > > > SELECT non_negative_derivative(first("op_w_process_latency.sum"), > 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM > "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ > /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" > fill(previous) Thanks. Is there any reason you monitor op_w_latency but not op_r_latency but instead op_latency? Also why do you monitor op_w_process_latency? but not op_r_process_latency? greets, Stefan > > > > > > - Mail original - > De: "Stefan Priebe, Profihost AG" > À: "aderumier" , "Sage Weil" > Cc: "ceph-users" , "ceph-devel" > > Envoyé: Mercredi 30 Janvier 2019 08:45:33 > Objet: Re: [ceph-users] ceph osd commit latency increase over time, until > restart > > Hi, > > Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: >> Hi, >> >> here some new results, >> different osd/ different cluster >> >> before osd restart latency was between 2-5ms >> after osd restart is around 1-1.5ms >> >> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) >> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) >> http://odisoweb1.odiso.net/cephperf2/diff.txt >> >> From what I see in diff, the biggest difference is in tcmalloc, but maybe >> I'm wrong. >> (I'm using tcmalloc 2.5-2.2) > > currently i'm in the process of switching back from jemalloc to tcmalloc > like suggested. This report makes me a little nervous about my change. > > Also i'm currently only monitoring latency for filestore osds. Which > exact values out of the daemon do you use for bluestore? > > I would like to check if i see the same behaviour. > > Greets, > Stefan > >> >> - Mail original - >> De: "Sage Weil" >> À: "aderumier" >> Cc: "ceph-users" , "ceph-devel" >> >> Envoyé: Vendredi 25 Janvier 2019 10:49:02 >> Objet: Re: ce
Re: [ceph-users] Luminous cluster in very bad state need some assistance.
ceph pg ls | grep 11.182 11.182 10 25 35 0 2534648064 1306 1306 active+recovery_wait+undersized+degraded 2019-02-04 09:23:26.461468 70238'1306 70673:24924 [64] 64 [64] 64 46843'56759413 2019-01-26 16:31:32.607109 46843'56628962 2019-01-24 08:56:59.228615 root@storage-node-1-l3:~# ceph pg 11.182 query { "state": "active+recovery_wait+undersized+degraded", "snap_trimq": "[1~b]", "snap_trimq_len": 11, "epoch": 70673, "up": [ 64 ], "acting": [ 64 ], "actingbackfill": [ "64" ], "info": { "pgid": "11.182", "last_update": "70238'1306", "last_complete": "46843'56787837", "log_tail": "0'0", "last_user_version": 1301, "last_backfill": "MAX", "last_backfill_bitwise": 0, "purged_snaps": [], "history": { "epoch_created": 54817, "epoch_pool_created": 278, "last_epoch_started": 70656, "last_interval_started": 70655, "last_epoch_clean": 67924, "last_interval_clean": 54687, "last_epoch_split": 54817, "last_epoch_marked_full": 0, "same_up_since": 70655, "same_interval_since": 70655, "same_primary_since": 70655, "last_scrub": "46843'56759413", "last_scrub_stamp": "2019-01-26 16:31:32.607109", "last_deep_scrub": "46843'56628962", "last_deep_scrub_stamp": "2019-01-24 08:56:59.228615", "last_clean_scrub_stamp": "2019-01-26 16:31:32.607109" }, "stats": { "version": "70238'1306", "reported_seq": "24940", "reported_epoch": "70673", "state": "active+recovery_wait+undersized+degraded", "last_fresh": "2019-02-04 09:25:56.966952", "last_change": "2019-02-04 09:25:56.966952", "last_active": "2019-02-04 09:25:56.966952", "last_peered": "2019-02-04 09:25:56.966952", "last_clean": "0.00", "last_became_active": "2019-02-04 07:57:08.769839", "last_became_peered": "2019-02-04 07:57:08.769839", "last_unstale": "2019-02-04 09:25:56.966952", "last_undegraded": "2019-02-04 07:57:08.762164", "last_fullsized": "2019-02-04 07:57:08.761962", "mapping_epoch": 70655, "log_start": "0'0", "ondisk_log_start": "0'0", "created": 54817, "last_epoch_clean": 67924, "parent": "0.0", "parent_split_bits": 0, "last_scrub": "46843'56759413", "last_scrub_stamp": "2019-01-26 16:31:32.607109", "last_deep_scrub": "46843'56628962", "last_deep_scrub_stamp": "2019-01-24 08:56:59.228615", "last_clean_scrub_stamp": "2019-01-26 16:31:32.607109", "log_size": 1306, "ondisk_log_size": 1306, "stats_invalid": false, "dirty_stats_invalid": false, "omap_stats_invalid": false, "hitset_stats_invalid": false, "hitset_bytes_stats_invalid": false, "pin_stats_invalid": false, "snaptrimq_len": 11, "stat_sum": { "num_bytes": 34648064, "num_objects": 10, "num_object_clones": 0, "num_object_copies": 20, "num_objects_missing_on_primary": 25, "num_objects_missing": 0, "num_objects_degraded": 35, "num_objects_misplaced": 0, "num_objects_unfound": 25, "num_objects_dirty": 10, "num_whiteouts": 0, "num_read": 1274, "num_read_kb": 33808, "num_write": 1388, "num_write_kb": 42956, "num_scrub_errors": 0, "num_shallow_scrub_errors": 0, "num_deep_scrub_errors": 0, "num_objects_recovered": 0, "num_bytes_recovered": 0, "num_keys_recovered": 0, "num_objects_omap": 0, "num_objects_hit_set_archive": 0, "num_bytes_hit_set_archive": 0, "num_flush": 0, "num_flush_kb": 0, "num_evict": 0, "num_evict_kb": 0, "num_promote": 0, "num_flush_mode_high": 0, "num_flush_mode_low": 0, "num_evict_mode_some": 0, "num_evict_mode_full": 0, "num_objects_pinned": 0, "num_legacy_snapsets": 0 }, "up": [ 64 ], "acting": [ 64 ], "blocked_by": [], "up_primary": 64, "acting_primary": 64
[ceph-users] Kernel requirements for balancer in upmap mode
The official documentation [*] says that the only requirement to use the balancer in upmap mode is that all clients must run at least luminous. But I read somewhere (also in this mailing list) that there are also requirements wrt the kernel. If so: 1) Could you please specify what is the minimum required kernel ? 2) Does this kernel requirement apply only to the OSD nodes ? Or also to the clients ? The ceph version I am interested in is Luminous Thanks a lot, Massimo [*] http://docs.ceph.com/docs/luminous/mgr/balancer/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Luminous cluster in very bad state need some assistance.
On Mon, 4 Feb 2019, Philippe Van Hecke wrote: > So i restarted the osd but he stop after some time. But this is an effect on > the cluster and cluster is on a partial recovery process. > > please find here log file of osd 49 after this restart > https://filesender.belnet.be/?s=download&token=8c9c39f2-36f6-43f7-bebb-175679d27a22 It's the same PG 11.182 hitting the same assert when it tries to recover to that OSD. I think the problem will go away once there has been some write traffic, but it may be tricky to prevent it from doing any recovery until then. I just noticed you pasted the wrong 'pg ls' result before: > > result of ceph pg ls | grep 11.118 > > > > 11.118 9788 00 0 0 40817837568 > > 1584 1584 active+clean 2019-02-01 > > 12:48:41.343228 70238'19811673 70493:34596887 [121,24]121 > > [121,24]121 69295'19811665 2019-02-01 12:48:41.343144 > > 66131'19810044 2019-01-30 11:44:36.006505 What does 11.182 look like? We can try something slighty different. From before it looked like your only 'incomplete' pg was 11.ac (ceph pg ls incomplete), and the needed state is either on osd.49 or osd.63. On osd.49, do ceph-objectstore-tool --op export on that pg, and then find an otherwise healthy OSD (that doesn't have 11.ac), stop it, and ceph-objectstore-tool --op import it there. When you start it up, 11.ac will hopefull peer and recover. (Or, alternatively, osd.63 may have the needed state.) sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Luminous cluster in very bad state need some assistance.
Hi, Seem that the recovery process stop and get back to the same situation as before. I hope that the log can provide more info. Any way thanks already for your assistance. Kr Philippe. From: Philippe Van Hecke Sent: 04 February 2019 07:53 To: Sage Weil Cc: ceph-users@lists.ceph.com; Belnet Services Subject: Re: [ceph-users] Luminous cluster in very bad state need some assistance. So i restarted the osd but he stop after some time. But this is an effect on the cluster and cluster is on a partial recovery process. please find here log file of osd 49 after this restart https://filesender.belnet.be/?s=download&token=8c9c39f2-36f6-43f7-bebb-175679d27a22 Kr Philippe. From: Philippe Van Hecke Sent: 04 February 2019 07:42 To: Sage Weil Cc: ceph-users@lists.ceph.com; Belnet Services Subject: Re: [ceph-users] Luminous cluster in very bad state need some assistance. oot@ls-node-5-lcl:~# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-49/ --journal /var/lib/ceph/osd/ceph-49/journal --pgid 11.182 --op remove --debug --force 2> ceph-objectstore-tool-export-remove.txt marking collection for removal setting '_remove' omap key finish_remove_pgs 11.182_head removing 11.182 Remove successful So now i suppose i restart the osd and see From: Sage Weil Sent: 04 February 2019 07:37 To: Philippe Van Hecke Cc: ceph-users@lists.ceph.com; Belnet Services Subject: Re: [ceph-users] Luminous cluster in very bad state need some assistance. On Mon, 4 Feb 2019, Philippe Van Hecke wrote: > result of ceph pg ls | grep 11.118 > > 11.118 9788 00 0 0 40817837568 > 1584 1584 active+clean 2019-02-01 > 12:48:41.343228 70238'19811673 70493:34596887 [121,24]121 > [121,24]121 69295'19811665 2019-02-01 12:48:41.343144 > 66131'19810044 2019-01-30 11:44:36.006505 > > cp done. > > So i can make ceph-objecstore-tool --op remove command ? yep! > > > From: Sage Weil > Sent: 04 February 2019 07:26 > To: Philippe Van Hecke > Cc: ceph-users@lists.ceph.com; Belnet Services > Subject: Re: [ceph-users] Luminous cluster in very bad state need some > assistance. > > On Mon, 4 Feb 2019, Philippe Van Hecke wrote: > > Hi Sage, > > > > I try to make the following. > > > > ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-49/ --journal > > /var/lib/ceph/osd/ceph-49/journal --pgid 11.182 --op export-remove --debug > > --file /tmp/export-pg/18.182 2>ceph-objectstore-tool-export-remove.txt > > but this rise exception > > > > find here > > https://filesender.belnet.be/?s=download&token=e2b1fdbc-0739-423f-9d97-0bd258843a33 > > file ceph-objectstore-tool-export-remove.txt > > In that case, cp --preserve=all > /var/lib/ceph/osd/ceph-49/current/11.182_head to a safe location and then > use the ceph-objecstore-tool --op remove command. But first confirm that > 'ceph pg ls' shows the PG as active. > > sage > > > > > Kr > > > > Philippe. > > > > > > From: Sage Weil > > Sent: 04 February 2019 06:59 > > To: Philippe Van Hecke > > Cc: ceph-users@lists.ceph.com; Belnet Services > > Subject: Re: [ceph-users] Luminous cluster in very bad state need some > > assistance. > > > > On Mon, 4 Feb 2019, Philippe Van Hecke wrote: > > > Hi Sage, First of all tanks for your help > > > > > > Please find here > > > https://filesender.belnet.be/?s=download&token=dea0edda-5b6a-4284-9ea1-c1fdf88b65e9 > > > the osd log with debug info for osd.49. and indeed if all buggy osd can > > > restart that can may be solve the issue. > > > But i also happy that you confirm my understanding that in the worst case > > > removing pool can also resolve the problem even in this case i lose data > > > but finish with a working cluster. > > > > If PGs are damaged, removing the pool would be part of getting to > > HEALTH_OK, but you'd probably also need to remove any problematic PGs that > > are preventing the OSD starting. > > > > But keep in mind that (1) i see 3 PGs that don't peer spread across pools > > 11 and 12; not sure which one you are considering deleting. Also (2) if > > one pool isn't fully available it generall won't be a problem for other > > pools, as long as the osds start. And doing ceph-objectstore-tool > > export-remove is a pretty safe way to move any problem PGs out of the way > > to get your OSDs starting--just make sure you hold onto that backup/export > > because you may need it later! > > > > > PS: don't know and don't want to open debat about top/bottom posting but > > > would like to know the preference of this list :-) > > > > No preference :) > > > > sage > > > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com