[ceph-users] May I know the exact date of Nautilus release? Thanks!

2019-02-04 Thread Zhu, Vivian




- Vivian
SSG OTC NST Storage
 Tel: (8621)61167437

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] crush map has straw_calc_version=0 and legacy tunables on luminous

2019-02-04 Thread Shain Miley
For future reference I found these 2 links which answer most of the questions:

http://docs.ceph.com/docs/master/rados/operations/crush-map/

https://www.openstack.org/assets/presentation-media/Advanced-Tuning-and-Operation-guide-for-Block-Storage-using-Ceph-Boston-2017-final.pdf

We have about 250TB (x3) in our cluster so I am leaning toward not changing 
things at this point because it sounds like there will be a significant amount 
of data movement involved for not a lot in return.

If anyone knows of a strong reason I should change the tunables profile away 
from what I have…then please let me know so I don’t end up running the cluster 
in a sub-optimal state for no reason.

Thanks,
Shain

--
Shain Miley | Manager of Systems and Infrastructure, Digital Media | 
smi...@npr.org | 202.513.3649

From: ceph-users  on behalf of Shain Miley 

Date: Monday, February 4, 2019 at 3:03 PM
To: "ceph-users@lists.ceph.com" 
Subject: [ceph-users] crush map has straw_calc_version=0 and legacy tunables on 
luminous

Hello,
I just upgraded our cluster to 12.2.11 and I have a few questions around 
straw_calc_version and tunables.

Currently ceph status shows the following:

crush map has straw_calc_version=0
crush map has legacy tunables (require argonaut, min is firefly)


  1.  Will setting tunables to optimal also change the staw_calc_version or do 
I need to set that separately?


  2.  Right now I have a set of rbd kernel clients connecting using kernel 
version 4.4.  The ‘ceph daemon mon.id sessions’ command shows that this client 
is still connecting using the hammer feature set (and a few others on jewel as 
well):

"MonSession(client.113933130 10.35.100.121:0/3425045489 is open allow *, 
features 0x7fddff8ee8cbffb (jewel))",  “MonSession(client.112250505 
10.35.100.99:0/4174610322 is open allow *, features 0x106b84a842a42 (hammer))",

My question is what is the minimum kernel version I would need to upgrade the 
4.4 kernel server to in order to get to jewel or luminous?



  1.  Will setting the tunables to optimal on luminous prevent jewel and hammer 
clients from connecting?  I want to make sure I don’t do anything will prevent 
my existing clients from connecting to the cluster.



Thanks in advance,
Shain

--
Shain Miley | Manager of Systems and Infrastructure, Digital Media | 
smi...@npr.org | 202.513.3649
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS MDS journal

2019-02-04 Thread Gregory Farnum
On Mon, Feb 4, 2019 at 8:03 AM Mahmoud Ismail 
wrote:

> On Mon, Feb 4, 2019 at 4:35 PM Gregory Farnum  wrote:
>
>>
>>
>> On Mon, Feb 4, 2019 at 7:32 AM Mahmoud Ismail <
>> mahmoudahmedism...@gmail.com> wrote:
>>
>>> On Mon, Feb 4, 2019 at 4:16 PM Gregory Farnum 
>>> wrote:
>>>
 On Fri, Feb 1, 2019 at 2:29 AM Mahmoud Ismail <
 mahmoudahmedism...@gmail.com> wrote:

> Hello,
>
> I'm a bit confused about how the journaling actually works in the MDS.
>
> I was reading about these two configuration parameters (journal write
> head interval)  and (mds early reply). Does the MDS flush the journal
> synchronously after each operation? and by setting mds eary reply to true
> it allows operations to return without flushing. If so, what the other
> parameter (journal write head interval) do or isn't it for MDS?. Also, can
> all operations return without flushing with the mds early reply or is it
> specific to a subset of operations?.
>

 In general, the MDS journal is flushed every five seconds (by default),
 and client requests get an early reply when the operation is done in memory
 but not yet committed to RADOS. Some operations will trigger an immediate
 flush, and there may be some operations that can't get an early reply or
 that need to wait for part of the operation to get committed (like renames
 that move a file's authority to a different MDS).
 IIRC the journal write head interval controls how often it flushes out
 the journal's header, which limits how out-of-date its hints on restart can
 be. (When the MDS restarts, it asks the journal head where the journal's
 unfinished start and end points are, but of course more of the journaled
 operations may have been fully completed since the head was written.)

>>>
>>> Thanks for the explanation. Which operations trigger an immediate flush?
>>> Is the readdir one of these operations?. I noticed that the readdir
>>> operation latency is going higher under load when the OSDs are hitting the
>>> limit of the underlying hdd throughput. Can i assume that this is happening
>>> due to the journal flushing then?
>>>
>>
>> Not directly, but a readdir might ask to know the size of each file and
>> that will force the other clients in the system to flush their dirty data
>> in the directory (so that the readdir can return valid results).
>> -Greg
>>
>>
>
> Could it be also due to the MDS lock (operations waiting for the lock
> under load)?
>

Well that's not going to cause high OSD usage, and the MDS lock is not held
while writes are happening. But if the MDS is using 100% CPU, yes, it could
be contended.


> Also, i assume that the journal is using a different thread for flushing,
> Right?
>

Yes, that's correct.

>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Optane still valid

2019-02-04 Thread solarflow99
I think one limitation would be the 375GB since bluestore needs a larger
amount of space than filestore did.

On Mon, Feb 4, 2019 at 10:20 AM Florian Engelmann <
florian.engelm...@everyware.ch> wrote:

> Hi,
>
> we have built a 6 Node NVMe only Ceph Cluster with 4x Intel DC P4510 8TB
> each and one Intel DC P4800X 375GB Optane each. Up to 10x P4510 can be
> installed in each node.
> WAL and RocksDBs for all P4510 should be stored on the Optane (approx.
> 30GB per RocksDB incl. WAL).
> Internally, discussions arose whether the Optane would become a
> bottleneck from a certain number of P4510 on.
> For us, the lowest possible latency is very important. Therefore the
> Optane NVMes were bought. In view of the good performance of the P4510,
> the question arises whether the Optanes still have a noticeable effect
> or whether they are actually just SPOFs?
>
>
> All the best,
> Florian
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] crush map has straw_calc_version=0 and legacy tunables on luminous

2019-02-04 Thread Shain Miley
Hello,
I just upgraded our cluster to 12.2.11 and I have a few questions around 
straw_calc_version and tunables.

Currently ceph status shows the following:

crush map has straw_calc_version=0
crush map has legacy tunables (require argonaut, min is firefly)


  1.  Will setting tunables to optimal also change the staw_calc_version or do 
I need to set that separately?

  2.  Right now I have a set of rbd kernel clients connecting using kernel 
version 4.4.  The ‘ceph daemon mon.id sessions’ command shows that this client 
is still connecting using the hammer feature set (and a few others on jewel as 
well):

"MonSession(client.113933130 10.35.100.121:0/3425045489 is open allow *, 
features 0x7fddff8ee8cbffb (jewel))",  “MonSession(client.112250505 
10.35.100.99:0/4174610322 is open allow *, features 0x106b84a842a42 (hammer))",

My question is what is the minimum kernel version I would need to upgrade the 
4.4 kernel server to in order to get to jewel or luminous?



  1.  Will setting the tunables to optimal on luminous prevent jewel and hammer 
clients from connecting?  I want to make sure I don’t do anything will prevent 
my existing clients from connecting to the cluster.


Thanks in advance,
Shain

--
Shain Miley | Manager of Systems and Infrastructure, Digital Media | 
smi...@npr.org | 202.513.3649
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS MDS journal

2019-02-04 Thread Mahmoud Ismail
On Mon, Feb 4, 2019 at 4:35 PM Gregory Farnum  wrote:

>
>
> On Mon, Feb 4, 2019 at 7:32 AM Mahmoud Ismail <
> mahmoudahmedism...@gmail.com> wrote:
>
>> On Mon, Feb 4, 2019 at 4:16 PM Gregory Farnum  wrote:
>>
>>> On Fri, Feb 1, 2019 at 2:29 AM Mahmoud Ismail <
>>> mahmoudahmedism...@gmail.com> wrote:
>>>
 Hello,

 I'm a bit confused about how the journaling actually works in the MDS.

 I was reading about these two configuration parameters (journal write
 head interval)  and (mds early reply). Does the MDS flush the journal
 synchronously after each operation? and by setting mds eary reply to true
 it allows operations to return without flushing. If so, what the other
 parameter (journal write head interval) do or isn't it for MDS?. Also, can
 all operations return without flushing with the mds early reply or is it
 specific to a subset of operations?.

>>>
>>> In general, the MDS journal is flushed every five seconds (by default),
>>> and client requests get an early reply when the operation is done in memory
>>> but not yet committed to RADOS. Some operations will trigger an immediate
>>> flush, and there may be some operations that can't get an early reply or
>>> that need to wait for part of the operation to get committed (like renames
>>> that move a file's authority to a different MDS).
>>> IIRC the journal write head interval controls how often it flushes out
>>> the journal's header, which limits how out-of-date its hints on restart can
>>> be. (When the MDS restarts, it asks the journal head where the journal's
>>> unfinished start and end points are, but of course more of the journaled
>>> operations may have been fully completed since the head was written.)
>>>
>>
>> Thanks for the explanation. Which operations trigger an immediate flush?
>> Is the readdir one of these operations?. I noticed that the readdir
>> operation latency is going higher under load when the OSDs are hitting the
>> limit of the underlying hdd throughput. Can i assume that this is happening
>> due to the journal flushing then?
>>
>
> Not directly, but a readdir might ask to know the size of each file and
> that will force the other clients in the system to flush their dirty data
> in the directory (so that the readdir can return valid results).
> -Greg
>
>

Could it be also due to the MDS lock (operations waiting for the lock under
load)? Also, i assume that the journal is using a different thread for
flushing, Right?


>

>>
>>>
>>
 Another question, are open operations also written to the journal?

>>>
>>> Not opens per se, but we do persist when clients have permission to
>>> operate on files.
>>> -Greg
>>>
>>>

 Regards,
 Mahmoud

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-04 Thread Alexandre DERUMIER
>>but I don't see l_bluestore_fragmentation counter.
>>(but I have bluestore_fragmentation_micros)

ok, this is the same

  b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros",
"How fragmented bluestore free space is (free extents / max 
possible number of free extents) * 1000");


Here a graph on last month, with bluestore_fragmentation_micros and latency,

http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png

- Mail original -
De: "Alexandre Derumier" 
À: "Igor Fedotov" 
Cc: "Stefan Priebe, Profihost AG" , "Mark Nelson" 
, "Sage Weil" , "ceph-users" 
, "ceph-devel" 
Envoyé: Lundi 4 Février 2019 16:04:38
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Thanks Igor, 

>>Could you please collect BlueStore performance counters right after OSD 
>>startup and once you get high latency. 
>> 
>>Specifically 'l_bluestore_fragmentation' parameter is of interest. 

I'm already monitoring with 
"ceph daemon osd.x perf dump ", (I have 2months history will all counters) 

but I don't see l_bluestore_fragmentation counter. 

(but I have bluestore_fragmentation_micros) 


>>Also if you're able to rebuild the code I can probably make a simple 
>>patch to track latency and some other internal allocator's paramter to 
>>make sure it's degraded and learn more details. 

Sorry, It's a critical production cluster, I can't test on it :( 
But I have a test cluster, maybe I can try to put some load on it, and try to 
reproduce. 



>>More vigorous fix would be to backport bitmap allocator from Nautilus 
>>and try the difference... 

Any plan to backport it to mimic ? (But I can wait for Nautilus) 
perf results of new bitmap allocator seem very promising from what I've seen in 
PR. 



- Mail original - 
De: "Igor Fedotov"  
À: "Alexandre Derumier" , "Stefan Priebe, Profihost AG" 
, "Mark Nelson"  
Cc: "Sage Weil" , "ceph-users" , 
"ceph-devel"  
Envoyé: Lundi 4 Février 2019 15:51:30 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart 

Hi Alexandre, 

looks like a bug in StupidAllocator. 

Could you please collect BlueStore performance counters right after OSD 
startup and once you get high latency. 

Specifically 'l_bluestore_fragmentation' parameter is of interest. 

Also if you're able to rebuild the code I can probably make a simple 
patch to track latency and some other internal allocator's paramter to 
make sure it's degraded and learn more details. 


More vigorous fix would be to backport bitmap allocator from Nautilus 
and try the difference... 


Thanks, 

Igor 


On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
> Hi again, 
> 
> I speak too fast, the problem has occured again, so it's not tcmalloc cache 
> size related. 
> 
> 
> I have notice something using a simple "perf top", 
> 
> each time I have this problem (I have seen exactly 4 times the same 
> behaviour), 
> 
> when latency is bad, perf top give me : 
> 
> StupidAllocator::_aligned_len 
> and 
> btree::btree_iterator long, unsigned long, std::less, mempoo 
> l::pool_allocator<(mempool::pool_index_t)1, std::pair unsigned long> >, 256> >, std::pair&, 
> std::pair const, unsigned long>*>::increment_slow() 
> 
> (around 10-20% time for both) 
> 
> 
> when latency is good, I don't see them at all. 
> 
> 
> I have used the Mark wallclock profiler, here the results: 
> 
> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
> 
> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
> 
> 
> here an extract of the thread with btree::btree_iterator && 
> StupidAllocator::_aligned_len 
> 
> 
> + 100.00% clone 
> + 100.00% start_thread 
> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) 
> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&, 
> ThreadPool::TPHandle&) 
> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr, 
> boost::intrusive_ptr, ThreadPool::TPHandle&) 
> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr&, 
> ThreadPool::TPHandle&) 
> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr) 
> | | + 68.00% 
> ReplicatedBackend::_handle_message(boost::intrusive_ptr) 
> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr) 
> | | + 67.00% non-virtual thunk to 
> PrimaryLogPG::queue_transactions(std::vector std::allocator >&, boost::intrusive_ptr) 
> | | | + 67.00% 
> BlueStore::queue_transactions(boost::intrusive_ptr&,
>  std::vector std::allocator >&, boost::intrusive_ptr, 
> ThreadPool::TPHandle*) 
> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
> ObjectStore::Transaction*) 
> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, 
> boost::intrusive_ptr&, 
> boost::intrusive_ptr&, unsigned long, unsigned long, 
> ceph::buffer::list&, unsigned int) 
> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, 
> boost::intrusive_ptr&, 

Re: [ceph-users] CephFS MDS journal

2019-02-04 Thread Gregory Farnum
On Mon, Feb 4, 2019 at 7:32 AM Mahmoud Ismail 
wrote:

> On Mon, Feb 4, 2019 at 4:16 PM Gregory Farnum  wrote:
>
>> On Fri, Feb 1, 2019 at 2:29 AM Mahmoud Ismail <
>> mahmoudahmedism...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I'm a bit confused about how the journaling actually works in the MDS.
>>>
>>> I was reading about these two configuration parameters (journal write
>>> head interval)  and (mds early reply). Does the MDS flush the journal
>>> synchronously after each operation? and by setting mds eary reply to true
>>> it allows operations to return without flushing. If so, what the other
>>> parameter (journal write head interval) do or isn't it for MDS?. Also, can
>>> all operations return without flushing with the mds early reply or is it
>>> specific to a subset of operations?.
>>>
>>
>> In general, the MDS journal is flushed every five seconds (by default),
>> and client requests get an early reply when the operation is done in memory
>> but not yet committed to RADOS. Some operations will trigger an immediate
>> flush, and there may be some operations that can't get an early reply or
>> that need to wait for part of the operation to get committed (like renames
>> that move a file's authority to a different MDS).
>> IIRC the journal write head interval controls how often it flushes out
>> the journal's header, which limits how out-of-date its hints on restart can
>> be. (When the MDS restarts, it asks the journal head where the journal's
>> unfinished start and end points are, but of course more of the journaled
>> operations may have been fully completed since the head was written.)
>>
>
> Thanks for the explanation. Which operations trigger an immediate flush?
> Is the readdir one of these operations?. I noticed that the readdir
> operation latency is going higher under load when the OSDs are hitting the
> limit of the underlying hdd throughput. Can i assume that this is happening
> due to the journal flushing then?
>

Not directly, but a readdir might ask to know the size of each file and
that will force the other clients in the system to flush their dirty data
in the directory (so that the readdir can return valid results).
-Greg


>
>
>>
>
>>> Another question, are open operations also written to the journal?
>>>
>>
>> Not opens per se, but we do persist when clients have permission to
>> operate on files.
>> -Greg
>>
>>
>>>
>>> Regards,
>>> Mahmoud
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS MDS journal

2019-02-04 Thread Mahmoud Ismail
On Mon, Feb 4, 2019 at 4:16 PM Gregory Farnum  wrote:

> On Fri, Feb 1, 2019 at 2:29 AM Mahmoud Ismail <
> mahmoudahmedism...@gmail.com> wrote:
>
>> Hello,
>>
>> I'm a bit confused about how the journaling actually works in the MDS.
>>
>> I was reading about these two configuration parameters (journal write
>> head interval)  and (mds early reply). Does the MDS flush the journal
>> synchronously after each operation? and by setting mds eary reply to true
>> it allows operations to return without flushing. If so, what the other
>> parameter (journal write head interval) do or isn't it for MDS?. Also, can
>> all operations return without flushing with the mds early reply or is it
>> specific to a subset of operations?.
>>
>
> In general, the MDS journal is flushed every five seconds (by default),
> and client requests get an early reply when the operation is done in memory
> but not yet committed to RADOS. Some operations will trigger an immediate
> flush, and there may be some operations that can't get an early reply or
> that need to wait for part of the operation to get committed (like renames
> that move a file's authority to a different MDS).
> IIRC the journal write head interval controls how often it flushes out the
> journal's header, which limits how out-of-date its hints on restart can be.
> (When the MDS restarts, it asks the journal head where the journal's
> unfinished start and end points are, but of course more of the journaled
> operations may have been fully completed since the head was written.)
>

Thanks for the explanation. Which operations trigger an immediate flush? Is
the readdir one of these operations?. I noticed that the readdir operation
latency is going higher under load when the OSDs are hitting the limit of
the underlying hdd throughput. Can i assume that this is happening due to
the journal flushing then?


>

>> Another question, are open operations also written to the journal?
>>
>
> Not opens per se, but we do persist when clients have permission to
> operate on files.
> -Greg
>
>
>>
>> Regards,
>> Mahmoud
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS MDS journal

2019-02-04 Thread Gregory Farnum
On Fri, Feb 1, 2019 at 2:29 AM Mahmoud Ismail 
wrote:

> Hello,
>
> I'm a bit confused about how the journaling actually works in the MDS.
>
> I was reading about these two configuration parameters (journal write head
> interval)  and (mds early reply). Does the MDS flush the journal
> synchronously after each operation? and by setting mds eary reply to true
> it allows operations to return without flushing. If so, what the other
> parameter (journal write head interval) do or isn't it for MDS?. Also, can
> all operations return without flushing with the mds early reply or is it
> specific to a subset of operations?.
>

In general, the MDS journal is flushed every five seconds (by default), and
client requests get an early reply when the operation is done in memory but
not yet committed to RADOS. Some operations will trigger an immediate
flush, and there may be some operations that can't get an early reply or
that need to wait for part of the operation to get committed (like renames
that move a file's authority to a different MDS).
IIRC the journal write head interval controls how often it flushes out the
journal's header, which limits how out-of-date its hints on restart can be.
(When the MDS restarts, it asks the journal head where the journal's
unfinished start and end points are, but of course more of the journaled
operations may have been fully completed since the head was written.)


>
> Another question, are open operations also written to the journal?
>

Not opens per se, but we do persist when clients have permission to operate
on files.
-Greg


>
> Regards,
> Mahmoud
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-04 Thread Alexandre DERUMIER
Thanks Igor,

>>Could you please collect BlueStore performance counters right after OSD 
>>startup and once you get high latency. 
>>
>>Specifically 'l_bluestore_fragmentation' parameter is of interest. 

I'm already monitoring with
"ceph daemon osd.x perf dump ",  (I have 2months history will all counters)

but I don't see l_bluestore_fragmentation counter.

(but I have bluestore_fragmentation_micros)


>>Also if you're able to rebuild the code I can probably make a simple 
>>patch to track latency and some other internal allocator's paramter to 
>>make sure it's degraded and learn more details. 

Sorry, It's a critical production cluster, I can't test on it :(
But I have a test cluster, maybe I can try to put some load on it, and try to 
reproduce.



>>More vigorous fix would be to backport bitmap allocator from Nautilus 
>>and try the difference... 

Any plan to backport it to mimic ? (But I can wait for Nautilus)
perf results of new bitmap allocator seem very promising from what I've seen in 
PR.



- Mail original -
De: "Igor Fedotov" 
À: "Alexandre Derumier" , "Stefan Priebe, Profihost AG" 
, "Mark Nelson" 
Cc: "Sage Weil" , "ceph-users" , 
"ceph-devel" 
Envoyé: Lundi 4 Février 2019 15:51:30
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Hi Alexandre, 

looks like a bug in StupidAllocator. 

Could you please collect BlueStore performance counters right after OSD 
startup and once you get high latency. 

Specifically 'l_bluestore_fragmentation' parameter is of interest. 

Also if you're able to rebuild the code I can probably make a simple 
patch to track latency and some other internal allocator's paramter to 
make sure it's degraded and learn more details. 


More vigorous fix would be to backport bitmap allocator from Nautilus 
and try the difference... 


Thanks, 

Igor 


On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
> Hi again, 
> 
> I speak too fast, the problem has occured again, so it's not tcmalloc cache 
> size related. 
> 
> 
> I have notice something using a simple "perf top", 
> 
> each time I have this problem (I have seen exactly 4 times the same 
> behaviour), 
> 
> when latency is bad, perf top give me : 
> 
> StupidAllocator::_aligned_len 
> and 
> btree::btree_iterator long, unsigned long, std::less, mempoo 
> l::pool_allocator<(mempool::pool_index_t)1, std::pair unsigned long> >, 256> >, std::pair&, 
> std::pair const, unsigned long>*>::increment_slow() 
> 
> (around 10-20% time for both) 
> 
> 
> when latency is good, I don't see them at all. 
> 
> 
> I have used the Mark wallclock profiler, here the results: 
> 
> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
> 
> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
> 
> 
> here an extract of the thread with btree::btree_iterator && 
> StupidAllocator::_aligned_len 
> 
> 
> + 100.00% clone 
> + 100.00% start_thread 
> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) 
> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&, 
> ThreadPool::TPHandle&) 
> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr, 
> boost::intrusive_ptr, ThreadPool::TPHandle&) 
> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr&, 
> ThreadPool::TPHandle&) 
> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr) 
> | | + 68.00% 
> ReplicatedBackend::_handle_message(boost::intrusive_ptr) 
> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr) 
> | | + 67.00% non-virtual thunk to 
> PrimaryLogPG::queue_transactions(std::vector std::allocator >&, boost::intrusive_ptr) 
> | | | + 67.00% 
> BlueStore::queue_transactions(boost::intrusive_ptr&,
>  std::vector std::allocator >&, boost::intrusive_ptr, 
> ThreadPool::TPHandle*) 
> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
> ObjectStore::Transaction*) 
> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, 
> boost::intrusive_ptr&, 
> boost::intrusive_ptr&, unsigned long, unsigned long, 
> ceph::buffer::list&, unsigned int) 
> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, 
> boost::intrusive_ptr&, 
> boost::intrusive_ptr, unsigned long, unsigned long, 
> ceph::buffer::list&, unsigned int) 
> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, 
> boost::intrusive_ptr, 
> boost::intrusive_ptr, BlueStore::WriteContext*) 
> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, 
> unsigned long, long, std::vector mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned 
> long, long, unsigned long*, unsigned int*) 
> | | | | | | + 34.00% 
> btree::btree_iterator long, unsigned long, std::less, 
> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair const, unsigned long> >, 256> >, std::pair long>&, std::pair*>

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-04 Thread Igor Fedotov

Hi Alexandre,

looks like a bug in StupidAllocator.

Could you please collect BlueStore performance counters right after OSD 
startup and once you get high latency.


Specifically 'l_bluestore_fragmentation' parameter is of interest.

Also if you're able to rebuild the code I can probably make a simple 
patch to track latency and some other internal allocator's paramter to 
make sure it's degraded and learn more details.



More vigorous fix would be to backport bitmap allocator from Nautilus 
and try the difference...



Thanks,

Igor


On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote:

Hi again,

I speak too fast, the problem has occured again, so it's not tcmalloc cache 
size related.


I have notice something using a simple "perf top",

each time I have this problem (I have seen exactly 4 times the same behaviour),

when latency is bad, perf top give me :

StupidAllocator::_aligned_len
and
btree::btree_iterator, mempoo
l::pool_allocator<(mempool::pool_index_t)1, std::pair >, 
256> >, std::pair&, std::pair*>::increment_slow()

(around 10-20% time for both)


when latency is good, I don't see them at all.


I have used the Mark wallclock profiler, here the results:

http://odisoweb1.odiso.net/gdbpmp-ok.txt

http://odisoweb1.odiso.net/gdbpmp-bad.txt


here an extract of the thread with btree::btree_iterator && 
StupidAllocator::_aligned_len


+ 100.00% clone
   + 100.00% start_thread
 + 100.00% ShardedThreadPool::WorkThreadSharded::entry()
   + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int)
 + 100.00% OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)
   + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&, 
ThreadPool::TPHandle&)
   | + 70.00% OSD::dequeue_op(boost::intrusive_ptr, 
boost::intrusive_ptr, ThreadPool::TPHandle&)
   |   + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr&, 
ThreadPool::TPHandle&)
   | + 68.00% 
PGBackend::handle_message(boost::intrusive_ptr)
   | | + 68.00% 
ReplicatedBackend::_handle_message(boost::intrusive_ptr)
   | |   + 68.00% 
ReplicatedBackend::do_repop(boost::intrusive_ptr)
   | | + 67.00% non-virtual thunk to 
PrimaryLogPG::queue_transactions(std::vector >&, boost::intrusive_ptr)
   | | | + 67.00% 
BlueStore::queue_transactions(boost::intrusive_ptr&, 
std::vector >&, 
boost::intrusive_ptr, ThreadPool::TPHandle*)
   | | |   + 66.00% 
BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
ObjectStore::Transaction*)
   | | |   | + 66.00% BlueStore::_write(BlueStore::TransContext*, 
boost::intrusive_ptr&, 
boost::intrusive_ptr&, unsigned long, unsigned long, 
ceph::buffer::list&, unsigned int)
   | | |   |   + 66.00% BlueStore::_do_write(BlueStore::TransContext*, 
boost::intrusive_ptr&, 
boost::intrusive_ptr, unsigned long, unsigned long, 
ceph::buffer::list&, unsigned int)
   | | |   | + 65.00% 
BlueStore::_do_alloc_write(BlueStore::TransContext*, 
boost::intrusive_ptr, 
boost::intrusive_ptr, BlueStore::WriteContext*)
   | | |   | | + 64.00% StupidAllocator::allocate(unsigned long, 
unsigned long, unsigned long, long, std::vector >*)
   | | |   | | | + 64.00% 
StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned 
long*, unsigned int*)
   | | |   | | |   + 34.00% btree::btree_iterator, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair >, 256> >, std::pair&, std::pair*>::increment_slow()
   | | |   | | |   + 26.00% StupidAllocator::_aligned_len(interval_set, 
mempool::pool_allocator<(mempool::pool_index_t)1, std::pair 
>, 256> >::iterator, unsigned long)



- Mail original -
De: "Alexandre Derumier" 
À: "Stefan Priebe, Profihost AG" 
Cc: "Sage Weil" , "ceph-users" , 
"ceph-devel" 
Envoyé: Lundi 4 Février 2019 09:38:11
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Hi,

some news:

I have tried with different transparent hugepage values (madvise, never) : no 
change

I have tried to increase bluestore_cache_size_ssd to 8G: no change

I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it 
seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to 
be sure)


Note that this behaviour seem to happen really faster (< 2 days) on my big nvme 
drives (6TB),
my others clusters user 1,6TB ssd.

Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by 
osd), but I'll try this week with 2osd by nvme, to see if it's helping.


BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 
2.26 (which have also thread cache) ?


Regards,

Alexandre


- Mail original -
De: "aderumier" 
À: "Stefan Priebe, Profihost AG" 
Cc: "Sage Weil" , "ceph-users" , 
"ceph-devel" 
Envoyé: Mercredi 30 Janvier 2019 19:58:15
Objet: Re: [ceph-us

[ceph-users] Optane still valid

2019-02-04 Thread Florian Engelmann

Hi,

we have built a 6 Node NVMe only Ceph Cluster with 4x Intel DC P4510 8TB 
each and one Intel DC P4800X 375GB Optane each. Up to 10x P4510 can be 
installed in each node.
WAL and RocksDBs for all P4510 should be stored on the Optane (approx. 
30GB per RocksDB incl. WAL).
Internally, discussions arose whether the Optane would become a 
bottleneck from a certain number of P4510 on.
For us, the lowest possible latency is very important. Therefore the 
Optane NVMes were bought. In view of the good performance of the P4510, 
the question arises whether the Optanes still have a noticeable effect 
or whether they are actually just SPOFs?



All the best,
Florian


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-04 Thread Alexandre DERUMIER
Hi again,

I speak too fast, the problem has occured again, so it's not tcmalloc cache 
size related.


I have notice something using a simple "perf top",

each time I have this problem (I have seen exactly 4 times the same behaviour),

when latency is bad, perf top give me : 

StupidAllocator::_aligned_len
and
btree::btree_iterator, mempoo
l::pool_allocator<(mempool::pool_index_t)1, std::pair >, 256> >, std::pair&, 
std::pair*>::increment_slow()

(around 10-20% time for both)


when latency is good, I don't see them at all.


I have used the Mark wallclock profiler, here the results:

http://odisoweb1.odiso.net/gdbpmp-ok.txt

http://odisoweb1.odiso.net/gdbpmp-bad.txt


here an extract of the thread with btree::btree_iterator && 
StupidAllocator::_aligned_len


+ 100.00% clone
  + 100.00% start_thread
+ 100.00% ShardedThreadPool::WorkThreadSharded::entry()
  + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int)
+ 100.00% OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)
  + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&, 
ThreadPool::TPHandle&)
  | + 70.00% OSD::dequeue_op(boost::intrusive_ptr, 
boost::intrusive_ptr, ThreadPool::TPHandle&)
  |   + 70.00% 
PrimaryLogPG::do_request(boost::intrusive_ptr&, 
ThreadPool::TPHandle&)
  | + 68.00% 
PGBackend::handle_message(boost::intrusive_ptr)
  | | + 68.00% 
ReplicatedBackend::_handle_message(boost::intrusive_ptr)
  | |   + 68.00% 
ReplicatedBackend::do_repop(boost::intrusive_ptr)
  | | + 67.00% non-virtual thunk to 
PrimaryLogPG::queue_transactions(std::vector >&, boost::intrusive_ptr)
  | | | + 67.00% 
BlueStore::queue_transactions(boost::intrusive_ptr&,
 std::vector 
>&, boost::intrusive_ptr, ThreadPool::TPHandle*)
  | | |   + 66.00% 
BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
ObjectStore::Transaction*)
  | | |   | + 66.00% 
BlueStore::_write(BlueStore::TransContext*, 
boost::intrusive_ptr&, 
boost::intrusive_ptr&, unsigned long, unsigned long, 
ceph::buffer::list&, unsigned int)
  | | |   |   + 66.00% 
BlueStore::_do_write(BlueStore::TransContext*, 
boost::intrusive_ptr&, 
boost::intrusive_ptr, unsigned long, unsigned long, 
ceph::buffer::list&, unsigned int)
  | | |   | + 65.00% 
BlueStore::_do_alloc_write(BlueStore::TransContext*, 
boost::intrusive_ptr, 
boost::intrusive_ptr, BlueStore::WriteContext*)
  | | |   | | + 64.00% StupidAllocator::allocate(unsigned 
long, unsigned long, unsigned long, long, std::vector >*)
  | | |   | | | + 64.00% 
StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned 
long*, unsigned int*)
  | | |   | | |   + 34.00% 
btree::btree_iterator, 
mempool::pool_allocator<(mempool::pool_index_t)1, std::pair >, 256> >, std::pair&, std::pair*>::increment_slow()
  | | |   | | |   + 26.00% 
StupidAllocator::_aligned_len(interval_set, 
mempool::pool_allocator<(mempool::pool_index_t)1, std::pair >, 256> >::iterator, unsigned long)



- Mail original -
De: "Alexandre Derumier" 
À: "Stefan Priebe, Profihost AG" 
Cc: "Sage Weil" , "ceph-users" , 
"ceph-devel" 
Envoyé: Lundi 4 Février 2019 09:38:11
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Hi, 

some news: 

I have tried with different transparent hugepage values (madvise, never) : no 
change 

I have tried to increase bluestore_cache_size_ssd to 8G: no change 

I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it 
seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to 
be sure) 


Note that this behaviour seem to happen really faster (< 2 days) on my big nvme 
drives (6TB), 
my others clusters user 1,6TB ssd. 

Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by 
osd), but I'll try this week with 2osd by nvme, to see if it's helping. 


BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 
2.26 (which have also thread cache) ? 


Regards, 

Alexandre 


- Mail original - 
De: "aderumier"  
À: "Stefan Priebe, Profihost AG"  
Cc: "Sage Weil" , "ceph-users" , 
"ceph-devel"  
Envoyé: Mercredi 30 Janvier 2019 19:58:15 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart 

>>Thanks. Is there any reason you monitor op_w_latency but not 
>>op_r_latency but instead op_latency? 
>> 
>>Also why do you monitor op_w_process_latency? but not op_r_process_latency? 

I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). 

I just don't see latency difference on reads. (or they are very very small vs 
the write latency increase) 



- Mail original - 
De: "Stefan Priebe, Profihost AG"  
À: "aderumier"  
Cc: "Sage Weil" , "ceph-users" , 
"ceph-devel"  
Envoyé: Mercredi

Re: [ceph-users] Kernel requirements for balancer in upmap mode

2019-02-04 Thread Massimo Sgaravatto
Thanks a lot !

On Mon, Feb 4, 2019 at 12:35 PM Konstantin Shalygin  wrote:

> So, if I am using ceph just to provide block storage to an OpenStack
> cluster (so using libvirt), the kernel version on the client nodes
> shouldn't matter, right ?
>
> Yep, just make sure your librbd on compute hosts is Luminous.
>
>
>
> k
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore deploys to tmpfs?

2019-02-04 Thread Alfredo Deza
On Mon, Feb 4, 2019 at 4:43 AM Hector Martin  wrote:
>
> On 02/02/2019 05:07, Stuart Longland wrote:
> > On 1/2/19 10:43 pm, Alfredo Deza wrote:
> >>> The tmpfs setup is expected. All persistent data for bluestore OSDs
> >>> setup with LVM are stored in LVM metadata. The LVM/udev handler for
> >>> bluestore volumes create these tmpfs filesystems on the fly and populate
> >>> them with the information from the metadata.
> >> That is mostly what happens. There isn't a dependency on UDEV anymore
> >> (yay), but the reason why files are mounted on tmpfs
> >> is because *bluestore* spits them out on activation, this makes the
> >> path fully ephemeral (a great thing!)
> >>
> >> The step-by-step is documented in this summary section of  'activate'
> >> http://docs.ceph.com/docs/master/ceph-volume/lvm/activate/#summary
> >>
> >> Filestore doesn't have any of these capabilities and it is why it does
> >> have an actual existing path (vs. tmpfs), and the files come from the
> >> data partition that
> >> gets mounted.
> >>
> >
> > Well, for whatever reason, ceph-osd isn't calling the activate script
> > before it starts up.
> >
> > It is worth noting that the systems I'm using do not use systemd out of
> > simplicity.  I might need to write an init script to do that.  It wasn't
> > clear last weekend what commands I needed to run to activate a BlueStore
> > OSD.
>
> The way you do this on Gentoo is by writing the OSD FSID into
> /etc/conf.d/ceph-osd.. You need to make note of the ID when the OSD
> is first deployed.
>
> # echo "bluestore_osd_fsid=$(cat /var/lib/ceph/osd/ceph-0/fsid)" >
> /etc/conf.d/ceph-osd.0
>
> And then of course do the usual initscript symlink enable on Gentoo:
>
> # ln -s ceph /etc/init.d/ceph-osd.0
> # rc-update add ceph-osd.0 default
>
> This will then call `ceph-volume lvm activate` for that OSD for you
> before bringing it up, which will populate the tmpfs. It is the Gentoo
> OpenRC equivalent of enabling the systemd unit for that osd-fsid on
> systemd systems (but ceph-volume won't do it for you).

This is spot on. The whole systemd script for ceph-volume, is merely
passing the ID and FSID over to ceph-volume, which ends up doing
something like:

ceph-volume lvm activate ID FSID

>
> You may also want to add some dependencies for all OSDs depending on
> your setup (e.g. I run single-host and the mon has the OSD dm-crypt
> keys, so that has to come first):
>
> # echo 'rc_need="ceph-mon.0"' > /etc/conf.d/ceph-osd
>
> The Gentoo initscript setup for Ceph is unfortunately not very well
> documented. I've been meaning to write a blogpost about this to try to
> share what I've learned :-)
>
> --
> Hector Martin (hec...@marcansoft.com)
> Public Key: https://marcan.st/marcan.asc
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problem replacing osd with ceph-deploy

2019-02-04 Thread Alfredo Deza
On Fri, Feb 1, 2019 at 6:07 PM Shain Miley  wrote:
>
> Hi,
>
> I went to replace a disk today (which I had not had to do in a while)
> and after I added it the results looked rather odd compared to times past:
>
> I was attempting to replace /dev/sdk on one of our osd nodes:
>
> #ceph-deploy disk zap hqosd7 /dev/sdk
> #ceph-deploy osd create --data /dev/sdk hqosd7
>
> [ceph_deploy.conf][DEBUG ] found configuration file at:
> /root/.cephdeploy.conf
> [ceph_deploy.cli][INFO  ] Invoked (2.0.1): /usr/local/bin/ceph-deploy
> osd create --data /dev/sdk hqosd7
> [ceph_deploy.cli][INFO  ] ceph-deploy options:
> [ceph_deploy.cli][INFO  ]  verbose   : False
> [ceph_deploy.cli][INFO  ]  bluestore : None
> [ceph_deploy.cli][INFO  ]  cd_conf   :
> 
> [ceph_deploy.cli][INFO  ]  cluster   : ceph
> [ceph_deploy.cli][INFO  ]  fs_type   : xfs
> [ceph_deploy.cli][INFO  ]  block_wal : None
> [ceph_deploy.cli][INFO  ]  default_release   : False
> [ceph_deploy.cli][INFO  ]  username  : None
> [ceph_deploy.cli][INFO  ]  journal   : None
> [ceph_deploy.cli][INFO  ]  subcommand: create
> [ceph_deploy.cli][INFO  ]  host  : hqosd7
> [ceph_deploy.cli][INFO  ]  filestore : None
> [ceph_deploy.cli][INFO  ]  func  :  at 0x7fa3b14b3398>
> [ceph_deploy.cli][INFO  ]  ceph_conf : None
> [ceph_deploy.cli][INFO  ]  zap_disk  : False
> [ceph_deploy.cli][INFO  ]  data  : /dev/sdk
> [ceph_deploy.cli][INFO  ]  block_db  : None
> [ceph_deploy.cli][INFO  ]  dmcrypt   : False
> [ceph_deploy.cli][INFO  ]  overwrite_conf: False
> [ceph_deploy.cli][INFO  ]  dmcrypt_key_dir   :
> /etc/ceph/dmcrypt-keys
> [ceph_deploy.cli][INFO  ]  quiet : False
> [ceph_deploy.cli][INFO  ]  debug : False
> [ceph_deploy.osd][DEBUG ] Creating OSD on cluster ceph with data device
> /dev/sdk
> [hqosd7][DEBUG ] connected to host: hqosd7
> [hqosd7][DEBUG ] detect platform information from remote host
> [hqosd7][DEBUG ] detect machine type
> [hqosd7][DEBUG ] find the location of an executable
> [ceph_deploy.osd][INFO  ] Distro info: Ubuntu 16.04 xenial
> [ceph_deploy.osd][DEBUG ] Deploying osd to hqosd7
> [hqosd7][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
> [hqosd7][DEBUG ] find the location of an executable
> [hqosd7][INFO  ] Running command: /usr/sbin/ceph-volume --cluster ceph
> lvm create --bluestore --data /dev/sdk
> [hqosd7][DEBUG ] Running command: /usr/bin/ceph-authtool --gen-print-key
> [hqosd7][DEBUG ] Running command: /usr/bin/ceph --cluster ceph --name
> client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
> -i - osd new c98a11d1-9b7f-487e-8c69-72fc662927d4
> [hqosd7][DEBUG ] Running command: vgcreate --force --yes
> ceph-bbe0e44e-afc9-4cf1-9f1a-ed7d20f796c1 /dev/sdk
> [hqosd7][DEBUG ]  stdout: Physical volume "/dev/sdk" successfully created
> [hqosd7][DEBUG ]  stdout: Volume group
> "ceph-bbe0e44e-afc9-4cf1-9f1a-ed7d20f796c1" successfully created
> [hqosd7][DEBUG ] Running command: lvcreate --yes -l 100%FREE -n
> osd-block-c98a11d1-9b7f-487e-8c69-72fc662927d4
> ceph-bbe0e44e-afc9-4cf1-9f1a-ed7d20f796c1
> [hqosd7][DEBUG ]  stdout: Logical volume
> "osd-block-c98a11d1-9b7f-487e-8c69-72fc662927d4" created.
> [hqosd7][DEBUG ] Running command: /usr/bin/ceph-authtool --gen-print-key
> [hqosd7][DEBUG ] Running command: mount -t tmpfs tmpfs
> /var/lib/ceph/osd/ceph-81
> [hqosd7][DEBUG ] Running command: chown -R ceph:ceph /dev/dm-0
> [hqosd7][DEBUG ] Running command: ln -s
> /dev/ceph-bbe0e44e-afc9-4cf1-9f1a-ed7d20f796c1/osd-block-c98a11d1-9b7f-487e-8c69-72fc662927d4
> /var/lib/ceph/osd/ceph-81/block
> [hqosd7][DEBUG ] Running command: ceph --cluster ceph --name
> client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
> mon getmap -o /var/lib/ceph/osd/ceph-81/activate.monmap
> [hqosd7][DEBUG ]  stderr: got monmap epoch 2
> [hqosd7][DEBUG ] Running command: ceph-authtool
> /var/lib/ceph/osd/ceph-81/keyring --create-keyring --name osd.81
> --add-key AQCyyFRcSwWqGBAAKZR8rcWIEknj/o3rsehOdA==
> [hqosd7][DEBUG ]  stdout: creating /var/lib/ceph/osd/ceph-81/keyring
> [hqosd7][DEBUG ]  stdout: added entity osd.81 auth auth(auid =
> 18446744073709551615 key=AQCyyFRcSwWqGBAAKZR8rcWIEknj/o3rsehOdA== with 0
> caps)
> [hqosd7][DEBUG ] Running command: chown -R ceph:ceph
> /var/lib/ceph/osd/ceph-81/keyring
> [hqosd7][DEBUG ] Running command: chown -R ceph:ceph
> /var/lib/ceph/osd/ceph-81/
> [hqosd7][DEBUG ] Running command: /usr/bin/ceph-osd --cluster ceph
> --osd-objectstore bluestore --mkfs -i 81 --monmap
> /var/lib/ceph/osd/ceph-81/activate.monmap --keyfile - --osd-data
> /var/lib/ceph/osd/ceph-8

[ceph-users] ceph OSD cache ration usage

2019-02-04 Thread M Ranga Swami Reddy
Hello - We are using the ceph osd nodes with cache controller cache of 1G size.
Are there any recommendation for using the cache for read and write?
Here we are using - HDDs with colocated journals.
For SSD journal - 0% cache and 100% write.

Thanks
Swami
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problem replacing osd with ceph-deploy

2019-02-04 Thread Alfredo Deza
On Fri, Feb 1, 2019 at 6:35 PM Vladimir Prokofev  wrote:
>
> Your output looks a bit weird, but still, this is normal for bluestore. It 
> creates small separate data partition that is presented as XFS mounted in 
> /var/lib/ceph/osd, while real data partition is hidden as raw(bluestore) 
> block device.

That is not right for this output. It is using ceph-volume with LVM,
there are no partitions being created.

> It's no longer possible to check disk utilisation with df using bluestore.
> To check your osd capacity use 'ceph osd df'
>
> сб, 2 февр. 2019 г. в 02:07, Shain Miley :
>>
>> Hi,
>>
>> I went to replace a disk today (which I had not had to do in a while)
>> and after I added it the results looked rather odd compared to times past:
>>
>> I was attempting to replace /dev/sdk on one of our osd nodes:
>>
>> #ceph-deploy disk zap hqosd7 /dev/sdk
>> #ceph-deploy osd create --data /dev/sdk hqosd7
>>
>> [ceph_deploy.conf][DEBUG ] found configuration file at:
>> /root/.cephdeploy.conf
>> [ceph_deploy.cli][INFO  ] Invoked (2.0.1): /usr/local/bin/ceph-deploy
>> osd create --data /dev/sdk hqosd7
>> [ceph_deploy.cli][INFO  ] ceph-deploy options:
>> [ceph_deploy.cli][INFO  ]  verbose   : False
>> [ceph_deploy.cli][INFO  ]  bluestore : None
>> [ceph_deploy.cli][INFO  ]  cd_conf   :
>> 
>> [ceph_deploy.cli][INFO  ]  cluster   : ceph
>> [ceph_deploy.cli][INFO  ]  fs_type   : xfs
>> [ceph_deploy.cli][INFO  ]  block_wal : None
>> [ceph_deploy.cli][INFO  ]  default_release   : False
>> [ceph_deploy.cli][INFO  ]  username  : None
>> [ceph_deploy.cli][INFO  ]  journal   : None
>> [ceph_deploy.cli][INFO  ]  subcommand: create
>> [ceph_deploy.cli][INFO  ]  host  : hqosd7
>> [ceph_deploy.cli][INFO  ]  filestore : None
>> [ceph_deploy.cli][INFO  ]  func  : > at 0x7fa3b14b3398>
>> [ceph_deploy.cli][INFO  ]  ceph_conf : None
>> [ceph_deploy.cli][INFO  ]  zap_disk  : False
>> [ceph_deploy.cli][INFO  ]  data  : /dev/sdk
>> [ceph_deploy.cli][INFO  ]  block_db  : None
>> [ceph_deploy.cli][INFO  ]  dmcrypt   : False
>> [ceph_deploy.cli][INFO  ]  overwrite_conf: False
>> [ceph_deploy.cli][INFO  ]  dmcrypt_key_dir   :
>> /etc/ceph/dmcrypt-keys
>> [ceph_deploy.cli][INFO  ]  quiet : False
>> [ceph_deploy.cli][INFO  ]  debug : False
>> [ceph_deploy.osd][DEBUG ] Creating OSD on cluster ceph with data device
>> /dev/sdk
>> [hqosd7][DEBUG ] connected to host: hqosd7
>> [hqosd7][DEBUG ] detect platform information from remote host
>> [hqosd7][DEBUG ] detect machine type
>> [hqosd7][DEBUG ] find the location of an executable
>> [ceph_deploy.osd][INFO  ] Distro info: Ubuntu 16.04 xenial
>> [ceph_deploy.osd][DEBUG ] Deploying osd to hqosd7
>> [hqosd7][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
>> [hqosd7][DEBUG ] find the location of an executable
>> [hqosd7][INFO  ] Running command: /usr/sbin/ceph-volume --cluster ceph
>> lvm create --bluestore --data /dev/sdk
>> [hqosd7][DEBUG ] Running command: /usr/bin/ceph-authtool --gen-print-key
>> [hqosd7][DEBUG ] Running command: /usr/bin/ceph --cluster ceph --name
>> client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
>> -i - osd new c98a11d1-9b7f-487e-8c69-72fc662927d4
>> [hqosd7][DEBUG ] Running command: vgcreate --force --yes
>> ceph-bbe0e44e-afc9-4cf1-9f1a-ed7d20f796c1 /dev/sdk
>> [hqosd7][DEBUG ]  stdout: Physical volume "/dev/sdk" successfully created
>> [hqosd7][DEBUG ]  stdout: Volume group
>> "ceph-bbe0e44e-afc9-4cf1-9f1a-ed7d20f796c1" successfully created
>> [hqosd7][DEBUG ] Running command: lvcreate --yes -l 100%FREE -n
>> osd-block-c98a11d1-9b7f-487e-8c69-72fc662927d4
>> ceph-bbe0e44e-afc9-4cf1-9f1a-ed7d20f796c1
>> [hqosd7][DEBUG ]  stdout: Logical volume
>> "osd-block-c98a11d1-9b7f-487e-8c69-72fc662927d4" created.
>> [hqosd7][DEBUG ] Running command: /usr/bin/ceph-authtool --gen-print-key
>> [hqosd7][DEBUG ] Running command: mount -t tmpfs tmpfs
>> /var/lib/ceph/osd/ceph-81
>> [hqosd7][DEBUG ] Running command: chown -R ceph:ceph /dev/dm-0
>> [hqosd7][DEBUG ] Running command: ln -s
>> /dev/ceph-bbe0e44e-afc9-4cf1-9f1a-ed7d20f796c1/osd-block-c98a11d1-9b7f-487e-8c69-72fc662927d4
>> /var/lib/ceph/osd/ceph-81/block
>> [hqosd7][DEBUG ] Running command: ceph --cluster ceph --name
>> client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
>> mon getmap -o /var/lib/ceph/osd/ceph-81/activate.monmap
>> [hqosd7][DEBUG ]  stderr: got monmap epoch 2
>> [hqosd7][DEBUG ] Running command: ceph-authtool
>> /var/lib/ceph/osd/ceph-81/keyring --create-keyring --name osd.81
>> --add-key AQCyyFRcSwWqGBAAKZR8rcWIEknj/o3r

Re: [ceph-users] Kernel requirements for balancer in upmap mode

2019-02-04 Thread Konstantin Shalygin

So, if I am using ceph just to provide block storage to an OpenStack
cluster (so using libvirt), the kernel version on the client nodes
shouldn't matter, right ?


Yep, just make sure your librbd on compute hosts is Luminous.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel requirements for balancer in upmap mode

2019-02-04 Thread Massimo Sgaravatto
Thanks a lot

So, if I am using ceph just to provide block storage to an OpenStack
cluster (so using libvirt), the kernel version on the client nodes
shouldn't matter, right ?

Thanks again, Massimo

On Mon, Feb 4, 2019 at 10:02 AM Ilya Dryomov  wrote:

> On Mon, Feb 4, 2019 at 9:25 AM Massimo Sgaravatto
>  wrote:
> >
> > The official documentation [*] says that the only requirement to use the
> balancer in upmap mode is that all clients must run at least luminous.
> > But I read somewhere (also in this mailing list) that there are also
> requirements wrt the kernel.
> > If so:
> >
> > 1) Could you please specify what is the minimum required kernel ?
>
> 4.13 or CentOS 7.5.  See [1] for details.
>
> > 2) Does this kernel requirement apply only to the OSD nodes ? Or also to
> the clients ?
>
> No, only to the kernel client nodes.  If the kernel client isn't used,
> there is no requirement at all.
>
> [1]
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/027002.html
>
> Thanks,
>
> Ilya
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore deploys to tmpfs?

2019-02-04 Thread Hector Martin

On 02/02/2019 05:07, Stuart Longland wrote:

On 1/2/19 10:43 pm, Alfredo Deza wrote:

The tmpfs setup is expected. All persistent data for bluestore OSDs
setup with LVM are stored in LVM metadata. The LVM/udev handler for
bluestore volumes create these tmpfs filesystems on the fly and populate
them with the information from the metadata.

That is mostly what happens. There isn't a dependency on UDEV anymore
(yay), but the reason why files are mounted on tmpfs
is because *bluestore* spits them out on activation, this makes the
path fully ephemeral (a great thing!)

The step-by-step is documented in this summary section of  'activate'
http://docs.ceph.com/docs/master/ceph-volume/lvm/activate/#summary

Filestore doesn't have any of these capabilities and it is why it does
have an actual existing path (vs. tmpfs), and the files come from the
data partition that
gets mounted.



Well, for whatever reason, ceph-osd isn't calling the activate script
before it starts up.

It is worth noting that the systems I'm using do not use systemd out of
simplicity.  I might need to write an init script to do that.  It wasn't
clear last weekend what commands I needed to run to activate a BlueStore
OSD.


The way you do this on Gentoo is by writing the OSD FSID into 
/etc/conf.d/ceph-osd.. You need to make note of the ID when the OSD 
is first deployed.


# echo "bluestore_osd_fsid=$(cat /var/lib/ceph/osd/ceph-0/fsid)" > 
/etc/conf.d/ceph-osd.0


And then of course do the usual initscript symlink enable on Gentoo:

# ln -s ceph /etc/init.d/ceph-osd.0
# rc-update add ceph-osd.0 default

This will then call `ceph-volume lvm activate` for that OSD for you 
before bringing it up, which will populate the tmpfs. It is the Gentoo 
OpenRC equivalent of enabling the systemd unit for that osd-fsid on 
systemd systems (but ceph-volume won't do it for you).


You may also want to add some dependencies for all OSDs depending on 
your setup (e.g. I run single-host and the mon has the OSD dm-crypt 
keys, so that has to come first):


# echo 'rc_need="ceph-mon.0"' > /etc/conf.d/ceph-osd

The Gentoo initscript setup for Ceph is unfortunately not very well 
documented. I've been meaning to write a blogpost about this to try to 
share what I've learned :-)


--
Hector Martin (hec...@marcansoft.com)
Public Key: https://marcan.st/marcan.asc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel requirements for balancer in upmap mode

2019-02-04 Thread Ilya Dryomov
On Mon, Feb 4, 2019 at 9:25 AM Massimo Sgaravatto
 wrote:
>
> The official documentation [*] says that the only requirement to use the 
> balancer in upmap mode is that all clients must run at least luminous.
> But I read somewhere (also in this mailing list) that there are also 
> requirements wrt the kernel.
> If so:
>
> 1) Could you please specify what is the minimum required kernel ?

4.13 or CentOS 7.5.  See [1] for details.

> 2) Does this kernel requirement apply only to the OSD nodes ? Or also to the 
> clients ?

No, only to the kernel client nodes.  If the kernel client isn't used,
there is no requirement at all.

[1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/027002.html

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-04 Thread Alexandre DERUMIER
Hi,

some news:

I have tried with different transparent hugepage values (madvise, never) : no 
change

I have tried to increase bluestore_cache_size_ssd to 8G: no change

I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it 
seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to 
be sure)


Note that this behaviour seem to happen really faster (< 2 days) on my big nvme 
drives (6TB),
my others clusters user 1,6TB ssd.

Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by 
osd), but I'll try this week with 2osd by nvme, to see if it's helping.


BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 
2.26 (which have also thread cache) ?


Regards,

Alexandre


- Mail original -
De: "aderumier" 
À: "Stefan Priebe, Profihost AG" 
Cc: "Sage Weil" , "ceph-users" , 
"ceph-devel" 
Envoyé: Mercredi 30 Janvier 2019 19:58:15
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

>>Thanks. Is there any reason you monitor op_w_latency but not 
>>op_r_latency but instead op_latency? 
>> 
>>Also why do you monitor op_w_process_latency? but not op_r_process_latency? 

I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). 

I just don't see latency difference on reads. (or they are very very small vs 
the write latency increase) 



- Mail original - 
De: "Stefan Priebe, Profihost AG"  
À: "aderumier"  
Cc: "Sage Weil" , "ceph-users" , 
"ceph-devel"  
Envoyé: Mercredi 30 Janvier 2019 19:50:20 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart 

Hi, 

Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
> Hi Stefan, 
> 
>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>> like suggested. This report makes me a little nervous about my change. 
> Well,I'm really not sure that it's a tcmalloc bug. 
> maybe bluestore related (don't have filestore anymore to compare) 
> I need to compare with bigger latencies 
> 
> here an example, when all osd at 20-50ms before restart, then after restart 
> (at 21:15), 1ms 
> http://odisoweb1.odiso.net/latencybad.png 
> 
> I observe the latency in my guest vm too, on disks iowait. 
> 
> http://odisoweb1.odiso.net/latencybadvm.png 
> 
>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>> exact values out of the daemon do you use for bluestore? 
> 
> here my influxdb queries: 
> 
> It take op_latency.sum/op_latency.avgcount on last second. 
> 
> 
> SELECT non_negative_derivative(first("op_latency.sum"), 
> 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" 
> WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter 
> GROUP BY time($interval), "host", "id" fill(previous) 
> 
> 
> SELECT non_negative_derivative(first("op_w_latency.sum"), 
> 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" 
> WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ 
> AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
> 
> 
> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 
> 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM 
> "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ 
> /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
> fill(previous) 

Thanks. Is there any reason you monitor op_w_latency but not 
op_r_latency but instead op_latency? 

Also why do you monitor op_w_process_latency? but not op_r_process_latency? 

greets, 
Stefan 

> 
> 
> 
> 
> 
> - Mail original - 
> De: "Stefan Priebe, Profihost AG"  
> À: "aderumier" , "Sage Weil"  
> Cc: "ceph-users" , "ceph-devel" 
>  
> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
> restart 
> 
> Hi, 
> 
> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>> Hi, 
>> 
>> here some new results, 
>> different osd/ different cluster 
>> 
>> before osd restart latency was between 2-5ms 
>> after osd restart is around 1-1.5ms 
>> 
>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>> 
>> From what I see in diff, the biggest difference is in tcmalloc, but maybe 
>> I'm wrong. 
>> (I'm using tcmalloc 2.5-2.2) 
> 
> currently i'm in the process of switching back from jemalloc to tcmalloc 
> like suggested. This report makes me a little nervous about my change. 
> 
> Also i'm currently only monitoring latency for filestore osds. Which 
> exact values out of the daemon do you use for bluestore? 
> 
> I would like to check if i see the same behaviour. 
> 
> Greets, 
> Stefan 
> 
>> 
>> - Mail original - 
>> De: "Sage Weil"  
>> À: "aderumier"  
>> Cc: "ceph-users" , "ceph-devel" 
>>  
>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>> Objet: Re: ce

Re: [ceph-users] Luminous cluster in very bad state need some assistance.

2019-02-04 Thread Philippe Van Hecke
 ceph pg ls | grep 11.182

11.182   10 25   35 0  2534648064 1306  
   1306 active+recovery_wait+undersized+degraded 2019-02-04 09:23:26.461468 
 70238'1306 70673:24924  [64] 64  [64] 64  
46843'56759413 2019-01-26 16:31:32.607109  46843'56628962 2019-01-24 
08:56:59.228615

root@storage-node-1-l3:~# ceph pg 11.182 query
{
"state": "active+recovery_wait+undersized+degraded",
"snap_trimq": "[1~b]",
"snap_trimq_len": 11,
"epoch": 70673,
"up": [
64
],
"acting": [
64
],
"actingbackfill": [
"64"
],
"info": {
"pgid": "11.182",
"last_update": "70238'1306",
"last_complete": "46843'56787837",
"log_tail": "0'0",
"last_user_version": 1301,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [],
"history": {
"epoch_created": 54817,
"epoch_pool_created": 278,
"last_epoch_started": 70656,
"last_interval_started": 70655,
"last_epoch_clean": 67924,
"last_interval_clean": 54687,
"last_epoch_split": 54817,
"last_epoch_marked_full": 0,
"same_up_since": 70655,
"same_interval_since": 70655,
"same_primary_since": 70655,
"last_scrub": "46843'56759413",
"last_scrub_stamp": "2019-01-26 16:31:32.607109",
"last_deep_scrub": "46843'56628962",
"last_deep_scrub_stamp": "2019-01-24 08:56:59.228615",
"last_clean_scrub_stamp": "2019-01-26 16:31:32.607109"
},
"stats": {
"version": "70238'1306",
"reported_seq": "24940",
"reported_epoch": "70673",
"state": "active+recovery_wait+undersized+degraded",
"last_fresh": "2019-02-04 09:25:56.966952",
"last_change": "2019-02-04 09:25:56.966952",
"last_active": "2019-02-04 09:25:56.966952",
"last_peered": "2019-02-04 09:25:56.966952",
"last_clean": "0.00",
"last_became_active": "2019-02-04 07:57:08.769839",
"last_became_peered": "2019-02-04 07:57:08.769839",
"last_unstale": "2019-02-04 09:25:56.966952",
"last_undegraded": "2019-02-04 07:57:08.762164",
"last_fullsized": "2019-02-04 07:57:08.761962",
"mapping_epoch": 70655,
"log_start": "0'0",
"ondisk_log_start": "0'0",
"created": 54817,
"last_epoch_clean": 67924,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "46843'56759413",
"last_scrub_stamp": "2019-01-26 16:31:32.607109",
"last_deep_scrub": "46843'56628962",
"last_deep_scrub_stamp": "2019-01-24 08:56:59.228615",
"last_clean_scrub_stamp": "2019-01-26 16:31:32.607109",
"log_size": 1306,
"ondisk_log_size": 1306,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"snaptrimq_len": 11,
"stat_sum": {
"num_bytes": 34648064,
"num_objects": 10,
"num_object_clones": 0,
"num_object_copies": 20,
"num_objects_missing_on_primary": 25,
"num_objects_missing": 0,
"num_objects_degraded": 35,
"num_objects_misplaced": 0,
"num_objects_unfound": 25,
"num_objects_dirty": 10,
"num_whiteouts": 0,
"num_read": 1274,
"num_read_kb": 33808,
"num_write": 1388,
"num_write_kb": 42956,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0
},
"up": [
64
],
"acting": [
64
],
"blocked_by": [],
"up_primary": 64,
"acting_primary": 64

[ceph-users] Kernel requirements for balancer in upmap mode

2019-02-04 Thread Massimo Sgaravatto
The official documentation [*] says that the only requirement to use the
balancer in upmap mode is that all clients must run at least luminous.
But I read somewhere (also in this mailing list) that there are also
requirements wrt the kernel.
If so:

1) Could you please specify what is the minimum required kernel ?
2) Does this kernel requirement apply only to the OSD nodes ? Or also to
the clients ?

The ceph version I am interested in is Luminous

Thanks a lot, Massimo



[*]  http://docs.ceph.com/docs/luminous/mgr/balancer/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous cluster in very bad state need some assistance.

2019-02-04 Thread Sage Weil
On Mon, 4 Feb 2019, Philippe Van Hecke wrote:
> So i restarted the osd but he stop after some time. But this is an effect on 
> the cluster and cluster is on a partial recovery process.
> 
> please find here log file of osd 49 after this restart 
> https://filesender.belnet.be/?s=download&token=8c9c39f2-36f6-43f7-bebb-175679d27a22

It's the same PG 11.182 hitting the same assert when it tries to recover 
to that OSD.  I think the problem will go away once there has been some 
write traffic, but it may be tricky to prevent it from doing any recovery 
until then.

I just noticed you pasted the wrong 'pg ls' result before:

> > result of  ceph pg ls | grep 11.118
> >
> > 11.118 9788  00 0   0 40817837568 
> > 1584 1584 active+clean 2019-02-01 
> > 12:48:41.343228  70238'19811673  70493:34596887  [121,24]121  
> > [121,24]121  69295'19811665 2019-02-01 12:48:41.343144  
> > 66131'19810044 2019-01-30 11:44:36.006505

What does 11.182 look like?

We can try something slighty different.  From before it looked like your 
only 'incomplete' pg was 11.ac (ceph pg ls incomplete), and the needed 
state is either on osd.49 or osd.63.  On osd.49, do ceph-objectstore-tool 
--op export on that pg, and then find an otherwise healthy OSD (that 
doesn't have 11.ac), stop it, and ceph-objectstore-tool --op import it 
there.  When you start it up, 11.ac will hopefull peer and recover.  (Or, 
alternatively, osd.63 may have the needed state.)

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous cluster in very bad state need some assistance.

2019-02-04 Thread Philippe Van Hecke
Hi,
Seem that the recovery process stop and get back to the same situation as 
before.
I hope that the log can provide more info. Any way thanks already for your 
assistance.

Kr

Philippe.


From: Philippe Van Hecke
Sent: 04 February 2019 07:53
To: Sage Weil
Cc: ceph-users@lists.ceph.com; Belnet Services
Subject: Re: [ceph-users] Luminous cluster in very bad state need some 
assistance.

So i restarted the osd but he stop after some time. But this is an effect on 
the cluster and cluster is on a partial recovery process.

please find here log file of osd 49 after this restart
https://filesender.belnet.be/?s=download&token=8c9c39f2-36f6-43f7-bebb-175679d27a22

Kr

Philippe.


From: Philippe Van Hecke
Sent: 04 February 2019 07:42
To: Sage Weil
Cc: ceph-users@lists.ceph.com; Belnet Services
Subject: Re: [ceph-users] Luminous cluster in very bad state need some 
assistance.

oot@ls-node-5-lcl:~# ceph-objectstore-tool --data-path 
/var/lib/ceph/osd/ceph-49/ --journal /var/lib/ceph/osd/ceph-49/journal --pgid 
11.182 --op remove --debug --force  2> ceph-objectstore-tool-export-remove.txt
 marking collection for removal
setting '_remove' omap key
finish_remove_pgs 11.182_head removing 11.182
Remove successful

So now i suppose i restart the osd and see



From: Sage Weil 
Sent: 04 February 2019 07:37
To: Philippe Van Hecke
Cc: ceph-users@lists.ceph.com; Belnet Services
Subject: Re: [ceph-users] Luminous cluster in very bad state need some 
assistance.

On Mon, 4 Feb 2019, Philippe Van Hecke wrote:
> result of  ceph pg ls | grep 11.118
>
> 11.118 9788  00 0   0 40817837568 
> 1584 1584 active+clean 2019-02-01 
> 12:48:41.343228  70238'19811673  70493:34596887  [121,24]121  
> [121,24]121  69295'19811665 2019-02-01 12:48:41.343144  
> 66131'19810044 2019-01-30 11:44:36.006505
>
> cp done.
>
> So i can make  ceph-objecstore-tool --op remove command ?

yep!


>
> 
> From: Sage Weil 
> Sent: 04 February 2019 07:26
> To: Philippe Van Hecke
> Cc: ceph-users@lists.ceph.com; Belnet Services
> Subject: Re: [ceph-users] Luminous cluster in very bad state need some 
> assistance.
>
> On Mon, 4 Feb 2019, Philippe Van Hecke wrote:
> > Hi Sage,
> >
> > I try to make the following.
> >
> > ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-49/ --journal 
> > /var/lib/ceph/osd/ceph-49/journal --pgid 11.182 --op export-remove --debug 
> > --file /tmp/export-pg/18.182 2>ceph-objectstore-tool-export-remove.txt
> > but this rise exception
> >
> > find here  
> > https://filesender.belnet.be/?s=download&token=e2b1fdbc-0739-423f-9d97-0bd258843a33
> >  file ceph-objectstore-tool-export-remove.txt
>
> In that case,  cp --preserve=all
> /var/lib/ceph/osd/ceph-49/current/11.182_head to a safe location and then
> use the ceph-objecstore-tool --op remove command.  But first confirm that
> 'ceph pg ls' shows the PG as active.
>
> sage
>
>
>  > > Kr
> >
> > Philippe.
> >
> > 
> > From: Sage Weil 
> > Sent: 04 February 2019 06:59
> > To: Philippe Van Hecke
> > Cc: ceph-users@lists.ceph.com; Belnet Services
> > Subject: Re: [ceph-users] Luminous cluster in very bad state need some 
> > assistance.
> >
> > On Mon, 4 Feb 2019, Philippe Van Hecke wrote:
> > > Hi Sage, First of all tanks for your help
> > >
> > > Please find here  
> > > https://filesender.belnet.be/?s=download&token=dea0edda-5b6a-4284-9ea1-c1fdf88b65e9
> > > the osd log with debug info for osd.49. and indeed if all buggy osd can 
> > > restart that can may be solve the issue.
> > > But i also happy that you confirm my understanding that in the worst case 
> > > removing pool can also resolve the problem even in this case i lose data  
> > > but finish with a working cluster.
> >
> > If PGs are damaged, removing the pool would be part of getting to
> > HEALTH_OK, but you'd probably also need to remove any problematic PGs that
> > are preventing the OSD starting.
> >
> > But keep in mind that (1) i see 3 PGs that don't peer spread across pools
> > 11 and 12; not sure which one you are considering deleting.  Also (2) if
> > one pool isn't fully available it generall won't be a problem for other
> > pools, as long as the osds start.  And doing ceph-objectstore-tool
> > export-remove is a pretty safe way to move any problem PGs out of the way
> > to get your OSDs starting--just make sure you hold onto that backup/export
> > because you may need it later!
> >
> > > PS: don't know and don't want to open debat about top/bottom posting but 
> > > would like to know the preference of this list :-)
> >
> > No preference :)
> >
> > sage
> >
> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com