[ceph-users] trying to understanding crush more deeply

2017-09-21 Thread Will Zhao
Hi Sage  and all :
I am tring to understand cursh more deeply. I have tried to read
the code and paper, and search the mail list archives ,  but I still
have some questions and can't understand it well.
If I have 100 osds, and when I add a osd ,  the osdmap changes,
and how the pg is recaulated to make sure the data movement is
minimal.  I tried to use crushtool --show-mappings --num-rep 3  --test
 -i map , through changing the map for 100osds and 101 osds , to look
the result , it looks like the pgmap changed a lot .  Shouldn't the
remap  only happen to some of the pgs ? Or crush from adding  a pg is
different from a new osdmap ? I konw I must understand something
wrong. I appreciate if you can explain more about the logic of adding
a osd . Or is there  more doc that I can read ? Thank you very much
!!! : )
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] librmb: Mail storage on RADOS with Dovecot

2017-09-21 Thread Adrian Saul

Thanks for bringing this to attention Wido - its of interest to us as we are 
currently looking to migrate mail platforms onto Ceph using NFS, but this seems 
far more practical.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Wido den Hollander
> Sent: Thursday, 21 September 2017 6:40 PM
> To: ceph-us...@ceph.com
> Subject: [ceph-users] librmb: Mail storage on RADOS with Dovecot
>
> Hi,
>
> A tracker issue has been out there for a while:
> http://tracker.ceph.com/issues/12430
>
> Storing e-mail in RADOS with Dovecot, the IMAP/POP3/LDA server with a
> huge marketshare.
>
> It took a while, but last year Deutsche Telekom took on the heavy work and
> started a project to develop librmb: LibRadosMailBox
>
> Together with Deutsche Telekom and Tallence GmbH (DE) this project came
> to life.
>
> First, the Github link: https://github.com/ceph-dovecot/dovecot-ceph-
> plugin
>
> I am not going to repeat everything which is on Github, put a short summary:
>
> - CephFS is used for storing Mailbox Indexes
> - E-Mails are stored directly as RADOS objects
> - It's a Dovecot plugin
>
> We would like everybody to test librmb and report back issues on Github so
> that further development can be done.
>
> It's not finalized yet, but all the help is welcome to make librmb the best
> solution for storing your e-mails on Ceph with Dovecot.
>
> Danny Al-Gaaf has written a small blogpost about it and a presentation:
>
> - https://dalgaaf.github.io/CephMeetUpBerlin20170918-librmb/
> - http://blog.bisect.de/2017/09/ceph-meetup-berlin-followup-librmb.html
>
> To get a idea of the scale: 4,7PB of RAW storage over 1.200 OSDs is the final
> goal (last slide in presentation). That will provide roughly 1,2PB of usable
> storage capacity for storing e-mail, a lot of e-mail.
>
> To see this project finally go into the Open Source world excites me a lot :-)
>
> A very, very big thanks to Deutsche Telekom for funding this awesome
> project!
>
> A big thanks as well to Tallence as they did an awesome job in developing
> librmb in such a short time.
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Power outages!!! help!

2017-09-21 Thread hjcho616
Ronny,
Could you help me with this log?  I got this with debug osd=20 filestore=20 
ms=20.  This one is running "ceph pg repair 2.7"  This is one of the smaller 
page, thus log was smaller.  Others have similar errors.  I can see the lines 
with ERR, but other than that is there something I should be paying attention 
to? 
https://drive.google.com/file/d/0By7YztAJNGUWNkpCV090dHBmOWc/view?usp=sharing
Error messages looks like this.2017-09-21 23:53:31.545510 7f51682df700 -1 
log_channel(cluster) log [ERR] : 2.7 shard 2: soid 
2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head data_digest 0x62b74a1f != 
data_digest 0x43d61c5d from auth oi 
2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head(12962'694 osd.2.0:90545 
dirty|data_digest|omap_digest s 4194304 uv 484 dd 43d61c5d od  
alloc_hint [0 0])2017-09-21 23:53:31.545520 7f51682df700 -1 
log_channel(cluster) log [ERR] : 2.7 shard 7: soid 
2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head data_digest 0x62b74a1f != 
data_digest 0x43d61c5d from auth oi 
2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head(12962'694 osd.2.0:90545 
dirty|data_digest|omap_digest s 4194304 uv 484 dd 43d61c5d od  
alloc_hint [0 0])2017-09-21 23:53:31.545531 7f51682df700 -1 
log_channel(cluster) log [ERR] : 2.7 soid 
2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head: failed to pick suitable auth 
object
I did try to move that object to different location as suggested from this 
page.http://ceph.com/geen-categorie/ceph-manually-repair-object/

This is what I ran.systemctl stop ceph-osd@7ceph-osd -i 7 --flush-journalcd 
/var/lib/ceph/osd/ceph-7cd current/2.7_head/mv 
rb.0.145d.2ae8944a.00bb__head_6F5DBE87__2 ~/ceph osd treesystemctl 
start ceph-osd@7ceph pg repair 2.7
Then I just get this..2017-09-22 00:41:06.495399 7f22ac3bd700 -1 
log_channel(cluster) log [ERR] : 2.7 shard 2: soid 
2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head data_digest 0x62b74a1f != 
data_digest 0x43d61c5d from auth oi 
2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head(12962'694 osd.2.0:90545 
dirty|data_digest|omap_digest s 4194304 uv 484 dd 43d61c5d od  
alloc_hint [0 0])2017-09-22 00:41:06.495417 7f22ac3bd700 -1 
log_channel(cluster) log [ERR] : 2.7 shard 7 missing 
2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head2017-09-22 00:41:06.495424 
7f22ac3bd700 -1 log_channel(cluster) log [ERR] : 2.7 soid 
2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head: failed to pick suitable auth 
object
Moving from osd.2 results in similar error message, just says missing on top 
one instead. =P

I was hoping this time would give me a different result as I let one more osd 
copy one from OSD1 by turning down osd.7 and set noout.  But it doesn't appear 
to care about that extra data. Maybe only true when size is 3?  Basically since 
I had most osds alive on OSD1 I was trying to favor data from OSD1. =P
What can I do in this case? According to 
http://ceph.com/geen-categorie/incomplete-pgs-oh-my/ inconsistent data can be 
expected with skip journal replay, and I had to use it as export crashed 
without it. =P  But doesn't say much about what to do in that case.If all went 
well, then your cluster is now back to 100% active+clean / HEALTH_OK state. 
Note that you may still have inconsistent or stale data stored inside the PG. 
This is because the state of the data on the OSD that failed is a bit unknown, 
especially if you had to use the ‘–skip-journal-replay’ option on the export. 
For RBD data, the client which utilizes the RBD should run a filesystem check 
against the RBD.

Regards,Hong 

On Thursday, September 21, 2017 1:46 AM, Ronny Aasen 
 wrote:
 

 On 21. sep. 2017 00:35, hjcho616 wrote:
> # rados list-inconsistent-pg data
> ["0.0","0.5","0.a","0.e","0.1c","0.29","0.2c"]
> # rados list-inconsistent-pg metadata
> ["1.d","1.3d"]
> # rados list-inconsistent-pg rbd
> ["2.7"]
> # rados list-inconsistent-obj 0.0 --format=json-pretty
> {
>      "epoch": 23112,
>      "inconsistents": []
> }
> # rados list-inconsistent-obj 0.5 --format=json-pretty
> {
>      "epoch": 23078,
>      "inconsistents": []
> }
> # rados list-inconsistent-obj 0.a --format=json-pretty
> {
>      "epoch": 22954,
>      "inconsistents": []
> }
> # rados list-inconsistent-obj 0.e --format=json-pretty
> {
>      "epoch": 23068,
>      "inconsistents": []
> }
> # rados list-inconsistent-obj 0.1c --format=json-pretty
> {
>      "epoch": 22954,
>      "inconsistents": []
> }
> # rados list-inconsistent-obj 0.29 --format=json-pretty
> {
>      "epoch": 22974,
>      "inconsistents": []
> }
> # rados list-inconsistent-obj 0.2c --format=json-pretty
> {
>      "epoch": 23194,
>      "inconsistents": []
> }
> # rados list-inconsistent-obj 1.d --format=json-pretty
> {
>      "epoch": 23072,
>      "inconsistents": []
> }
> # rados list-inconsistent-obj 1.3d --format=json-pretty
> {
>      "epoch": 23221,
>      "inconsistents": []
> }
> # rados list-inconsistent-obj 2.7 --format=json-pretty
> {
>      "epoch": 23032,
>      "inconsiste

Re: [ceph-users] librmb: Mail storage on RADOS with Dovecot

2017-09-21 Thread Brad Hubbard
This looks great Wido!


Kudos to all involved.

On Thu, Sep 21, 2017 at 6:40 PM, Wido den Hollander  wrote:
> Hi,
>
> A tracker issue has been out there for a while: 
> http://tracker.ceph.com/issues/12430
>
> Storing e-mail in RADOS with Dovecot, the IMAP/POP3/LDA server with a huge 
> marketshare.
>
> It took a while, but last year Deutsche Telekom took on the heavy work and 
> started a project to develop librmb: LibRadosMailBox
>
> Together with Deutsche Telekom and Tallence GmbH (DE) this project came to 
> life.
>
> First, the Github link: https://github.com/ceph-dovecot/dovecot-ceph-plugin
>
> I am not going to repeat everything which is on Github, put a short summary:
>
> - CephFS is used for storing Mailbox Indexes
> - E-Mails are stored directly as RADOS objects
> - It's a Dovecot plugin
>
> We would like everybody to test librmb and report back issues on Github so 
> that further development can be done.
>
> It's not finalized yet, but all the help is welcome to make librmb the best 
> solution for storing your e-mails on Ceph with Dovecot.
>
> Danny Al-Gaaf has written a small blogpost about it and a presentation:
>
> - https://dalgaaf.github.io/CephMeetUpBerlin20170918-librmb/
> - http://blog.bisect.de/2017/09/ceph-meetup-berlin-followup-librmb.html
>
> To get a idea of the scale: 4,7PB of RAW storage over 1.200 OSDs is the final 
> goal (last slide in presentation). That will provide roughly 1,2PB of usable 
> storage capacity for storing e-mail, a lot of e-mail.
>
> To see this project finally go into the Open Source world excites me a lot :-)
>
> A very, very big thanks to Deutsche Telekom for funding this awesome project!
>
> A big thanks as well to Tallence as they did an awesome job in developing 
> librmb in such a short time.
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-21 Thread Benjeman Meekhof
Some of this thread seems to contradict the documentation and confuses
me.  Is the statement below correct?

"The BlueStore journal will always be placed on the fastest device
available, so using a DB device will provide the same benefit that the
WAL device would while also allowing additional metadata to be stored
there (if it will fix)."

http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices

 it seems to be saying that there's no reason to create separate WAL
and DB partitions if they are on the same device.  Specifying one
large DB partition per OSD will cover both uses.

thanks,
Ben

On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
 wrote:
> On 09/21/2017 05:03 PM, Mark Nelson wrote:
>>
>>
>> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
>>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
 On 2017-09-21 07:56, Lazuardi Nasution wrote:

> Hi,
>
> I'm still looking for the answer of these questions. Maybe someone can
> share their thought on these. Any comment will be helpful too.
>
> Best regards,
>
> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
> mailto:mrxlazuar...@gmail.com>> wrote:
>
> Hi,
>
> 1. Is it possible configure use osd_data not as small partition on
> OSD but a folder (ex. on root disk)? If yes, how to do that with
> ceph-disk and any pros/cons of doing that?
> 2. Is WAL & DB size calculated based on OSD size or expected
> throughput like on journal device of filestore? If no, what is the
> default value and pro/cons of adjusting that?
> 3. Is partition alignment matter on Bluestore, including WAL & DB
> if using separate device for them?
>
> Best regards,
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 I am also looking for recommendations on wal/db partition sizes. Some
 hints:

 ceph-disk defaults used in case it does not find
 bluestore_block_wal_size or bluestore_block_db_size in config file:

 wal =  512MB

 db = if bluestore_block_size (data size) is in config file it uses 1/100
 of it else it uses 1G.

 There is also a presentation by Sage back in March, see page 16:

 https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in


 wal: 512 MB

 db: "a few" GB

 the wal size is probably not debatable, it will be like a journal for
 small block sizes which are constrained by iops hence 512 MB is more
 than enough. Probably we will see more on the db size in the future.
>>>
>>> This is what I understood so far.
>>> I wonder if it makes sense to set the db size as big as possible and
>>> divide entire db device is  by the number of OSDs it will serve.
>>>
>>> E.g. 10 OSDs / 1 NVME (800GB)
>>>
>>>  (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD
>>>
>>> Is this smart/stupid?
>>
>> Personally I'd use 512MB-2GB for the WAL (larger buffers reduce write
>> amp but mean larger memtables and potentially higher overhead scanning
>> through memtables).  4x256MB buffers works pretty well, but it means
>> memory overhead too.  Beyond that, I'd devote the entire rest of the
>> device to DB partitions.
>>
>
> thanks for your suggestion Mark!
>
> So, just to make sure I understood this right:
>
> You'd  use a separeate 512MB-2GB WAL partition for each OSD and the
> entire rest for DB partitions.
>
> In the example case with 10xHDD OSD and 1 NVME it would then be 10 WAL
> partitions with each 512MB-2GB and 10 equal sized DB partitions
> consuming the rest of the NVME.
>
>
> Thanks
>   Dietmar
> --
> _
> D i e t m a r  R i e d e r, Mag.Dr.
> Innsbruck Medical University
> Biocenter - Division for Bioinformatics
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-21 Thread Dietmar Rieder
On 09/21/2017 05:03 PM, Mark Nelson wrote:
> 
> 
> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
>>> On 2017-09-21 07:56, Lazuardi Nasution wrote:
>>>
 Hi,

 I'm still looking for the answer of these questions. Maybe someone can
 share their thought on these. Any comment will be helpful too.

 Best regards,

 On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
 mailto:mrxlazuar...@gmail.com>> wrote:

     Hi,

     1. Is it possible configure use osd_data not as small partition on
     OSD but a folder (ex. on root disk)? If yes, how to do that with
     ceph-disk and any pros/cons of doing that?
     2. Is WAL & DB size calculated based on OSD size or expected
     throughput like on journal device of filestore? If no, what is the
     default value and pro/cons of adjusting that?
     3. Is partition alignment matter on Bluestore, including WAL & DB
     if using separate device for them?

     Best regards,


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>> I am also looking for recommendations on wal/db partition sizes. Some
>>> hints:
>>>
>>> ceph-disk defaults used in case it does not find
>>> bluestore_block_wal_size or bluestore_block_db_size in config file:
>>>
>>> wal =  512MB
>>>
>>> db = if bluestore_block_size (data size) is in config file it uses 1/100
>>> of it else it uses 1G.
>>>
>>> There is also a presentation by Sage back in March, see page 16:
>>>
>>> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
>>>
>>>
>>> wal: 512 MB
>>>
>>> db: "a few" GB
>>>
>>> the wal size is probably not debatable, it will be like a journal for
>>> small block sizes which are constrained by iops hence 512 MB is more
>>> than enough. Probably we will see more on the db size in the future.
>>
>> This is what I understood so far.
>> I wonder if it makes sense to set the db size as big as possible and
>> divide entire db device is  by the number of OSDs it will serve.
>>
>> E.g. 10 OSDs / 1 NVME (800GB)
>>
>>  (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD
>>
>> Is this smart/stupid?
> 
> Personally I'd use 512MB-2GB for the WAL (larger buffers reduce write
> amp but mean larger memtables and potentially higher overhead scanning
> through memtables).  4x256MB buffers works pretty well, but it means
> memory overhead too.  Beyond that, I'd devote the entire rest of the
> device to DB partitions.
> 

thanks for your suggestion Mark!

So, just to make sure I understood this right:

You'd  use a separeate 512MB-2GB WAL partition for each OSD and the
entire rest for DB partitions.

In the example case with 10xHDD OSD and 1 NVME it would then be 10 WAL
partitions with each 512MB-2GB and 10 equal sized DB partitions
consuming the rest of the NVME.


Thanks
  Dietmar
-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-21 Thread Mark Nelson



On 09/21/2017 03:17 AM, Dietmar Rieder wrote:

On 09/21/2017 09:45 AM, Maged Mokhtar wrote:

On 2017-09-21 07:56, Lazuardi Nasution wrote:


Hi,

I'm still looking for the answer of these questions. Maybe someone can
share their thought on these. Any comment will be helpful too.

Best regards,

On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
mailto:mrxlazuar...@gmail.com>> wrote:

Hi,

1. Is it possible configure use osd_data not as small partition on
OSD but a folder (ex. on root disk)? If yes, how to do that with
ceph-disk and any pros/cons of doing that?
2. Is WAL & DB size calculated based on OSD size or expected
throughput like on journal device of filestore? If no, what is the
default value and pro/cons of adjusting that?
3. Is partition alignment matter on Bluestore, including WAL & DB
if using separate device for them?

Best regards,


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




I am also looking for recommendations on wal/db partition sizes. Some hints:

ceph-disk defaults used in case it does not find
bluestore_block_wal_size or bluestore_block_db_size in config file:

wal =  512MB

db = if bluestore_block_size (data size) is in config file it uses 1/100
of it else it uses 1G.

There is also a presentation by Sage back in March, see page 16:

https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in

wal: 512 MB

db: "a few" GB

the wal size is probably not debatable, it will be like a journal for
small block sizes which are constrained by iops hence 512 MB is more
than enough. Probably we will see more on the db size in the future.


This is what I understood so far.
I wonder if it makes sense to set the db size as big as possible and
divide entire db device is  by the number of OSDs it will serve.

E.g. 10 OSDs / 1 NVME (800GB)

 (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD

Is this smart/stupid?


Personally I'd use 512MB-2GB for the WAL (larger buffers reduce write 
amp but mean larger memtables and potentially higher overhead scanning 
through memtables).  4x256MB buffers works pretty well, but it means 
memory overhead too.  Beyond that, I'd devote the entire rest of the 
device to DB partitions.


Mark




Dietmar
 --
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] monitor takes long time to join quorum: STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH got BADAUTHORIZER

2017-09-21 Thread Sean Purdy
On Thu, 21 Sep 2017, Marc Roos said:
>  
> 
> In my case it was syncing, and was syncing slowly (hour or so?). You 
> should see this in the log file. I wanted to report this, because my 
> store.db is only 200MB, and I guess you want your monitors up and 
> running quickly.

Well I wondered about that, but if it can't talk to the monitor quorum leader, 
it's not going to start copying data.

And no new files had been added to this test cluster.

 
> I also noticed that when the 3rd monitor left the quorum, ceph -s 
> command was slow timing out. Probably trying to connect to the 3rd 
> monitor, but why? When this monitor is not in quorum.

There's a setting for client timeouts.  I forget where.
 

Sean
 
 
 
 
 
> -Original Message-
> From: Sean Purdy [mailto:s.pu...@cv-library.co.uk] 
> Sent: donderdag 21 september 2017 12:02
> To: Gregory Farnum
> Cc: ceph-users
> Subject: Re: [ceph-users] monitor takes long time to join quorum: 
> STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH got BADAUTHORIZER
> 
> On Wed, 20 Sep 2017, Gregory Farnum said:
> > That definitely sounds like a time sync issue. Are you *sure* they 
> > matched each other?
> 
> NTP looked OK at the time.  But see below.
> 
> 
> > Is it reproducible on restart?
> 
> Today I did a straight reboot - and it was fine, no issues.
> 
> 
> The issue occurs after the machine is off for a number of hours, or has 
> been worked on in the BIOS for a number of hours and then booted.  And 
> then perhaps waited at the disk decrypt key prompt.
> 
> So I'd suspect hardware clock drift at those times.  (Using Dell R720xd 
> machines)
> 
> 
> Logs show a time change a few seconds after boot.  After boot it's 
> running NTP and within that 45 minute period the NTP state looks the 
> same as the other nodes in the (small) cluster.
> 
> How much drift is allowed between monitors?
> 
> 
> Logs say:
> 
> Sep 20 09:45:21 store03 ntp[2329]: Starting NTP server: ntpd.
> Sep 20 09:45:21 store03 ntpd[2462]: proto: precision = 0.075 usec (-24) 
> ...
> Sep 20 09:46:44 store03 systemd[1]: Time has been changed Sep 20 
> 09:46:44 store03 ntpd[2462]: receive: Unexpected origin timestamp 
> 0xdd6ca972.c694801d does not match aorg 00. from 
> server@172.16.0.16 xmt 0xdd6ca974.0c5c18f
> 
> So system time was changed about 6 seconds after disks were 
> unlocked/boot proceeded.  But there was still 45 minutes of monitor 
> messages after that.  Surely the time should have converged sooner than 
> 45 minutes?
> 
> 
> 
> NTP from today, post-problem.  But ntpq at the time of the problem 
> looked just as OK:
> 
> store01:~$ ntpstat
> synchronised to NTP server (172.16.0.19) at stratum 3
>time correct to within 47 ms
> 
> store02$ ntpstat
> synchronised to NTP server (172.16.0.19) at stratum 3
>time correct to within 63 ms
> 
> store03:~$ sudo ntpstat
> synchronised to NTP server (172.16.0.19) at stratum 3
>time correct to within 63 ms
> 
> store03:~$ ntpq -p
>  remote   refid  st t when poll reach   delay   offset  
> jitter
> 
> ==
> +172.16.0.16 85.91.1.164  3 u  561 1024  3770.2870.554   
> 0.914
> +172.16.0.18 94.125.129.7 3 u  411 1024  3770.388   -0.331   
> 0.139
> *172.16.0.19 158.43.128.332 u  289 1024  3770.282   -0.005   
> 0.103
> 
> 
> Sean
> 
>  
> > On Wed, Sep 20, 2017 at 2:50 AM Sean Purdy  
> wrote:
> > 
> > >
> > > Hi,
> > >
> > >
> > > Luminous 12.2.0
> > >
> > > Three node cluster, 18 OSD, debian stretch.
> > >
> > >
> > > One node is down for maintenance for several hours.  When bringing 
> > > it back up, OSDs rejoin after 5 minutes, but health is still 
> > > warning.  monitor has not joined quorum after 40 minutes and logs 
> > > show BADAUTHORIZER message every time the monitor tries to connect 
> to the leader.
> > >
> > > 2017-09-20 09:46:05.581590 7f49e2b29700  0 -- 172.16.0.45:0/2243 >>
> > > 172.16.0.43:6812/2422 conn(0x5600720fb800 :-1 
> > > s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 
> > > l=0).handle_connect_reply connect got BADAUTHORIZER
> > >
> > > Then after ~45 minutes monitor *does* join quorum.
> > >
> > > I'm presuming this isn't normal behaviour?  Or if it is, let me know 
> 
> > > and I won't worry.
> > >
> > > All three nodes are using ntp and look OK timewise.
> > >
> > >
> > > ceph-mon log:
> > >
> > > (.43 is leader, .45 is rebooted node, .44 is other live node in 
> > > quorum)
> > >
> > > Boot:
> > >
> > > 2017-09-20 09:45:21.874152 7f49efeb8f80  0 ceph version 12.2.0
> > > (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process 
> > > (unknown), pid 2243
> > >
> > > 2017-09-20 09:46:01.824708 7f49e1b27700  0 -- 172.16.0.45:6789/0 >> 
> > > 172.16.0.44:6789/0 conn(0x56007244d000 :6789 
> > > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
> > > l=0).handle_connect_msg accept connect_seq 3 vs existing csq=0 
> > > existing_state=STATE_CONNECTING 

Re: [ceph-users] monitor takes long time to join quorum: STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH got BADAUTHORIZER

2017-09-21 Thread Marc Roos
 

In my case it was syncing, and was syncing slowly (hour or so?). You 
should see this in the log file. I wanted to report this, because my 
store.db is only 200MB, and I guess you want your monitors up and 
running quickly.

I also noticed that when the 3rd monitor left the quorum, ceph -s 
command was slow timing out. Probably trying to connect to the 3rd 
monitor, but why? When this monitor is not in quorum.






-Original Message-
From: Sean Purdy [mailto:s.pu...@cv-library.co.uk] 
Sent: donderdag 21 september 2017 12:02
To: Gregory Farnum
Cc: ceph-users
Subject: Re: [ceph-users] monitor takes long time to join quorum: 
STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH got BADAUTHORIZER

On Wed, 20 Sep 2017, Gregory Farnum said:
> That definitely sounds like a time sync issue. Are you *sure* they 
> matched each other?

NTP looked OK at the time.  But see below.


> Is it reproducible on restart?

Today I did a straight reboot - and it was fine, no issues.


The issue occurs after the machine is off for a number of hours, or has 
been worked on in the BIOS for a number of hours and then booted.  And 
then perhaps waited at the disk decrypt key prompt.

So I'd suspect hardware clock drift at those times.  (Using Dell R720xd 
machines)


Logs show a time change a few seconds after boot.  After boot it's 
running NTP and within that 45 minute period the NTP state looks the 
same as the other nodes in the (small) cluster.

How much drift is allowed between monitors?


Logs say:

Sep 20 09:45:21 store03 ntp[2329]: Starting NTP server: ntpd.
Sep 20 09:45:21 store03 ntpd[2462]: proto: precision = 0.075 usec (-24) 
...
Sep 20 09:46:44 store03 systemd[1]: Time has been changed Sep 20 
09:46:44 store03 ntpd[2462]: receive: Unexpected origin timestamp 
0xdd6ca972.c694801d does not match aorg 00. from 
server@172.16.0.16 xmt 0xdd6ca974.0c5c18f

So system time was changed about 6 seconds after disks were 
unlocked/boot proceeded.  But there was still 45 minutes of monitor 
messages after that.  Surely the time should have converged sooner than 
45 minutes?



NTP from today, post-problem.  But ntpq at the time of the problem 
looked just as OK:

store01:~$ ntpstat
synchronised to NTP server (172.16.0.19) at stratum 3
   time correct to within 47 ms

store02$ ntpstat
synchronised to NTP server (172.16.0.19) at stratum 3
   time correct to within 63 ms

store03:~$ sudo ntpstat
synchronised to NTP server (172.16.0.19) at stratum 3
   time correct to within 63 ms

store03:~$ ntpq -p
 remote   refid  st t when poll reach   delay   offset  
jitter

==
+172.16.0.16 85.91.1.164  3 u  561 1024  3770.2870.554   
0.914
+172.16.0.18 94.125.129.7 3 u  411 1024  3770.388   -0.331   
0.139
*172.16.0.19 158.43.128.332 u  289 1024  3770.282   -0.005   
0.103


Sean

 
> On Wed, Sep 20, 2017 at 2:50 AM Sean Purdy  
wrote:
> 
> >
> > Hi,
> >
> >
> > Luminous 12.2.0
> >
> > Three node cluster, 18 OSD, debian stretch.
> >
> >
> > One node is down for maintenance for several hours.  When bringing 
> > it back up, OSDs rejoin after 5 minutes, but health is still 
> > warning.  monitor has not joined quorum after 40 minutes and logs 
> > show BADAUTHORIZER message every time the monitor tries to connect 
to the leader.
> >
> > 2017-09-20 09:46:05.581590 7f49e2b29700  0 -- 172.16.0.45:0/2243 >>
> > 172.16.0.43:6812/2422 conn(0x5600720fb800 :-1 
> > s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 
> > l=0).handle_connect_reply connect got BADAUTHORIZER
> >
> > Then after ~45 minutes monitor *does* join quorum.
> >
> > I'm presuming this isn't normal behaviour?  Or if it is, let me know 

> > and I won't worry.
> >
> > All three nodes are using ntp and look OK timewise.
> >
> >
> > ceph-mon log:
> >
> > (.43 is leader, .45 is rebooted node, .44 is other live node in 
> > quorum)
> >
> > Boot:
> >
> > 2017-09-20 09:45:21.874152 7f49efeb8f80  0 ceph version 12.2.0
> > (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process 
> > (unknown), pid 2243
> >
> > 2017-09-20 09:46:01.824708 7f49e1b27700  0 -- 172.16.0.45:6789/0 >> 
> > 172.16.0.44:6789/0 conn(0x56007244d000 :6789 
> > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
> > l=0).handle_connect_msg accept connect_seq 3 vs existing csq=0 
> > existing_state=STATE_CONNECTING 2017-09-20 09:46:01.824723 
> > 7f49e1b27700  0 -- 172.16.0.45:6789/0 >> 172.16.0.44:6789/0 
> > conn(0x56007244d000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH 
> > pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 
> > 3, 0x5600722c.cseq = 0), sending RESETSESSION 2017-09-20 
> > 09:46:01.825247 7f49e1b27700  0 -- 172.16.0.45:6789/0 >> 
> > 172.16.0.44:6789/0 conn(0x56007244d000 :6789 
> > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
> > l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 
> > existing_state=STATE

Re: [ceph-users] luminous vs jewel rbd performance

2017-09-21 Thread Mark Nelson

Hi Rafael,

In the original email you mentioned 4M block size, seq read, but here it 
looks like you are doing 4k writes?  Can you clarify?  If you are doing 
4k direct sequential writes with iodepth=1 and are also using librbd 
cache, please make sure that librbd is set to writeback mode in both 
cases.  RBD by default will not kick into WB mode until it sees a flush 
request, and the librbd engine in fio doesn't issue one before a test is 
started.  It can be pretty easy to end up in a situation where writeback 
cache is active on some tests but not others if you aren't careful.  IE 
If one of your tests was done after a flush and the other was not, you'd 
likely see a dramatic difference in performance during this test.


You can avoid this by telling librbd to always use WB mode (at least 
when benchmarking):


rbd cache writethrough until flush = false

Mark

On 09/20/2017 01:51 AM, Rafael Lopez wrote:

Hi Alexandre,

Yeah we are using filestore for the moment with luminous. With regards
to client, I tried both jewel and luminous librbd versions against the
luminous cluster - similar results.

I am running fio on a physical machine with fio rbd engine. This is a
snippet of the fio config for the runs (the complete jobfile adds
variations of read/write/block size/iodepth).

[global]
ioengine=rbd
clientname=cinder-volume
pool=rbd-bronze
invalidate=1
ramp_time=5
runtime=30
time_based
direct=1

[write-rbd1-4k-depth1]
rbdname=rbd-tester-fio
bs=4k
iodepth=1
rw=write
stonewall

[write-rbd2-4k-depth16]
rbdname=rbd-tester-fio-2
bs=4k
iodepth=16
rw=write
stonewall

Raf

On 20 September 2017 at 16:43, Alexandre DERUMIER mailto:aderum...@odiso.com>> wrote:

Hi

so, you use also filestore on luminous ?

do you have also upgraded librbd on client ? (are you benching
inside a qemu machine ? or directly with fio-rbd ?)



(I'm going to do a lot of benchmarks in coming week, I'll post
results on mailing soon.)



- Mail original -
De: "Rafael Lopez" mailto:rafael.lo...@monash.edu>>
À: "ceph-users" mailto:ceph-users@lists.ceph.com>>
Envoyé: Mercredi 20 Septembre 2017 08:17:23
Objet: [ceph-users] luminous vs jewel rbd performance

hey guys.
wondering if anyone else has done some solid benchmarking of jewel
vs luminous, in particular on the same cluster that has been
upgraded (same cluster, client and config).

we have recently upgraded a cluster from 10.2.9 to 12.2.0, and
unfortunately i only captured results from a single fio (librbd) run
with a few jobs in it before upgrading. i have run the same fio
jobfile many times at different times of the day since upgrading,
and been unable to produce a close match to the pre-upgrade (jewel)
run from the same client. one particular job is significantly slower
(4M block size, iodepth=1, seq read), up to 10x in one run.

i realise i havent supplied much detail and it could be dozens of
things, but i just wanted to see if anyone else had done more
quantitative benchmarking or had similar experiences. keep in mind
all we changed was daemons were restarted to use luminous code,
everything else exactly the same. granted it is possible that
some/all osds had some runtime config injected that differs from
now, but i'm fairly confident this is not the case as they were
recently restarted (on jewel code) after OS upgrades.

cheers,
Raf

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--
*Rafael Lopez*
Research Devops Engineer
Monash University eResearch Centre

T: +61 3 9905 9118 
M: +61 (0)427682670 
E: rafael.lo...@monash.edu 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph mgr dashboard, no socket could be created

2017-09-21 Thread Bryan Banister
I'm not sure what happened but the dashboard module can no longer startup now:

2017-09-21 09:28:34.646369 7fffef2e6700 -1 mgr got_mgr_map mgrmap module list 
changed to (dashboard), respawn
2017-09-21 09:28:34.646372 7fffef2e6700  1 mgr respawn  e: '/usr/bin/ceph-mgr'
2017-09-21 09:28:34.646374 7fffef2e6700  1 mgr respawn  0: '/usr/bin/ceph-mgr'
2017-09-21 09:28:34.646375 7fffef2e6700  1 mgr respawn  1: '-f'
2017-09-21 09:28:34.646376 7fffef2e6700  1 mgr respawn  2: '--cluster'
2017-09-21 09:28:34.646376 7fffef2e6700  1 mgr respawn  3: 'ceph'
2017-09-21 09:28:34.646377 7fffef2e6700  1 mgr respawn  4: '--id'
2017-09-21 09:28:34.646378 7fffef2e6700  1 mgr respawn  5: 'carf-ceph-osd15'
2017-09-21 09:28:34.646379 7fffef2e6700  1 mgr respawn  6: '--setuser'
2017-09-21 09:28:34.646379 7fffef2e6700  1 mgr respawn  7: 'ceph'
2017-09-21 09:28:34.646380 7fffef2e6700  1 mgr respawn  8: '--setgroup'
2017-09-21 09:28:34.646380 7fffef2e6700  1 mgr respawn  9: 'ceph'
2017-09-21 09:28:34.646398 7fffef2e6700  1 mgr respawn respawning with exe 
/usr/bin/ceph-mgr
2017-09-21 09:28:34.646399 7fffef2e6700  1 mgr respawn  exe_path /proc/self/exe
2017-09-21 09:28:34.670145 77fdd480  0 ceph version 12.1.4 
(a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc), process (unknown), 
pid 11714
2017-09-21 09:28:34.670163 77fdd480  0 pidfile_write: ignore empty 
--pid-file
2017-09-21 09:28:34.689929 77fdd480  1 mgr send_beacon standby
2017-09-21 09:28:35.647600 7fffef2e6700  1 mgr handle_mgr_map Activating!
2017-09-21 09:28:35.647762 7fffef2e6700  1 mgr handle_mgr_map I am now 
activating
2017-09-21 09:28:35.684790 7fffeb2de700  1 mgr init Loading python module 
'dashboard'
2017-09-21 09:28:35.774439 7fffeb2de700  1 mgr load Constructed class from 
module: dashboard
2017-09-21 09:28:35.774448 7fffeb2de700  1 mgr start Creating threads for 1 
modules
2017-09-21 09:28:35.774489 7fffeb2de700  1 mgr send_beacon active
2017-09-21 09:28:35.880124 7fffd3742700 -1 mgr serve dashboard.serve:
2017-09-21 09:28:35.880132 7fffd3742700 -1 mgr serve Traceback (most recent 
call last):
  File "/usr/lib64/ceph/mgr/dashboard/module.py", line 989, in serve
cherrypy.engine.start()
  File "/usr/lib/python2.7/site-packages/cherrypy/process/wspbus.py", line 250, 
in start
raise e_info
ChannelFailures: error('No socket could be created',)

Any ideas what causes this?

Thanks,
-Bryan



Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you are hereby notified that any review, dissemination 
or copying of this email is strictly prohibited, and to please notify the 
sender immediately and destroy this email and any attachments. Email 
transmission cannot be guaranteed to be secure or error-free. The Company, 
therefore, does not make any guarantees as to the completeness or accuracy of 
this email or any attachments. This email is for informational purposes only 
and does not constitute a recommendation, offer, request or solicitation of any 
kind to buy, sell, subscribe, redeem or perform any type of transaction of a 
financial product.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore disk colocation using NVRAM, SSD and SATA

2017-09-21 Thread Mark Nelson



On 09/21/2017 03:19 AM, Maged Mokhtar wrote:

On 2017-09-21 10:01, Dietmar Rieder wrote:


Hi,

I'm in the same situation (NVMEs, SSDs, SAS HDDs). I asked the same
questions to myself.
For now I decided to use the NVMEs as wal and db devices for the SAS
HDDs and on the SSDs I colocate wal and  db.

However, I'm still wonderin how (to what size) and if I should change
the default sizes of wal and db.

Dietmar

On 09/21/2017 01:18 AM, Alejandro Comisario wrote:

But for example, on the same server i have 3 disks technologies to
deploy pools, SSD, SAS and SATA.
The NVME were bought just thinking on the journal for SATA and SAS,
since journals for SSD were colocated.

But now, exactly the same scenario, should i trust the NVME for the SSD
pool ? are there that much of a  gain ? against colocating block.* on
the same SSD?

best.

On Wed, Sep 20, 2017 at 6:36 PM, Nigel Williams
mailto:nigel.willi...@tpac.org.au>
>> wrote:

On 21 September 2017 at 04:53, Maximiliano Venesio
mailto:mass...@nubeliu.com>
>> wrote:

Hi guys i'm reading different documents about bluestore, and it
never recommends to use NVRAM to store the bluefs db,
nevertheless the official documentation says that, is better to
use the faster device to put the block.db in.


​Likely not mentioned since no one yet has had the opportunity to
test it.​

So how do i have to deploy using bluestore, regarding where i
should put block.wal and block.db ?


​block.* would be best on your NVRAM device, like this:

​ceph-deploy osd create --bluestore c0osd-136:/dev/sda --block-wal
/dev/nvme0n1 --block-db /dev/nvme0n1



___
ceph-users mailing list
ceph-users@lists.ceph.com 
>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejan...@nubeliu.com 
>Cell: +54 9
11 3770 1857
_
www.nubeliu.com  



___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




My guess is for wal: you are dealing with a 2 step io operation so in
case it is collocated on your SSDs your iops for small writes will be
halfed. The decision is if you add a small NVMEs as wal for 4 or 5
(large) SSDs, you will double their iops for small io sized. This is not
the case for db.

For wal size:  512 MB is recommended ( ceph-disk default )

For db size: a "few" GB..probably 10GB is a good number. I guess we will
hear more in the future.


There's a pretty good chance that if you are writing out lots of small 
RGW or rados objects you'll blow past 10GB of metadata once rocksdb 
space-amp is factored in.  I can pretty routinely do it when writing out 
millions of rados objects per OSD.  Bluestore will switch to write 
metadata out to the block disk and in this case it might not be that bad 
of a transition (NVMe to SSD).  If you have spare room, you might as 
well give the DB partition whatever you have available on the device.  A 
harder question is how much fast storage to buy for the WAL/DB.  It's 
not straight forward, and rocksdb can be tuned in various ways to favor 
reducing space/write/read amplification, but not all 3 at once.  Right 
now we are likely favoring reducing write-amplification over space/read 
amp, but one could imagine that with a small amount of incredibly fast 
storage it might be better to favor reducing space-amp.


Mark



Maged Mokhtar




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about the Ceph's performance with spdk

2017-09-21 Thread Alejandro Comisario
Bump ! i saw this on the documentation for Bluestore also
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#spdk-usage

Does anyone has any experience ?

On Thu, Jun 8, 2017 at 2:27 AM, Li,Datong  wrote:

> Hi all,
>
> I’m new in Ceph, and I wonder to know the performance report exactly about
> Ceph’s spdk, but I couldn’t find it. The most thing I want to know is the
> performance improvement before spdk and after.
>
> Thanks,
> Datong Li
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] luminous vs jewel rbd performance

2017-09-21 Thread Alexandre DERUMIER
ok, thanks.

I'll try to do same bench in coming week, I'll you in touch with results.


- Mail original -
De: "Rafael Lopez" 
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Mercredi 20 Septembre 2017 08:51:22
Objet: Re: [ceph-users] luminous vs jewel rbd performance

Hi Alexandre, 
Yeah we are using filestore for the moment with luminous. With regards to 
client, I tried both jewel and luminous librbd versions against the luminous 
cluster - similar results. 

I am running fio on a physical machine with fio rbd engine. This is a snippet 
of the fio config for the runs (the complete jobfile adds variations of 
read/write/block size/iodepth). 

[global] 
ioengine=rbd 
clientname=cinder-volume 
pool=rbd-bronze 
invalidate=1 
ramp_time=5 
runtime=30 
time_based 
direct=1 

[write-rbd1-4k-depth1] 
rbdname=rbd-tester-fio 
bs=4k 
iodepth=1 
rw=write 
stonewall 

[write-rbd2-4k-depth16] 
rbdname=rbd-tester-fio-2 
bs=4k 
iodepth=16 
rw=write 
stonewall 

Raf 

On 20 September 2017 at 16:43, Alexandre DERUMIER < [ 
mailto:aderum...@odiso.com | aderum...@odiso.com ] > wrote: 


Hi 

so, you use also filestore on luminous ? 

do you have also upgraded librbd on client ? (are you benching inside a qemu 
machine ? or directly with fio-rbd ?) 



(I'm going to do a lot of benchmarks in coming week, I'll post results on 
mailing soon.) 



- Mail original - 
De: "Rafael Lopez" < [ mailto:rafael.lo...@monash.edu | rafael.lo...@monash.edu 
] > 
À: "ceph-users" < [ mailto:ceph-users@lists.ceph.com | 
ceph-users@lists.ceph.com ] > 
Envoyé: Mercredi 20 Septembre 2017 08:17:23 
Objet: [ceph-users] luminous vs jewel rbd performance 

hey guys. 
wondering if anyone else has done some solid benchmarking of jewel vs luminous, 
in particular on the same cluster that has been upgraded (same cluster, client 
and config). 

we have recently upgraded a cluster from 10.2.9 to 12.2.0, and unfortunately i 
only captured results from a single fio (librbd) run with a few jobs in it 
before upgrading. i have run the same fio jobfile many times at different times 
of the day since upgrading, and been unable to produce a close match to the 
pre-upgrade (jewel) run from the same client. one particular job is 
significantly slower (4M block size, iodepth=1, seq read), up to 10x in one 
run. 

i realise i havent supplied much detail and it could be dozens of things, but i 
just wanted to see if anyone else had done more quantitative benchmarking or 
had similar experiences. keep in mind all we changed was daemons were restarted 
to use luminous code, everything else exactly the same. granted it is possible 
that some/all osds had some runtime config injected that differs from now, but 
i'm fairly confident this is not the case as they were recently restarted (on 
jewel code) after OS upgrades. 

cheers, 
Raf 

___ 
ceph-users mailing list 
[ mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ] 
[ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] 







-- 
Rafael Lopez 
Research Devops Engineer 
Monash University eResearch Centre 

T: [ tel:%2B61%203%209905%209118 | +61 3 9905 9118 ] 
M: [ tel:%2B61%204%2027682%20670 | +61 (0)427682670 ] 
E: [ mailto:rafael.lo...@monash.edu | rafael.lo...@monash.edu ] 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore disk colocation using NVRAM, SSD and SATA

2017-09-21 Thread Dietmar Rieder
On 09/21/2017 10:19 AM, Maged Mokhtar wrote:
> On 2017-09-21 10:01, Dietmar Rieder wrote:
> 
>> Hi,
>>
>> I'm in the same situation (NVMEs, SSDs, SAS HDDs). I asked the same
>> questions to myself.
>> For now I decided to use the NVMEs as wal and db devices for the SAS
>> HDDs and on the SSDs I colocate wal and  db.
>>
>> However, I'm still wonderin how (to what size) and if I should change
>> the default sizes of wal and db.
>>
>> Dietmar
>>
>> On 09/21/2017 01:18 AM, Alejandro Comisario wrote:
>>> But for example, on the same server i have 3 disks technologies to
>>> deploy pools, SSD, SAS and SATA.
>>> The NVME were bought just thinking on the journal for SATA and SAS,
>>> since journals for SSD were colocated.
>>>
>>> But now, exactly the same scenario, should i trust the NVME for the SSD
>>> pool ? are there that much of a  gain ? against colocating block.* on
>>> the same SSD? 
>>>
>>> best.
>>>
>>> On Wed, Sep 20, 2017 at 6:36 PM, Nigel Williams
>>> mailto:nigel.willi...@tpac.org.au>
>>> >> >> wrote:
>>>
>>> On 21 September 2017 at 04:53, Maximiliano Venesio
>>> mailto:mass...@nubeliu.com>
>>> >> wrote:
>>>
>>> Hi guys i'm reading different documents about bluestore, and it
>>> never recommends to use NVRAM to store the bluefs db,
>>> nevertheless the official documentation says that, is better to
>>> use the faster device to put the block.db in.
>>>
>>>
>>> ​Likely not mentioned since no one yet has had the opportunity to
>>> test it.​
>>>
>>> So how do i have to deploy using bluestore, regarding where i
>>> should put block.wal and block.db ? 
>>>
>>>
>>> ​block.* would be best on your NVRAM device, like this:
>>>
>>> ​ceph-deploy osd create --bluestore c0osd-136:/dev/sda --block-wal
>>> /dev/nvme0n1 --block-db /dev/nvme0n1
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com 
>>> >
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>>>
>>>
>>>
>>>
>>> -- 
>>> *Alejandro Comisario*
>>> *CTO | NUBELIU*
>>> E-mail: alejan...@nubeliu.com 
>>> >Cell: +54 9
>>> 11 3770 1857
>>> _
>>> www.nubeliu.com  
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
>  
> 
> My guess is for wal: you are dealing with a 2 step io operation so in
> case it is collocated on your SSDs your iops for small writes will be
> halfed. The decision is if you add a small NVMEs as wal for 4 or 5
> (large) SSDs, you will double their iops for small io sized. This is not
> the case for db.
> 
> For wal size:  512 MB is recommended ( ceph-disk default )
> 
> For db size: a "few" GB..probably 10GB is a good number. I guess we will
> hear more in the future.
> 

Hi,

you are right, putting the wal.db for the SSDs (we don't have many,
2/node) on the NVMEs as well might be good.

Dietmar

-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] s3cmd not working with luminous radosgw

2017-09-21 Thread Yoann Moulin
Hi Matt,

 Does anyone have tested s3cmd or other tools to manage ACL on luminous 
 radosGW ?
>>>
>>> Don't know about ACL, but s3cmd for other things works for me.  Version 
>>> 1.6.1
>>
>> Finally, I found out what happened, I had 2 issues. One, on s3cmd config 
>> file, radosgw with luminous does not support signature v2 anymore, only
>> v4 is supported, I had to add this to my .s3cfg file :
> 
> V4 is supported, but to the best of my knowledge, you can use sigv2 if 
> desired.

Indeed, it seems to work in sigv2 :)

>> The second was in the rgw section into ceph.conf file. The line "rgw dns 
>> name" was missing.
> 
> Depending on your setup, "rgw dns name" may be required, yes.

in my case, it seems to be mandatory

Best regards,

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] s3cmd not working with luminous radosgw

2017-09-21 Thread Yoann Moulin
Hello,

>> Does anyone have tested s3cmd or other tools to manage ACL on luminous 
>> radosGW ?
> 
> Don't know about ACL, but s3cmd for other things works for me.  Version 1.6.1

Finally, I found out what happened, I had 2 issues. One, on s3cmd config file, 
radosgw with luminous does not support signature v2 anymore, only
v4 is supported, I had to add this to my .s3cfg file :

> signature_v2 = False

The second was in the rgw section into ceph.conf file. The line "rgw dns name" 
was missing. I have deployed my cluster with ceph-ansible and it
seems that I need a new option in the all.ym file :

> radosgw_resolve_cname: true # enable for radosgw to resolve DNS CNAME based 
> bucket names

I have added it manually and now it works (ansible-playbook didn't add it, I 
must figure out why).

Thanks for you help

Best regards,

Yoann Moulin
PS : with config lines, it's better :)

>>> I have a fresh luminous cluster in test and I made a copy of a bucket (4TB 
>>> 1.5M files) with rclone, I'm able to list/copy files with rclone but
>>> s3cmd does not work at all, it is just able to give the bucket list but I 
>>> can't list files neither update ACL.
>>>
>>> does anyone already test this ?
>>>
>>> root@iccluster012:~# rclone --version
>>> rclone v1.37
>>>
>>> root@iccluster012:~# s3cmd --version
>>> s3cmd version 2.0.0
>>>
>>>
>>> ### rclone ls files ###
>>>
>>> root@iccluster012:~# rclone ls testadmin:image-net/LICENSE
>>>  1589 LICENSE
>>> root@iccluster012:~#
>>>
>>> nginx (as revers proxy) log :
>>>
 10.90.37.13 - - [15/Sep/2017:10:30:02 +0200] "HEAD /image-net/LICENSE 
 HTTP/1.1" 200 0 "-" "rclone/v1.37"
 10.90.37.13 - - [15/Sep/2017:10:30:02 +0200] "GET 
 /image-net?delimiter=%2F&max-keys=1024&prefix= HTTP/1.1" 200 779 "-" 
 "rclone/v1.37"
>>>
>>> rgw logs :
>>>
 2017-09-15 10:30:02.620266 7ff1f58f7700  1 == starting new request 
 req=0x7ff1f58f11f0 =
 2017-09-15 10:30:02.622245 7ff1f58f7700  1 == req done 
 req=0x7ff1f58f11f0 op status=0 http_status=200 ==
 2017-09-15 10:30:02.622324 7ff1f58f7700  1 civetweb: 0x56061584b000: 
 127.0.0.1 - - [15/Sep/2017:10:30:02 +0200] "HEAD /image-net/LICENSE 
 HTTP/1.0" 1 0 - rclone/v1.37
 2017-09-15 10:30:02.623361 7ff1f50f6700  1 == starting new request 
 req=0x7ff1f50f01f0 =
 2017-09-15 10:30:02.689632 7ff1f50f6700  1 == req done 
 req=0x7ff1f50f01f0 op status=0 http_status=200 ==
 2017-09-15 10:30:02.689719 7ff1f50f6700  1 civetweb: 0x56061585: 
 127.0.0.1 - - [15/Sep/2017:10:30:02 +0200] "GET 
 /image-net?delimiter=%2F&max-keys=1024&prefix= HTTP/1.0" 1 0 - rclone/v1.37
>>>
>>>
>>>
>>> ### s3cmds ls files ###
>>>
>>> root@iccluster012:~# s3cmd -v -c ~/.s3cfg-test-rgwadmin ls 
>>> s3://image-net/LICENSE
>>> root@iccluster012:~#
>>>
>>> nginx (as revers proxy) log :
>>>
 10.90.37.13 - - [15/Sep/2017:10:30:04 +0200] "GET 
 http://test.iccluster.epfl.ch/image-net/?location HTTP/1.1" 200 127 "-" "-"
 10.90.37.13 - - [15/Sep/2017:10:30:04 +0200] "GET 
 http://image-net.test.iccluster.epfl.ch/?delimiter=%2F&prefix=LICENSE 
 HTTP/1.1" 200 318 "-" "-"
>>>
>>> rgw logs :
>>>
 2017-09-15 10:30:04.295355 7ff1f48f5700  1 == starting new request 
 req=0x7ff1f48ef1f0 =
 2017-09-15 10:30:04.295913 7ff1f48f5700  1 == req done 
 req=0x7ff1f48ef1f0 op status=0 http_status=200 ==
 2017-09-15 10:30:04.295977 7ff1f48f5700  1 civetweb: 0x560615855000: 
 127.0.0.1 - - [15/Sep/2017:10:30:04 +0200] "GET /image-net/?location 
 HTTP/1.0" 1 0 - -
 2017-09-15 10:30:04.299303 7ff1f40f4700  1 == starting new request 
 req=0x7ff1f40ee1f0 =
 2017-09-15 10:30:04.300993 7ff1f40f4700  1 == req done 
 req=0x7ff1f40ee1f0 op status=0 http_status=200 ==
 2017-09-15 10:30:04.301070 7ff1f40f4700  1 civetweb: 0x56061585a000: 
 127.0.0.1 - - [15/Sep/2017:10:30:04 +0200] "GET 
 /?delimiter=%2F&prefix=LICENSE HTTP/1.0" 1 0 - 
>>>
>>>
>>>
>>> ### s3cmd : list bucket ###
>>>
>>> root@iccluster012:~# s3cmd -v -c ~/.s3cfg-test-rgwadmin ls s3://
>>> 2017-08-28 12:27  s3://image-net
>>> root@iccluster012:~#
>>>
>>> nginx (as revers proxy) log :
>>>
 ==> nginx/access.log <==
 10.90.37.13 - - [15/Sep/2017:10:36:10 +0200] "GET 
 http://test.iccluster.epfl.ch/ HTTP/1.1" 200 318 "-" "-"
>>>
>>> rgw logs :
>>>
 2017-09-15 10:36:10.645354 7ff1f38f3700  1 == starting new request 
 req=0x7ff1f38ed1f0 =
 2017-09-15 10:36:10.647419 7ff1f38f3700  1 == req done 
 req=0x7ff1f38ed1f0 op status=0 http_status=200 ==
 2017-09-15 10:36:10.647488 7ff1f38f3700  1 civetweb: 0x56061585f000: 
 127.0.0.1 - - [15/Sep/2017:10:36:10 +0200] "GET / HTTP/1.0" 1 0 - -
>>>
>>>
>>>
>>> ### rclone : list bucket ###
>>>
>>>
>>> root@iccluster012:~# rclone lsd testadmin:
>>>   -1 2017-08-28 12:27:33-1 image-net
>>> root@iccluster012:~#
>>>
>>> nginx (as revers proxy) l

Re: [ceph-users] s3cmd not working with luminous radosgw

2017-09-21 Thread Yoann Moulin
Hello,

>> Does anyone have tested s3cmd or other tools to manage ACL on luminous 
>> radosGW ?
> 
> Don't know about ACL, but s3cmd for other things works for me.  Version 1.6.1

Finally, I found out what happened, I had 2 issues. One, on s3cmd config file, 
radosgw with luminous does not support signature v2 anymore, only
v4 is supported, I had to add this to my .s3cfg file :

The second was in the rgw section into ceph.conf file. The line "rgw dns name" 
was missing. I have deployed my cluster with ceph-ansible and it
seems that I need a new option in the all.ym file :

I have added it manually and now it works (ansible-playbook didn't add it, I 
must figure out why).

Thanks for you help

Best regards,

Yoann Moulin

>>> I have a fresh luminous cluster in test and I made a copy of a bucket (4TB 
>>> 1.5M files) with rclone, I'm able to list/copy files with rclone but
>>> s3cmd does not work at all, it is just able to give the bucket list but I 
>>> can't list files neither update ACL.
>>>
>>> does anyone already test this ?
>>>
>>> root@iccluster012:~# rclone --version
>>> rclone v1.37
>>>
>>> root@iccluster012:~# s3cmd --version
>>> s3cmd version 2.0.0
>>>
>>>
>>> ### rclone ls files ###
>>>
>>> root@iccluster012:~# rclone ls testadmin:image-net/LICENSE
>>>  1589 LICENSE
>>> root@iccluster012:~#
>>>
>>> nginx (as revers proxy) log :
>>>
 10.90.37.13 - - [15/Sep/2017:10:30:02 +0200] "HEAD /image-net/LICENSE 
 HTTP/1.1" 200 0 "-" "rclone/v1.37"
 10.90.37.13 - - [15/Sep/2017:10:30:02 +0200] "GET 
 /image-net?delimiter=%2F&max-keys=1024&prefix= HTTP/1.1" 200 779 "-" 
 "rclone/v1.37"
>>>
>>> rgw logs :
>>>
 2017-09-15 10:30:02.620266 7ff1f58f7700  1 == starting new request 
 req=0x7ff1f58f11f0 =
 2017-09-15 10:30:02.622245 7ff1f58f7700  1 == req done 
 req=0x7ff1f58f11f0 op status=0 http_status=200 ==
 2017-09-15 10:30:02.622324 7ff1f58f7700  1 civetweb: 0x56061584b000: 
 127.0.0.1 - - [15/Sep/2017:10:30:02 +0200] "HEAD /image-net/LICENSE 
 HTTP/1.0" 1 0 - rclone/v1.37
 2017-09-15 10:30:02.623361 7ff1f50f6700  1 == starting new request 
 req=0x7ff1f50f01f0 =
 2017-09-15 10:30:02.689632 7ff1f50f6700  1 == req done 
 req=0x7ff1f50f01f0 op status=0 http_status=200 ==
 2017-09-15 10:30:02.689719 7ff1f50f6700  1 civetweb: 0x56061585: 
 127.0.0.1 - - [15/Sep/2017:10:30:02 +0200] "GET 
 /image-net?delimiter=%2F&max-keys=1024&prefix= HTTP/1.0" 1 0 - rclone/v1.37
>>>
>>>
>>>
>>> ### s3cmds ls files ###
>>>
>>> root@iccluster012:~# s3cmd -v -c ~/.s3cfg-test-rgwadmin ls 
>>> s3://image-net/LICENSE
>>> root@iccluster012:~#
>>>
>>> nginx (as revers proxy) log :
>>>
 10.90.37.13 - - [15/Sep/2017:10:30:04 +0200] "GET 
 http://test.iccluster.epfl.ch/image-net/?location HTTP/1.1" 200 127 "-" "-"
 10.90.37.13 - - [15/Sep/2017:10:30:04 +0200] "GET 
 http://image-net.test.iccluster.epfl.ch/?delimiter=%2F&prefix=LICENSE 
 HTTP/1.1" 200 318 "-" "-"
>>>
>>> rgw logs :
>>>
 2017-09-15 10:30:04.295355 7ff1f48f5700  1 == starting new request 
 req=0x7ff1f48ef1f0 =
 2017-09-15 10:30:04.295913 7ff1f48f5700  1 == req done 
 req=0x7ff1f48ef1f0 op status=0 http_status=200 ==
 2017-09-15 10:30:04.295977 7ff1f48f5700  1 civetweb: 0x560615855000: 
 127.0.0.1 - - [15/Sep/2017:10:30:04 +0200] "GET /image-net/?location 
 HTTP/1.0" 1 0 - -
 2017-09-15 10:30:04.299303 7ff1f40f4700  1 == starting new request 
 req=0x7ff1f40ee1f0 =
 2017-09-15 10:30:04.300993 7ff1f40f4700  1 == req done 
 req=0x7ff1f40ee1f0 op status=0 http_status=200 ==
 2017-09-15 10:30:04.301070 7ff1f40f4700  1 civetweb: 0x56061585a000: 
 127.0.0.1 - - [15/Sep/2017:10:30:04 +0200] "GET 
 /?delimiter=%2F&prefix=LICENSE HTTP/1.0" 1 0 - 
>>>
>>>
>>>
>>> ### s3cmd : list bucket ###
>>>
>>> root@iccluster012:~# s3cmd -v -c ~/.s3cfg-test-rgwadmin ls s3://
>>> 2017-08-28 12:27  s3://image-net
>>> root@iccluster012:~#
>>>
>>> nginx (as revers proxy) log :
>>>
 ==> nginx/access.log <==
 10.90.37.13 - - [15/Sep/2017:10:36:10 +0200] "GET 
 http://test.iccluster.epfl.ch/ HTTP/1.1" 200 318 "-" "-"
>>>
>>> rgw logs :
>>>
 2017-09-15 10:36:10.645354 7ff1f38f3700  1 == starting new request 
 req=0x7ff1f38ed1f0 =
 2017-09-15 10:36:10.647419 7ff1f38f3700  1 == req done 
 req=0x7ff1f38ed1f0 op status=0 http_status=200 ==
 2017-09-15 10:36:10.647488 7ff1f38f3700  1 civetweb: 0x56061585f000: 
 127.0.0.1 - - [15/Sep/2017:10:36:10 +0200] "GET / HTTP/1.0" 1 0 - -
>>>
>>>
>>>
>>> ### rclone : list bucket ###
>>>
>>>
>>> root@iccluster012:~# rclone lsd testadmin:
>>>   -1 2017-08-28 12:27:33-1 image-net
>>> root@iccluster012:~#
>>>
>>> nginx (as revers proxy) log :
>>>
 ==> nginx/access.log <==
 10.90.37.13 - - [15/Sep/2017:10:37:53 +0200] "GET / HTTP/1.1" 200 318 "-" 
 "rclone/v1.37"
>>>
>>> rgw logs :

Re: [ceph-users] monitor takes long time to join quorum: STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH got BADAUTHORIZER

2017-09-21 Thread Sean Purdy
On Wed, 20 Sep 2017, Gregory Farnum said:
> That definitely sounds like a time sync issue. Are you *sure* they matched
> each other?

NTP looked OK at the time.  But see below.


> Is it reproducible on restart?

Today I did a straight reboot - and it was fine, no issues.


The issue occurs after the machine is off for a number of hours, or has been 
worked on in the BIOS for a number of hours and then booted.  And then perhaps 
waited at the disk decrypt key prompt.

So I'd suspect hardware clock drift at those times.  (Using Dell R720xd 
machines)


Logs show a time change a few seconds after boot.  After boot it's running NTP 
and within that 45 minute period the NTP state looks the same as the other 
nodes in the (small) cluster.

How much drift is allowed between monitors?


Logs say:

Sep 20 09:45:21 store03 ntp[2329]: Starting NTP server: ntpd.
Sep 20 09:45:21 store03 ntpd[2462]: proto: precision = 0.075 usec (-24)
...
Sep 20 09:46:44 store03 systemd[1]: Time has been changed
Sep 20 09:46:44 store03 ntpd[2462]: receive: Unexpected origin timestamp 
0xdd6ca972.c694801d does not match aorg 00. from 
server@172.16.0.16 xmt 0xdd6ca974.0c5c18f

So system time was changed about 6 seconds after disks were unlocked/boot 
proceeded.  But there was still 45 minutes of monitor messages after that.  
Surely the time should have converged sooner than 45 minutes?



NTP from today, post-problem.  But ntpq at the time of the problem looked just 
as OK:

store01:~$ ntpstat
synchronised to NTP server (172.16.0.19) at stratum 3
   time correct to within 47 ms

store02$ ntpstat
synchronised to NTP server (172.16.0.19) at stratum 3
   time correct to within 63 ms

store03:~$ sudo ntpstat
synchronised to NTP server (172.16.0.19) at stratum 3
   time correct to within 63 ms

store03:~$ ntpq -p
 remote   refid  st t when poll reach   delay   offset  jitter
==
+172.16.0.16 85.91.1.164  3 u  561 1024  3770.2870.554   0.914
+172.16.0.18 94.125.129.7 3 u  411 1024  3770.388   -0.331   0.139
*172.16.0.19 158.43.128.332 u  289 1024  3770.282   -0.005   0.103


Sean

 
> On Wed, Sep 20, 2017 at 2:50 AM Sean Purdy  wrote:
> 
> >
> > Hi,
> >
> >
> > Luminous 12.2.0
> >
> > Three node cluster, 18 OSD, debian stretch.
> >
> >
> > One node is down for maintenance for several hours.  When bringing it back
> > up, OSDs rejoin after 5 minutes, but health is still warning.  monitor has
> > not joined quorum after 40 minutes and logs show BADAUTHORIZER message
> > every time the monitor tries to connect to the leader.
> >
> > 2017-09-20 09:46:05.581590 7f49e2b29700  0 -- 172.16.0.45:0/2243 >>
> > 172.16.0.43:6812/2422 conn(0x5600720fb800 :-1
> > s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0
> > l=0).handle_connect_reply connect got BADAUTHORIZER
> >
> > Then after ~45 minutes monitor *does* join quorum.
> >
> > I'm presuming this isn't normal behaviour?  Or if it is, let me know and I
> > won't worry.
> >
> > All three nodes are using ntp and look OK timewise.
> >
> >
> > ceph-mon log:
> >
> > (.43 is leader, .45 is rebooted node, .44 is other live node in quorum)
> >
> > Boot:
> >
> > 2017-09-20 09:45:21.874152 7f49efeb8f80  0 ceph version 12.2.0
> > (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process
> > (unknown), pid 2243
> >
> > 2017-09-20 09:46:01.824708 7f49e1b27700  0 -- 172.16.0.45:6789/0 >>
> > 172.16.0.44:6789/0 conn(0x56007244d000 :6789
> > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
> > accept connect_seq 3 vs existing csq=0 existing_state=STATE_CONNECTING
> > 2017-09-20 09:46:01.824723 7f49e1b27700  0 -- 172.16.0.45:6789/0 >>
> > 172.16.0.44:6789/0 conn(0x56007244d000 :6789
> > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
> > accept we reset (peer sent cseq 3, 0x5600722c.cseq = 0), sending
> > RESETSESSION
> > 2017-09-20 09:46:01.825247 7f49e1b27700  0 -- 172.16.0.45:6789/0 >>
> > 172.16.0.44:6789/0 conn(0x56007244d000 :6789
> > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
> > accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING
> > 2017-09-20 09:46:01.828053 7f49e1b27700  0 -- 172.16.0.45:6789/0 >>
> > 172.16.0.44:6789/0 conn(0x5600722c :-1
> > s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=21872 cs=1 l=0).process
> > missed message?  skipped from seq 0 to 552717734
> >
> > 2017-09-20 09:46:05.580342 7f49e1b27700  0 -- 172.16.0.45:6789/0 >>
> > 172.16.0.43:6789/0 conn(0x5600720fe800 :-1
> > s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=49261 cs=1 l=0).process
> > missed message?  skipped from seq 0 to 1151972199
> > 2017-09-20 09:46:05.581097 7f49e2b29700  0 -- 172.16.0.45:0/2243 >>
> > 172.16.0.43:6812/2422 conn(0x5600720fb800 :-1
> > s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0
> > l=0).handle_connect_reply connect got BADAU

Re: [ceph-users] Bluestore disk colocation using NVRAM, SSD and SATA

2017-09-21 Thread Дробышевский , Владимир
I believe you should use only one big partition in case of one device per
OSD.

And in case of using additional device(s) for wal\db block.db size is set
to 1% of the main partition by default (at least according to ceph-disk
sources; it just gets bluestore_block_size, divides it by 100 and use it as
a bluestore_block_db_size if it is not set in config).

Actual size will vary depending on workload\number of objects. Haven't
found any info on how to check current db size yet, but didn't dig into
that much.

Best regards,
Vladimir

2017-09-21 13:01 GMT+05:00 Dietmar Rieder :

> Hi,
>
> I'm in the same situation (NVMEs, SSDs, SAS HDDs). I asked the same
> questions to myself.
> For now I decided to use the NVMEs as wal and db devices for the SAS
> HDDs and on the SSDs I colocate wal and  db.
>
> However, I'm still wonderin how (to what size) and if I should change
> the default sizes of wal and db.
>
> Dietmar
>
> On 09/21/2017 01:18 AM, Alejandro Comisario wrote:
> > But for example, on the same server i have 3 disks technologies to
> > deploy pools, SSD, SAS and SATA.
> > The NVME were bought just thinking on the journal for SATA and SAS,
> > since journals for SSD were colocated.
> >
> > But now, exactly the same scenario, should i trust the NVME for the SSD
> > pool ? are there that much of a  gain ? against colocating block.* on
> > the same SSD?
> >
> > best.
> >
> > On Wed, Sep 20, 2017 at 6:36 PM, Nigel Williams
> > mailto:nigel.willi...@tpac.org.au>> wrote:
> >
> > On 21 September 2017 at 04:53, Maximiliano Venesio
> > mailto:mass...@nubeliu.com>> wrote:
> >
> > Hi guys i'm reading different documents about bluestore, and it
> > never recommends to use NVRAM to store the bluefs db,
> > nevertheless the official documentation says that, is better to
> > use the faster device to put the block.db in.
> >
> >
> > ​Likely not mentioned since no one yet has had the opportunity to
> > test it.​
> >
> > So how do i have to deploy using bluestore, regarding where i
> > should put block.wal and block.db ?
> >
> >
> > ​block.* would be best on your NVRAM device, like this:
> >
> > ​ceph-deploy osd create --bluestore c0osd-136:/dev/sda --block-wal
> > /dev/nvme0n1 --block-db /dev/nvme0n1
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> >
> >
> >
> >
> > --
> > *Alejandro Comisario*
> > *CTO | NUBELIU*
> > E-mail: alejan...@nubeliu.com Cell: +54 9
> > 11 3770 1857
> > _
> > www.nubeliu.com 
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
> --
> _
> D i e t m a r  R i e d e r, Mag.Dr.
> Innsbruck Medical University
> Biocenter - Division for Bioinformatics
> Innrain 80, 6020 Innsbruck
> Phone: +43 512 9003 71402
> Fax: +43 512 9003 73100
> Email: dietmar.rie...@i-med.ac.at
> Web:   http://www.icbi.at
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] librmb: Mail storage on RADOS with Dovecot

2017-09-21 Thread Wido den Hollander
Hi,

A tracker issue has been out there for a while: 
http://tracker.ceph.com/issues/12430

Storing e-mail in RADOS with Dovecot, the IMAP/POP3/LDA server with a huge 
marketshare.

It took a while, but last year Deutsche Telekom took on the heavy work and 
started a project to develop librmb: LibRadosMailBox

Together with Deutsche Telekom and Tallence GmbH (DE) this project came to life.

First, the Github link: https://github.com/ceph-dovecot/dovecot-ceph-plugin

I am not going to repeat everything which is on Github, put a short summary:

- CephFS is used for storing Mailbox Indexes
- E-Mails are stored directly as RADOS objects
- It's a Dovecot plugin

We would like everybody to test librmb and report back issues on Github so that 
further development can be done.

It's not finalized yet, but all the help is welcome to make librmb the best 
solution for storing your e-mails on Ceph with Dovecot.

Danny Al-Gaaf has written a small blogpost about it and a presentation:

- https://dalgaaf.github.io/CephMeetUpBerlin20170918-librmb/
- http://blog.bisect.de/2017/09/ceph-meetup-berlin-followup-librmb.html

To get a idea of the scale: 4,7PB of RAW storage over 1.200 OSDs is the final 
goal (last slide in presentation). That will provide roughly 1,2PB of usable 
storage capacity for storing e-mail, a lot of e-mail.

To see this project finally go into the Open Source world excites me a lot :-)

A very, very big thanks to Deutsche Telekom for funding this awesome project!

A big thanks as well to Tallence as they did an awesome job in developing 
librmb in such a short time.

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD: How many snapshots is too many?

2017-09-21 Thread Florian Haas
On Thu, Sep 21, 2017 at 9:53 AM, Gregory Farnum  wrote:
>> > The other reason we maintain the full set of deleted snaps is to prevent
>> > client operations from re-creating deleted snapshots — we filter all
>> > client IO which includes snaps against the deleted_snaps set in the PG.
>> > Apparently this is also big enough in RAM to be a real (but much
>> > smaller) problem.
>> >
>> > Unfortunately eliminating that is a lot harder
>>
>> Just checking here, for clarification: what is "that" here? Are you
>> saying that eliminating the full set of deleted snaps is harder than
>> introducing a deleting_snaps member, or that both are harder than
>> potential mitigation strategies that were previously discussed in this
>> thread?
>
>
> Eliminating the full set we store on the OSD node is much harder than
> converting the OSDMap to specify deleting_ rather than deleted_snaps — the
> former at minimum requires changes to the client protocol and we’re not
> actually sure how to do it; the latter can be done internally to the cluster
> and has a well-understood algorithm to implement.

Got it. Thanks for the clarification.

>> > This is why I was so insistent on numbers, formulae or even
>> > rules-of-thumb to predict what works and what does not. Greg's "one
>> > snapshot per RBD per day is probably OK" from a few months ago seemed
>> > promising, but looking at your situation it's probably not that useful
>> > a rule.
>>
>> Is there something that you can suggest here, perhaps taking into
>> account the discussion you had with Patrick last week?
>
>
> I think I’ve already shared everything I have on this. Try to treat
> sequential snaps the same way and don’t create a bunch of holes in the
> interval set.

Right. But that's not something the regular Ceph cluster operator has
much influence over.

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore disk colocation using NVRAM, SSD and SATA

2017-09-21 Thread Maged Mokhtar
On 2017-09-21 10:01, Dietmar Rieder wrote:

> Hi,
> 
> I'm in the same situation (NVMEs, SSDs, SAS HDDs). I asked the same
> questions to myself.
> For now I decided to use the NVMEs as wal and db devices for the SAS
> HDDs and on the SSDs I colocate wal and  db.
> 
> However, I'm still wonderin how (to what size) and if I should change
> the default sizes of wal and db.
> 
> Dietmar
> 
> On 09/21/2017 01:18 AM, Alejandro Comisario wrote: 
> 
>> But for example, on the same server i have 3 disks technologies to
>> deploy pools, SSD, SAS and SATA.
>> The NVME were bought just thinking on the journal for SATA and SAS,
>> since journals for SSD were colocated.
>> 
>> But now, exactly the same scenario, should i trust the NVME for the SSD
>> pool ? are there that much of a  gain ? against colocating block.* on
>> the same SSD? 
>> 
>> best.
>> 
>> On Wed, Sep 20, 2017 at 6:36 PM, Nigel Williams
>> mailto:nigel.willi...@tpac.org.au>> wrote:
>> 
>> On 21 September 2017 at 04:53, Maximiliano Venesio
>> mailto:mass...@nubeliu.com>> wrote:
>> 
>> Hi guys i'm reading different documents about bluestore, and it
>> never recommends to use NVRAM to store the bluefs db,
>> nevertheless the official documentation says that, is better to
>> use the faster device to put the block.db in.
>> 
>> ​Likely not mentioned since no one yet has had the opportunity to
>> test it.​
>> 
>> So how do i have to deploy using bluestore, regarding where i
>> should put block.wal and block.db ? 
>> 
>> ​block.* would be best on your NVRAM device, like this:
>> 
>> ​ceph-deploy osd create --bluestore c0osd-136:/dev/sda --block-wal
>> /dev/nvme0n1 --block-db /dev/nvme0n1
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> -- 
>> *Alejandro Comisario*
>> *CTO | NUBELIU*
>> E-mail: alejan...@nubeliu.com Cell: +54 9
>> 11 3770 1857
>> _
>> www.nubeliu.com [1] 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

My guess is for wal: you are dealing with a 2 step io operation so in
case it is collocated on your SSDs your iops for small writes will be
halfed. The decision is if you add a small NVMEs as wal for 4 or 5
(large) SSDs, you will double their iops for small io sized. This is not
the case for db. 

For wal size:  512 MB is recommended ( ceph-disk default ) 

For db size: a "few" GB..probably 10GB is a good number. I guess we will
hear more in the future. 

Maged Mokhtar 

  

Links:
--
[1] http://www.nubeliu.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-21 Thread Dietmar Rieder
On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
> On 2017-09-21 07:56, Lazuardi Nasution wrote:
> 
>> Hi,
>>  
>> I'm still looking for the answer of these questions. Maybe someone can
>> share their thought on these. Any comment will be helpful too.
>>  
>> Best regards,
>>
>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
>> mailto:mrxlazuar...@gmail.com>> wrote:
>>
>> Hi,
>>  
>> 1. Is it possible configure use osd_data not as small partition on
>> OSD but a folder (ex. on root disk)? If yes, how to do that with
>> ceph-disk and any pros/cons of doing that?
>> 2. Is WAL & DB size calculated based on OSD size or expected
>> throughput like on journal device of filestore? If no, what is the
>> default value and pro/cons of adjusting that?
>> 3. Is partition alignment matter on Bluestore, including WAL & DB
>> if using separate device for them?
>>  
>> Best regards,
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
>  
> 
> I am also looking for recommendations on wal/db partition sizes. Some hints:
> 
> ceph-disk defaults used in case it does not find
> bluestore_block_wal_size or bluestore_block_db_size in config file:
> 
> wal =  512MB
> 
> db = if bluestore_block_size (data size) is in config file it uses 1/100
> of it else it uses 1G.
> 
> There is also a presentation by Sage back in March, see page 16:
> 
> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
> 
> wal: 512 MB
> 
> db: "a few" GB 
> 
> the wal size is probably not debatable, it will be like a journal for
> small block sizes which are constrained by iops hence 512 MB is more
> than enough. Probably we will see more on the db size in the future.

This is what I understood so far.
I wonder if it makes sense to set the db size as big as possible and
divide entire db device is  by the number of OSDs it will serve.

E.g. 10 OSDs / 1 NVME (800GB)

 (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD

Is this smart/stupid?

Dietmar
 --
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore disk colocation using NVRAM, SSD and SATA

2017-09-21 Thread Graeme Seaton

Hi,

This is the approach I've also taken.

As for sizing, I simply divided the nvme into a partition per HDD and 
colocate the WAL/DB in that partition.  My understanding is that 
Bluestore will simply use the extra space for smaller reads/writes until 
it reaches capacity when it then spools out to the HDD.


Graeme

On 21/09/17 09:01, Dietmar Rieder wrote:

Hi,

I'm in the same situation (NVMEs, SSDs, SAS HDDs). I asked the same
questions to myself.
For now I decided to use the NVMEs as wal and db devices for the SAS
HDDs and on the SSDs I colocate wal and  db.

However, I'm still wonderin how (to what size) and if I should change
the default sizes of wal and db.

Dietmar

On 09/21/2017 01:18 AM, Alejandro Comisario wrote:

But for example, on the same server i have 3 disks technologies to
deploy pools, SSD, SAS and SATA.
The NVME were bought just thinking on the journal for SATA and SAS,
since journals for SSD were colocated.

But now, exactly the same scenario, should i trust the NVME for the SSD
pool ? are there that much of a  gain ? against colocating block.* on
the same SSD?

best.

On Wed, Sep 20, 2017 at 6:36 PM, Nigel Williams
mailto:nigel.willi...@tpac.org.au>> wrote:

 On 21 September 2017 at 04:53, Maximiliano Venesio
 mailto:mass...@nubeliu.com>> wrote:

 Hi guys i'm reading different documents about bluestore, and it
 never recommends to use NVRAM to store the bluefs db,
 nevertheless the official documentation says that, is better to
 use the faster device to put the block.db in.


 ​Likely not mentioned since no one yet has had the opportunity to
 test it.​

 So how do i have to deploy using bluestore, regarding where i
 should put block.wal and block.db ?


 ​block.* would be best on your NVRAM device, like this:

 ​ceph-deploy osd create --bluestore c0osd-136:/dev/sda --block-wal
 /dev/nvme0n1 --block-db /dev/nvme0n1



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 




--
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejan...@nubeliu.com Cell: +54 9
11 3770 1857
_
www.nubeliu.com 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore disk colocation using NVRAM, SSD and SATA

2017-09-21 Thread Dietmar Rieder
Hi,

I'm in the same situation (NVMEs, SSDs, SAS HDDs). I asked the same
questions to myself.
For now I decided to use the NVMEs as wal and db devices for the SAS
HDDs and on the SSDs I colocate wal and  db.

However, I'm still wonderin how (to what size) and if I should change
the default sizes of wal and db.

Dietmar

On 09/21/2017 01:18 AM, Alejandro Comisario wrote:
> But for example, on the same server i have 3 disks technologies to
> deploy pools, SSD, SAS and SATA.
> The NVME were bought just thinking on the journal for SATA and SAS,
> since journals for SSD were colocated.
> 
> But now, exactly the same scenario, should i trust the NVME for the SSD
> pool ? are there that much of a  gain ? against colocating block.* on
> the same SSD? 
> 
> best.
> 
> On Wed, Sep 20, 2017 at 6:36 PM, Nigel Williams
> mailto:nigel.willi...@tpac.org.au>> wrote:
> 
> On 21 September 2017 at 04:53, Maximiliano Venesio
> mailto:mass...@nubeliu.com>> wrote:
> 
> Hi guys i'm reading different documents about bluestore, and it
> never recommends to use NVRAM to store the bluefs db,
> nevertheless the official documentation says that, is better to
> use the faster device to put the block.db in.
> 
> 
> ​Likely not mentioned since no one yet has had the opportunity to
> test it.​
> 
> So how do i have to deploy using bluestore, regarding where i
> should put block.wal and block.db ? 
> 
> 
> ​block.* would be best on your NVRAM device, like this:
> 
> ​ceph-deploy osd create --bluestore c0osd-136:/dev/sda --block-wal
> /dev/nvme0n1 --block-db /dev/nvme0n1
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> 
> -- 
> *Alejandro Comisario*
> *CTO | NUBELIU*
> E-mail: alejan...@nubeliu.com Cell: +54 9
> 11 3770 1857
> _
> www.nubeliu.com 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD: How many snapshots is too many?

2017-09-21 Thread Gregory Farnum
On Mon, Sep 18, 2017 at 4:11 AM Florian Haas  wrote:

> On 09/16/2017 01:36 AM, Gregory Farnum wrote:
> > On Mon, Sep 11, 2017 at 1:10 PM Florian Haas  > > wrote:
> >
> > On Mon, Sep 11, 2017 at 8:27 PM, Mclean, Patrick
> > mailto:patrick.mcl...@sony.com>> wrote:
> > >
> > > On 2017-09-08 06:06 PM, Gregory Farnum wrote:
> > > > On Fri, Sep 8, 2017 at 5:47 PM, Mclean, Patrick
> > mailto:patrick.mcl...@sony.com>> wrote:
> > > >
> > > >> On a related note, we are very curious why the snapshot id is
> > > >> incremented when a snapshot is deleted, this creates lots
> > > >> phantom entries in the deleted snapshots set. Interleaved
> > > >> deletions and creations will cause massive fragmentation in
> > > >> the interval set. The only reason we can come up for this
> > > >> is to track if anything changed, but I suspect a different
> > > >> value that doesn't inject entries in to the interval set might
> > > >> be better for this purpose.
> > > > Yes, it's because having a sequence number tied in with the
> > snapshots
> > > > is convenient for doing comparisons. Those aren't leaked snapids
> > that
> > > > will make holes; when we increment the snapid to delete
> something we
> > > > also stick it in the removed_snaps set. (I suppose if you
> alternate
> > > > deleting a snapshot with adding one that does increase the size
> > until
> > > > you delete those snapshots; hrmmm. Another thing to avoid doing I
> > > > guess.)
> > > >
> > >
> > >
> > > Fair enough, though it seems like these limitations of the
> > > snapshot system should be documented.
> >
> > This is why I was so insistent on numbers, formulae or even
> > rules-of-thumb to predict what works and what does not. Greg's "one
> > snapshot per RBD per day is probably OK" from a few months ago seemed
> > promising, but looking at your situation it's probably not that
> useful
> > a rule.
> >
> >
> > > We most likely would
> > > have used a completely different strategy if it was documented
> > > that certain snapshot creation and removal patterns could
> > > cause the cluster to fall over over time.
> >
> > I think right now there are probably very few people, if any, who
> > could *describe* the pattern that causes this. That complicates
> > matters of documentation. :)
> >
> >
> > > >>> It might really just be the osdmap update processing -- that
> would
> > > >>> make me happy as it's a much easier problem to resolve. But
> > I'm also
> > > >>> surprised it's *that* expensive, even at the scales you've
> > described.
> >
> > ^^ This is what I mean. It's kind of tough to document things if
> we're
> > still in "surprised that this is causing harm" territory.
> >
> >
> > > >> That would be nice, but unfortunately all the data is pointing
> > > >> to PGPool::Update(),
> > > > Yes, that's the OSDMap update processing I referred to. This is
> good
> > > > in terms of our ability to remove it without changing client
> > > > interfaces and things.
> > >
> > > That is good to hear, hopefully this stuff can be improved soon
> > > then.
> >
> > Greg, can you comment on just how much potential improvement you see
> > here? Is it more like "oh we know we're doing this one thing horribly
> > inefficiently, but we never thought this would be an issue so we
> shied
> > away from premature optimization, but we can easily reduce 70% CPU
> > utilization to 1%" or rather like "we might be able to improve this
> by
> > perhaps 5%, but 100,000 RBDs is too many if you want to be using
> > snapshotting at all, for the foreseeable future"?
> >
> >
> > I got the chance to discuss this a bit with Patrick at the Open Source
> > Summit Wednesday (good to see you!).
> >
> > So the idea in the previously-referenced CDM talk essentially involves
> > changing the way we distribute snap deletion instructions from a
> > "deleted_snaps" member in the OSDMap to a "deleting_snaps" member that
> > gets trimmed once the OSDs report to the manager that they've finished
> > removing that snapid. This should entirely resolve the CPU burn they're
> > seeing during OSDMap processing on the nodes, as it shrinks the
> > intersection operation down from "all the snaps" to merely "the snaps
> > not-done-deleting".
> >
> > The other reason we maintain the full set of deleted snaps is to prevent
> > client operations from re-creating deleted snapshots — we filter all
> > client IO which includes snaps against the deleted_snaps set in the PG.
> > Apparently this is also big enough in RAM to be a real (but much
> > smaller) problem.
> >
> > Unfortunately eliminating that is a lot harder
>
> Just checking here, for clarification: what is "that" here? Are you
> saying that eliminating the full set of deleted snaps is harder 

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-21 Thread Maged Mokhtar
On 2017-09-21 07:56, Lazuardi Nasution wrote:

> Hi, 
> 
> I'm still looking for the answer of these questions. Maybe someone can share 
> their thought on these. Any comment will be helpful too. 
> 
> Best regards, 
> 
> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution  
> wrote:
> 
>> Hi, 
>> 
>> 1. Is it possible configure use osd_data not as small partition on OSD but a 
>> folder (ex. on root disk)? If yes, how to do that with ceph-disk and any 
>> pros/cons of doing that? 
>> 2. Is WAL & DB size calculated based on OSD size or expected throughput like 
>> on journal device of filestore? If no, what is the default value and 
>> pro/cons of adjusting that? 
>> 3. Is partition alignment matter on Bluestore, including WAL & DB if using 
>> separate device for them? 
>> 
>> Best regards,
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

I am also looking for recommendations on wal/db partition sizes. Some
hints: 

ceph-disk defaults used in case it does not find
bluestore_block_wal_size or bluestore_block_db_size in config file: 

wal =  512MB 

db = if bluestore_block_size (data size) is in config file it uses 1/100
of it else it uses 1G. 

There is also a presentation by Sage back in March, see page 16: 

https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in


wal: 512 MB 

db: "a few" GB  

the wal size is probably not debatable, it will be like a journal for
small block sizes which are constrained by iops hence 512 MB is more
than enough. Probably we will see more on the db size in the future. 

Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com