Re: [ceph-users] Deep scrub distribution

2017-10-25 Thread Alejandro Comisario
any comment on this one ?
interesting what to do in this situation

On Wed, Jul 5, 2017 at 10:51 PM, Adrian Saul 
wrote:

>
>
> During a recent snafu with a production cluster I disabled scrubbing and
> deep scrubbing in order to reduce load on the cluster while things
> backfilled and settled down.  The PTSD caused by the incident meant I was
> not keen to re-enable it until I was confident we had fixed the root cause
> of the issues (driver issues with a new NIC type introduced with new
> hardware that did not show up until production load hit them).   My cluster
> is using Jewel 10.2.1, and is a mix of SSD and SATA over 20 hosts, 352 OSDs
> in total.
>
>
>
> Fast forward a few weeks and I was ready to re-enable it.  On some reading
> I was concerned the cluster might kick off excessive scrubbing once I unset
> the flags, so I tried increasing the deep scrub interval from 7 days to 60
> days – with most of the last deep scrubs being from over a month before I
> was hoping it would distribute them over the next 30 days.  Having unset
> the flag and carefully watched the cluster it seems to have just run a
> steady catch up without significant impact.  What I am noticing though is
> that the scrubbing is seeming to just run through the full set of PGs, so
> it did some 2280 PGs last night over 6 hours, and so far today in 12 hours
> another 4000 odd.  With 13408 PGs, I am guessing that all this will stop
> some time early tomorrow.
>
>
>
> ceph-glb-fec-01[/var/log]$ sudo ceph pg dump|awk '{print $20}'|grep
> 2017|sort|uniq -c
>
> dumped all in format plain
>
>   5 2017-05-23
>
>  18 2017-05-24
>
>  33 2017-05-25
>
>  52 2017-05-26
>
>  89 2017-05-27
>
> 114 2017-05-28
>
> 144 2017-05-29
>
> 172 2017-05-30
>
> 256 2017-05-31
>
> 191 2017-06-01
>
> 230 2017-06-02
>
> 369 2017-06-03
>
> 606 2017-06-04
>
> 680 2017-06-05
>
> 919 2017-06-06
>
>1261 2017-06-07
>
>1876 2017-06-08
>
>  15 2017-06-09
>
>2280 2017-07-05
>
>4098 2017-07-06
>
>
>
> My concern is am I now set to have all 13408 PGs do a deep scrub in 60
> days in a serial fashion again over 3 days.  I would much rather they
> distribute over that period.
>
>
>
> Will the OSDs do this distribution themselves now they have caught up, or
> do I need to say create a script that will trigger batches of PGs to deep
> scrub over time to push out the distribution again?
>
>
>
>
>
> [image: Description: http://res.tpgi.com.au/img/signature/spacer.gif]
>
> [image: Description: http://res.tpgi.com.au/img/signature/prpletop.jpg]
>
> *Adrian Saul* | Infrastructure Projects Team Lead
> IT
> T 02 9009 9041 | M +61 402 075 760
> 30 Ross St, Glebe NSW 2037
> adrian.s...@tpgtelecom.com.au | www.tpg.com.au
>
> *TPG Telecom (ASX: TPM)*
>
> [image: Description:
> http://res.tpgi.com.au/img/signature/tpgtelecomlogo.jpg]
>
> This email and any attachments are confidential and may be subject to
> copyright, legal or some other professional privilege. They are intended
> solely for the attention and use of the named addressee(s). They may only
> be copied, distributed or disclosed with the consent of the copyright
> owner. If you have received this email by mistake or by breach of the
> confidentiality clause, please notify the sender immediately by return
> email and delete or destroy all copies of the email. Any confidentiality,
> privilege or copyright is not waived or lost because this email has been
> sent to you by mistake.
>
>
>
>
> Confidentiality: This email and any attachments are confidential and may
> be subject to copyright, legal or some other professional privilege. They
> are intended solely for the attention and use of the named addressee(s).
> They may only be copied, distributed or disclosed with the consent of the
> copyright owner. If you have received this email by mistake or by breach of
> the confidentiality clause, please notify the sender immediately by return
> email and delete or destroy all copies of the email. Any confidentiality,
> privilege or copyright is not waived or lost because this email has been
> sent to you by mistake.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
_
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to stop using (unmount) a failed OSD with BlueStore ?

2017-10-17 Thread Alejandro Comisario
I believe you are absolutelly right.
It was my fault not checking the dates before posting, my bad.

Thanks for you help.
best.

On Tue, Oct 17, 2017 at 8:14 PM, Jamie Fargen  wrote:

> Alejandro-
>
> Those are kernel messages indicating that the an error was encountered
> when data was sent to the storage device and are not related directly to
> the operation of Ceph. The messages you sent also appear to have happened 4
> days ago on Friday and if they have subsided then it probably means nothing
> further has tried to read/write to the disk, but the messages will be
> present in dmesg until the kernel ring buffer is overwritten or the system
> is restarted.
>
> -Jamie
>
>
> On Tue, Oct 17, 2017 at 6:47 PM, Alejandro Comisario <
> alejan...@nubeliu.com> wrote:
>
>> Jamie, thanks for replying, info is as follow:
>>
>> 1)
>>
>> [Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 FAILED Result:
>> hostbyte=DID_OK driverbyte=DRIVER_SENSE
>> [Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 Sense Key : Medium
>> Error [current]
>> [Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 Add. Sense: No
>> additional sense information
>> [Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 CDB: Read(10) 28 00
>> 00 00 09 10 00 00 f0 00
>> [Fri Oct 13 10:21:24 2017] blk_update_request: I/O error, dev sdx, sector
>> 2320
>>
>> 2)
>>
>> ndc-cl-mon1:~# ceph status
>>   cluster:
>> id: 48158350-ba8a-420b-9c09-68da57205924
>> health: HEALTH_OK
>>
>>   services:
>> mon: 3 daemons, quorum ndc-cl-mon1,ndc-cl-mon2,ndc-cl-mon3
>> mgr: ndc-cl-mon1(active), standbys: ndc-cl-mon3, ndc-cl-mon2
>> osd: 161 osds: 160 up, 160 in
>>
>>   data:
>> pools:   4 pools, 12288 pgs
>> objects: 663k objects, 2650 GB
>> usage:   9695 GB used, 258 TB / 267 TB avail
>> pgs: 12288 active+clean
>>
>>   io:
>> client:   0 B/s rd, 1248 kB/s wr, 49 op/s rd, 106 op/s wr
>>
>> 3)
>>
>> https://pastebin.com/MeCKqvp1
>>
>>
>> On Tue, Oct 17, 2017 at 5:59 PM, Jamie Fargen  wrote:
>>
>>> Alejandro-
>>> Please provide the folloing information:
>>> 1) Include an example of an actual message you are seeing in dmesg.
>>> 2) Provide the output of # ceph status
>>> 3) Provide the output of # ceph osd tree
>>>
>>> Regards,
>>> Jamie Fargen
>>>
>>>
>>>
>>> On Tue, Oct 17, 2017 at 4:34 PM, Alejandro Comisario <
>>> alejan...@nubeliu.com> wrote:
>>>
>>>> hi guys, any tip or help ?
>>>>
>>>> On Mon, Oct 16, 2017 at 1:50 PM, Alejandro Comisario <
>>>> alejan...@nubeliu.com> wrote:
>>>>
>>>>> Hi all, i have to hot-swap a failed osd on a Luminous Cluster with
>>>>> Blue store (the disk is SATA, WAL and DB are on NVME).
>>>>>
>>>>> I've issued a:
>>>>> * ceph osd crush reweight osd_id 0
>>>>> * systemctl stop (osd I'd daemon)
>>>>> * umount /var/lib/ceph/osd/osd_id
>>>>> * ceph osd destroy osd_id
>>>>>
>>>>> everything seems of, but if I left everything as is ( until I wait for
>>>>> the replaced disk ) I can see that dmesg errors on writing on the device
>>>>> are still appearing.
>>>>>
>>>>> The osd is of course down and out the crushmap.
>>>>> am I missing something ? like a step to execute or something else ?
>>>>>
>>>>> hoping to get help.
>>>>> best.
>>>>>
>>>>> ​alejandrito
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *Alejandro Comisario*
>>>> *CTO | NUBELIU*
>>>> E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
>>>> _
>>>>
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>
>>>
>>> --
>>> Jamie Fargen
>>> Consultant
>>> jfar...@redhat.com
>>> 813-817-4430 <(813)%20817-4430>
>>>
>>
>>
>>
>> --
>> *Alejandro Comisario*
>> *CTO | NUBELIU*
>> E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
>> _
>>
>
>
>
> --
> Jamie Fargen
> Consultant
> jfar...@redhat.com
> 813-817-4430
>



-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
_
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to stop using (unmount) a failed OSD with BlueStore ?

2017-10-17 Thread Alejandro Comisario
I believe you are absolutelly right.
It was my fault not checking the dates before posting, my bad.

Thanks for you help.
best.

On Tue, Oct 17, 2017 at 8:14 PM, Jamie Fargen  wrote:

> Alejandro-
>
> Those are kernel messages indicating that the an error was encountered
> when data was sent to the storage device and are not related directly to
> the operation of Ceph. The messages you sent also appear to have happened 4
> days ago on Friday and if they have subsided then it probably means nothing
> further has tried to read/write to the disk, but the messages will be
> present in dmesg until the kernel ring buffer is overwritten or the system
> is restarted.
>
> -Jamie
>
>
> On Tue, Oct 17, 2017 at 6:47 PM, Alejandro Comisario <
> alejan...@nubeliu.com> wrote:
>
>> Jamie, thanks for replying, info is as follow:
>>
>> 1)
>>
>> [Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 FAILED Result:
>> hostbyte=DID_OK driverbyte=DRIVER_SENSE
>> [Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 Sense Key : Medium
>> Error [current]
>> [Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 Add. Sense: No
>> additional sense information
>> [Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 CDB: Read(10) 28 00
>> 00 00 09 10 00 00 f0 00
>> [Fri Oct 13 10:21:24 2017] blk_update_request: I/O error, dev sdx, sector
>> 2320
>>
>> 2)
>>
>> ndc-cl-mon1:~# ceph status
>>   cluster:
>> id: 48158350-ba8a-420b-9c09-68da57205924
>> health: HEALTH_OK
>>
>>   services:
>> mon: 3 daemons, quorum ndc-cl-mon1,ndc-cl-mon2,ndc-cl-mon3
>> mgr: ndc-cl-mon1(active), standbys: ndc-cl-mon3, ndc-cl-mon2
>> osd: 161 osds: 160 up, 160 in
>>
>>   data:
>> pools:   4 pools, 12288 pgs
>> objects: 663k objects, 2650 GB
>> usage:   9695 GB used, 258 TB / 267 TB avail
>> pgs: 12288 active+clean
>>
>>   io:
>> client:   0 B/s rd, 1248 kB/s wr, 49 op/s rd, 106 op/s wr
>>
>> 3)
>>
>> https://pastebin.com/MeCKqvp1
>>
>>
>> On Tue, Oct 17, 2017 at 5:59 PM, Jamie Fargen  wrote:
>>
>>> Alejandro-
>>> Please provide the folloing information:
>>> 1) Include an example of an actual message you are seeing in dmesg.
>>> 2) Provide the output of # ceph status
>>> 3) Provide the output of # ceph osd tree
>>>
>>> Regards,
>>> Jamie Fargen
>>>
>>>
>>>
>>> On Tue, Oct 17, 2017 at 4:34 PM, Alejandro Comisario <
>>> alejan...@nubeliu.com> wrote:
>>>
>>>> hi guys, any tip or help ?
>>>>
>>>> On Mon, Oct 16, 2017 at 1:50 PM, Alejandro Comisario <
>>>> alejan...@nubeliu.com> wrote:
>>>>
>>>>> Hi all, i have to hot-swap a failed osd on a Luminous Cluster with
>>>>> Blue store (the disk is SATA, WAL and DB are on NVME).
>>>>>
>>>>> I've issued a:
>>>>> * ceph osd crush reweight osd_id 0
>>>>> * systemctl stop (osd I'd daemon)
>>>>> * umount /var/lib/ceph/osd/osd_id
>>>>> * ceph osd destroy osd_id
>>>>>
>>>>> everything seems of, but if I left everything as is ( until I wait for
>>>>> the replaced disk ) I can see that dmesg errors on writing on the device
>>>>> are still appearing.
>>>>>
>>>>> The osd is of course down and out the crushmap.
>>>>> am I missing something ? like a step to execute or something else ?
>>>>>
>>>>> hoping to get help.
>>>>> best.
>>>>>
>>>>> ​alejandrito
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *Alejandro Comisario*
>>>> *CTO | NUBELIU*
>>>> E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
>>>> _
>>>>
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>
>>>
>>> --
>>> Jamie Fargen
>>> Consultant
>>> jfar...@redhat.com
>>> 813-817-4430 <(813)%20817-4430>
>>>
>>
>>
>>
>> --
>> *Alejandro Comisario*
>> *CTO | NUBELIU*
>> E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
>> _
>>
>
>
>
> --
> Jamie Fargen
> Consultant
> jfar...@redhat.com
> 813-817-4430
>



-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
_
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to stop using (unmount) a failed OSD with BlueStore ?

2017-10-17 Thread Alejandro Comisario
Jamie, thanks for replying, info is as follow:

1)

[Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 Sense Key : Medium
Error [current]
[Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 Add. Sense: No
additional sense information
[Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 CDB: Read(10) 28 00 00
00 09 10 00 00 f0 00
[Fri Oct 13 10:21:24 2017] blk_update_request: I/O error, dev sdx, sector
2320

2)

ndc-cl-mon1:~# ceph status
  cluster:
id: 48158350-ba8a-420b-9c09-68da57205924
health: HEALTH_OK

  services:
mon: 3 daemons, quorum ndc-cl-mon1,ndc-cl-mon2,ndc-cl-mon3
mgr: ndc-cl-mon1(active), standbys: ndc-cl-mon3, ndc-cl-mon2
osd: 161 osds: 160 up, 160 in

  data:
pools:   4 pools, 12288 pgs
objects: 663k objects, 2650 GB
usage:   9695 GB used, 258 TB / 267 TB avail
pgs: 12288 active+clean

  io:
client:   0 B/s rd, 1248 kB/s wr, 49 op/s rd, 106 op/s wr

3)

https://pastebin.com/MeCKqvp1


On Tue, Oct 17, 2017 at 5:59 PM, Jamie Fargen  wrote:

> Alejandro-
> Please provide the folloing information:
> 1) Include an example of an actual message you are seeing in dmesg.
> 2) Provide the output of # ceph status
> 3) Provide the output of # ceph osd tree
>
> Regards,
> Jamie Fargen
>
>
>
> On Tue, Oct 17, 2017 at 4:34 PM, Alejandro Comisario <
> alejan...@nubeliu.com> wrote:
>
>> hi guys, any tip or help ?
>>
>> On Mon, Oct 16, 2017 at 1:50 PM, Alejandro Comisario <
>> alejan...@nubeliu.com> wrote:
>>
>>> Hi all, i have to hot-swap a failed osd on a Luminous Cluster with Blue
>>> store (the disk is SATA, WAL and DB are on NVME).
>>>
>>> I've issued a:
>>> * ceph osd crush reweight osd_id 0
>>> * systemctl stop (osd I'd daemon)
>>> * umount /var/lib/ceph/osd/osd_id
>>> * ceph osd destroy osd_id
>>>
>>> everything seems of, but if I left everything as is ( until I wait for
>>> the replaced disk ) I can see that dmesg errors on writing on the device
>>> are still appearing.
>>>
>>> The osd is of course down and out the crushmap.
>>> am I missing something ? like a step to execute or something else ?
>>>
>>> hoping to get help.
>>> best.
>>>
>>> ​alejandrito
>>>
>>
>>
>>
>> --
>> *Alejandro Comisario*
>> *CTO | NUBELIU*
>> E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
>> _
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> Jamie Fargen
> Consultant
> jfar...@redhat.com
> 813-817-4430
>



-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
_
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to stop using (unmount) a failed OSD with BlueStore ?

2017-10-17 Thread Alejandro Comisario
hi guys, any tip or help ?

On Mon, Oct 16, 2017 at 1:50 PM, Alejandro Comisario 
wrote:

> Hi all, i have to hot-swap a failed osd on a Luminous Cluster with Blue
> store (the disk is SATA, WAL and DB are on NVME).
>
> I've issued a:
> * ceph osd crush reweight osd_id 0
> * systemctl stop (osd I'd daemon)
> * umount /var/lib/ceph/osd/osd_id
> * ceph osd destroy osd_id
>
> everything seems of, but if I left everything as is ( until I wait for the
> replaced disk ) I can see that dmesg errors on writing on the device are
> still appearing.
>
> The osd is of course down and out the crushmap.
> am I missing something ? like a step to execute or something else ?
>
> hoping to get help.
> best.
>
> ​alejandrito
>



-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
_
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to stop using (unmount) a failed OSD with BlueStore ?

2017-10-16 Thread Alejandro Comisario
Hi all, i have to hot-swap a failed osd on a Luminous Cluster with Blue
store (the disk is SATA, WAL and DB are on NVME).

I've issued a:
* ceph osd crush reweight osd_id 0
* systemctl stop (osd I'd daemon)
* umount /var/lib/ceph/osd/osd_id
* ceph osd destroy osd_id

everything seems of, but if I left everything as is ( until I wait for the
replaced disk ) I can see that dmesg errors on writing on the device are
still appearing.

The osd is of course down and out the crushmap.
am I missing something ? like a step to execute or something else ?

hoping to get help.
best.

​alejandrito
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] right way to recover a failed OSD (disk) when using BlueStore ?

2017-10-11 Thread Alejandro Comisario
David, thanks.
I've switched the brnach to Luminous and the doc is the same (thankfully).

No worries, i'll wait till someone that hopefully did it already might give
me a hint.
thanks!

On Wed, Oct 11, 2017 at 11:00 AM, David Turner 
wrote:

> Careful when you're looking at documentation.  You're looking at the
> master branch which might have unreleased features or changes that your
> release doesn't have.  You'll want to change master in the url to luminous
> to make sure that you're looking at the documentation for your version of
> Ceph.
>
> I haven't personally used bluestore yet so I can't say what the proper
> commands are there without just looking online for the answer.  I do know
> that there is no reason to have your DB and WAL devices on separate
> partitions if they're on the same device.  What's been mentioned on the ML
> is that you want to create a partition for the DB and the WAL will use it.
> A partition for the WAL is only if it is planned to be on a different
> device than the DB.
>
> On Tue, Oct 10, 2017 at 5:59 PM Alejandro Comisario 
> wrote:
>
>> Hi, i see some notes there that did'nt existed on jewel :
>>
>> http://docs.ceph.com/docs/master/rados/operations/add-
>> or-rm-osds/#replacing-an-osd
>>
>> In my case what im using right now on that OSD is this :
>>
>> root@ndc-cl-osd4:~# ls -lsah /var/lib/ceph/osd/ceph-104
>> total 64K
>>0 drwxr-xr-x  2 ceph ceph  310 Sep 21 10:56 .
>> 4.0K drwxr-xr-x 25 ceph ceph 4.0K Sep 21 10:56 ..
>>0 lrwxrwxrwx  1 ceph ceph   58 Sep 21 10:30 block ->
>> /dev/disk/by-partuuid/0ffa3ed7-169f-485c-9170-648ce656e9b1
>>0 lrwxrwxrwx  1 ceph ceph   58 Sep 21 10:30 block.db ->
>> /dev/disk/by-partuuid/5873e2cb-3c26-4a7d-8ff1-1bc3e2d62e5a
>>0 lrwxrwxrwx  1 ceph ceph   58 Sep 21 10:30 block.wal ->
>> /dev/disk/by-partuuid/aed9e5e4-c798-46b5-8243-e462e74f6485
>>
>> block.db and block.wal are on two different NVME partitions, witch are 
>> nvme1n1p17
>> and nvme1n1p18 so assuming after hot swaping the device, the drive
>> letter is "sdx" according to the link above what would be the right command
>> to re-use the two NVME partitions for block db and wal ?
>>
>> I presume that everything else is the same.
>> best.
>>
>>
>> On Sat, Sep 30, 2017 at 9:00 PM, David Turner 
>> wrote:
>>
>>> I'm pretty sure that the process is the same as with filestore. The
>>> cluster doesn't really know if an osd is filestore or bluestore... It's
>>> just an osd running a daemon.
>>>
>>> If there are any differences, they would be in the release notes for
>>> Luminous as changes from Jewel.
>>>
>>> On Sat, Sep 30, 2017, 6:28 PM Alejandro Comisario 
>>> wrote:
>>>
>>>> Hi all.
>>>> Independetly that i've deployerd a ceph Luminous cluster with Bluestore
>>>> using ceph-ansible (https://github.com/ceph/ceph-ansible) what is the
>>>> right way to replace a disk when using Bluestore ?
>>>>
>>>> I will try to forget everything i know on how to recover things with
>>>> filestore and start fresh.
>>>>
>>>> Any how-to's ? experiences ? i dont seem to find an official way of
>>>> doing it.
>>>> best.
>>>>
>>>> --
>>>> *Alejandro Comisario*
>>>> *CTO | NUBELIU*
>>>> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
>>>> <+54%209%2011%203770-1857>
>>>> _
>>>> www.nubeliu.com
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>
>>
>> --
>> *Alejandro Comisario*
>> *CTO | NUBELIU*
>> E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
>> _
>>
>


-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
_
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] right way to recover a failed OSD (disk) when using BlueStore ?

2017-10-10 Thread Alejandro Comisario
Hi, i see some notes there that did'nt existed on jewel :

http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd

In my case what im using right now on that OSD is this :

root@ndc-cl-osd4:~# ls -lsah /var/lib/ceph/osd/ceph-104
total 64K
   0 drwxr-xr-x  2 ceph ceph  310 Sep 21 10:56 .
4.0K drwxr-xr-x 25 ceph ceph 4.0K Sep 21 10:56 ..
   0 lrwxrwxrwx  1 ceph ceph   58 Sep 21 10:30 block ->
/dev/disk/by-partuuid/0ffa3ed7-169f-485c-9170-648ce656e9b1
   0 lrwxrwxrwx  1 ceph ceph   58 Sep 21 10:30 block.db ->
/dev/disk/by-partuuid/5873e2cb-3c26-4a7d-8ff1-1bc3e2d62e5a
   0 lrwxrwxrwx  1 ceph ceph   58 Sep 21 10:30 block.wal ->
/dev/disk/by-partuuid/aed9e5e4-c798-46b5-8243-e462e74f6485

block.db and block.wal are on two different NVME partitions, witch are
nvme1n1p17
and nvme1n1p18 so assuming after hot swaping the device, the drive letter
is "sdx" according to the link above what would be the right command to
re-use the two NVME partitions for block db and wal ?

I presume that everything else is the same.
best.


On Sat, Sep 30, 2017 at 9:00 PM, David Turner  wrote:

> I'm pretty sure that the process is the same as with filestore. The
> cluster doesn't really know if an osd is filestore or bluestore... It's
> just an osd running a daemon.
>
> If there are any differences, they would be in the release notes for
> Luminous as changes from Jewel.
>
> On Sat, Sep 30, 2017, 6:28 PM Alejandro Comisario 
> wrote:
>
>> Hi all.
>> Independetly that i've deployerd a ceph Luminous cluster with Bluestore
>> using ceph-ansible (https://github.com/ceph/ceph-ansible) what is the
>> right way to replace a disk when using Bluestore ?
>>
>> I will try to forget everything i know on how to recover things with
>> filestore and start fresh.
>>
>> Any how-to's ? experiences ? i dont seem to find an official way of doing
>> it.
>> best.
>>
>> --
>> *Alejandro Comisario*
>> *CTO | NUBELIU*
>> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
>> _
>> www.nubeliu.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>


-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
_
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] right way to recover a failed OSD (disk) when using BlueStore ?

2017-09-30 Thread Alejandro Comisario
Hi all.
Independetly that i've deployerd a ceph Luminous cluster with Bluestore
using ceph-ansible (https://github.com/ceph/ceph-ansible) what is the right
way to replace a disk when using Bluestore ?

I will try to forget everything i know on how to recover things with
filestore and start fresh.

Any how-to's ? experiences ? i dont seem to find an official way of doing
it.
best.

-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about the Ceph's performance with spdk

2017-09-21 Thread Alejandro Comisario
Bump ! i saw this on the documentation for Bluestore also
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#spdk-usage

Does anyone has any experience ?

On Thu, Jun 8, 2017 at 2:27 AM, Li,Datong  wrote:

> Hi all,
>
> I’m new in Ceph, and I wonder to know the performance report exactly about
> Ceph’s spdk, but I couldn’t find it. The most thing I want to know is the
> performance improvement before spdk and after.
>
> Thanks,
> Datong Li
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore disk colocation using NVRAM, SSD and SATA

2017-09-20 Thread Alejandro Comisario
But for example, on the same server i have 3 disks technologies to deploy
pools, SSD, SAS and SATA.
The NVME were bought just thinking on the journal for SATA and SAS, since
journals for SSD were colocated.

But now, exactly the same scenario, should i trust the NVME for the SSD
pool ? are there that much of a  gain ? against colocating block.* on the
same SSD?

best.

On Wed, Sep 20, 2017 at 6:36 PM, Nigel Williams 
wrote:

> On 21 September 2017 at 04:53, Maximiliano Venesio 
> wrote:
>
>> Hi guys i'm reading different documents about bluestore, and it never
>> recommends to use NVRAM to store the bluefs db, nevertheless the official
>> documentation says that, is better to use the faster device to put the
>> block.db in.
>>
>
> ​Likely not mentioned since no one yet has had the opportunity to test it.​
>
> So how do i have to deploy using bluestore, regarding where i should put
>> block.wal and block.db ?
>>
>
> ​block.* would be best on your NVRAM device, like this:
>
> ​ceph-deploy osd create --bluestore c0osd-136:/dev/sda --block-wal
> /dev/nvme0n1 --block-db /dev/nvme0n1
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore "separate" WAL and DB

2017-09-20 Thread Alejandro Comisario
Bump! i would love the thoughts about this !

On Fri, Sep 8, 2017 at 7:44 AM, Richard Hesketh <
richard.hesk...@rd.bbc.co.uk> wrote:

> Hi,
>
> Reading the ceph-users list I'm obviously seeing a lot of people talking
> about using bluestore now that Luminous has been released. I note that many
> users seem to be under the impression that they need separate block devices
> for the bluestore data block, the DB, and the WAL... even when they are
> going to put the DB and the WAL on the same device!
>
> As per the docs at http://docs.ceph.com/docs/master/rados/configuration/
> bluestore-config-ref/ this is nonsense:
>
> > If there is only a small amount of fast storage available (e.g., less
> than a gigabyte), we recommend using it as a WAL device. If there is more,
> provisioning a DB
> > device makes more sense. The BlueStore journal will always be placed on
> the fastest device available, so using a DB device will provide the same
> benefit that the WAL
> > device would while also allowing additional metadata to be stored there
> (if it will fix). [sic, I assume that should be "fit"]
>
> I understand that if you've got three speeds of storage available, there
> may be some sense to dividing these. For instance, if you've got lots of
> HDD, a bit of SSD, and a tiny NVMe available in the same host, data on HDD,
> DB on SSD and WAL on NVMe may be a sensible division of data. That's not
> the case for most of the examples I'm reading; they're talking about
> putting DB and WAL on the same block device, but in different partitions.
> There's even one example of someone suggesting to try partitioning a single
> SSD to put data/DB/WAL all in separate partitions!
>
> Are the docs wrong and/or I am missing something about optimal bluestore
> setup, or do people simply have the wrong end of the stick? I ask because
> I'm just going through switching all my OSDs over to Bluestore now and I've
> just been reusing the partitions I set up for journals on my SSDs as DB
> devices for Bluestore HDDs without specifying anything to do with the WAL,
> and I'd like to know sooner rather than later if I'm making some sort of
> horrible mistake.
>
> Rich
> --
> Richard Hesketh
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Errors connecting cinder-volume to ceph

2017-06-19 Thread Alejandro Comisario
you might want to configure cinder.conf with

verbose = true
debug = true

and see /var/log/cinder/cinder-volume.log after a "systemctl restart
cinder-volume" to see the real cause.

best.
alejandrito

On Mon, Jun 19, 2017 at 6:25 PM, T. Nichole Williams 
wrote:

> Hello,
>
> I’m having trouble connecting Ceph to OpenStack Cinder following the guide
> in docs.ceph.com & I can’t figure out what’s wrong. I’ve confirmed auth
> connectivity for both root & ceph users on my openstack controller node,
> but the RBDDriver is not initializing. I’ve dug through every related
> Google article I can find with no results. Any one have any tips?
>
> Here’s output of a few sample errors, auth list from controller, &
> cinder-manage config list. Please let me know if you need any further info
> from my side.
> https://gist.githubusercontent.com/OGtrilliams/
> ed7642358a113ab7d908f4240427ad2e/raw/282de9ce1756670fe8c3071d1613e3
> c64d6e5b2f/cinder-conf
>
> T. Nichole Williams
> tribe...@tribecc.us
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG that should not be on undersized+degraded on multi datacenter Ceph cluster

2017-06-08 Thread Alejandro Comisario
Hi Brad.
Taking into consideration the unlikely posibility that someone
realizes what the problem is in this specific case, that would be
higly apreciated.

I presume that having jewel, if you can somehow remediate this, will
be something that i will not be able to have on this deploy right?

best.

On Thu, Jun 8, 2017 at 2:20 AM, Brad Hubbard  wrote:
> On Thu, Jun 8, 2017 at 2:59 PM, Alejandro Comisario
>  wrote:
>> ha!
>> is there ANY way of knowing when this peering maximum has been reached for a
>> PG?
>
> Not currently AFAICT.
>
> It takes place deep in this c code that is shared between the kernel
> and userspace implementations.
>
> https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L444
>
> Whilst the kernel implementation generates some output the userspace
> code does not. I'm looking at how that situation can be improved.
>
>>
>> On Jun 7, 2017 20:21, "Brad Hubbard"  wrote:
>>>
>>> On Wed, Jun 7, 2017 at 5:13 PM, Peter Maloney
>>>  wrote:
>>>
>>> >
>>> > Now if only there was a log or warning seen in ceph -s that said the
>>> > tries was exceeded,
>>>
>>> Challenge accepted.
>>>
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>> --
>>> Cheers,
>>> Brad
>
>
>
> --
> Cheers,
> Brad



-- 
Alejandro Comisario
CTO | NUBELIU
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG that should not be on undersized+degraded on multi datacenter Ceph cluster

2017-06-07 Thread Alejandro Comisario
ha!
is there ANY way of knowing when this peering maximum has been reached for
a PG?

On Jun 7, 2017 20:21, "Brad Hubbard"  wrote:

> On Wed, Jun 7, 2017 at 5:13 PM, Peter Maloney
>  wrote:
>
> >
> > Now if only there was a log or warning seen in ceph -s that said the
> > tries was exceeded,
>
> Challenge accepted.
>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Cheers,
> Brad
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG that should not be on undersized+degraded on multi datacenter Ceph cluster

2017-06-07 Thread Alejandro Comisario
Peter, hi ... what happened to me is exactly what happened to you,
thanks so much for pointing that out!

I'm amazed on how you realized that was the problem !!
Maybe that will help me troubleshoot a little more pro.

best.

On Wed, Jun 7, 2017 at 5:06 PM, Alejandro Comisario
 wrote:
> Peter, hi.
> thanks for the reply, let me check that out, and get back to you
>
> On Wed, Jun 7, 2017 at 4:13 AM, Peter Maloney
>  wrote:
>> On 06/06/17 19:23, Alejandro Comisario wrote:
>>> Hi all, i have a multi datacenter 6 nodes (6 osd) ceph jewel cluster.
>>> There are 3 pools in the cluster, all three with size 3 and min_size 2.
>>>
>>> Today, i shut down all three nodes (controlled and in order) on
>>> datacenter "CPD2" just to validate that everything keeps working on
>>> "CPD1", whitch did (including rebalance of the infromation).
>>>
>>> After everything was off on CPD2, the "osd tree" looks like this,
>>> whitch seems ok.
>>>
>>> root@oskceph01:~# ceph osd tree
>>> ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>> -1 30.0 root default
>>> -8 15.0 datacenter CPD1
>>> -2  5.0 host oskceph01
>>>  0  5.0 osd.0   up  1.0  1.0
>>> -6  5.0 host oskceph05
>>>  4  5.0 osd.4   up  1.0  1.0
>>> -4  5.0 host oskceph03
>>>  2  5.0 osd.2   up  1.0  1.0
>>> -9 15.0 datacenter CPD2
>>> -3  5.0 host oskceph02
>>>  1  5.0 osd.1 down0  1.0
>>> -5  5.0 host oskceph04
>>>  3  5.0 osd.3 down0  1.0
>>> -7  5.0 host oskceph06
>>>  5  5.0 osd.5 down0  1.0
>>>
>>> ...
>>>
>>> root@oskceph01:~# ceph pg dump | egrep degrad
>>> dumped all in format plain
>>> 8.1b3 178 0 178 0 0 1786814 3078 3078 active+undersized+degraded
>>> 2017-06-06 13:11:46.130567 2361'250952 2361:248472 [0,2] 0 [0,2] 0
>>> 1889'249956 2017-06-06 04:11:52.736214 1889'242115 2017-06-03
>>> 19:07:06.615674
>>>
>>> For some extrange reason, i see that the acting set is [0,2] i dont
>>> see osd.4 on the acting set, and honestly, i dont know why.
>>>
>>> ...
>> I'm assuming you have failure domain as host, not datacenter? (otherwise
>> you'd never get 0,2 ... and size 3 could never work either)
>>
>> So then it looks like a problem I had and solved this week... I had 60
>> osds with 19 down to be replaced, and one pg out of 1152 wouldn't peer.
>> Randomly I realized what was wrong... there's a "tunable
>> choose_total_tries" you can increase so the pgs that tried to find an
>> osd that many times and failed will try more:
>>
>>> ceph osd getcrushmap -o crushmap
>>> crushtool -d crushmap -o crushmap.txt
>>> vim crushmap.txt
>>> here you change tunable choose_total_tries higher... default is
>>> 50. 100 worked for me the first time, and then later I changed it
>>> again to 200.
>>> crushtool -c crushmap.txt -o crushmap.new
>>> ceph osd setcrushmap -i crushmap.new
>>
>> if anything goes wrong with the new crushmap, you can always set the old
>> again:
>>> ceph osd setcrushmap -i crushmap
>>
>> Then you have to wait some time, maybe 30s before you have pgs peering.
>>
>> Now if only there was a log or warning seen in ceph -s that said the
>> tries was exceeded, then this solution would be more obvious (and we
>> would know whether it applies to you)
>>
>
>
>
> --
> Alejandro Comisario
> CTO | NUBELIU
> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
> _
> www.nubeliu.com



-- 
Alejandro Comisario
CTO | NUBELIU
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG that should not be on undersized+degraded on multi datacenter Ceph cluster

2017-06-07 Thread Alejandro Comisario
Peter, hi.
thanks for the reply, let me check that out, and get back to you

On Wed, Jun 7, 2017 at 4:13 AM, Peter Maloney
 wrote:
> On 06/06/17 19:23, Alejandro Comisario wrote:
>> Hi all, i have a multi datacenter 6 nodes (6 osd) ceph jewel cluster.
>> There are 3 pools in the cluster, all three with size 3 and min_size 2.
>>
>> Today, i shut down all three nodes (controlled and in order) on
>> datacenter "CPD2" just to validate that everything keeps working on
>> "CPD1", whitch did (including rebalance of the infromation).
>>
>> After everything was off on CPD2, the "osd tree" looks like this,
>> whitch seems ok.
>>
>> root@oskceph01:~# ceph osd tree
>> ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> -1 30.0 root default
>> -8 15.0 datacenter CPD1
>> -2  5.0 host oskceph01
>>  0  5.0 osd.0   up  1.0  1.0
>> -6  5.0 host oskceph05
>>  4  5.0 osd.4   up  1.0  1.0
>> -4  5.0 host oskceph03
>>  2  5.0 osd.2   up  1.0  1.0
>> -9 15.0 datacenter CPD2
>> -3  5.0 host oskceph02
>>  1  5.0 osd.1 down0  1.0
>> -5  5.0 host oskceph04
>>  3  5.0 osd.3 down0  1.0
>> -7  5.0 host oskceph06
>>  5  5.0 osd.5 down0  1.0
>>
>> ...
>>
>> root@oskceph01:~# ceph pg dump | egrep degrad
>> dumped all in format plain
>> 8.1b3 178 0 178 0 0 1786814 3078 3078 active+undersized+degraded
>> 2017-06-06 13:11:46.130567 2361'250952 2361:248472 [0,2] 0 [0,2] 0
>> 1889'249956 2017-06-06 04:11:52.736214 1889'242115 2017-06-03
>> 19:07:06.615674
>>
>> For some extrange reason, i see that the acting set is [0,2] i dont
>> see osd.4 on the acting set, and honestly, i dont know why.
>>
>> ...
> I'm assuming you have failure domain as host, not datacenter? (otherwise
> you'd never get 0,2 ... and size 3 could never work either)
>
> So then it looks like a problem I had and solved this week... I had 60
> osds with 19 down to be replaced, and one pg out of 1152 wouldn't peer.
> Randomly I realized what was wrong... there's a "tunable
> choose_total_tries" you can increase so the pgs that tried to find an
> osd that many times and failed will try more:
>
>> ceph osd getcrushmap -o crushmap
>> crushtool -d crushmap -o crushmap.txt
>> vim crushmap.txt
>> here you change tunable choose_total_tries higher... default is
>> 50. 100 worked for me the first time, and then later I changed it
>> again to 200.
>> crushtool -c crushmap.txt -o crushmap.new
>> ceph osd setcrushmap -i crushmap.new
>
> if anything goes wrong with the new crushmap, you can always set the old
> again:
>> ceph osd setcrushmap -i crushmap
>
> Then you have to wait some time, maybe 30s before you have pgs peering.
>
> Now if only there was a log or warning seen in ceph -s that said the
> tries was exceeded, then this solution would be more obvious (and we
> would know whether it applies to you)
>



-- 
Alejandro Comisario
CTO | NUBELIU
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG that should not be on undersized+degraded on multi datacenter Ceph cluster

2017-06-06 Thread Alejandro Comisario
Hi all, i have a multi datacenter 6 nodes (6 osd) ceph jewel cluster.
There are 3 pools in the cluster, all three with size 3 and min_size 2.

Today, i shut down all three nodes (controlled and in order) on
datacenter "CPD2" just to validate that everything keeps working on
"CPD1", whitch did (including rebalance of the infromation).

After everything was off on CPD2, the "osd tree" looks like this,
whitch seems ok.

root@oskceph01:~# ceph osd tree
ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 30.0 root default
-8 15.0 datacenter CPD1
-2  5.0 host oskceph01
 0  5.0 osd.0   up  1.0  1.0
-6  5.0 host oskceph05
 4  5.0 osd.4   up  1.0  1.0
-4  5.0 host oskceph03
 2  5.0 osd.2   up  1.0  1.0
-9 15.0 datacenter CPD2
-3  5.0 host oskceph02
 1  5.0 osd.1 down0  1.0
-5  5.0 host oskceph04
 3  5.0 osd.3 down0  1.0
-7  5.0 host oskceph06
 5  5.0 osd.5 down0  1.0

Meaning that all PGS should have as acting set the osds 0, 2 and 4.
But "ceph health detail" shows me this weird PG in undersized+degraded
state as follow:

root@oskceph01:~# ceph health detail
HEALTH_WARN 1 pgs degraded; 1 pgs stuck unclean; 1 pgs undersized;
recovery 178/310287 objects degraded (0.057%); too many PGs per OSD
(1835 > max 300)
pg 8.1b3 is stuck unclean for 7735.364142, current state
active+undersized+degraded, last acting [0,2]
pg 8.1b3 is active+undersized+degraded, acting [0,2]
recovery 178/310287 objects degraded (0.057%)

the "pg dump" command shows as follow.

root@oskceph01:~# ceph pg dump | egrep degrad
dumped all in format plain
8.1b3 178 0 178 0 0 1786814 3078 3078 active+undersized+degraded
2017-06-06 13:11:46.130567 2361'250952 2361:248472 [0,2] 0 [0,2] 0
1889'249956 2017-06-06 04:11:52.736214 1889'242115 2017-06-03
19:07:06.615674

For some extrange reason, i see that the acting set is [0,2] i dont
see osd.4 on the acting set, and honestly, i dont know why.

Tried "pg repair" with no luck, and dont know what's the right way
then to fix/understand whats going on.


thanks!
-- 
Alejandrito
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Changing replica size of a running pool

2017-05-05 Thread Alejandro Comisario
Thanks David!
Any one ? more thoughts ?

On Wed, May 3, 2017 at 3:38 PM, David Turner  wrote:

> Those are both things that people have done and both work.  Neither is
> optimal, but both options work fine.  The best option is to definitely just
> get a third node now as you aren't going to be getting it for additional
> space from it later.  Your usable space between a 2 node size 2 cluster and
> a 3 node size 3 cluster is identical.
>
> If getting a third node is not possible, I would recommend a size 2
> min_size 2 configuration.  You will block writes if either of your nodes or
> any copy of your data is down, but you will not get into an inconsistent
> state that can happen with min_size of 1 (and you can always set the
> min_size of a pool to 1 on the fly to perform maintenance).  If you go with
> the option to use the failure domain of OSDs instead of hosts and have size
> 3, then a single node going down will block writes into your cluster.  The
> only you gain from this is having 3 physical copies of the data until you
> get a third node, but a lot of backfilling when you change the crush rule.
>
> A more complex option that I think would be a better solution than your 2
> options would be to create 2 hosts in your crush map for each physical host
> and split the OSDs in each host evenly between them.  That way you can have
> 2 copies of data in a given node, but never all 3 copies.  You have your 3
> copies of data and guaranteed that not all 3 are on the same host.
> Assuming min_size of 2, you will still block writes if you restart either
> node.
>
> If modifying the hosts in your crush map doesn't sound daunting, then I
> would recommend going that route... For most people that is more complex
> than they'd like to go and I would say size 2 min_size 2 would be the way
> to go until you get a third node.  #my2cents
>
> On Wed, May 3, 2017 at 12:41 PM Maximiliano Venesio 
> wrote:
>
>> Guys hi.
>>
>> I have a Jewel Cluster composed by two storage servers which are
>> configured on
>> the crush map as different buckets to store data.
>>
>> I've to configure two new pools on this cluster with the certainty
>> that i'll have to add more servers in a short term.
>>
>> Taking into account that the recommended replication size for every
>> pool is 3, i'm thinking in two possible scenarios.
>>
>> 1) Set the replica size in 2 now, and in the future change the replica
>> size to 3 on a running pool.
>> Is that possible? Can i have serious issues with the rebalance of the
>> pgs, changing the pool size on the fly ?
>>
>> 2) Set the replica size to 3, and change the ruleset to replicate by
>> OSD instead of HOST now, and in the future change this rule in the
>> ruleset to replicate again by host in a running pool.
>> Is that possible? Can i have serious issues with the rebalance of the
>> pgs, changing the ruleset in a running pool ?
>>
>> Which do you think is the best option ?
>>
>>
>> Thanks in advanced.
>>
>>
>> Maximiliano Venesio
>> Chief Cloud Architect | NUBELIU
>> E-mail: massimo@nubeliu.comCell: +54 9 11 3770 1853
>> <+54%209%2011%203770-1853>
>> _
>> www.nubeliu.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Client's read affinity

2017-04-05 Thread Alejandro Comisario
Another thing that i would love to ask and clarify is, would this work
for openstack vms that uses cinder, instead of vms that uses direct
integration between nova and ceph ?
We use cinder bootable volumes and normal cinder attached volumes to vms.

thx

On Wed, Apr 5, 2017 at 10:36 AM, Wes Dillingham
 wrote:
> This is a big development for us. I have not heard of this option either. I
> am excited to play with this feature and the implications it may have in
> improving RBD reads in our multi-datacenter RBD pools.
>
> Just to clarify the following options:
> "rbd localize parent reads = true" and "crush location = foo=bar" are
> configuration options for the client's ceph.conf and are not needed for OSD
> hosts as their locations are already encoded in the CRUSH map.
>
> It looks like this is a pretty old option (
> http://narkive.com/ZkTahBVu:5.455.67 )
>
> so I am assuming it is relatively tried and true? but I have never heard of
> it before... is anyone out there using this in a production RBD environment?
>
>
>
>
> On Tue, Apr 4, 2017 at 7:36 PM, Jason Dillaman  wrote:
>>
>> AFAIK, the OSDs should discover their location in the CRUSH map
>> automatically -- therefore, this "crush location" config override
>> would be used for librbd client configuration ("i.e. [client]
>> section") to describe their location in the CRUSH map relative to
>> racks, hosts, etc.
>>
>> On Tue, Apr 4, 2017 at 3:12 PM, Brian Andrus 
>> wrote:
>> > Jason, I haven't heard much about this feature.
>> >
>> > Will the localization have effect if the crush location configuration is
>> > set
>> > in the [osd] section, or does it need to apply globally for clients as
>> > well?
>> >
>> > On Fri, Mar 31, 2017 at 6:38 AM, Jason Dillaman 
>> > wrote:
>> >>
>> >> Assuming you are asking about RBD-back VMs, it is not possible to
>> >> localize the all reads to the VM image. You can, however, enable
>> >> localization of the parent image since that is a read-only data set.
>> >> To enable that feature, set "rbd localize parent reads = true" and
>> >> populate the "crush location = host=X rack=Y etc=Z" in your ceph.conf.
>> >>
>> >> On Fri, Mar 31, 2017 at 9:00 AM, Alejandro Comisario
>> >>  wrote:
>> >> > any experiences ?
>> >> >
>> >> > On Wed, Mar 29, 2017 at 2:02 PM, Alejandro Comisario
>> >> >  wrote:
>> >> >> Guys hi.
>> >> >> I have a Jewel Cluster divided into two racks which is configured on
>> >> >> the crush map.
>> >> >> I have clients (openstack compute nodes) that are closer from one
>> >> >> rack
>> >> >> than to another.
>> >> >>
>> >> >> I would love to (if is possible) to specify in some way the clients
>> >> >> to
>> >> >> read first from the nodes on a specific rack then try the other one
>> >> >> if
>> >> >> is not possible.
>> >> >>
>> >> >> Is that doable ? can somebody explain me how to do it ?
>> >> >> best.
>> >> >>
>> >> >> --
>> >> >> Alejandrito
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Alejandro Comisario
>> >> > CTO | NUBELIU
>> >> > E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
>> >> > _
>> >> > www.nubeliu.com
>> >> > ___
>> >> > ceph-users mailing list
>> >> > ceph-users@lists.ceph.com
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> >>
>> >>
>> >> --
>> >> Jason
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> >
>> >
>> >
>> > --
>> > Brian Andrus | Cloud Systems Engineer | DreamHost
>> > brian.and...@dreamhost.com | www.dreamhost.com
>>
>>
>>
>> --
>> Jason
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
> Respectfully,
>
> Wes Dillingham
> wes_dilling...@harvard.edu
> Research Computing | Infrastructure Engineer
> Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Alejandrito
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Client's read affinity

2017-03-31 Thread Alejandro Comisario
any experiences ?

On Wed, Mar 29, 2017 at 2:02 PM, Alejandro Comisario
 wrote:
> Guys hi.
> I have a Jewel Cluster divided into two racks which is configured on
> the crush map.
> I have clients (openstack compute nodes) that are closer from one rack
> than to another.
>
> I would love to (if is possible) to specify in some way the clients to
> read first from the nodes on a specific rack then try the other one if
> is not possible.
>
> Is that doable ? can somebody explain me how to do it ?
> best.
>
> --
> Alejandrito



-- 
Alejandro Comisario
CTO | NUBELIU
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Client's read affinity

2017-03-29 Thread Alejandro Comisario
Guys hi.
I have a Jewel Cluster divided into two racks which is configured on
the crush map.
I have clients (openstack compute nodes) that are closer from one rack
than to another.

I would love to (if is possible) to specify in some way the clients to
read first from the nodes on a specific rack then try the other one if
is not possible.

Is that doable ? can somebody explain me how to do it ?
best.

-- 
Alejandrito
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to think a two different disk's technologies architecture

2017-03-24 Thread Alejandro Comisario
thanks for the recommendations so far.
any one with more experiences and thoughts?

best

On Mar 23, 2017 16:36, "Maxime Guyot"  wrote:

> Hi Alexandro,
>
> As I understand you are planning NVMe for Journal for SATA HDD and
> collocated journal for SATA SSD?
>
> Option 1:
> - 24x SATA SSDs per server, will have a bottleneck with the storage
> bus/controller.  Also, I would consider the network capacity 24xSSDs will
> deliver more performance than 24xHDD with journal, but you have the same
> network capacity on both types of nodes.
> - This option is a little easier to implement: just move nodes in
> different CRUSHmap root
> - Failure of a server (assuming size = 3) will impact all PGs
> Option 2:
> - You may have noisy neighbors effect between HDDs and SSDs, if HDDs are
> able to saturate your NICs or storage controller. So be mindful of this
> with the hardware design
> - To configure the CRUSHmap for this you need to split each server in 2, I
> usually use “server1-hdd” and “server1-ssd” and map the right OSD in the
> right bucket, so a little extra work here but you can easily fix a “crush
> location hook” script for it (see example http://www.root314.com/2017/
> 01/15/Ceph-storage-tiers/)
> - In case of a server failure recovery will be faster than option 1 and
> will impact less PGs
>
> Some general notes:
> - SSD pools perform better with higher frequency CPUs
> - the 1GB of RAM per TB is a little outdated, the current consensus for
> HDD OSDs is around 2GB/OSD (see https://www.redhat.com/cms/
> managed-files/st-rhcs-config-guide-technology-detail-
> inc0387897-201604-en.pdf)
> - Network wise, if the SSD OSDs are rated for 500MB/s and use collocated
> journal you could generate up to 250MB/s of traffic per SSD OSD (24Gbps for
> 12x or 48Gbps for 24x) therefore I would consider doing 4x10G and
> consolidate both client and cluster network on that
>
> Cheers,
> Maxime
>
> On 23/03/17 18:55, "ceph-users on behalf of Alejandro Comisario" <
> ceph-users-boun...@lists.ceph.com on behalf of alejan...@nubeliu.com>
> wrote:
>
> Hi everyone!
> I have to install a ceph cluster (6 nodes) with two "flavors" of
> disks, 3 servers with SSD and 3 servers with SATA.
>
> Y will purchase 24 disks servers (the ones with sata with NVE SSD for
> the SATA journal)
> Processors will be 2 x E5-2620v4 with HT, and ram will be 20GB for the
> OS, and 1.3GB of ram per storage TB.
>
> The servers will have 2 x 10Gb bonding for public network and 2 x 10Gb
> for cluster network.
> My doubts resides, ar want to ask the community about experiences and
> pains and gains of choosing between.
>
> Option 1
> 3 x servers just for SSD
> 3 x servers jsut for SATA
>
> Option 2
> 6 x servers with 12 SSD and 12 SATA each
>
> Regarding crushmap configuration and rules everything is clear to make
> sure that two pools (poolSSD and poolSATA) uses the right disks.
>
> But, what about performance, maintenance, architecture scalability,
> etc ?
>
> thank you very much !
>
> --
> Alejandrito
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to think a two different disk's technologies architecture

2017-03-23 Thread Alejandro Comisario
Hi everyone!
I have to install a ceph cluster (6 nodes) with two "flavors" of
disks, 3 servers with SSD and 3 servers with SATA.

Y will purchase 24 disks servers (the ones with sata with NVE SSD for
the SATA journal)
Processors will be 2 x E5-2620v4 with HT, and ram will be 20GB for the
OS, and 1.3GB of ram per storage TB.

The servers will have 2 x 10Gb bonding for public network and 2 x 10Gb
for cluster network.
My doubts resides, ar want to ask the community about experiences and
pains and gains of choosing between.

Option 1
3 x servers just for SSD
3 x servers jsut for SATA

Option 2
6 x servers with 12 SSD and 12 SATA each

Regarding crushmap configuration and rules everything is clear to make
sure that two pools (poolSSD and poolSATA) uses the right disks.

But, what about performance, maintenance, architecture scalability, etc ?

thank you very much !

-- 
Alejandrito
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] can a OSD affect performance from pool X when blocking/slow requests PGs from pool Y ?

2017-03-23 Thread Alejandro Comisario
Deffinitelly in our case OSD were not the guilty ones, since all osd that
where blocking requests allways from the same pool, worked flawlesly (and
still do) after we deleted the pool where we always saw the blocked PG's.

Since the pool was accesed by just one client, and had almost no ops to it,
i really dont know how to reproduce the issue, but surely scares me to
happen ever again, and most, taking into consideration that blocked iops on
a OSD could cascade through all the cluster and block all other pools.

It was technically hard to explain to the management that one 1.5GB pool
locked almost 250 vms from different TB size pools, and most of it, not
having a root cause (meaning, why only that pool generated blocked iops)

hope to hear some more technical insights or someone else that went through
the same.
best.

On Thu, Mar 23, 2017 at 5:47 AM, Peter Maloney <
peter.malo...@brockmann-consult.de> wrote:

> I think Greg (who appears to be a ceph committer) basically said he was
> interested in looking at it, if only you had the pool that failed this way.
>
> Why not try to reproduce it, and make a log of your procedure so he can
> reproduce it too? What caused the slow requests... copy on write from
> snapshots? A bad disk? exclusive-lock with 2 clients writing at the same
> time maybe?
>
> I'd be interested in a solution too... like why can't idle disks (non-full
> disk queue) mean that the osd op or whatever queue can still fill with
> requests not related to the blocked pg/objects? I would love for ceph to
> handle this better. I suspect some issues I have are related to this (slow
> requests on one VM can freeze others [likely blame the osd], even requiring
> kill -9 [likely blame client librbd]).
>
> On 03/22/17 16:18, Alejandro Comisario wrote:
>
> any thoughts ?
>
> On Tue, Mar 14, 2017 at 10:22 PM, Alejandro Comisario <
> alejan...@nubeliu.com> wrote:
>
>> Greg, thanks for the reply.
>> True that i cant provide enough information to know what happened since
>> the pool is gone.
>>
>> But based on your experience, can i please take some of your time, and
>> give me the TOP 5 fo what could happen / would be the reason to happen what
>> hapened to that pool (or any pool) that makes Ceph (maybe hapened
>> specifically in Hammer ) to behave like that ?
>>
>> Information that i think will be of value, is that the cluster was 5
>> nodes large, running "0.94.6-1trusty" i added two nodes running the latest
>> "0.94.9-1trusty" and replication into those new disks never ended, since i
>> saw WEIRD errors on the new OSDs, so i thought that packages needed to be
>> the same, so i "apt-get upgraded" the 5 old nodes without restrting
>> nothing, so rebalancing started to happen without errors (WEIRD).
>>
>> after these two nodes reached 100% of the disks weight, the cluster
>> worked perfectly for about two weeks, till this happened.
>> After the resolution from my first email, everything has been working
>> perfect.
>>
>> thanks for the responses.
>>
>>
>> On Fri, Mar 10, 2017 at 4:23 PM, Gregory Farnum 
>> wrote:
>>
>>>
>>>
>>> On Tue, Mar 7, 2017 at 10:18 AM Alejandro Comisario <
>>> alejan...@nubeliu.com> wrote:
>>>
>>>> Gregory, thanks for the response, what you've said is by far, the most
>>>> enlightneen thing i know about ceph in a long time.
>>>>
>>>> What brings even greater doubt, which is, this "non-functional" pool,
>>>> was only 1.5GB large, vs 50-150GB on the other effected pools, the tiny
>>>> pool was still being used, and just because that pool was blovking
>>>> requests, the whole cluster was unresponsive.
>>>>
>>>> So , what do you mean by "non-functional" pool ? how a pool can become
>>>> non-functional ? and what asures me that tomorrow (just becaue i deleted
>>>> the 1.5GB pool to fix the whole problem) another pool doesnt becomes
>>>> non-functional ?
>>>>
>>>
>>> Well, you said there were a bunch of slow requests. That can happen any
>>> number of ways, if you're overloading the OSDs or something.
>>> When there are slow requests, those ops take up OSD memory and throttle,
>>> and so they don't let in new messages until the old ones are serviced. This
>>> can cascade across a cluster -- because everything is interconnected,
>>> clients and OSDs end up with all their requests targeted at the slow OSDs
>>> which aren't letting in new IO quickly enough. It's one of the weaknesses
>>> of the standard deployment patterns, but it usually doesn't come up unless
>>> something else has gone pretty wrong first.
>>> As for what actually went wrong here, you haven't provided near enough
>>> information and probably can't now that the pool has been deleted. *shrug*
>>> -Greg
>>>
>>
>


-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] can a OSD affect performance from pool X when blocking/slow requests PGs from pool Y ?

2017-03-22 Thread Alejandro Comisario
any thoughts ?

On Tue, Mar 14, 2017 at 10:22 PM, Alejandro Comisario  wrote:

> Greg, thanks for the reply.
> True that i cant provide enough information to know what happened since
> the pool is gone.
>
> But based on your experience, can i please take some of your time, and
> give me the TOP 5 fo what could happen / would be the reason to happen what
> hapened to that pool (or any pool) that makes Ceph (maybe hapened
> specifically in Hammer ) to behave like that ?
>
> Information that i think will be of value, is that the cluster was 5 nodes
> large, running "0.94.6-1trusty" i added two nodes running the latest
> "0.94.9-1trusty" and replication into those new disks never ended, since i
> saw WEIRD errors on the new OSDs, so i thought that packages needed to be
> the same, so i "apt-get upgraded" the 5 old nodes without restrting
> nothing, so rebalancing started to happen without errors (WEIRD).
>
> after these two nodes reached 100% of the disks weight, the cluster worked
> perfectly for about two weeks, till this happened.
> After the resolution from my first email, everything has been working
> perfect.
>
> thanks for the responses.
>
>
> On Fri, Mar 10, 2017 at 4:23 PM, Gregory Farnum 
> wrote:
>
>>
>>
>> On Tue, Mar 7, 2017 at 10:18 AM Alejandro Comisario <
>> alejan...@nubeliu.com> wrote:
>>
>>> Gregory, thanks for the response, what you've said is by far, the most
>>> enlightneen thing i know about ceph in a long time.
>>>
>>> What brings even greater doubt, which is, this "non-functional" pool,
>>> was only 1.5GB large, vs 50-150GB on the other effected pools, the tiny
>>> pool was still being used, and just because that pool was blovking
>>> requests, the whole cluster was unresponsive.
>>>
>>> So , what do you mean by "non-functional" pool ? how a pool can become
>>> non-functional ? and what asures me that tomorrow (just becaue i deleted
>>> the 1.5GB pool to fix the whole problem) another pool doesnt becomes
>>> non-functional ?
>>>
>>
>> Well, you said there were a bunch of slow requests. That can happen any
>> number of ways, if you're overloading the OSDs or something.
>> When there are slow requests, those ops take up OSD memory and throttle,
>> and so they don't let in new messages until the old ones are serviced. This
>> can cascade across a cluster -- because everything is interconnected,
>> clients and OSDs end up with all their requests targeted at the slow OSDs
>> which aren't letting in new IO quickly enough. It's one of the weaknesses
>> of the standard deployment patterns, but it usually doesn't come up unless
>> something else has gone pretty wrong first.
>> As for what actually went wrong here, you haven't provided near enough
>> information and probably can't now that the pool has been deleted. *shrug*
>> -Greg
>>
>>
>>
>>
>>> Ceph Bug ?
>>> Another Bug ?
>>> Something than can be avoided ?
>>>
>>>
>>> On Tue, Mar 7, 2017 at 2:11 PM, Gregory Farnum 
>>> wrote:
>>>
>>> Some facts:
>>> The OSDs use a lot of gossip protocols to distribute information.
>>> The OSDs limit how many client messages they let in to the system at a
>>> time.
>>> The OSDs do not distinguish between client ops for different pools (the
>>> blocking happens before they have any idea what the target is).
>>>
>>> So, yes: if you have a non-functional pool and clients keep trying to
>>> access it, those requests can fill up the OSD memory queues and block
>>> access to other pools as it cascades across the system.
>>>
>>> On Sun, Mar 5, 2017 at 6:22 PM Alejandro Comisario <
>>> alejan...@nubeliu.com> wrote:
>>>
>>> Hi, we have a 7 nodes ubuntu ceph hammer pool (78 OSD to be exact).
>>> This weekend we'be experienced a huge outage from our customers vms
>>> (located on pool CUSTOMERS, replica size 3 ) when lots of OSD's
>>> started to slow request/block PG's on pool PRIVATE ( replica size 1 )
>>> basically all PG's blocked where just one OSD in the acting set, but
>>> all customers on the other pool got their vms almost freezed.
>>>
>>> while trying to do basic troubleshooting like doing noout and then
>>> bringing down the OSD that slowed/blocked the most, inmediatelly
>>> another OSD slowed/locked iops on pgs from the same PRIVATE pool, so
>>> we rolled back that cha

Re: [ceph-users] question about block sizes, rados objects and file striping (and maybe more)

2017-03-20 Thread Alejandro Comisario
Jason, thanks for the reply, you really got my question right.
So, some doubts that might show that i lack of some general knowledge.

When i read that someone is testing a ceph cluster with secuential 4k
block writes, does that could happen inside a vm that is using an RBD
backed OS ?
In that case, should the vm's FS should be formated to allow 4K writes
 so that the block level of the vm writes 4K down to the hypervisor ?

In that case, asuming that i have a 9K mtu between the compute node
and the ceph cluster.
What is the default rados block size in whitch the objects are divided
against the amount of information ?


On Mon, Mar 20, 2017 at 7:06 PM, Jason Dillaman  wrote:
> It's a very broad question -- are you trying to determine something
> more specific?
>
> Notionally, your DB engine will safely journal the changes to disk,
> commit the changes to the backing table structures, and prune the
> journal. Your mileage my vary depending on the specific DB engine and
> its configuration settings.
>
> The VM's OS will send write requests addressed by block offset and
> block counts (e.g. 512 blocks) through the block device hardware
> (either a slower emulated block device or a faster paravirtualized
> block device like virtio-blk/virtio-scsi). Within the internals of
> QEMU, these block-addressed write requests will be delivered to librbd
> in byte-addressed format (the blocks are converted to absolute byte
> ranges).
>
> librbd will take the provided byte offset and length and quickly
> calculate which backing RADOS objects are associated with the provided
> range [1]. If the extent intersects multiple backing objects, the
> sub-operation is sent to each affected object in parallel. These
> operations will be sent to the OSDs responsible for handling the
> object (as per the CRUSH map) -- by default via TCP/IP. The MTU is the
> maximum size of each IP packet -- larger MTUs allow you to send more
> data within a single packet [2].
>
> [1] http://docs.ceph.com/docs/master/architecture/#data-striping
> [2] https://en.wikipedia.org/wiki/Maximum_transmission_unit
>
>
>
> On Mon, Mar 20, 2017 at 5:24 PM, Alejandro Comisario
>  wrote:
>> anyone ?
>>
>> On Fri, Mar 17, 2017 at 5:40 PM, Alejandro Comisario
>>  wrote:
>>> Hi, it's been a while since im using Ceph, and still im a little
>>> ashamed that when certain situation happens, i dont have the knowledge
>>> to explain or plan things.
>>>
>>> Basically what i dont know is, and i will do an exercise.
>>>
>>> EXCERCISE:
>>> a virtual machine running on KVM has an extra block device where the
>>> datafiles of a database runs (this block device is exposed to the vm
>>> using libvirt)
>>>
>>> facts.
>>> * the db writes to disk in 8K blocks
>>> * the connection between the phisical compute node and Ceph has an MTU of 
>>> 1500
>>> * the QEMU RBD driver uses a stipe unit of 2048 kB and a stripe count of 4.
>>> * everything else is default
>>>
>>> So conceptually, if someone can explain me, what happens from the
>>> momment the DB contained on the VM commits to disk a query of
>>> 20MBytes, what happens on the compute node, what happens on the
>>> client's file striping, what happens on the network (regarding
>>> packages, if other than creating 1500 bytes packages), what happens
>>> with rados objects, block sizes, etc.
>>>
>>> I would love to read this from the bests, mainly because as i said i
>>> dont understand all the workflow of blocks, objects, etc.
>>>
>>> thanks to everyone !
>>>
>>> --
>>> Alejandrito
>>
>>
>>
>> --
>> Alejandro Comisario
>> CTO | NUBELIU
>> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
>> _
>> www.nubeliu.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Jason



-- 
Alejandro Comisario
CTO | NUBELIU
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question about block sizes, rados objects and file striping (and maybe more)

2017-03-20 Thread Alejandro Comisario
anyone ?

On Fri, Mar 17, 2017 at 5:40 PM, Alejandro Comisario
 wrote:
> Hi, it's been a while since im using Ceph, and still im a little
> ashamed that when certain situation happens, i dont have the knowledge
> to explain or plan things.
>
> Basically what i dont know is, and i will do an exercise.
>
> EXCERCISE:
> a virtual machine running on KVM has an extra block device where the
> datafiles of a database runs (this block device is exposed to the vm
> using libvirt)
>
> facts.
> * the db writes to disk in 8K blocks
> * the connection between the phisical compute node and Ceph has an MTU of 1500
> * the QEMU RBD driver uses a stipe unit of 2048 kB and a stripe count of 4.
> * everything else is default
>
> So conceptually, if someone can explain me, what happens from the
> momment the DB contained on the VM commits to disk a query of
> 20MBytes, what happens on the compute node, what happens on the
> client's file striping, what happens on the network (regarding
> packages, if other than creating 1500 bytes packages), what happens
> with rados objects, block sizes, etc.
>
> I would love to read this from the bests, mainly because as i said i
> dont understand all the workflow of blocks, objects, etc.
>
> thanks to everyone !
>
> --
> Alejandrito



-- 
Alejandro Comisario
CTO | NUBELIU
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] question about block sizes, rados objects and file striping (and maybe more)

2017-03-17 Thread Alejandro Comisario
Hi, it's been a while since im using Ceph, and still im a little
ashamed that when certain situation happens, i dont have the knowledge
to explain or plan things.

Basically what i dont know is, and i will do an exercise.

EXCERCISE:
a virtual machine running on KVM has an extra block device where the
datafiles of a database runs (this block device is exposed to the vm
using libvirt)

facts.
* the db writes to disk in 8K blocks
* the connection between the phisical compute node and Ceph has an MTU of 1500
* the QEMU RBD driver uses a stipe unit of 2048 kB and a stripe count of 4.
* everything else is default

So conceptually, if someone can explain me, what happens from the
momment the DB contained on the VM commits to disk a query of
20MBytes, what happens on the compute node, what happens on the
client's file striping, what happens on the network (regarding
packages, if other than creating 1500 bytes packages), what happens
with rados objects, block sizes, etc.

I would love to read this from the bests, mainly because as i said i
dont understand all the workflow of blocks, objects, etc.

thanks to everyone !

-- 
Alejandrito
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] can a OSD affect performance from pool X when blocking/slow requests PGs from pool Y ?

2017-03-14 Thread Alejandro Comisario
Greg, thanks for the reply.
True that i cant provide enough information to know what happened since the
pool is gone.

But based on your experience, can i please take some of your time, and give
me the TOP 5 fo what could happen / would be the reason to happen what
hapened to that pool (or any pool) that makes Ceph (maybe hapened
specifically in Hammer ) to behave like that ?

Information that i think will be of value, is that the cluster was 5 nodes
large, running "0.94.6-1trusty" i added two nodes running the latest
"0.94.9-1trusty" and replication into those new disks never ended, since i
saw WEIRD errors on the new OSDs, so i thought that packages needed to be
the same, so i "apt-get upgraded" the 5 old nodes without restrting
nothing, so rebalancing started to happen without errors (WEIRD).

after these two nodes reached 100% of the disks weight, the cluster worked
perfectly for about two weeks, till this happened.
After the resolution from my first email, everything has been working
perfect.

thanks for the responses.


On Fri, Mar 10, 2017 at 4:23 PM, Gregory Farnum  wrote:

>
>
> On Tue, Mar 7, 2017 at 10:18 AM Alejandro Comisario 
> wrote:
>
>> Gregory, thanks for the response, what you've said is by far, the most
>> enlightneen thing i know about ceph in a long time.
>>
>> What brings even greater doubt, which is, this "non-functional" pool, was
>> only 1.5GB large, vs 50-150GB on the other effected pools, the tiny pool
>> was still being used, and just because that pool was blovking requests, the
>> whole cluster was unresponsive.
>>
>> So , what do you mean by "non-functional" pool ? how a pool can become
>> non-functional ? and what asures me that tomorrow (just becaue i deleted
>> the 1.5GB pool to fix the whole problem) another pool doesnt becomes
>> non-functional ?
>>
>
> Well, you said there were a bunch of slow requests. That can happen any
> number of ways, if you're overloading the OSDs or something.
> When there are slow requests, those ops take up OSD memory and throttle,
> and so they don't let in new messages until the old ones are serviced. This
> can cascade across a cluster -- because everything is interconnected,
> clients and OSDs end up with all their requests targeted at the slow OSDs
> which aren't letting in new IO quickly enough. It's one of the weaknesses
> of the standard deployment patterns, but it usually doesn't come up unless
> something else has gone pretty wrong first.
> As for what actually went wrong here, you haven't provided near enough
> information and probably can't now that the pool has been deleted. *shrug*
> -Greg
>
>
>
>
>> Ceph Bug ?
>> Another Bug ?
>> Something than can be avoided ?
>>
>>
>> On Tue, Mar 7, 2017 at 2:11 PM, Gregory Farnum 
>> wrote:
>>
>> Some facts:
>> The OSDs use a lot of gossip protocols to distribute information.
>> The OSDs limit how many client messages they let in to the system at a
>> time.
>> The OSDs do not distinguish between client ops for different pools (the
>> blocking happens before they have any idea what the target is).
>>
>> So, yes: if you have a non-functional pool and clients keep trying to
>> access it, those requests can fill up the OSD memory queues and block
>> access to other pools as it cascades across the system.
>>
>> On Sun, Mar 5, 2017 at 6:22 PM Alejandro Comisario 
>> wrote:
>>
>> Hi, we have a 7 nodes ubuntu ceph hammer pool (78 OSD to be exact).
>> This weekend we'be experienced a huge outage from our customers vms
>> (located on pool CUSTOMERS, replica size 3 ) when lots of OSD's
>> started to slow request/block PG's on pool PRIVATE ( replica size 1 )
>> basically all PG's blocked where just one OSD in the acting set, but
>> all customers on the other pool got their vms almost freezed.
>>
>> while trying to do basic troubleshooting like doing noout and then
>> bringing down the OSD that slowed/blocked the most, inmediatelly
>> another OSD slowed/locked iops on pgs from the same PRIVATE pool, so
>> we rolled back that change and started to move data around with the
>> same logic (reweighting down those OSD) with exactly the same result.
>>
>> So, me made a decition, we decided to delete the pool where all PGS
>> where slowed/locked allways despite the osd.
>>
>> Not even 10 secconds passes after the pool deletion, where not only
>> there were no more degraded PGs, bit also ALL slow iops dissapeared
>> for ever, and performance from hundreds of vms came to normal
>> immediately.
>>
>> I mus

Re: [ceph-users] can a OSD affect performance from pool X when blocking/slow requests PGs from pool Y ?

2017-03-10 Thread Alejandro Comisario
Any thoughts ?

On Tue, Mar 7, 2017 at 3:17 PM, Alejandro Comisario 
wrote:

> Gregory, thanks for the response, what you've said is by far, the most
> enlightneen thing i know about ceph in a long time.
>
> What brings even greater doubt, which is, this "non-functional" pool, was
> only 1.5GB large, vs 50-150GB on the other effected pools, the tiny pool
> was still being used, and just because that pool was blovking requests, the
> whole cluster was unresponsive.
>
> So , what do you mean by "non-functional" pool ? how a pool can become
> non-functional ? and what asures me that tomorrow (just becaue i deleted
> the 1.5GB pool to fix the whole problem) another pool doesnt becomes
> non-functional ?
>
> Ceph Bug ?
> Another Bug ?
> Something than can be avoided ?
>
>
> On Tue, Mar 7, 2017 at 2:11 PM, Gregory Farnum  wrote:
>
>> Some facts:
>> The OSDs use a lot of gossip protocols to distribute information.
>> The OSDs limit how many client messages they let in to the system at a
>> time.
>> The OSDs do not distinguish between client ops for different pools (the
>> blocking happens before they have any idea what the target is).
>>
>> So, yes: if you have a non-functional pool and clients keep trying to
>> access it, those requests can fill up the OSD memory queues and block
>> access to other pools as it cascades across the system.
>>
>> On Sun, Mar 5, 2017 at 6:22 PM Alejandro Comisario 
>> wrote:
>>
>>> Hi, we have a 7 nodes ubuntu ceph hammer pool (78 OSD to be exact).
>>> This weekend we'be experienced a huge outage from our customers vms
>>> (located on pool CUSTOMERS, replica size 3 ) when lots of OSD's
>>> started to slow request/block PG's on pool PRIVATE ( replica size 1 )
>>> basically all PG's blocked where just one OSD in the acting set, but
>>> all customers on the other pool got their vms almost freezed.
>>>
>>> while trying to do basic troubleshooting like doing noout and then
>>> bringing down the OSD that slowed/blocked the most, inmediatelly
>>> another OSD slowed/locked iops on pgs from the same PRIVATE pool, so
>>> we rolled back that change and started to move data around with the
>>> same logic (reweighting down those OSD) with exactly the same result.
>>>
>>> So, me made a decition, we decided to delete the pool where all PGS
>>> where slowed/locked allways despite the osd.
>>>
>>> Not even 10 secconds passes after the pool deletion, where not only
>>> there were no more degraded PGs, bit also ALL slow iops dissapeared
>>> for ever, and performance from hundreds of vms came to normal
>>> immediately.
>>>
>>> I must say that i was kinda scared to see that happen, bascally
>>> because there was only ONE POOL's PGS always slowed, but performance
>>> hit the another pool, so ... did not the PGS that exists on one pool
>>> are not shared by the other ?
>>> If my assertion is true, why OSD's locking iops from one pool's pg
>>> slowed down all other pgs from other pools ?
>>>
>>> again, i just deleted a pool that has almost no traffic, because its
>>> pgs were locked and affected pgs on another pool, and as soon as that
>>> happened, the whole cluster came back to normal (and of course,
>>> HEALTH_OK and no slow transaction whatsoever)
>>>
>>> please, someone help me understand the gap where i miss something,
>>> since this , as long as my ceph knowledge is concerned, makes no
>>> sense.
>>>
>>> PS: i have found someone that , looks like went through the same here:
>>> https://forum.proxmox.com/threads/ceph-osd-failure-causing-
>>> proxmox-node-to-crash.20781/
>>> but i still dont understand what happened.
>>>
>>> hoping to get the help from the community.
>>>
>>> --
>>> Alejandrito.
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
>
> --
> *Alejandro Comisario*
> *CTO | NUBELIU*
> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
> _
> www.nubeliu.com
>



-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] can a OSD affect performance from pool X when blocking/slow requests PGs from pool Y ?

2017-03-07 Thread Alejandro Comisario
Gregory, thanks for the response, what you've said is by far, the most
enlightneen thing i know about ceph in a long time.

What brings even greater doubt, which is, this "non-functional" pool, was
only 1.5GB large, vs 50-150GB on the other effected pools, the tiny pool
was still being used, and just because that pool was blovking requests, the
whole cluster was unresponsive.

So , what do you mean by "non-functional" pool ? how a pool can become
non-functional ? and what asures me that tomorrow (just becaue i deleted
the 1.5GB pool to fix the whole problem) another pool doesnt becomes
non-functional ?

Ceph Bug ?
Another Bug ?
Something than can be avoided ?


On Tue, Mar 7, 2017 at 2:11 PM, Gregory Farnum  wrote:

> Some facts:
> The OSDs use a lot of gossip protocols to distribute information.
> The OSDs limit how many client messages they let in to the system at a
> time.
> The OSDs do not distinguish between client ops for different pools (the
> blocking happens before they have any idea what the target is).
>
> So, yes: if you have a non-functional pool and clients keep trying to
> access it, those requests can fill up the OSD memory queues and block
> access to other pools as it cascades across the system.
>
> On Sun, Mar 5, 2017 at 6:22 PM Alejandro Comisario 
> wrote:
>
>> Hi, we have a 7 nodes ubuntu ceph hammer pool (78 OSD to be exact).
>> This weekend we'be experienced a huge outage from our customers vms
>> (located on pool CUSTOMERS, replica size 3 ) when lots of OSD's
>> started to slow request/block PG's on pool PRIVATE ( replica size 1 )
>> basically all PG's blocked where just one OSD in the acting set, but
>> all customers on the other pool got their vms almost freezed.
>>
>> while trying to do basic troubleshooting like doing noout and then
>> bringing down the OSD that slowed/blocked the most, inmediatelly
>> another OSD slowed/locked iops on pgs from the same PRIVATE pool, so
>> we rolled back that change and started to move data around with the
>> same logic (reweighting down those OSD) with exactly the same result.
>>
>> So, me made a decition, we decided to delete the pool where all PGS
>> where slowed/locked allways despite the osd.
>>
>> Not even 10 secconds passes after the pool deletion, where not only
>> there were no more degraded PGs, bit also ALL slow iops dissapeared
>> for ever, and performance from hundreds of vms came to normal
>> immediately.
>>
>> I must say that i was kinda scared to see that happen, bascally
>> because there was only ONE POOL's PGS always slowed, but performance
>> hit the another pool, so ... did not the PGS that exists on one pool
>> are not shared by the other ?
>> If my assertion is true, why OSD's locking iops from one pool's pg
>> slowed down all other pgs from other pools ?
>>
>> again, i just deleted a pool that has almost no traffic, because its
>> pgs were locked and affected pgs on another pool, and as soon as that
>> happened, the whole cluster came back to normal (and of course,
>> HEALTH_OK and no slow transaction whatsoever)
>>
>> please, someone help me understand the gap where i miss something,
>> since this , as long as my ceph knowledge is concerned, makes no
>> sense.
>>
>> PS: i have found someone that , looks like went through the same here:
>> https://forum.proxmox.com/threads/ceph-osd-failure-
>> causing-proxmox-node-to-crash.20781/
>> but i still dont understand what happened.
>>
>> hoping to get the help from the community.
>>
>> --
>> Alejandrito.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>


-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] can a OSD affect performance from pool X when blocking/slow requests PGs from pool Y ?

2017-03-06 Thread Alejandro Comisario
Hi, we have a 7 nodes ubuntu ceph hammer pool (78 OSD to be exact).
This weekend we'be experienced a huge outage from our customers vms
(located on pool CUSTOMERS, replica size 3 ) when lots of OSD's
started to slow request/block PG's on pool PRIVATE ( replica size 1 )
basically all PG's blocked where just one OSD in the acting set, but
all customers on the other pool got their vms almost freezed.

while trying to do basic troubleshooting like doing noout and then
bringing down the OSD that slowed/blocked the most, inmediatelly
another OSD slowed/locked iops on pgs from the same PRIVATE pool, so
we rolled back that change and started to move data around with the
same logic (reweighting down those OSD) with exactly the same result.

So, me made a decition, we decided to delete the pool where all PGS
where slowed/locked allways despite the osd.

Not even 10 secconds passes after the pool deletion, where not only
there were no more degraded PGs, bit also ALL slow iops dissapeared
for ever, and performance from hundreds of vms came to normal
immediately.

I must say that i was kinda scared to see that happen, bascally
because there was only ONE POOL's PGS always slowed, but performance
hit the another pool, so ... did not the PGS that exists on one pool
are not shared by the other ?
If my assertion is true, why OSD's locking iops from one pool's pg
slowed down all other pgs from other pools ?

again, i just deleted a pool that has almost no traffic, because its
pgs were locked and affected pgs on another pool, and as soon as that
happened, the whole cluster came back to normal (and of course,
HEALTH_OK and no slow transaction whatsoever)

please, someone help me understand the gap where i miss something,
since this , as long as my ceph knowledge is concerned, makes no
sense.

PS: i have found someone that , looks like went through the same here:
https://forum.proxmox.com/threads/ceph-osd-failure-causing-proxmox-node-to-crash.20781/
but i still dont understand what happened.

hoping to get the help from the community.

-- 
Alejandrito
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] can a OSD affect performance from pool X when blocking/slow requests PGs from pool Y ?

2017-03-05 Thread Alejandro Comisario
Hi, we have a 7 nodes ubuntu ceph hammer pool (78 OSD to be exact).
This weekend we'be experienced a huge outage from our customers vms
(located on pool CUSTOMERS, replica size 3 ) when lots of OSD's
started to slow request/block PG's on pool PRIVATE ( replica size 1 )
basically all PG's blocked where just one OSD in the acting set, but
all customers on the other pool got their vms almost freezed.

while trying to do basic troubleshooting like doing noout and then
bringing down the OSD that slowed/blocked the most, inmediatelly
another OSD slowed/locked iops on pgs from the same PRIVATE pool, so
we rolled back that change and started to move data around with the
same logic (reweighting down those OSD) with exactly the same result.

So, me made a decition, we decided to delete the pool where all PGS
where slowed/locked allways despite the osd.

Not even 10 secconds passes after the pool deletion, where not only
there were no more degraded PGs, bit also ALL slow iops dissapeared
for ever, and performance from hundreds of vms came to normal
immediately.

I must say that i was kinda scared to see that happen, bascally
because there was only ONE POOL's PGS always slowed, but performance
hit the another pool, so ... did not the PGS that exists on one pool
are not shared by the other ?
If my assertion is true, why OSD's locking iops from one pool's pg
slowed down all other pgs from other pools ?

again, i just deleted a pool that has almost no traffic, because its
pgs were locked and affected pgs on another pool, and as soon as that
happened, the whole cluster came back to normal (and of course,
HEALTH_OK and no slow transaction whatsoever)

please, someone help me understand the gap where i miss something,
since this , as long as my ceph knowledge is concerned, makes no
sense.

PS: i have found someone that , looks like went through the same here:
https://forum.proxmox.com/threads/ceph-osd-failure-causing-proxmox-node-to-crash.20781/
but i still dont understand what happened.

hoping to get the help from the community.

-- 
Alejandrito.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com