Re: [ceph-users] ceph pg backfill_toofull

2018-12-11 Thread Maged Mokhtar


There are 2 relevant params

mon_osd_full_ratio     0.95

osd_backfill_full_ratio 0.85
you are probably hitting them both
As a short term/ temp fix you may increase these values and maybe adjust 
weights on osds if you have to.
However you really need to fix this by adding more osds to your cluster, 
else it will happen again and again. Also when planing for required 
storage capacity, you should plan when 1 or 2 hosts fail and their pgs 
will distributed on remaining nodes, else you will hit the same issue.


/Maged




On 12/12/2018 07:52, Klimenko, Roman wrote:


Hi everyone. Yesterday i found that on our overcrowded Hammer ceph 
cluster (83% used in HDD pool) several osds were in danger zone - near 
95%.


I reweighted them, and after several moments I got pgs stuck in 
backfill_toofull.


After that, I reapplied reweight to osds - no luck.

Currently, all reweights are equal 1.0, and ceph do nothing - no 
rebalance and recovering.


How I can make ceph recover these pgs?

ceph -s

     health HEALTH_WARN
            47 pgs backfill_toofull
            47 pgs stuck unclean
            recovery 16/9422472 objects degraded (0.000%)
            recovery 365332/9422472 objects misplaced (3.877%)
            7 near full osd(s)

ceph osd df tree
ID WEIGHT   REWEIGHT SIZE   USE    AVAIL %USE  VAR  TYPE NAME
-1 30.65996        - 37970G 29370G 8599G 77.35 1.00 root default
-6 18.65996        - 20100G 16681G 3419G 82.99 1.07  region HDD
-3  6.09000        -  6700G  5539G 1160G 82.68 1.07    host ceph03.HDD
 1  1.0  1.0  1116G   841G  274G 75.39 0.97        osd.1
 5  1.0  1.0  1116G   916G  200G 82.07 1.06        osd.5
 3  1.0  1.0  1116G   939G  177G 84.14 1.09        osd.3
 8  1.09000  1.0  1116G   952G  164G 85.29 1.10        osd.8
 7  1.0  1.0  1116G   972G  143G 87.11 1.13        osd.7
11  1.0  1.0  1116G   916G  200G 82.08 1.06        osd.11
-4  6.16998        -  6700G  5612G 1088G 83.76 1.08    host ceph02.HDD
14  1.09000  1.0  1116G   950G  165G 85.16 1.10        osd.14
13  0.8  1.0  1116G   949G  167G 85.03 1.10        osd.13
16  1.09000  1.0  1116G   921G  195G 82.50 1.07        osd.16
17  1.0  1.0  1116G   899G  216G 80.59 1.04        osd.17
10  1.09000  1.0  1116G   952G  164G 85.28 1.10        osd.10
15  1.0  1.0  1116G   938G  178G 84.02 1.09        osd.15
-2  6.39998        -  6700G  5529G 1170G 82.53 1.07    host ceph01.HDD
12  1.09000  1.0  1116G   953G  163G 85.39 1.10        osd.12
 9  0.95000  1.0  1116G   939G  177G 84.14 1.09        osd.9
 2  1.09000  1.0  1116G   911G  204G 81.64 1.06        osd.2
 0  1.09000  1.0  1116G   951G  165G 85.22 1.10        osd.0
 6  1.09000  1.0  1116G   917G  199G 82.12 1.06        osd.6
 4  1.09000  1.0  1116G   856G  260G 76.67 0.99        osd.4




​




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Maged Mokhtar
CEO PetaSAN
4 Emad El Deen Kamel
Cairo 11371, Egypt
www.petasan.org
+201006979931
skype: maged.mokhtar

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] move directories in cephfs

2018-12-11 Thread Zhenshi Zhou
Hi

Than means, the 'mv' operation should be done if src and dst
are in the same pool, and the client should have same permission
on both src and dst.

Do I have the right understanding?

Marc Roos  于2018年12月11日周二 下午4:53写道:

> >Moving data between pools when a file is moved to a different directory
>
> >is most likely problematic - for example an inode can be hard linked to
>
> >two different directories that are in two different pools - then what
> >happens to the file?  Unix/posix semantics don't really specify a
> parent
> >directory to a regular file.
> >
> >That being said - it would be really nice if there were a way to move
> an
> >inode from one pool to another transparently (with some explicit
> >command).  Perhaps locking the inode up for the duration of the move,
> >and releasing it when the move is complete (so that clients that have
> >the file open don't notice any disruptions).  Are there any plans in
> >this direction?
>
> I do also hope so. Because this would be for me expected behavior. I ran
> into this issue accidentally because I had different permissions on the
> pools. How can I explain a user that if they move files between 2
> specific folders that they should not mv but cp. Now I have to
> workaround this buy apply separate mounts.
>
>
> -Original Message-
> From: Andras Pataki [mailto:apat...@flatironinstitute.org]
> Sent: 11 December 2018 00:34
> To: Marc Roos; ceph; ceph-users
> Subject: Re: [ceph-users] move directories in cephfs
>
> Moving data between pools when a file is moved to a different directory
> is most likely problematic - for example an inode can be hard linked to
> two different directories that are in two different pools - then what
> happens to the file?  Unix/posix semantics don't really specify a parent
>
> directory to a regular file.
>
> That being said - it would be really nice if there were a way to move an
>
> inode from one pool to another transparently (with some explicit
> command).  Perhaps locking the inode up for the duration of the move,
> and releasing it when the move is complete (so that clients that have
> the file open don't notice any disruptions).  Are there any plans in
> this direction?
>
> Andras
>
> On 12/10/18 10:55 AM, Marc Roos wrote:
> >
> >
> > Except if you have different pools on these directories. Then the data
> > is not moved(copied), which I think should be done. This should be
> > changed, because no one will expect a symlink to the old pool.
> >
> >
> >
> >
> > -Original Message-
> > From: Jack [mailto:c...@jack.fr.eu.org]
> > Sent: 10 December 2018 15:14
> > To: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] move directories in cephfs
> >
> > Having the / mounted somewhere, you can simply "mv" directories around
> >
> > On 12/10/2018 02:59 PM, Zhenshi Zhou wrote:
> >> Hi,
> >>
> >> Is there a way I can move sub-directories outside the directory.
> >> For instance, a directory /parent contains 3 sub-directories
> >> /parent/a, /parent/b, /parent/c. All these directories have huge data
> >> in it. I'm gonna move /parent/b to /b. I don't want to copy the whole
> >> directory outside cause it will be so slow.
> >>
> >> Besides, I heard about cephfs-shell early today. I'm wondering which
> >> version will ceph have this command tool. My cluster is luminous
> >> 12.2.5.
> >>
> >> Thanks
> >>
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph pg backfill_toofull

2018-12-11 Thread Klimenko, Roman
Hi everyone. Yesterday i found that on our overcrowded Hammer ceph cluster (83% 
used in HDD pool) several osds were in danger zone - near 95%.

I reweighted them, and after several moments I got pgs stuck in 
backfill_toofull.

After that, I reapplied reweight to osds - no luck.

Currently, all reweights are equal 1.0, and ceph do nothing - no rebalance and 
recovering.

How I can make ceph recover these pgs?

ceph -s

 health HEALTH_WARN
47 pgs backfill_toofull
47 pgs stuck unclean
recovery 16/9422472 objects degraded (0.000%)
recovery 365332/9422472 objects misplaced (3.877%)
7 near full osd(s)

ceph osd df tree
ID WEIGHT   REWEIGHT SIZE   USEAVAIL %USE  VAR  TYPE NAME
-1 30.65996- 37970G 29370G 8599G 77.35 1.00 root default
-6 18.65996- 20100G 16681G 3419G 82.99 1.07 region HDD
-3  6.09000-  6700G  5539G 1160G 82.68 1.07 host ceph03.HDD
 1  1.0  1.0  1116G   841G  274G 75.39 0.97 osd.1
 5  1.0  1.0  1116G   916G  200G 82.07 1.06 osd.5
 3  1.0  1.0  1116G   939G  177G 84.14 1.09 osd.3
 8  1.09000  1.0  1116G   952G  164G 85.29 1.10 osd.8
 7  1.0  1.0  1116G   972G  143G 87.11 1.13 osd.7
11  1.0  1.0  1116G   916G  200G 82.08 1.06 osd.11
-4  6.16998-  6700G  5612G 1088G 83.76 1.08 host ceph02.HDD
14  1.09000  1.0  1116G   950G  165G 85.16 1.10 osd.14
13  0.8  1.0  1116G   949G  167G 85.03 1.10 osd.13
16  1.09000  1.0  1116G   921G  195G 82.50 1.07 osd.16
17  1.0  1.0  1116G   899G  216G 80.59 1.04 osd.17
10  1.09000  1.0  1116G   952G  164G 85.28 1.10 osd.10
15  1.0  1.0  1116G   938G  178G 84.02 1.09 osd.15
-2  6.39998-  6700G  5529G 1170G 82.53 1.07 host ceph01.HDD
12  1.09000  1.0  1116G   953G  163G 85.39 1.10 osd.12
 9  0.95000  1.0  1116G   939G  177G 84.14 1.09 osd.9
 2  1.09000  1.0  1116G   911G  204G 81.64 1.06 osd.2
 0  1.09000  1.0  1116G   951G  165G 85.22 1.10 osd.0
 6  1.09000  1.0  1116G   917G  199G 82.12 1.06 osd.6
 4  1.09000  1.0  1116G   856G  260G 76.67 0.99 osd.4




?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lost 1/40 OSDs at EC 4+1, now PGs are incomplete

2018-12-11 Thread Adam Tygart
AFAIR, there is a feature request in the works to allow rebuild with K
chunks, but not allow normal read/write until min_size is met. Not
that I think running with m=1 is a good idea. I'm not seeing the
tracker issue for it at the moment, though.

--
Adam
On Tue, Dec 11, 2018 at 9:50 PM Ashley Merrick  wrote:
>
> Yes if you set back to 5, every time your loose an OSD your have to set to 4 
> and let the rebuild take place before putting back to 5.
>
> I guess is all down to how important 100% up time is over you manually 
> monitoring the back fill / fix the OSD / replace the OSD by dropping to 4 vs 
> letting it do this automatically and risk a further OSD loss.
>
> If you have the space ID suggest going to 4 + 2 and migrating your data, this 
> would remove the ongoing issue and give you some extra data protection from 
> OSD loss.
>
> On Wed, Dec 12, 2018 at 11:43 AM David Young  
> wrote:
>>
>> (accidentally forgot to reply to the list)
>>
>> Thank you, setting min_size to 4 allowed I/O again, and the 39 incomplete 
>> PGs are now:
>>
>> 39  active+undersized+degraded+remapped+backfilling
>>
>> Once backfilling is done, I'll increase min_size to 5 again.
>>
>> Am I likely to encounter this issue whenever I loose an OSD (I/O freezes and 
>> manually reducing size is required), and is there anything I should be doing 
>> differently?
>>
>> Thanks again!
>> D
>>
>>
>>
>> Sent with ProtonMail Secure Email.
>>
>> ‐‐‐ Original Message ‐‐‐
>> On Wednesday, December 12, 2018 3:31 PM, Ashley Merrick 
>>  wrote:
>>
>> With EC the min size is set to K + 1.
>>
>> Generally EC is used with a M of 2 or more, reason min size is set to 1 is 
>> now you are in a state when a further OSD loss will cause some PG’s to not 
>> have at least K size available as you only have 1 extra M.
>>
>> As per the error you can get your pool back online by setting min_size to 4.
>>
>> However this would only be a temp fix while you get the OSD back online / 
>> rebuilt so you can go back to your 4 + 1 state.
>>
>> ,Ash
>>
>> On Wed, 12 Dec 2018 at 10:27 AM, David Young  
>> wrote:
>>>
>>> Hi all,
>>>
>>> I have a small 2-node cluster with 40 OSDs, using erasure coding 4+1
>>>
>>> I lost osd38, and now I have 39 incomplete PGs.
>>>
>>> ---
>>> PG_AVAILABILITY Reduced data availability: 39 pgs inactive, 39 pgs 
>>> incomplete
>>> pg 22.2 is incomplete, acting [19,33,10,8,29] (reducing pool media 
>>> min_size from 5 may help; search ceph.com/docs for 'incomplete')
>>> pg 22.f is incomplete, acting [17,9,23,14,15] (reducing pool media  
>>> from 5 may help; search ceph.com/docs for 'incomplete')
>>> pg 22.12 is incomplete, acting [7,33,10,31,29] (reducing pool media 
>>> min_size from 5 may help; search ceph.com/docs for 'incomplete')
>>> pg 22.13 is incomplete, acting [23,0,15,33,13] (reducing pool media 
>>> min_size from 5 may help; search ceph.com/docs for 'incomplete')
>>> pg 22.23 is incomplete, acting [29,17,18,15,12] (reducing pool media 
>>> min_size from 5 may help; search ceph.com/docs for 'incomplete')
>>> 
>>> ---
>>>
>>> My EC profile is below:
>>>
>>> ---
>>> root@prod1:~# ceph osd erasure-code-profile get ec-41-profile
>>> crush-device-class=
>>> crush-failure-domain=osd
>>> crush-root=default
>>> jerasure-per-chunk-alignment=false
>>> k=4
>>> m=1
>>> plugin=jerasure
>>> technique=reed_sol_van
>>> w=8
>>> ---
>>>
>>> When I query one of the incomplete PGs, I see this:
>>>
>>> ---
>>> "recovery_state": [
>>> {
>>> "name": "Started/Primary/Peering/Incomplete",
>>> "enter_time": "2018-12-11 20:46:11.645796",
>>> "comment": "not enough complete instances of this PG"
>>> },
>>> ---
>>>
>>> And this:
>>>
>>> ---
>>> "probing_osds": [
>>> "0(4)",
>>> "7(2)",
>>> "9(1)",
>>> "11(4)",
>>> "22(3)",
>>> "29(2)",
>>> "36(0)"
>>> ],
>>> "down_osds_we_would_probe": [
>>> 38
>>> ],
>>> "peering_blocked_by": []
>>> },
>>> ---
>>>
>>> I have set this in /etc/ceph/ceph.conf to no effect:
>>>osd_find_best_info_ignore_history_les = true
>>>
>>>
>>> As a result of the incomplete PGs, I/O is currently frozen to at last part 
>>> of my cephfs.
>>>
>>> I expected to be able to tolerate the loss of an OSD without issue, is 
>>> there anything I can do to restore these incomplete PGs?
>>>
>>> When I bring back a new osd38, I see:
>>> ---
>>> "probing_osds": [
>>> "4(2)",
>>> "11(3)",
>>> "22(1)",
>>> "24(1)",
>>> "26(2)",
>>> "36(4)",
>>> "38(1)",
>>> "39(0)"
>>> ],
>>> "down_osds_we_would_probe": [],
>>> "peering_blocked_by": []
>>> },
>>> {
>>> "name": "Started",
>>> 

Re: [ceph-users] Lost 1/40 OSDs at EC 4+1, now PGs are incomplete

2018-12-11 Thread Ashley Merrick
Yes if you set back to 5, every time your loose an OSD your have to set to
4 and let the rebuild take place before putting back to 5.

I guess is all down to how important 100% up time is over you manually
monitoring the back fill / fix the OSD / replace the OSD by dropping to 4
vs letting it do this automatically and risk a further OSD loss.

If you have the space ID suggest going to 4 + 2 and migrating your data,
this would remove the ongoing issue and give you some extra data protection
from OSD loss.

On Wed, Dec 12, 2018 at 11:43 AM David Young 
wrote:

> (accidentally forgot to reply to the list)
>
> Thank you, setting min_size to 4 allowed I/O again, and the 39 incomplete
> PGs are now:
>
> 39  active+undersized+degraded+remapped+backfilling
>
> Once backfilling is done, I'll increase min_size to 5 again.
>
> Am I likely to encounter this issue whenever I loose an OSD (I/O freezes
> and manually reducing size is required), and is there anything I should be
> doing differently?
>
> Thanks again!
> D
>
>
>
> Sent with ProtonMail  Secure Email.
>
> ‐‐‐ Original Message ‐‐‐
> On Wednesday, December 12, 2018 3:31 PM, Ashley Merrick <
> singap...@amerrick.co.uk> wrote:
>
> With EC the min size is set to K + 1.
>
> Generally EC is used with a M of 2 or more, reason min size is set to 1 is
> now you are in a state when a further OSD loss will cause some PG’s to not
> have at least K size available as you only have 1 extra M.
>
> As per the error you can get your pool back online by setting min_size to
> 4.
>
> However this would only be a temp fix while you get the OSD back online /
> rebuilt so you can go back to your 4 + 1 state.
>
> ,Ash
>
> On Wed, 12 Dec 2018 at 10:27 AM, David Young 
> wrote:
>
>> Hi all,
>>
>> I have a small 2-node cluster with 40 OSDs, using erasure coding 4+1
>>
>> I lost osd38, and now I have 39 incomplete PGs.
>>
>> ---
>> PG_AVAILABILITY Reduced data availability: 39 pgs inactive, 39 pgs
>> incomplete
>> pg 22.2 is incomplete, acting [19,33,10,8,29] (reducing pool media
>> min_size from 5 may help; search ceph.com/docs for 'incomplete')
>> pg 22.f is incomplete, acting [17,9,23,14,15] (reducing pool media
>>  from 5 may help; search ceph.com/docs for 'incomplete')
>> pg 22.12 is incomplete, acting [7,33,10,31,29] (reducing pool media
>> min_size from 5 may help; search ceph.com/docs for 'incomplete')
>> pg 22.13 is incomplete, acting [23,0,15,33,13] (reducing pool media
>> min_size from 5 may help; search ceph.com/docs for 'incomplete')
>> pg 22.23 is incomplete, acting [29,17,18,15,12] (reducing pool media
>> min_size from 5 may help; search ceph.com/docs for 'incomplete')
>> 
>> ---
>>
>> My EC profile is below:
>>
>> ---
>> root@prod1:~# ceph osd erasure-code-profile get ec-41-profile
>> crush-device-class=
>> crush-failure-domain=osd
>> crush-root=default
>> jerasure-per-chunk-alignment=false
>> k=4
>> m=1
>> plugin=jerasure
>> technique=reed_sol_van
>> w=8
>> ---
>>
>> When I query one of the incomplete PGs, I see this:
>>
>> ---
>> "recovery_state": [
>> {
>> "name": "Started/Primary/Peering/Incomplete",
>> "enter_time": "2018-12-11 20:46:11.645796",
>> "comment": "not enough complete instances of this PG"
>> },
>> ---
>>
>> And this:
>>
>> ---
>> "probing_osds": [
>> "0(4)",
>> "7(2)",
>> "9(1)",
>> "11(4)",
>> "22(3)",
>> "29(2)",
>> "36(0)"
>> ],
>> "down_osds_we_would_probe": [
>> 38
>> ],
>> "peering_blocked_by": []
>> },
>> ---
>>
>> I have set this in /etc/ceph/ceph.conf to no effect:
>>osd_find_best_info_ignore_history_les = true
>>
>>
>> As a result of the incomplete PGs, I/O is currently frozen to at last
>> part of my cephfs.
>>
>> I expected to be able to tolerate the loss of an OSD without issue, is
>> there anything I can do to restore these incomplete PGs?
>>
>> When I bring back a new osd38, I see:
>> ---
>> "probing_osds": [
>> "4(2)",
>> "11(3)",
>> "22(1)",
>> "24(1)",
>> "26(2)",
>> "36(4)",
>> "38(1)",
>> "39(0)"
>> ],
>> "down_osds_we_would_probe": [],
>> "peering_blocked_by": []
>> },
>> {
>> "name": "Started",
>> "enter_time": "2018-12-11 21:06:35.307379"
>> }
>> ---
>>
>> But my recovery state is still:
>>
>> ---
>> "recovery_state": [
>> {
>> "name": "Started/Primary/Peering/Incomplete",
>> "enter_time": "2018-12-11 21:06:35.320292",
>> "comment": "not enough complete instances of this PG"
>> },
>> ---
>>
>> Any ideas?
>>
>> Thanks!
>> D
>>
>> ___

Re: [ceph-users] Lost 1/40 OSDs at EC 4+1, now PGs are incomplete

2018-12-11 Thread David Young
(accidentally forgot to reply to the list)

> Thank you, setting min_size to 4 allowed I/O again, and the 39 incomplete PGs 
> are now:
>
> 39  active+undersized+degraded+remapped+backfilling
>
> Once backfilling is done, I'll increase min_size to 5 again.
>
> Am I likely to encounter this issue whenever I loose an OSD (I/O freezes and 
> manually reducing size is required), and is there anything I should be doing 
> differently?
>
> Thanks again!
> D
>
> Sent with [ProtonMail](https://protonmail.com) Secure Email.
>
> ‐‐‐ Original Message ‐‐‐
> On Wednesday, December 12, 2018 3:31 PM, Ashley Merrick 
>  wrote:
>
>> With EC the min size is set to K + 1.
>>
>> Generally EC is used with a M of 2 or more, reason min size is set to 1 is 
>> now you are in a state when a further OSD loss will cause some PG’s to not 
>> have at least K size available as you only have 1 extra M.
>>
>> As per the error you can get your pool back online by setting min_size to 4.
>>
>> However this would only be a temp fix while you get the OSD back online / 
>> rebuilt so you can go back to your 4 + 1 state.
>>
>> ,Ash
>>
>> On Wed, 12 Dec 2018 at 10:27 AM, David Young  
>> wrote:
>>
>>> Hi all,
>>>
>>> I have a small 2-node cluster with 40 OSDs, using erasure coding 4+1
>>>
>>> I lost osd38, and now I have 39 incomplete PGs.
>>>
>>> ---
>>> PG_AVAILABILITY Reduced data availability: 39 pgs inactive, 39 pgs 
>>> incomplete
>>> pg 22.2 is incomplete, acting [19,33,10,8,29] (reducing pool media 
>>> min_size from 5 may help; search ceph.com/docs for 'incomplete')
>>> pg 22.f is incomplete, acting [17,9,23,14,15] (reducing pool media  
>>> from 5 may help; search ceph.com/docs for 'incomplete')
>>> pg 22.12 is incomplete, acting [7,33,10,31,29] (reducing pool media 
>>> min_size from 5 may help; search ceph.com/docs for 'incomplete')
>>> pg 22.13 is incomplete, acting [23,0,15,33,13] (reducing pool media 
>>> min_size from 5 may help; search ceph.com/docs for 'incomplete')
>>> pg 22.23 is incomplete, acting [29,17,18,15,12] (reducing pool media 
>>> min_size from 5 may help; search ceph.com/docs for 'incomplete')
>>> 
>>> ---
>>>
>>> My EC profile is below:
>>>
>>> ---
>>> root@prod1:~# ceph osd erasure-code-profile get ec-41-profile
>>> crush-device-class=
>>> crush-failure-domain=osd
>>> crush-root=default
>>> jerasure-per-chunk-alignment=false
>>> k=4
>>> m=1
>>> plugin=jerasure
>>> technique=reed_sol_van
>>> w=8
>>> ---
>>>
>>> When I query one of the incomplete PGs, I see this:
>>>
>>> ---
>>> "recovery_state": [
>>> {
>>> "name": "Started/Primary/Peering/Incomplete",
>>> "enter_time": "2018-12-11 20:46:11.645796",
>>> "comment": "not enough complete instances of this PG"
>>> },
>>> ---
>>>
>>> And this:
>>>
>>> ---
>>> "probing_osds": [
>>> "0(4)",
>>> "7(2)",
>>> "9(1)",
>>> "11(4)",
>>> "22(3)",
>>> "29(2)",
>>> "36(0)"
>>> ],
>>> "down_osds_we_would_probe": [
>>> 38
>>> ],
>>> "peering_blocked_by": []
>>> },
>>> ---
>>>
>>> I have set this in /etc/ceph/ceph.conf to no effect:
>>>osd_find_best_info_ignore_history_les = true
>>>
>>> As a result of the incomplete PGs, I/O is currently frozen to at last part 
>>> of my cephfs.
>>>
>>> I expected to be able to tolerate the loss of an OSD without issue, is 
>>> there anything I can do to restore these incomplete PGs?
>>>
>>> When I bring back a new osd38, I see:
>>> ---
>>> "probing_osds": [
>>> "4(2)",
>>> "11(3)",
>>> "22(1)",
>>> "24(1)",
>>> "26(2)",
>>> "36(4)",
>>> "38(1)",
>>> "39(0)"
>>> ],
>>> "down_osds_we_would_probe": [],
>>> "peering_blocked_by": []
>>> },
>>> {
>>> "name": "Started",
>>> "enter_time": "2018-12-11 21:06:35.307379"
>>> }
>>> ---
>>>
>>> But my recovery state is still:
>>>
>>> ---
>>> "recovery_state": [
>>> {
>>> "name": "Started/Primary/Peering/Incomplete",
>>> "enter_time": "2018-12-11 21:06:35.320292",
>>> "comment": "not enough complete instances of this PG"
>>> },
>>> ---
>>>
>>> Any ideas?
>>>
>>> Thanks!
>>> D
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lost 1/40 OSDs at EC 4+1, now PGs are incomplete

2018-12-11 Thread Ashley Merrick
With EC the min size is set to K + 1.

Generally EC is used with a M of 2 or more, reason min size is set to 1 is
now you are in a state when a further OSD loss will cause some PG’s to not
have at least K size available as you only have 1 extra M.

As per the error you can get your pool back online by setting min_size to 4.

However this would only be a temp fix while you get the OSD back online /
rebuilt so you can go back to your 4 + 1 state.

,Ash

On Wed, 12 Dec 2018 at 10:27 AM, David Young 
wrote:

> Hi all,
>
> I have a small 2-node cluster with 40 OSDs, using erasure coding 4+1
>
> I lost osd38, and now I have 39 incomplete PGs.
>
> ---
> PG_AVAILABILITY Reduced data availability: 39 pgs inactive, 39 pgs
> incomplete
> pg 22.2 is incomplete, acting [19,33,10,8,29] (reducing pool media
> min_size from 5 may help; search ceph.com/docs for 'incomplete')
> pg 22.f is incomplete, acting [17,9,23,14,15] (reducing pool media
>  from 5 may help; search ceph.com/docs for 'incomplete')
> pg 22.12 is incomplete, acting [7,33,10,31,29] (reducing pool media
> min_size from 5 may help; search ceph.com/docs for 'incomplete')
> pg 22.13 is incomplete, acting [23,0,15,33,13] (reducing pool media
> min_size from 5 may help; search ceph.com/docs for 'incomplete')
> pg 22.23 is incomplete, acting [29,17,18,15,12] (reducing pool media
> min_size from 5 may help; search ceph.com/docs for 'incomplete')
> 
> ---
>
> My EC profile is below:
>
> ---
> root@prod1:~# ceph osd erasure-code-profile get ec-41-profile
> crush-device-class=
> crush-failure-domain=osd
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=4
> m=1
> plugin=jerasure
> technique=reed_sol_van
> w=8
> ---
>
> When I query one of the incomplete PGs, I see this:
>
> ---
> "recovery_state": [
> {
> "name": "Started/Primary/Peering/Incomplete",
> "enter_time": "2018-12-11 20:46:11.645796",
> "comment": "not enough complete instances of this PG"
> },
> ---
>
> And this:
>
> ---
> "probing_osds": [
> "0(4)",
> "7(2)",
> "9(1)",
> "11(4)",
> "22(3)",
> "29(2)",
> "36(0)"
> ],
> "down_osds_we_would_probe": [
> 38
> ],
> "peering_blocked_by": []
> },
> ---
>
> I have set this in /etc/ceph/ceph.conf to no effect:
>osd_find_best_info_ignore_history_les = true
>
>
> As a result of the incomplete PGs, I/O is currently frozen to at last
> part of my cephfs.
>
> I expected to be able to tolerate the loss of an OSD without issue, is
> there anything I can do to restore these incomplete PGs?
>
> When I bring back a new osd38, I see:
> ---
> "probing_osds": [
> "4(2)",
> "11(3)",
> "22(1)",
> "24(1)",
> "26(2)",
> "36(4)",
> "38(1)",
> "39(0)"
> ],
> "down_osds_we_would_probe": [],
> "peering_blocked_by": []
> },
> {
> "name": "Started",
> "enter_time": "2018-12-11 21:06:35.307379"
> }
> ---
>
> But my recovery state is still:
>
> ---
> "recovery_state": [
> {
> "name": "Started/Primary/Peering/Incomplete",
> "enter_time": "2018-12-11 21:06:35.320292",
> "comment": "not enough complete instances of this PG"
> },
> ---
>
> Any ideas?
>
> Thanks!
> D
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Lost 1/40 OSDs at EC 4+1, now PGs are incomplete

2018-12-11 Thread David Young
Hi all,

I have a small 2-node cluster with 40 OSDs, using erasure coding 4+1

I lost osd38, and now I have 39 incomplete PGs.

---
PG_AVAILABILITY Reduced data availability: 39 pgs inactive, 39 pgs incomplete
pg 22.2 is incomplete, acting [19,33,10,8,29] (reducing pool media min_size 
from 5 may help; search ceph.com/docs for 'incomplete')
pg 22.f is incomplete, acting [17,9,23,14,15] (reducing pool media min_size 
from 5 may help; search ceph.com/docs for 'incomplete')
pg 22.12 is incomplete, acting [7,33,10,31,29] (reducing pool media 
min_size from 5 may help; search ceph.com/docs for 'incomplete')
pg 22.13 is incomplete, acting [23,0,15,33,13] (reducing pool media 
min_size from 5 may help; search ceph.com/docs for 'incomplete')
pg 22.23 is incomplete, acting [29,17,18,15,12] (reducing pool media 
min_size from 5 may help; search ceph.com/docs for 'incomplete')

---

My EC profile is below:

---
root@prod1:~# ceph osd erasure-code-profile get ec-41-profile
crush-device-class=
crush-failure-domain=osd
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=1
plugin=jerasure
technique=reed_sol_van
w=8
---

When I query one of the incomplete PGs, I see this:

---
"recovery_state": [
{
"name": "Started/Primary/Peering/Incomplete",
"enter_time": "2018-12-11 20:46:11.645796",
"comment": "not enough complete instances of this PG"
},
---

And this:

---
"probing_osds": [
"0(4)",
"7(2)",
"9(1)",
"11(4)",
"22(3)",
"29(2)",
"36(0)"
],
"down_osds_we_would_probe": [
38
],
"peering_blocked_by": []
},
---

I have set this in /etc/ceph/ceph.conf to no effect:
   osd_find_best_info_ignore_history_les = true

As a result of the incomplete PGs, I/O is currently frozen to at last part of 
my cephfs.

I expected to be able to tolerate the loss of an OSD without issue, is there 
anything I can do to restore these incomplete PGs?

When I bring back a new osd38, I see:
---
"probing_osds": [
"4(2)",
"11(3)",
"22(1)",
"24(1)",
"26(2)",
"36(4)",
"38(1)",
"39(0)"
],
"down_osds_we_would_probe": [],
"peering_blocked_by": []
},
{
"name": "Started",
"enter_time": "2018-12-11 21:06:35.307379"
}
---

But my recovery state is still:

---
"recovery_state": [
{
"name": "Started/Primary/Peering/Incomplete",
"enter_time": "2018-12-11 21:06:35.320292",
"comment": "not enough complete instances of this PG"
},
---

Any ideas?

Thanks!
D___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SLOW SSD's after moving to Bluestore

2018-12-11 Thread Mark Kirkwood
Looks like the 'delaylog' option for xfs is the problem - no longer
supported in later kernels. See
https://github.com/torvalds/linux/commit/444a702231412e82fb1c09679adc159301e9242c

Offhand I'm not sure where that option is being added (whether
ceph-deploy or ceph-volume), but you could just do surgery on whichever
one is adding it...

regards

Mark 


On 12/12/18 1:33 PM, Tyler Bishop wrote:
>
>
> [osci-1001][DEBUG ] Running command: mount -t xfs -o
> 
> "rw,noatime,noquota,logbsize=256k,logbufs=8,inode64,allocsize=4M,delaylog"
> 
> /dev/ceph-7b308a5a-a8e9-48aa-86a9-39957dcbd1eb/osd-data-81522145-e31b-4325-83fd-6cfefc1b761f
> /var/lib/ceph/osd/ceph-1
>
> [osci-1001][DEBUG ]  stderr: mount: unsupported option format:
> 
> "rw,noatime,noquota,logbsize=256k,logbufs=8,inode64,allocsize=4M,delaylog"
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SLOW SSD's after moving to Bluestore

2018-12-11 Thread Tyler Bishop
>
> [root] osci-1001.infra.cin1.corp:~/cephdeploy # ceph-deploy osd create
>> --filestore --fs-type xfs --data /dev/sdb2 --journal /dev/sdb1 osci-1001
>
> [ceph_deploy.conf][DEBUG ] found configuration file at:
>> /root/.cephdeploy.conf
>
> [ceph_deploy.cli][INFO  ] Invoked (2.0.1): /usr/bin/ceph-deploy osd create
>> --filestore --fs-type xfs --data /dev/sdb2 --journal /dev/sdb1 osci-1001
>
> [ceph_deploy.cli][INFO  ] ceph-deploy options:
>
> [ceph_deploy.cli][INFO  ]  verbose   : False
>
> [ceph_deploy.cli][INFO  ]  bluestore : None
>
> [ceph_deploy.cli][INFO  ]  cd_conf   :
>> 
>
> [ceph_deploy.cli][INFO  ]  cluster   : ceph
>
> [ceph_deploy.cli][INFO  ]  fs_type   : xfs
>
> [ceph_deploy.cli][INFO  ]  block_wal : None
>
> [ceph_deploy.cli][INFO  ]  default_release   : False
>
> [ceph_deploy.cli][INFO  ]  username  : None
>
> [ceph_deploy.cli][INFO  ]  journal   : /dev/sdb1
>
> [ceph_deploy.cli][INFO  ]  subcommand: create
>
> [ceph_deploy.cli][INFO  ]  host  : osci-1001
>
> [ceph_deploy.cli][INFO  ]  filestore : True
>
> [ceph_deploy.cli][INFO  ]  func  : > at 0x7fde72db0578>
>
> [ceph_deploy.cli][INFO  ]  ceph_conf : None
>
> [ceph_deploy.cli][INFO  ]  zap_disk  : False
>
> [ceph_deploy.cli][INFO  ]  data  : /dev/sdb2
>
> [ceph_deploy.cli][INFO  ]  block_db  : None
>
> [ceph_deploy.cli][INFO  ]  dmcrypt   : False
>
> [ceph_deploy.cli][INFO  ]  overwrite_conf: False
>
> [ceph_deploy.cli][INFO  ]  dmcrypt_key_dir   :
>> /etc/ceph/dmcrypt-keys
>
> [ceph_deploy.cli][INFO  ]  quiet : False
>
> [ceph_deploy.cli][INFO  ]  debug : False
>
> [ceph_deploy.osd][DEBUG ] Creating OSD on cluster ceph with data device
>> /dev/sdb2
>
> [osci-1001][DEBUG ] connected to host: osci-1001
>
> [osci-1001][DEBUG ] detect platform information from remote host
>
> [osci-1001][DEBUG ] detect machine type
>
> [osci-1001][DEBUG ] find the location of an executable
>
> [ceph_deploy.osd][INFO  ] Distro info: CentOS Linux 7.5.1804 Core
>
> [ceph_deploy.osd][DEBUG ] Deploying osd to osci-1001
>
> [osci-1001][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
>
> [osci-1001][DEBUG ] find the location of an executable
>
> [osci-1001][INFO  ] Running command: /usr/sbin/ceph-volume --cluster ceph
>> lvm create --filestore --data /dev/sdb2 --journal /dev/sdb1
>
> [osci-1001][WARNIN] -->  RuntimeError: command returned non-zero exit
>> status: 1
>
> [osci-1001][DEBUG ] Running command: /bin/ceph-authtool --gen-print-key
>
> [osci-1001][DEBUG ] Running command: /bin/ceph --cluster ceph --name
>> client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i
>> - osd new 81522145-e31b-4325-83fd-6cfefc1b761f
>
> [osci-1001][DEBUG ] Running command: vgcreate --force --yes
>> ceph-7b308a5a-a8e9-48aa-86a9-39957dcbd1eb /dev/sdb2
>
> [osci-1001][DEBUG ]  stdout: Physical volume "/dev/sdb2" successfully
>> created.
>
> [osci-1001][DEBUG ]  stdout: Volume group
>> "ceph-7b308a5a-a8e9-48aa-86a9-39957dcbd1eb" successfully created
>
> [osci-1001][DEBUG ] Running command: lvcreate --yes -l 100%FREE -n
>> osd-data-81522145-e31b-4325-83fd-6cfefc1b761f
>> ceph-7b308a5a-a8e9-48aa-86a9-39957dcbd1eb
>
> [osci-1001][DEBUG ]  stdout: Logical volume
>> "osd-data-81522145-e31b-4325-83fd-6cfefc1b761f" created.
>
> [osci-1001][DEBUG ] Running command: /bin/ceph-authtool --gen-print-key
>
> [osci-1001][DEBUG ] Running command: mkfs -t xfs -f -i size=2048
>> /dev/ceph-7b308a5a-a8e9-48aa-86a9-39957dcbd1eb/osd-data-81522145-e31b-4325-83fd-6cfefc1b761f
>
> [osci-1001][DEBUG ]  stdout:
>> meta-data=/dev/ceph-7b308a5a-a8e9-48aa-86a9-39957dcbd1eb/osd-data-81522145-e31b-4325-83fd-6cfefc1b761f
>> isize=2048   agcount=4, agsize=58239488 blks
>
> [osci-1001][DEBUG ]  =   sectsz=4096  attr=2,
>> projid32bit=1
>
> [osci-1001][DEBUG ]  =   crc=1
>> finobt=0, sparse=0
>
> [osci-1001][DEBUG ] data =   bsize=4096
>>  blocks=232957952, imaxpct=25
>
> [osci-1001][DEBUG ]  =   sunit=0  swidth=0
>> blks
>
> [osci-1001][DEBUG ] naming   =version 2  bsize=4096
>>  ascii-ci=0 ftype=1
>
> [osci-1001][DEBUG ] log  =internal log   bsize=4096
>>  blocks=113749, version=2
>
> [osci-1001][DEBUG ]  =   sectsz=4096  sunit=1
>> blks, lazy-count=1
>
> [osci-1001][DEBUG ] realtime =none   extsz=4096
>>  blocks=0, rtextents=0
>
> [osci-1001][DEBUG ] Running command: mount -t xfs -o
>> "rw,noatime,noquota,logbsize=256k,logbufs=8,inode64,allocsize=4M,delaylog"
>> /dev/cep

Re: [ceph-users] SLOW SSD's after moving to Bluestore

2018-12-11 Thread Tyler Bishop
Now I'm just trying to figure out how to create filestore in Luminous.
I've read every doc and tried every flag but I keep ending up with
either a data LV of 100% on the VG or a bunch fo random errors for
unsupported flags...

# ceph-disk prepare --filestore --fs-type xfs --data-dev /dev/sdb1
--journal-dev /dev/sdb2 --osd-id 3
usage: ceph-disk [-h] [-v] [--log-stdout] [--prepend-to-path PATH]
 [--statedir PATH] [--sysconfdir PATH] [--setuser USER]
 [--setgroup GROUP]


{prepare,activate,activate-lockbox,activate-block,activate-journal,activate-all,list,suppress-activate,unsuppress-activate,deactivate,destroy,zap,trigger,fix}
 ...
ceph-disk: error: unrecognized arguments: /dev/sdb1
On Tue, Dec 11, 2018 at 7:22 PM Christian Balzer  wrote:
>
>
> Hello,
>
> On Tue, 11 Dec 2018 23:22:40 +0300 Igor Fedotov wrote:
>
> > Hi Tyler,
> >
> > I suspect you have BlueStore DB/WAL at these drives as well, don't you?
> >
> > Then perhaps you have performance issues with f[data]sync requests which
> > DB/WAL invoke pretty frequently.
> >
> Since he explicitly mentioned using these SSDs with filestore AND the
> journals on the same SSD I'd expect a similar impact aka piss-poor
> performance in his existing setup (the 300 other OSDs).
>
> Unless of course some bluestore is significantly more sync happy than the
> filestore journal and/or other bluestore particulars (reduced caching
> space, not caching in some situations) are rearing their ugly heads.
>
> Christian
>
> > See the following links for details:
> >
> > https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/
> >
> > https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> >
> > The latter link shows pretty poor numbers for M500DC drives.
> >
> >
> > Thanks,
> >
> > Igor
> >
> >
> > On 12/11/2018 4:58 AM, Tyler Bishop wrote:
> >
> > > Older Crucial/Micron M500/M600
> > > _
> > >
> > > *Tyler Bishop*
> > > EST 2007
> > >
> > >
> > > O:513-299-7108 x1000
> > > M:513-646-5809
> > > http://BeyondHosting.net 
> > >
> > >
> > > This email is intended only for the recipient(s) above and/or
> > > otherwise authorized personnel. The information contained herein and
> > > attached is confidential and the property of Beyond Hosting. Any
> > > unauthorized copying, forwarding, printing, and/or disclosing
> > > any information related to this email is prohibited. If you received
> > > this message in error, please contact the sender and destroy all
> > > copies of this email and any attachment(s).
> > >
> > >
> > > On Mon, Dec 10, 2018 at 8:57 PM Christian Balzer  > > > wrote:
> > >
> > > Hello,
> > >
> > > On Mon, 10 Dec 2018 20:43:40 -0500 Tyler Bishop wrote:
> > >
> > > > I don't think thats my issue here because I don't see any IO to
> > > justify the
> > > > latency.  Unless the IO is minimal and its ceph issuing a bunch
> > > of discards
> > > > to the ssd and its causing it to slow down while doing that.
> > > >
> > >
> > > What does atop have to say?
> > >
> > > Discards/Trims are usually visible in it, this is during a fstrim of a
> > > RAID1 / :
> > > ---
> > > DSK |  sdb  | busy 81% |  read   0 | write  8587
> > > | MBw/s 2323.4 |  avio 0.47 ms |
> > > DSK |  sda  | busy 70% |  read   2 | write  8587
> > > | MBw/s 2323.4 |  avio 0.41 ms |
> > > ---
> > >
> > > The numbers tend to be a lot higher than what the actual interface is
> > > capable of, clearly the SSD is reporting its internal activity.
> > >
> > > In any case, it should give a good insight of what is going on
> > > activity
> > > wise.
> > > Also for posterity and curiosity, what kind of SSDs?
> > >
> > > Christian
> > >
> > > > Log isn't showing anything useful and I have most debugging
> > > disabled.
> > > >
> > > >
> > > >
> > > > On Mon, Dec 10, 2018 at 7:43 PM Mark Nelson  > > > wrote:
> > > >
> > > > > Hi Tyler,
> > > > >
> > > > > I think we had a user a while back that reported they had
> > > background
> > > > > deletion work going on after upgrading their OSDs from
> > > filestore to
> > > > > bluestore due to PGs having been moved around.  Is it possible
> > > that your
> > > > > cluster is doing a bunch of work (deletion or otherwise)
> > > beyond the
> > > > > regular client load?  I don't remember how to check for this
> > > off the top
> > > > > of my head, but it might be something to investigate.  If
> > > that's what it
> > > > > is, we just recently added the ability to throttle background
> > > deletes:
> > > > >
> > > > > https://github.com/ceph/ceph/pull/24749
> > > > >
> > > > >
> > > > > If the logs/admin 

Re: [ceph-users] SLOW SSD's after moving to Bluestore

2018-12-11 Thread Christian Balzer

Hello,

On Tue, 11 Dec 2018 23:22:40 +0300 Igor Fedotov wrote:

> Hi Tyler,
> 
> I suspect you have BlueStore DB/WAL at these drives as well, don't you?
> 
> Then perhaps you have performance issues with f[data]sync requests which 
> DB/WAL invoke pretty frequently.
>
Since he explicitly mentioned using these SSDs with filestore AND the
journals on the same SSD I'd expect a similar impact aka piss-poor
performance in his existing setup (the 300 other OSDs).

Unless of course some bluestore is significantly more sync happy than the
filestore journal and/or other bluestore particulars (reduced caching
space, not caching in some situations) are rearing their ugly heads.
 
Christian

> See the following links for details:
> 
> https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/
> 
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> 
> The latter link shows pretty poor numbers for M500DC drives.
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 12/11/2018 4:58 AM, Tyler Bishop wrote:
> 
> > Older Crucial/Micron M500/M600
> > _
> >
> > *Tyler Bishop*
> > EST 2007
> >
> >
> > O:513-299-7108 x1000
> > M:513-646-5809
> > http://BeyondHosting.net 
> >
> >
> > This email is intended only for the recipient(s) above and/or 
> > otherwise authorized personnel. The information contained herein and 
> > attached is confidential and the property of Beyond Hosting. Any 
> > unauthorized copying, forwarding, printing, and/or disclosing 
> > any information related to this email is prohibited. If you received 
> > this message in error, please contact the sender and destroy all 
> > copies of this email and any attachment(s).
> >
> >
> > On Mon, Dec 10, 2018 at 8:57 PM Christian Balzer  > > wrote:
> >
> > Hello,
> >
> > On Mon, 10 Dec 2018 20:43:40 -0500 Tyler Bishop wrote:
> >  
> > > I don't think thats my issue here because I don't see any IO to  
> > justify the  
> > > latency.  Unless the IO is minimal and its ceph issuing a bunch  
> > of discards  
> > > to the ssd and its causing it to slow down while doing that.
> > >  
> >
> > What does atop have to say?
> >
> > Discards/Trims are usually visible in it, this is during a fstrim of a
> > RAID1 / :
> > ---
> > DSK |          sdb  | busy     81% |  read       0 | write  8587 
> > | MBw/s 2323.4 |  avio 0.47 ms |
> > DSK |          sda  | busy     70% |  read       2 | write  8587 
> > | MBw/s 2323.4 |  avio 0.41 ms |
> > ---
> >
> > The numbers tend to be a lot higher than what the actual interface is
> > capable of, clearly the SSD is reporting its internal activity.
> >
> > In any case, it should give a good insight of what is going on
> > activity
> > wise.
> > Also for posterity and curiosity, what kind of SSDs?
> >
> > Christian
> >  
> > > Log isn't showing anything useful and I have most debugging  
> > disabled.  
> > >
> > >
> > >
> > > On Mon, Dec 10, 2018 at 7:43 PM Mark Nelson  > > wrote:  
> > >  
> > > > Hi Tyler,
> > > >
> > > > I think we had a user a while back that reported they had  
> > background  
> > > > deletion work going on after upgrading their OSDs from  
> > filestore to  
> > > > bluestore due to PGs having been moved around.  Is it possible  
> > that your  
> > > > cluster is doing a bunch of work (deletion or otherwise)  
> > beyond the  
> > > > regular client load?  I don't remember how to check for this  
> > off the top  
> > > > of my head, but it might be something to investigate.  If  
> > that's what it  
> > > > is, we just recently added the ability to throttle background  
> > deletes:  
> > > >
> > > > https://github.com/ceph/ceph/pull/24749
> > > >
> > > >
> > > > If the logs/admin socket don't tell you anything, you could  
> > also try  
> > > > using our wallclock profiler to see what the OSD is spending  
> > it's time  
> > > > doing:
> > > >
> > > > https://github.com/markhpc/gdbpmp/
> > > >
> > > >
> > > > ./gdbpmp -t 1000 -p`pidof ceph-osd` -o foo.gdbpmp
> > > >
> > > > ./gdbpmp -i foo.gdbpmp -t 1
> > > >
> > > >
> > > > Mark
> > > >
> > > > On 12/10/18 6:09 PM, Tyler Bishop wrote:  
> > > > > Hi,
> > > > >
> > > > > I have an SSD only cluster that I recently converted from  
> > filestore to  
> > > > > bluestore and performance has totally tanked. It was fairly  
> > decent  
> > > > > before, only having a little additional latency than  
> > expected.  Now  
> > > > > since converting to bluestore the latency is extremely high,  
> > SECONDS.  
> > > > > I am trying to determine if it an issue with the SSD's or 

Re: [ceph-users] KVM+Ceph: Live migration of I/O-heavy VM

2018-12-11 Thread Kevin Olbrich
> > Assuming everything is on LVM including the root filesystem, only moving
> > the boot partition will have to be done outside of LVM.
>
> Since the OP mentioned MS Exchange, I assume the VM is running windows.
> You can do the same LVM-like trick in Windows Server via Disk Manager
> though; add the new ceph RBD disk to the existing data volume as a
> mirror; wait for it to sync, then break the mirror and remove the
> original disk.

Mirrors only work on dynamic disks which are a pain to revert and
cause lot's of problems with backup solutions.
I will keep this in mind as this is still better than shutting down
the whole VM.

@all
Thank you very much for your inputs. I will try some less important
VMs and then start migration of the big one.

Kind regards
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SLOW SSD's after moving to Bluestore

2018-12-11 Thread Igor Fedotov

Hi Tyler,

I suspect you have BlueStore DB/WAL at these drives as well, don't you?

Then perhaps you have performance issues with f[data]sync requests which 
DB/WAL invoke pretty frequently.


See the following links for details:

https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

The latter link shows pretty poor numbers for M500DC drives.


Thanks,

Igor


On 12/11/2018 4:58 AM, Tyler Bishop wrote:


Older Crucial/Micron M500/M600
_

*Tyler Bishop*
EST 2007


O:513-299-7108 x1000
M:513-646-5809
http://BeyondHosting.net 


This email is intended only for the recipient(s) above and/or 
otherwise authorized personnel. The information contained herein and 
attached is confidential and the property of Beyond Hosting. Any 
unauthorized copying, forwarding, printing, and/or disclosing 
any information related to this email is prohibited. If you received 
this message in error, please contact the sender and destroy all 
copies of this email and any attachment(s).



On Mon, Dec 10, 2018 at 8:57 PM Christian Balzer > wrote:


Hello,

On Mon, 10 Dec 2018 20:43:40 -0500 Tyler Bishop wrote:

> I don't think thats my issue here because I don't see any IO to
justify the
> latency.  Unless the IO is minimal and its ceph issuing a bunch
of discards
> to the ssd and its causing it to slow down while doing that.
>

What does atop have to say?

Discards/Trims are usually visible in it, this is during a fstrim of a
RAID1 / :
---
DSK |          sdb  | busy     81% |  read       0 | write  8587 
| MBw/s 2323.4 |  avio 0.47 ms |
DSK |          sda  | busy     70% |  read       2 | write  8587 
| MBw/s 2323.4 |  avio 0.41 ms |
---

The numbers tend to be a lot higher than what the actual interface is
capable of, clearly the SSD is reporting its internal activity.

In any case, it should give a good insight of what is going on
activity
wise.
Also for posterity and curiosity, what kind of SSDs?

Christian

> Log isn't showing anything useful and I have most debugging
disabled.
>
>
>
> On Mon, Dec 10, 2018 at 7:43 PM Mark Nelson mailto:mnel...@redhat.com>> wrote:
>
> > Hi Tyler,
> >
> > I think we had a user a while back that reported they had
background
> > deletion work going on after upgrading their OSDs from
filestore to
> > bluestore due to PGs having been moved around.  Is it possible
that your
> > cluster is doing a bunch of work (deletion or otherwise)
beyond the
> > regular client load?  I don't remember how to check for this
off the top
> > of my head, but it might be something to investigate.  If
that's what it
> > is, we just recently added the ability to throttle background
deletes:
> >
> > https://github.com/ceph/ceph/pull/24749
> >
> >
> > If the logs/admin socket don't tell you anything, you could
also try
> > using our wallclock profiler to see what the OSD is spending
it's time
> > doing:
> >
> > https://github.com/markhpc/gdbpmp/
> >
> >
> > ./gdbpmp -t 1000 -p`pidof ceph-osd` -o foo.gdbpmp
> >
> > ./gdbpmp -i foo.gdbpmp -t 1
> >
> >
> > Mark
> >
> > On 12/10/18 6:09 PM, Tyler Bishop wrote:
> > > Hi,
> > >
> > > I have an SSD only cluster that I recently converted from
filestore to
> > > bluestore and performance has totally tanked. It was fairly
decent
> > > before, only having a little additional latency than
expected.  Now
> > > since converting to bluestore the latency is extremely high,
SECONDS.
> > > I am trying to determine if it an issue with the SSD's or
Bluestore
> > > treating them differently than filestore... potential garbage
> > > collection? 24+ hrs ???
> > >
> > > I am now seeing constant 100% IO utilization on ALL of the
devices and
> > > performance is terrible!
> > >
> > > IOSTAT
> > >
> > > avg-cpu:  %user   %nice %system %iowait %steal   %idle
> > >            1.37    0.00    0.34   18.59 0.00   79.70
> > >
> > > Device:         rrqm/s   wrqm/s     r/s     w/s rkB/s    wkB/s
> > > avgrq-sz avgqu-sz   await r_await w_await svctm  %util
> > > sda               0.00     0.00    0.00 9.50  0.00    64.00
> > > 13.47     0.01    1.16    0.00    1.16  1.11  1.05
> > > sdb               0.00    96.50    4.50   46.50 34.00 11776.00
> > >  463.14   132.68 1174.84  782.67 1212.80 19.61 100.00
> > > dm-0              0.00     0.00    5.50  128.00 44.00  8162.00
> > >  122.94   507.84 1704.93  674.09 1749.23  7.49 100.00
> > >
> > > avg-cpu:  %user   %nice %system %iowait %steal  

Re: [ceph-users] KVM+Ceph: Live migration of I/O-heavy VM

2018-12-11 Thread Ronny Aasen

On 11.12.2018 12:59, Kevin Olbrich wrote:

Hi!

Currently I plan a migration of a large VM (MS Exchange, 300 Mailboxes
and 900GB DB) from qcow2 on ext4 (RAID1) to an all-flash Ceph luminous
cluster (which already holds lot's of images).
The server has access to both local and cluster-storage, I only need
to live migrate the storage, not machine.

I have never used live migration as it can cause more issues and the
VMs that are already migrated, had planned downtime.
Taking the VM offline and convert/import using qemu-img would take
some hours but I would like to still serve clients, even if it is
slower.

The VM is I/O-heavy in terms of the old storage (LSI/Adaptec with
BBU). There are two HDDs bound as RAID1 which are constantly under 30%
- 60% load (this goes up to 100% during reboot, updates or login
prime-time).

What happens when either the local compute node or the ceph cluster
fails (degraded)? Or network is unavailable?
Are all writes performed to both locations? Is this fail-safe? Or does
the VM crash in worst case, which can lead to dirty shutdown for MS-EX
DBs?


the disk is on the source location untill the migration is finalized. if 
the local compute node crashed and the vm dies with it before the 
migration is done. the disk is on the source location as expected.  if 
nodes on the ceph cluster dies but the cluster is operational, ceph just 
selfheal and the migration is finished. if the cluster dies hard enough 
to actually break, the migration will timeout , and abort. and disk 
remains on source location. if network is unavailable the transfer will 
also timeout.


good luck

Ronny Aasen




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] KVM+Ceph: Live migration of I/O-heavy VM

2018-12-11 Thread Ronny Aasen

On 11.12.2018 17:39, Lionel Bouton wrote:

Le 11/12/2018 à 15:51, Konstantin Shalygin a écrit :



Currently I plan a migration of a large VM (MS Exchange, 300 Mailboxes
and 900GB DB) from qcow2 on ext4 (RAID1) to an all-flash Ceph luminous
cluster (which already holds lot's of images).
The server has access to both local and cluster-storage, I only need
to live migrate the storage, not machine.

I have never used live migration as it can cause more issues and the
VMs that are already migrated, had planned downtime.
Taking the VM offline and convert/import using qemu-img would take
some hours but I would like to still serve clients, even if it is
slower.

The VM is I/O-heavy in terms of the old storage (LSI/Adaptec with
BBU). There are two HDDs bound as RAID1 which are constantly under 30%
- 60% load (this goes up to 100% during reboot, updates or login
prime-time).

What happens when either the local compute node or the ceph cluster
fails (degraded)? Or network is unavailable?
Are all writes performed to both locations? Is this fail-safe? Or does
the VM crash in worst case, which can lead to dirty shutdown for MS-EX
DBs?

The node currently has 4GB free RAM and 29GB listed as cache /
available. These numbers need caution because we have "tuned" enabled
which causes de-deplication on RAM and this host runs about 10 Windows
VMs.
During reboots or updates, RAM can get full again.

Maybe I am to cautious about live-storage-migration, maybe I am not.

What are your experiences or advices?

Thank you very much!


I was read your message two times and still can't figure out what is 
your question?


You need move your block image from some storage to Ceph? No, you 
can't do this without downtime because fs consistency.


You can easy migrate your filesystem via rsync for example, with 
small downtime for reboot VM.




I believe OP is trying to use the storage migration feature of QEMU. 
I've never tried it and I wouldn't recommend it (probably not very 
tested and there is a large window for failure).



use the qemu storage migration feature via proxmox webui several times a 
day. never any issues.


I regularly migrate between  ceph rbd,  local directory, shared lvm over 
fiberchannel, nfs server.  super easy and convenient.



Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] KVM+Ceph: Live migration of I/O-heavy VM

2018-12-11 Thread Jack
We are using qemu storage migration regularly via proxmox

Works fine, you can go on


On 12/11/2018 05:39 PM, Lionel Bouton wrote:
> 
> I believe OP is trying to use the storage migration feature of QEMU.
> I've never tried it and I wouldn't recommend it (probably not very
> tested and there is a large window for failure).
> 
> One tactic that can be used assuming OP is using LVM in the VM for
> storage is to add a Ceph volume to the VM (probably needs a reboot) add
> the corresponding virtual disk to the VM volume group and then migrate
> all data from the logical volume(s) to the new disk. LVM is using
> mirroring internally during the transfer so you get robustness by using
> it. It can be slow (especially with old kernels) but at least it is
> safe. I've done a DRBD to Ceph migration with this process 5 years ago.
> When all logical volumes are moved to the new disk you can remove the
> old disk from the volume group.
> 
> Assuming everything is on LVM including the root filesystem, only moving
> the boot partition will have to be done outside of LVM.
> 
> Best regards,
> 
> Lionel
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] KVM+Ceph: Live migration of I/O-heavy VM

2018-12-11 Thread Graham Allan



On 12/11/2018 10:39 AM, Lionel Bouton wrote:

Le 11/12/2018 à 15:51, Konstantin Shalygin a écrit :



Currently I plan a migration of a large VM (MS Exchange, 300 Mailboxes
and 900GB DB) from qcow2 on ext4 (RAID1) to an all-flash Ceph luminous
cluster (which already holds lot's of images).
The server has access to both local and cluster-storage, I only need
to live migrate the storage, not machine.

I have never used live migration as it can cause more issues and the
VMs that are already migrated, had planned downtime.
Taking the VM offline and convert/import using qemu-img would take
some hours but I would like to still serve clients, even if it is
slower.
I believe OP is trying to use the storage migration feature of QEMU. 
I've never tried it and I wouldn't recommend it (probably not very 
tested and there is a large window for failure).


One tactic that can be used assuming OP is using LVM in the VM for 
storage is to add a Ceph volume to the VM (probably needs a reboot) add 
the corresponding virtual disk to the VM volume group and then migrate 
all data from the logical volume(s) to the new disk. LVM is using 
mirroring internally during the transfer so you get robustness by using 
it. It can be slow (especially with old kernels) but at least it is 
safe. I've done a DRBD to Ceph migration with this process 5 years ago.
When all logical volumes are moved to the new disk you can remove the 
old disk from the volume group.


Assuming everything is on LVM including the root filesystem, only moving 
the boot partition will have to be done outside of LVM.


Since the OP mentioned MS Exchange, I assume the VM is running windows. 
You can do the same LVM-like trick in Windows Server via Disk Manager 
though; add the new ceph RBD disk to the existing data volume as a 
mirror; wait for it to sync, then break the mirror and remove the 
original disk.


--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] civitweb segfaults

2018-12-11 Thread Casey Bodley

Hi Leon,

Are you running with a non-default value of rgw_gc_max_objs? I was able 
to reproduce this exact stack trace by setting rgw_gc_max_objs = 0; I 
can't think of any other way to get a 'Floating point exception' here.


On 12/11/18 10:31 AM, Leon Robinson wrote:

Hello, I have found a surefire way to bring down our swift gateways.

First, upload a bunch of large files and split it in to segments, e.g.

for i in {1..100}; do swift upload test_container -S 10485760 
CentOS-7-x86_64-GenericCloud.qcow2 --object-name 
CentOS-7-x86_64-GenericCloud.qcow2-$i; done


This creates 100 objects in test_container and 1000 or so objects in 
test_container_segments


Then, Delete them. Preferably in a ludicrous manner.

for i in $(swift list test_container); do swift delete test_container 
$i; done


What results is:

 -13> 2018-12-11 15:17:57.627655 7fc128b49700  1 -- 
172.28.196.121:0/464072497 <== osd.480 172.26.212.6:6802/2058882 1 
 osd_op_reply(11 .dir.default.1083413551.2.7 [call,call] 
v1423252'7548804 uv7548804 ondisk = 0) v8  213+0+0 (3895049453 0 
0) 0x55c98f45e9c0 con 0x55c98f4d7800
   -12> 2018-12-11 15:17:57.627827 7fc0e3ffe700  1 -- 
172.28.196.121:0/464072497 --> 172.26.221.7:6816/2366816 -- 
osd_op(unknown.0.0:12 14.110b 
14:d08c26b8:::default.1083413551.2_CentOS-7-x86_64-GenericCloud.qcow2-10%2f1532606905.440697%2f938016768%2f10485760%2f0037:head 
[cmpxattr user.rgw.idtag (25) op 1 mode 1,call rgw.obj_remove] snapc 
0=[] ondisk+write+known_if_redirected e1423252) v8 -- 0x55c98f4603c0 con 0
   -11> 2018-12-11 15:17:57.628582 7fc128348700  5 -- 
172.28.196.121:0/157062182 >> 172.26.225.9:6828/2257653 
conn(0x55c98f0eb000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH 
pgs=540 cs=1 l=1). rx osd.87 seq 2 0x55c98f4603c0 osd_op_reply(340 
obj_delete_at_hint.55 [call] v1423252'9217746 uv9217746 ondisk 
= 0) v8
   -10> 2018-12-11 15:17:57.628604 7fc128348700  1 -- 
172.28.196.121:0/157062182 <== osd.87 172.26.225.9:6828/2257653 2  
osd_op_reply(340 obj_delete_at_hint.55 [call] v1423252'9217746 
uv9217746 ondisk = 0) v8  173+0+0 (3971813511 0 0) 0x55c98f4603c0 
con 0x55c98f0eb000
-9> 2018-12-11 15:17:57.628760 7fc1017f9700  1 -- 
172.28.196.121:0/157062182 --> 172.26.225.9:6828/2257653 -- 
osd_op(unknown.0.0:341 13.4f 
13:f3db1134:::obj_delete_at_hint.55:head [call timeindex.list] 
snapc 0=[] ondisk+read+known_if_redirected e1423252) v8 -- 
0x55c98f45fa00 con 0
-8> 2018-12-11 15:17:57.629306 7fc128348700  5 -- 
172.28.196.121:0/157062182 >> 172.26.225.9:6828/2257653 
conn(0x55c98f0eb000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH 
pgs=540 cs=1 l=1). rx osd.87 seq 3 0x55c98f45fa00 osd_op_reply(341 
obj_delete_at_hint.55 [call] v0'0 uv9217746 ondisk = 0) v8
-7> 2018-12-11 15:17:57.629326 7fc128348700  1 -- 
172.28.196.121:0/157062182 <== osd.87 172.26.225.9:6828/2257653 3  
osd_op_reply(341 obj_delete_at_hint.55 [call] v0'0 uv9217746 
ondisk = 0) v8  173+0+15 (3272189389 0 2149983739) 0x55c98f45fa00 
con 0x55c98f0eb000
-6> 2018-12-11 15:17:57.629398 7fc128348700  5 -- 
172.28.196.121:0/464072497 >> 172.26.221.7:6816/2366816 
conn(0x55c98f4d6000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH 
pgs=181 cs=1 l=1). rx osd.58 seq 2 0x55c98f45fa00 osd_op_reply(12 
default.1083413551.2_CentOS-7-x86_64-GenericCloud.qcow2-10/1532606905.440697/938016768/10485760/0037 
[cmpxattr (25) op 1 mode 1,call] v1423252'743755 uv743755 ondisk = 0) v8
-5> 2018-12-11 15:17:57.629418 7fc128348700  1 -- 
172.28.196.121:0/464072497 <== osd.58 172.26.221.7:6816/2366816 2  
osd_op_reply(12 
default.1083413551.2_CentOS-7-x86_64-GenericCloud.qcow2-10/1532606905.440697/938016768/10485760/0037 
[cmpxattr (25) op 1 mode 1,call] v1423252'743755 uv743755 ondisk = 0) 
v8  290+0+0 (3763879162 0 0) 0x55c98f45fa00 con 0x55c98f4d6000
-4> 2018-12-11 15:17:57.629458 7fc1017f9700  1 -- 
172.28.196.121:0/157062182 --> 172.26.225.9:6828/2257653 -- 
osd_op(unknown.0.0:342 13.4f 
13:f3db1134:::obj_delete_at_hint.55:head [call lock.unlock] 
snapc 0=[] ondisk+write+known_if_redirected e1423252) v8 -- 
0x55c98f45fd40 con 0
-3> 2018-12-11 15:17:57.629603 7fc0e3ffe700  1 -- 
172.28.196.121:0/464072497 --> 172.26.212.6:6802/2058882 -- 
osd_op(unknown.0.0:13 15.1e0 
15:079bdcbb:::.dir.default.1083413551.2.7:head [call 
rgw.guard_bucket_resharding,call rgw.bucket_complete_op] snapc 0=[] 
ondisk+write+known_if_redirected e1423252) v8 -- 0x55c98f460700 con 0
-2> 2018-12-11 15:17:57.631312 7fc128b49700  5 -- 
172.28.196.121:0/464072497 >> 172.26.212.6:6802/2058882 
conn(0x55c98f4d7800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH 
pgs=202 cs=1 l=1). rx osd.480 seq 2 0x55c98f460700 osd_op_reply(13 
.dir.default.1083413551.2.7 [call,call] v1423252'7548805 uv7548805 
ondisk = 0) v8
-1> 2018-12-11 15:17:57.631329 7fc128b49700  1 -- 
172.28.196.121:0/464072497 <== osd.480 172.26.212.6:6802/2058882 2 
 osd_op_reply(13 .dir.d

Re: [ceph-users] KVM+Ceph: Live migration of I/O-heavy VM

2018-12-11 Thread Lionel Bouton
Le 11/12/2018 à 15:51, Konstantin Shalygin a écrit :
>
>> Currently I plan a migration of a large VM (MS Exchange, 300 Mailboxes
>> and 900GB DB) from qcow2 on ext4 (RAID1) to an all-flash Ceph luminous
>> cluster (which already holds lot's of images).
>> The server has access to both local and cluster-storage, I only need
>> to live migrate the storage, not machine.
>>
>> I have never used live migration as it can cause more issues and the
>> VMs that are already migrated, had planned downtime.
>> Taking the VM offline and convert/import using qemu-img would take
>> some hours but I would like to still serve clients, even if it is
>> slower.
>>
>> The VM is I/O-heavy in terms of the old storage (LSI/Adaptec with
>> BBU). There are two HDDs bound as RAID1 which are constantly under 30%
>> - 60% load (this goes up to 100% during reboot, updates or login
>> prime-time).
>>
>> What happens when either the local compute node or the ceph cluster
>> fails (degraded)? Or network is unavailable?
>> Are all writes performed to both locations? Is this fail-safe? Or does
>> the VM crash in worst case, which can lead to dirty shutdown for MS-EX
>> DBs?
>>
>> The node currently has 4GB free RAM and 29GB listed as cache /
>> available. These numbers need caution because we have "tuned" enabled
>> which causes de-deplication on RAM and this host runs about 10 Windows
>> VMs.
>> During reboots or updates, RAM can get full again.
>>
>> Maybe I am to cautious about live-storage-migration, maybe I am not.
>>
>> What are your experiences or advices?
>>
>> Thank you very much!
>
> I was read your message two times and still can't figure out what is
> your question?
>
> You need move your block image from some storage to Ceph? No, you
> can't do this without downtime because fs consistency.
>
> You can easy migrate your filesystem via rsync for example, with small
> downtime for reboot VM.
>

I believe OP is trying to use the storage migration feature of QEMU.
I've never tried it and I wouldn't recommend it (probably not very
tested and there is a large window for failure).

One tactic that can be used assuming OP is using LVM in the VM for
storage is to add a Ceph volume to the VM (probably needs a reboot) add
the corresponding virtual disk to the VM volume group and then migrate
all data from the logical volume(s) to the new disk. LVM is using
mirroring internally during the transfer so you get robustness by using
it. It can be slow (especially with old kernels) but at least it is
safe. I've done a DRBD to Ceph migration with this process 5 years ago.
When all logical volumes are moved to the new disk you can remove the
old disk from the volume group.

Assuming everything is on LVM including the root filesystem, only moving
the boot partition will have to be done outside of LVM.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] civitweb segfaults

2018-12-11 Thread Leon Robinson
Hello, I have found a surefire way to bring down our swift gateways.

First, upload a bunch of large files and split it in to segments, e.g.

for i in {1..100}; do swift upload test_container -S 10485760 
CentOS-7-x86_64-GenericCloud.qcow2 --object-name 
CentOS-7-x86_64-GenericCloud.qcow2-$i; done

This creates 100 objects in test_container and 1000 or so objects in 
test_container_segments

Then, Delete them. Preferably in a ludicrous manner.

for i in $(swift list test_container); do swift delete test_container $i; done

What results is:

 -13> 2018-12-11 15:17:57.627655 7fc128b49700  1 -- 172.28.196.121:0/464072497 
<== osd.480 172.26.212.6:6802/2058882 1  osd_op_reply(11 
.dir.default.1083413551.2.7 [call,call] v1423252'7548804 uv7548804 ondisk = 0) 
v8  213+0+0 (3895049453 0 0) 0x55c98f45e9c0 con 0x55c98f4d7800
   -12> 2018-12-11 15:17:57.627827 7fc0e3ffe700  1 -- 
172.28.196.121:0/464072497 --> 172.26.221.7:6816/2366816 -- 
osd_op(unknown.0.0:12 14.110b 
14:d08c26b8:::default.1083413551.2_CentOS-7-x86_64-GenericCloud.qcow2-10%2f1532606905.440697%2f938016768%2f10485760%2f0037:head
 [cmpxattr user.rgw.idtag (25) op 1 mode 1,call rgw.obj_remove] snapc 0=[] 
ondisk+write+known_if_redirected e1423252) v8 -- 0x55c98f4603c0 con 0
   -11> 2018-12-11 15:17:57.628582 7fc128348700  5 -- 
172.28.196.121:0/157062182 >> 172.26.225.9:6828/2257653 conn(0x55c98f0eb000 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=540 cs=1 l=1). rx osd.87 seq 
2 0x55c98f4603c0 osd_op_reply(340 obj_delete_at_hint.55 [call] 
v1423252'9217746 uv9217746 ondisk = 0) v8
   -10> 2018-12-11 15:17:57.628604 7fc128348700  1 -- 
172.28.196.121:0/157062182 <== osd.87 172.26.225.9:6828/2257653 2  
osd_op_reply(340 obj_delete_at_hint.55 [call] v1423252'9217746 
uv9217746 ondisk = 0) v8  173+0+0 (3971813511 0 0) 0x55c98f4603c0 con 
0x55c98f0eb000
-9> 2018-12-11 15:17:57.628760 7fc1017f9700  1 -- 
172.28.196.121:0/157062182 --> 172.26.225.9:6828/2257653 -- 
osd_op(unknown.0.0:341 13.4f 13:f3db1134:::obj_delete_at_hint.55:head 
[call timeindex.list] snapc 0=[] ondisk+read+known_if_redirected e1423252) v8 
-- 0x55c98f45fa00 con 0
-8> 2018-12-11 15:17:57.629306 7fc128348700  5 -- 
172.28.196.121:0/157062182 >> 172.26.225.9:6828/2257653 conn(0x55c98f0eb000 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=540 cs=1 l=1). rx osd.87 seq 
3 0x55c98f45fa00 osd_op_reply(341 obj_delete_at_hint.55 [call] v0'0 
uv9217746 ondisk = 0) v8
-7> 2018-12-11 15:17:57.629326 7fc128348700  1 -- 
172.28.196.121:0/157062182 <== osd.87 172.26.225.9:6828/2257653 3  
osd_op_reply(341 obj_delete_at_hint.55 [call] v0'0 uv9217746 ondisk = 
0) v8  173+0+15 (3272189389 0 2149983739) 0x55c98f45fa00 con 0x55c98f0eb000
-6> 2018-12-11 15:17:57.629398 7fc128348700  5 -- 
172.28.196.121:0/464072497 >> 172.26.221.7:6816/2366816 conn(0x55c98f4d6000 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=181 cs=1 l=1). rx osd.58 seq 
2 0x55c98f45fa00 osd_op_reply(12 
default.1083413551.2_CentOS-7-x86_64-GenericCloud.qcow2-10/1532606905.440697/938016768/10485760/0037
 [cmpxattr (25) op 1 mode 1,call] v1423252'743755 uv743755 ondisk = 0) v8
-5> 2018-12-11 15:17:57.629418 7fc128348700  1 -- 
172.28.196.121:0/464072497 <== osd.58 172.26.221.7:6816/2366816 2  
osd_op_reply(12 
default.1083413551.2_CentOS-7-x86_64-GenericCloud.qcow2-10/1532606905.440697/938016768/10485760/0037
 [cmpxattr (25) op 1 mode 1,call] v1423252'743755 uv743755 ondisk = 0) v8  
290+0+0 (3763879162 0 0) 0x55c98f45fa00 con 0x55c98f4d6000
-4> 2018-12-11 15:17:57.629458 7fc1017f9700  1 -- 
172.28.196.121:0/157062182 --> 172.26.225.9:6828/2257653 -- 
osd_op(unknown.0.0:342 13.4f 13:f3db1134:::obj_delete_at_hint.55:head 
[call lock.unlock] snapc 0=[] ondisk+write+known_if_redirected e1423252) v8 -- 
0x55c98f45fd40 con 0
-3> 2018-12-11 15:17:57.629603 7fc0e3ffe700  1 -- 
172.28.196.121:0/464072497 --> 172.26.212.6:6802/2058882 -- 
osd_op(unknown.0.0:13 15.1e0 15:079bdcbb:::.dir.default.1083413551.2.7:head 
[call rgw.guard_bucket_resharding,call rgw.bucket_complete_op] snapc 0=[] 
ondisk+write+known_if_redirected e1423252) v8 -- 0x55c98f460700 con 0
-2> 2018-12-11 15:17:57.631312 7fc128b49700  5 -- 
172.28.196.121:0/464072497 >> 172.26.212.6:6802/2058882 conn(0x55c98f4d7800 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=202 cs=1 l=1). rx osd.480 seq 
2 0x55c98f460700 osd_op_reply(13 .dir.default.1083413551.2.7 [call,call] 
v1423252'7548805 uv7548805 ondisk = 0) v8
-1> 2018-12-11 15:17:57.631329 7fc128b49700  1 -- 
172.28.196.121:0/464072497 <== osd.480 172.26.212.6:6802/2058882 2  
osd_op_reply(13 .dir.default.1083413551.2.7 [call,call] v1423252'7548805 
uv7548805 ondisk = 0) v8  213+0+0 (4216487267 0 0) 0x55c98f460700 con 
0x55c98f4d7800
 0> 2018-12-11 15:17:57.631834 7fc0e3ffe700 -1 *** Caught signal (Floating 
point exception) **
 in thread 7fc0e3ffe700 thread_name:civet

Re: [ceph-users] KVM+Ceph: Live migration of I/O-heavy VM

2018-12-11 Thread Konstantin Shalygin

Currently I plan a migration of a large VM (MS Exchange, 300 Mailboxes
and 900GB DB) from qcow2 on ext4 (RAID1) to an all-flash Ceph luminous
cluster (which already holds lot's of images).
The server has access to both local and cluster-storage, I only need
to live migrate the storage, not machine.

I have never used live migration as it can cause more issues and the
VMs that are already migrated, had planned downtime.
Taking the VM offline and convert/import using qemu-img would take
some hours but I would like to still serve clients, even if it is
slower.

The VM is I/O-heavy in terms of the old storage (LSI/Adaptec with
BBU). There are two HDDs bound as RAID1 which are constantly under 30%
- 60% load (this goes up to 100% during reboot, updates or login
prime-time).

What happens when either the local compute node or the ceph cluster
fails (degraded)? Or network is unavailable?
Are all writes performed to both locations? Is this fail-safe? Or does
the VM crash in worst case, which can lead to dirty shutdown for MS-EX
DBs?

The node currently has 4GB free RAM and 29GB listed as cache /
available. These numbers need caution because we have "tuned" enabled
which causes de-deplication on RAM and this host runs about 10 Windows
VMs.
During reboots or updates, RAM can get full again.

Maybe I am to cautious about live-storage-migration, maybe I am not.

What are your experiences or advices?

Thank you very much!


I was read your message two times and still can't figure out what is 
your question?


You need move your block image from some storage to Ceph? No, you can't 
do this without downtime because fs consistency.


You can easy migrate your filesystem via rsync for example, with small 
downtime for reboot VM.





k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to fix X is an unexpected clone

2018-12-11 Thread Achim Ledermüller
Hi Stefan,
Hi Everyone,

I am in a similar situation like you were a year ago. During some
backfilling we removed an old snapshot and with the next deep-scrub we
ended with the same log as you did.

> deep-scrub 2.61b
2:d8736536:::rbd_data.e22260238e1f29.0046d527:177f6 : is an
unexpected clone

We run Luminous 12.2.10 and the snapshot 177f6 doesn't exist any more.
The unexpected clone is replicated correctly over three OSDs and is
still available in the file system.

Thanh Tran wrote[1] that moving the objects away fixes the problem. But
you wrote, that deleting the objects in the file system is crashing
Ceph.

What exactly means crashing? Was the PG, RBD or the whole cluster
unavailable for the clients? Or nothing at all?

I am not sure what is a good way to solve the problem:

1. Should I delete the objects in the file system with running OSDs and
hopefully Ceph will fix the rest (like Thanh Tran did it). Maybe
afterwards i have to do a remove-clone-metadata with the objectstore-
tool? Will the PG and RBD stay online or should I plan an
unavailability?

2. Should i use the ceph-objectstore-tool with remove and/or remove-
clone-metadata to delete the objects for each OSD, one after another,
so the PG can be online?

3. Should i use the ceph-objectstore-tool with remove or remove-clone-
metadata to delete the objects with all OSDs down (belonging to the
PG)?

Do you have any advice? Was your PG/RBD/Cluster unavailable during the
fix?

Thanks,
Achim


[1]
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-December/023199.html




-- 
Achim Ledermüller, M. Sc.
Lead Senior Systems Engineer

NETWAYS Managed Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg
Tel: +49 911 92885-0 | Fax: +49 911 92885-77
CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB25207
http://www.netways.de | achim.ledermuel...@netways.de

** Icinga Camp Berlin 2019 - March - icinga.com **
** OSDC 2019 - May - osdc.de **
** Icinga as a Service - nws.netways.de **
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] yet another deep-scrub performance topic

2018-12-11 Thread Janne Johansson
Den tis 11 dec. 2018 kl 12:54 skrev Caspar Smit :
>
> On a Luminous 12.2.7 cluster these are the defaults:
> ceph daemon osd.x config show

thank you very much.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] KVM+Ceph: Live migration of I/O-heavy VM

2018-12-11 Thread Kevin Olbrich
Hi!

Currently I plan a migration of a large VM (MS Exchange, 300 Mailboxes
and 900GB DB) from qcow2 on ext4 (RAID1) to an all-flash Ceph luminous
cluster (which already holds lot's of images).
The server has access to both local and cluster-storage, I only need
to live migrate the storage, not machine.

I have never used live migration as it can cause more issues and the
VMs that are already migrated, had planned downtime.
Taking the VM offline and convert/import using qemu-img would take
some hours but I would like to still serve clients, even if it is
slower.

The VM is I/O-heavy in terms of the old storage (LSI/Adaptec with
BBU). There are two HDDs bound as RAID1 which are constantly under 30%
- 60% load (this goes up to 100% during reboot, updates or login
prime-time).

What happens when either the local compute node or the ceph cluster
fails (degraded)? Or network is unavailable?
Are all writes performed to both locations? Is this fail-safe? Or does
the VM crash in worst case, which can lead to dirty shutdown for MS-EX
DBs?

The node currently has 4GB free RAM and 29GB listed as cache /
available. These numbers need caution because we have "tuned" enabled
which causes de-deplication on RAM and this host runs about 10 Windows
VMs.
During reboots or updates, RAM can get full again.

Maybe I am to cautious about live-storage-migration, maybe I am not.

What are your experiences or advices?

Thank you very much!

Kind regards
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] yet another deep-scrub performance topic

2018-12-11 Thread Caspar Smit
On a Luminous 12.2.7 cluster these are the defaults:

ceph daemon osd.x config show

   "osd_scrub_max_interval": "604800.00",
   "osd_scrub_min_interval": "86400.00",
   "osd_scrub_interval_randomize_ratio": "0.50",
   "osd_scrub_chunk_max": "25",
   "osd_scrub_chunk_min": "5",
   "osd_scrub_priority": "5",
   "osd_scrub_sleep": "0.00",
   "osd_deep_scrub_interval": "604800.00",
   "osd_deep_scrub_stride": "524288",
   "osd_disk_thread_ioprio_class": "",
   "osd_disk_thread_ioprio_priority": "-1",

You can check your differences with the defaults using:

ceph daemon osd.x config diff

Kind regards,
Caspar


Op di 11 dec. 2018 om 12:36 schreef Janne Johansson :

> Den tis 11 dec. 2018 kl 12:26 skrev Caspar Smit :
> >
> > Furthermore, presuming you are running Jewel or Luminous you can change
> some settings in ceph.conf to mitigate the deep-scrub impact:
> >
> > osd scrub max interval = 4838400
> > osd scrub min interval = 2419200
> > osd scrub interval randomize ratio = 1.0
> > osd scrub chunk max = 1
> > osd scrub chunk min = 1
> > osd scrub priority = 1
> > osd scrub sleep = 0.1
> > osd deep scrub interval = 2419200
> > osd deep scrub stride = 1048576
> > osd disk thread ioprio class = idle
> > osd disk thread ioprio priority = 7
> >
>
> It would be interesting to see what the defaults for those were, so
> one can see which go up and which go down.
>
> --
> May the most significant bit of your life be positive.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] yet another deep-scrub performance topic

2018-12-11 Thread Janne Johansson
Den tis 11 dec. 2018 kl 12:26 skrev Caspar Smit :
>
> Furthermore, presuming you are running Jewel or Luminous you can change some 
> settings in ceph.conf to mitigate the deep-scrub impact:
>
> osd scrub max interval = 4838400
> osd scrub min interval = 2419200
> osd scrub interval randomize ratio = 1.0
> osd scrub chunk max = 1
> osd scrub chunk min = 1
> osd scrub priority = 1
> osd scrub sleep = 0.1
> osd deep scrub interval = 2419200
> osd deep scrub stride = 1048576
> osd disk thread ioprio class = idle
> osd disk thread ioprio priority = 7
>

It would be interesting to see what the defaults for those were, so
one can see which go up and which go down.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] yet another deep-scrub performance topic

2018-12-11 Thread Caspar Smit
Furthermore, presuming you are running Jewel or Luminous you can change
some settings in ceph.conf to mitigate the deep-scrub impact:

osd scrub max interval = 4838400
osd scrub min interval = 2419200
osd scrub interval randomize ratio = 1.0
osd scrub chunk max = 1
osd scrub chunk min = 1
osd scrub priority = 1
osd scrub sleep = 0.1
osd deep scrub interval = 2419200
osd deep scrub stride = 1048576
osd disk thread ioprio class = idle
osd disk thread ioprio priority = 7

Kind regards,
Caspar


Op ma 10 dec. 2018 om 12:06 schreef Vladimir Prokofev :

> Hello list.
>
> Deep scrub totally kills cluster performance.
> First of all, it takes several minutes to complete:
> 2018-12-09 01:39:53.857994 7f2d32fde700  0 log_channel(cluster) log [DBG]
> : 4.75 deep-scrub starts
> 2018-12-09 01:46:30.703473 7f2d32fde700  0 log_channel(cluster) log [DBG]
> : 4.75 deep-scrub ok
>
> Second, while it runs, it consumes 100% of OSD time[1]. This is on an
> ordinary 7200RPM spinner.
> While this happens, VMs cannot access their disks, and that leads to
> service interruptions.
>
> I disabled scrub and deep-scrub operations for now, and have 2 major
> questions:
>  - can I disable 'health warning' status for noscrub and nodeep-scrub? I
> thought there was a way to do this, but can't find it. I want my cluster to
> think it's healthy, so if any new 'slow requests' or anything else pops -
> it will change status to 'health warning' again;
>  - is there a way to limit deepscrub impact on disk performance, or do I
> just have to go and buy SSDs?
>
> [1] https://imgur.com/a/TKH3uda
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] yet another deep-scrub performance topic

2018-12-11 Thread Caspar Smit
Hi Vladimir,

While it is advisable to investigate why deep-scrub is killing your
performance (it's enabled for a reason) and find ways to fix that (seperate
block.db SSD's for instance might help) here's a way to accomodate your
needs:

For all your 7200RPM Spinner based pools do:

ceph osd pool set  no-scrub true
ceph osd pool set  nodeep-scrub true

Though, you might want to leave normal scrubbing on and see if that is
enough. Normal scrubbing has way less impact on performance then deep-scrub.

When you've set all your spinner pools with above flag(s) you can unset the
global no-scrub and nodeep-scrub flags and your health warning goes away
(and scrubbing does not occur on those pools).

Kind regards,
Caspar


Op ma 10 dec. 2018 om 12:06 schreef Vladimir Prokofev :

> Hello list.
>
> Deep scrub totally kills cluster performance.
> First of all, it takes several minutes to complete:
> 2018-12-09 01:39:53.857994 7f2d32fde700  0 log_channel(cluster) log [DBG]
> : 4.75 deep-scrub starts
> 2018-12-09 01:46:30.703473 7f2d32fde700  0 log_channel(cluster) log [DBG]
> : 4.75 deep-scrub ok
>
> Second, while it runs, it consumes 100% of OSD time[1]. This is on an
> ordinary 7200RPM spinner.
> While this happens, VMs cannot access their disks, and that leads to
> service interruptions.
>
> I disabled scrub and deep-scrub operations for now, and have 2 major
> questions:
>  - can I disable 'health warning' status for noscrub and nodeep-scrub? I
> thought there was a way to do this, but can't find it. I want my cluster to
> think it's healthy, so if any new 'slow requests' or anything else pops -
> it will change status to 'health warning' again;
>  - is there a way to limit deepscrub impact on disk performance, or do I
> just have to go and buy SSDs?
>
> [1] https://imgur.com/a/TKH3uda
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] move directories in cephfs

2018-12-11 Thread Marc Roos
>Moving data between pools when a file is moved to a different directory 

>is most likely problematic - for example an inode can be hard linked to 

>two different directories that are in two different pools - then what 
>happens to the file?  Unix/posix semantics don't really specify a 
parent 
>directory to a regular file.
>
>That being said - it would be really nice if there were a way to move 
an 
>inode from one pool to another transparently (with some explicit 
>command).  Perhaps locking the inode up for the duration of the move, 
>and releasing it when the move is complete (so that clients that have 
>the file open don't notice any disruptions).  Are there any plans in 
>this direction?

I do also hope so. Because this would be for me expected behavior. I ran 
into this issue accidentally because I had different permissions on the 
pools. How can I explain a user that if they move files between 2 
specific folders that they should not mv but cp. Now I have to 
workaround this buy apply separate mounts. 


-Original Message-
From: Andras Pataki [mailto:apat...@flatironinstitute.org] 
Sent: 11 December 2018 00:34
To: Marc Roos; ceph; ceph-users
Subject: Re: [ceph-users] move directories in cephfs

Moving data between pools when a file is moved to a different directory 
is most likely problematic - for example an inode can be hard linked to 
two different directories that are in two different pools - then what 
happens to the file?  Unix/posix semantics don't really specify a parent 

directory to a regular file.

That being said - it would be really nice if there were a way to move an 

inode from one pool to another transparently (with some explicit 
command).  Perhaps locking the inode up for the duration of the move, 
and releasing it when the move is complete (so that clients that have 
the file open don't notice any disruptions).  Are there any plans in 
this direction?

Andras

On 12/10/18 10:55 AM, Marc Roos wrote:
>   
>
> Except if you have different pools on these directories. Then the data
> is not moved(copied), which I think should be done. This should be
> changed, because no one will expect a symlink to the old pool.
>
>
>
>
> -Original Message-
> From: Jack [mailto:c...@jack.fr.eu.org]
> Sent: 10 December 2018 15:14
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] move directories in cephfs
>
> Having the / mounted somewhere, you can simply "mv" directories around
>
> On 12/10/2018 02:59 PM, Zhenshi Zhou wrote:
>> Hi,
>>
>> Is there a way I can move sub-directories outside the directory.
>> For instance, a directory /parent contains 3 sub-directories
>> /parent/a, /parent/b, /parent/c. All these directories have huge data
>> in it. I'm gonna move /parent/b to /b. I don't want to copy the whole
>> directory outside cause it will be so slow.
>>
>> Besides, I heard about cephfs-shell early today. I'm wondering which
>> version will ceph have this command tool. My cluster is luminous
>> 12.2.5.
>>
>> Thanks
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com