Re: [ceph-users] Monitoring bluestore compression ratio

2017-12-04 Thread Rafał Wądołowski

Finally, I've founded the command:


ceph daemon osd.1 perf dump | grep bluestore


And there you have compressed data

Regards,

Rafał Wądołowski


http://cloudferro.com/ 
On 04.12.2017 14:17, Rafał Wądołowski wrote:

Hi,

Is there any command or tool to show effectiveness of bluestore 
compression?


I see the difference (in ceph osd df tree), while uploading a object 
to ceph, but maybe there are more friendly method to do it.





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous, RGW bucket resharding

2017-12-04 Thread Andreas Calminder
Thanks!
Is there anything in the bug tracker about the resharding issues that
I can check, just to follow progress?

Regards,
Andreas

On 4 December 2017 at 18:57, Orit Wasserman  wrote:
> Hi Andreas,
>
> On Mon, Dec 4, 2017 at 11:26 AM, Andreas Calminder
>  wrote:
>> Hello,
>> With release 12.2.2 dynamic resharding bucket index has been disabled
>> when running a multisite environment
>> (http://tracker.ceph.com/issues/21725). Does this mean that resharding
>> of bucket indexes shouldn't be done at all, manually, while running
>> multisite as there's a risk of corruption?
>>
>
> You will need to stop the sync on the bucket before doing the
> resharding and start it again after the resharding completes.
> It will start a full sync on the bucket (it doesn't mean we copy the
> objects but we go over on all of them to check if the need to be
> synced).
> We will automate this as part of the reshard admin command in the next
> Luminous release.
>
>> Also, as dynamic bucket resharding was/is the main motivator moving to
>> Luminous (for me at least) is dynamic reshardning something that is
>> planned to be fixed for multisite environments later in the Luminous
>> life-cycle or will it be left disabled forever?
>>
>
> We are planning to enable it in Luminous time.
>
> Regards,
> Orit
>
>> Thanks!
>> /andreas
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 答复: Question about BUG #11332

2017-12-04 Thread 许雪寒
Thanks for your reply, greg:-)

Monitor processes its requests in the main dispatch loop, however, the "PAXOS 
COMMIT" transaction is executed by another thread MonitorDBStore::io_work, so I 
think it could be possible that they run concurrently. On the other hand, 
although the commit transaction is executed in another thread, the 
PaxosService's state, by which I mean "active" or not, is modified in the main 
dispatch loop. So we now think that it should work fine if we make 
"send_incremental" method check if the OSDMonitor is active and put a callback 
in the "wait_for_active" queue if it's not.

Is this right?

Thanks:-)



发件人: Gregory Farnum [mailto:gfar...@redhat.com] 
发送时间: 2017年12月5日 5:48
收件人: 许雪寒
抄送: ceph-users@lists.ceph.com; 陈玉鹏
主题: Re: [ceph-users] Question about BUG #11332

On Thu, Nov 23, 2017 at 1:55 AM 许雪寒  wrote:
Hi, everyone.

 We also encountered this problem: http://tracker.ceph.com/issues/11332. And we 
found that this seems to be caused by the lack of mutual exclusion between 
applying "trim" and handling subscriptions. Since "build_incremental" 
operations doesn't go through the "PAXOS" procedure, and applying "trim" 
contains two phases, which are modifying "mondbstore" and updating 
"cached_first_committed", there could be a chance for "send_incremental" 
operations to happen between them. What's more, "build_incremental" operations 
also contain two phases, getting "cached_first_committed" and getting actual 
incrementals for MonDBStore. So, if "build_incremental" do happens concurrently 
with applying "trim", it could get an out-dated "cached_first_committed" and 
try to read a full map whose already trimmed.

Is this right?

I don't think this is right. Keep in mind that the monitors are basically a 
single-threaded event-driven machine. Both trimming and building incrementals 
happen in direct response to receiving messages, in the main dispatch loop, and 
while trimming is happening the PaxosService is not readable. So it won't be 
invoking build_incremental() and they won't run concurrently.
-Greg
 

If it is, we think maybe all “READ” operations in monitor should be 
synchronized with paxos commit. Right? Should some kind of read-write locking 
mechanism be used here?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replaced a disk, first time. Quick question

2017-12-04 Thread Michael Kuriger
I've seen that before (over 100%) but I forget the cause.  At any rate, the way 
I replace disks is to first set the osd weight to 0, wait for data to 
rebalance, then down / out the osd.  I don't think ceph does any reads from a 
disk once you've marked it out so hopefully there are other copies.

Mike Kuriger
Sr. Unix Systems Engineer
T: 818-649-7235 M: 818-434-6195
[ttp://www.hotyellow.com/deximages/dex-thryv-logo.jpg]

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Drew 
Weaver
Sent: Monday, December 04, 2017 8:39 AM
To: 'ceph-us...@ceph.com'
Subject: [ceph-users] Replaced a disk, first time. Quick question

Howdy,

I replaced a disk today because it was marked as Predicted failure. These were 
the steps I took

ceph osd out osd17
ceph -w #waited for it to get done
systemctl stop ceph-osd@osd17
ceph osd purge osd17 --yes-i-really-mean-it
umount /var/lib/ceph/osd/ceph-osdX

I noticed that after I ran the 'osd out' command that it started moving data 
around.

19446/16764 objects degraded (115.999%) <-- I noticed that number seems odd

So then I replaced the disk
Created a new label on it
Ceph-deploy osd prepare OSD5:sdd

THIS time, it started rebuilding

40795/16764 objects degraded (243.349%) <-- Now I'm really concerned.

Perhaps I don't quite understand what the numbers are telling me but is it 
normal for it to rebuilding more objects than exist?

Thanks,
-Drew


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] injecting args output misleading

2017-12-04 Thread Brad Hubbard
On Tue, Dec 5, 2017 at 6:12 AM, Brady Deetz  wrote:
> I'm not sure if this is a bug where ceph incorrectly reports to the user or
> if this is just a matter of misleading language. Thought I might bring it up
> in any case.
>
> I under stand that "may require restart" is fairly direct in its ambiguity,
> but this probably shouldn't be ambiguous without a good technical reason.
> But, I find "not observed" to be quite misleading. These arg injections are
> very clearly being observed.

Not in 12.2.1, at lease not by the test applied here (which is the
definitive test).

https://github.com/badone/ceph/blob/8ad1a5b642f1c5db3b2d103dc9e64e3c8ad70a27/src/common/config.cc#L682-#L707

I guess "observed" is open to interpretation, but isn't just about everything?

In master "mon_allow_pool_delete" is observed courtesy of this PR and
you no longer see the warning.

https://github.com/ceph/ceph/pull/18125


> Maybe the output should be "not observed by
> 'component x', change may require restart." But, I'd still like a definitive
> yes or no for service restarts required by arg injects.
>
> I've run into this on osd args as well.
>
> Ceph Luminous 12.2.1 (CentOS 7.4.1708)
>
> [root@mon0 ceph-admin]# ceph --admin-daemon /var/run/ceph/ceph-mon.mon0.asok
> config show | grep "mon_allow_pool_delete"
> "mon_allow_pool_delete": "false",
>
> [root@mon0 ceph-admin]# ceph tell mon.0 injectargs
> '--mon_allow_pool_delete=true'
> injectargs:mon_allow_pool_delete = 'true' (not observed, change may require
> restart)
> [root@mon0 ceph-admin]# ceph tell mon.1 injectargs
> '--mon_allow_pool_delete=true'
> injectargs:mon_allow_pool_delete = 'true' (not observed, change may require
> restart)
> [root@mon0 ceph-admin]# ceph tell mon.2 injectargs
> '--mon_allow_pool_delete=true'
> injectargs:mon_allow_pool_delete = 'true' (not observed, change may require
> restart)
>
> [root@mon0 ceph-admin]# ceph --admin-daemon /var/run/ceph/ceph-mon.mon0.asok
> config show | grep "mon_allow_pool_delete"
> "mon_allow_pool_delete": "true",
>
> [root@mon0 ceph-admin]# ceph tell mon.0 injectargs
> '--mon_allow_pool_delete=false'
> injectargs:mon_allow_pool_delete = 'false' (not observed, change may require
> restart)
> [root@mon0 ceph-admin]# ceph tell mon.1 injectargs
> '--mon_allow_pool_delete=false'
> injectargs:mon_allow_pool_delete = 'false' (not observed, change may require
> restart)
> [root@mon0 ceph-admin]# ceph tell mon.2 injectargs
> '--mon_allow_pool_delete=false'
> injectargs:mon_allow_pool_delete = 'false' (not observed, change may require
> restart)
>
> [root@mon0 ceph-admin]# ceph --admin-daemon /var/run/ceph/ceph-mon.mon0.asok
> config show | grep "mon_allow_pool_delete"
> "mon_allow_pool_delete": "false",
>
> Thanks for the hard work, devs!
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] tcmu-runner failing during image creation

2017-12-04 Thread Brady Deetz
I thought I was good to go with tcmu-runner on Kernel 4.14, but I guess
not? Any thoughts on the output below?

2017-12-04 17:44:09,631ERROR [rbd-target-api:665:_disk()] - LUN alloc
problem - Could not set LIO device attribute cmd_time_out/qfull_time_out
for device: iscsi-primary.primary00. Kernel not supported. - error(Cannot
find attribute: qfull_time_out)


[root@dc1srviscsi01 ~]# uname -a
Linux dc1srviscsi01.ceph.xxx.xxx 4.14.3-1.el7.elrepo.x86_64 #1 SMP Thu Nov
30 09:35:20 EST 2017 x86_64 x86_64 x86_64 GNU/Linux

ceph-iscsi-cli/
[root@dc1srviscsi01 ceph-iscsi-cli]# git branch
* (detached from 2.5)
  master

ceph-iscsi-cli/
[root@dc1srviscsi01 ceph-iscsi-cli]# git branch
* (detached from 2.5)
  master

ceph-iscsi-config/
[root@dc1srviscsi01 ceph-iscsi-config]# git branch
* (detached from 2.3)
  master

rtslib-fb/
[root@dc1srviscsi01 rtslib-fb]# git branch
* (detached from v2.1.fb64)
  master

targetcli-fb/
[root@dc1srviscsi01 targetcli-fb]# git branch
* (detached from v2.1.fb47)
  master

tcmu-runner/
[root@dc1srviscsi01 tcmu-runner]# git branch
* (detached from v1.3.0-rc4)
  master
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding multiple OSD

2017-12-04 Thread Karun Josy
Thank you for detailed explanation!

Got one another doubt,

This is the total space available in the cluster :

TOTAL : 23490G
Use  : 10170G
Avail : 13320G


But ecpool shows max avail as just 3 TB. What am I missing ?

==


$ ceph df
GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
23490G 13338G   10151G 43.22
POOLS:
NAMEID USED  %USED MAX AVAIL OBJECTS
ostemplates 1   162G  2.79 1134G   42084
imagepool   34  122G  2.11 1891G   34196
cvm154  8058 0 1891G 950
ecpool1 55 4246G 42.77 3546G 1232590


$ ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS
 0   ssd 1.86469  1.0  1909G   625G  1284G 32.76 0.76 201
 1   ssd 1.86469  1.0  1909G   691G  1217G 36.23 0.84 208
 2   ssd 0.87320  1.0   894G   587G   306G 65.67 1.52 156
11   ssd 0.87320  1.0   894G   631G   262G 70.68 1.63 186
 3   ssd 0.87320  1.0   894G   605G   288G 67.73 1.56 165
14   ssd 0.87320  1.0   894G   635G   258G 71.07 1.64 177
 4   ssd 0.87320  1.0   894G   419G   474G 46.93 1.08 127
15   ssd 0.87320  1.0   894G   373G   521G 41.73 0.96 114
16   ssd 0.87320  1.0   894G   492G   401G 55.10 1.27 149
 5   ssd 0.87320  1.0   894G   288G   605G 32.25 0.74  87
 6   ssd 0.87320  1.0   894G   342G   551G 38.28 0.88 102
 7   ssd 0.87320  1.0   894G   300G   593G 33.61 0.78  93
22   ssd 0.87320  1.0   894G   343G   550G 38.43 0.89 104
 8   ssd 0.87320  1.0   894G   267G   626G 29.90 0.69  77
 9   ssd 0.87320  1.0   894G   376G   518G 42.06 0.97 118
10   ssd 0.87320  1.0   894G   322G   571G 36.12 0.83 102
19   ssd 0.87320  1.0   894G   339G   554G 37.95 0.88 109
12   ssd 0.87320  1.0   894G   360G   534G 40.26 0.93 112
13   ssd 0.87320  1.0   894G   404G   489G 45.21 1.04 120
20   ssd 0.87320  1.0   894G   342G   551G 38.29 0.88 103
23   ssd 0.87320  1.0   894G   148G   745G 16.65 0.38  61
17   ssd 0.87320  1.0   894G   423G   470G 47.34 1.09 117
18   ssd 0.87320  1.0   894G   403G   490G 45.18 1.04 120
21   ssd 0.87320  1.0   894G   444G   450G 49.67 1.15 130
TOTAL 23490G 10170G 13320G 43.30



Karun Josy

On Tue, Dec 5, 2017 at 4:42 AM, Karun Josy  wrote:

> Thank you for detailed explanation!
>
> Got one another doubt,
>
> This is the total space available in the cluster :
>
> TOTAL 23490G
> Use 10170G
> Avail : 13320G
>
>
> But ecpool shows max avail as just 3 TB.
>
>
>
> Karun Josy
>
> On Tue, Dec 5, 2017 at 1:06 AM, David Turner 
> wrote:
>
>> No, I would only add disks to 1 failure domain at a time.  So in your
>> situation where you're adding 2 more disks to each node, I would recommend
>> adding the 2 disks into 1 node at a time.  Your failure domain is the
>> crush-failure-domain=host.  So you can lose a host and only lose 1 copy of
>> the data.  If all of your pools are using the k=5 m=3 profile, then I would
>> say it's fine to add the disks into 2 nodes at a time.  If you have any
>> replica pools for RGW metadata or anything, then I would stick with the 1
>> host at a time.
>>
>> On Mon, Dec 4, 2017 at 2:29 PM Karun Josy  wrote:
>>
>>> Thanks for your reply!
>>>
>>> I am using erasure coded profile with k=5, m=3 settings
>>>
>>> $ ceph osd erasure-code-profile get profile5by3
>>> crush-device-class=
>>> crush-failure-domain=host
>>> crush-root=default
>>> jerasure-per-chunk-alignment=false
>>> k=5
>>> m=3
>>> plugin=jerasure
>>> technique=reed_sol_van
>>> w=8
>>>
>>>
>>> Cluster has 8 nodes, with 3 disks each. We are planning to add 2 more on
>>> each nodes.
>>>
>>> If I understand correctly, then I can add 3 disks at once right ,
>>> assuming 3 disks can fail at a time as per the ec code profile.
>>>
>>> Karun Josy
>>>
>>> On Tue, Dec 5, 2017 at 12:06 AM, David Turner 
>>> wrote:
>>>
 Depending on how well you burn-in/test your new disks, I like to only
 add 1 failure domain of disks at a time in case you have bad disks that
 you're adding.  If you are confident that your disks aren't likely to fail
 during the backfilling, then you can go with more.  I just added 8 servers
 (16 OSDs each) to a cluster with 15 servers (16 OSDs each) all at the same
 time, but we spent 2 weeks testing the hardware before adding the new nodes
 to the cluster.

 If you add 1 failure domain at a time, then any DoA disks in the new
 nodes will only be able to fail with 1 copy of your data instead of across
 multiple nodes.

 On Mon, Dec 4, 2017 at 12:54 PM Karun Josy 
 wrote:

> Hi,
>
> Is it recommended to add OSD disks one by one or can I add couple of
> disks at a time ?
>
> Current cluster size is about 4 TB.
>
>
>

Re: [ceph-users] Adding multiple OSD

2017-12-04 Thread Karun Josy
Thank you for detailed explanation!

Got one another doubt,

This is the total space available in the cluster :

TOTAL 23490G
Use 10170G
Avail : 13320G


But ecpool shows max avail as just 3 TB.



Karun Josy

On Tue, Dec 5, 2017 at 1:06 AM, David Turner  wrote:

> No, I would only add disks to 1 failure domain at a time.  So in your
> situation where you're adding 2 more disks to each node, I would recommend
> adding the 2 disks into 1 node at a time.  Your failure domain is the
> crush-failure-domain=host.  So you can lose a host and only lose 1 copy of
> the data.  If all of your pools are using the k=5 m=3 profile, then I would
> say it's fine to add the disks into 2 nodes at a time.  If you have any
> replica pools for RGW metadata or anything, then I would stick with the 1
> host at a time.
>
> On Mon, Dec 4, 2017 at 2:29 PM Karun Josy  wrote:
>
>> Thanks for your reply!
>>
>> I am using erasure coded profile with k=5, m=3 settings
>>
>> $ ceph osd erasure-code-profile get profile5by3
>> crush-device-class=
>> crush-failure-domain=host
>> crush-root=default
>> jerasure-per-chunk-alignment=false
>> k=5
>> m=3
>> plugin=jerasure
>> technique=reed_sol_van
>> w=8
>>
>>
>> Cluster has 8 nodes, with 3 disks each. We are planning to add 2 more on
>> each nodes.
>>
>> If I understand correctly, then I can add 3 disks at once right ,
>> assuming 3 disks can fail at a time as per the ec code profile.
>>
>> Karun Josy
>>
>> On Tue, Dec 5, 2017 at 12:06 AM, David Turner 
>> wrote:
>>
>>> Depending on how well you burn-in/test your new disks, I like to only
>>> add 1 failure domain of disks at a time in case you have bad disks that
>>> you're adding.  If you are confident that your disks aren't likely to fail
>>> during the backfilling, then you can go with more.  I just added 8 servers
>>> (16 OSDs each) to a cluster with 15 servers (16 OSDs each) all at the same
>>> time, but we spent 2 weeks testing the hardware before adding the new nodes
>>> to the cluster.
>>>
>>> If you add 1 failure domain at a time, then any DoA disks in the new
>>> nodes will only be able to fail with 1 copy of your data instead of across
>>> multiple nodes.
>>>
>>> On Mon, Dec 4, 2017 at 12:54 PM Karun Josy  wrote:
>>>
 Hi,

 Is it recommended to add OSD disks one by one or can I add couple of
 disks at a time ?

 Current cluster size is about 4 TB.



 Karun
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about BUG #11332

2017-12-04 Thread Gregory Farnum
On Thu, Nov 23, 2017 at 1:55 AM 许雪寒  wrote:

> Hi, everyone.
>
>  We also encountered this problem: http://tracker.ceph.com/issues/11332.
> And we found that this seems to be caused by the lack of mutual exclusion
> between applying "trim" and handling subscriptions. Since
> "build_incremental" operations doesn't go through the "PAXOS" procedure,
> and applying "trim" contains two phases, which are modifying "mondbstore"
> and updating "cached_first_committed", there could be a chance for
> "send_incremental" operations to happen between them. What's more,
> "build_incremental" operations also contain two phases, getting
> "cached_first_committed" and getting actual incrementals for MonDBStore.
> So, if "build_incremental" do happens concurrently with applying "trim", it
> could get an out-dated "cached_first_committed" and try to read a full map
> whose already trimmed.
>
> Is this right?
>

I don't think this is right. Keep in mind that the monitors are basically a
single-threaded event-driven machine. Both trimming and building
incrementals happen in direct response to receiving messages, in the main
dispatch loop, and while trimming is happening the PaxosService is not
readable. So it won't be invoking build_incremental() and they won't run
concurrently.
-Greg


>
> If it is, we think maybe all “READ” operations in monitor should be
> synchronized with paxos commit. Right? Should some kind of read-write
> locking mechanism be used here?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] injecting args output misleading

2017-12-04 Thread Gregory Farnum
On Mon, Dec 4, 2017 at 12:12 PM Brady Deetz  wrote:

> I'm not sure if this is a bug where ceph incorrectly reports to the user
> or if this is just a matter of misleading language. Thought I might bring
> it up in any case.
>
> I under stand that "may require restart" is fairly direct in its
> ambiguity, but this probably shouldn't be ambiguous without a good
> technical reason. But, I find "not observed" to be quite misleading. These
> arg injections are very clearly being observed. Maybe the output should be
> "not observed by 'component x', change may require restart." But, I'd still
> like a definitive yes or no for service restarts required by arg injects.
>

You're right, but unfortunately there is a good technical reason. Or at
least one of effort versus payoff.

We are slowly (or, lately, somewhat rapidly) changing things in
configuration management within Ceph, but the historical pattern is that we
have a giant struct of variables, and when we want a new config option we
add a new variable. These are initialized on startup.

Initially, whenever we wanted to access the variables, we just read them
directly out of the config struct. Unfortunately, this could be racy in the
case where we changed configs while running, and some configs can't be
changed live. So we added an "observer" framework, which lets us call
specific functions whenever specified config options get changed, and a set
of thread-safe accessor functions.

Then we saw people injecting options we know don't take live effect, so we
wanted to add a warning. (For instance, you can't change the number of
worker shards in the OSD once it's running.) Obviously, we can assume
programmatically that any configuration option with an observer will take
live effect. But what about places where we still read the value directly
out of the config struct while running? Those *do* take effect immediately,
but there's no programmatic way to identify them.

We've talked in the past about adding an extra flag which indicates if the
config is a live-changed one or not, but it's run up against issues of
difficulty and the effort required to track down every option. (Adding yet
another "this might not be registered" statement won't help us much, so it
basically needs to be done monolithically.) John recently reworked the
config options framework and it may be easier to make these changes now,
but the effort of auditing every single option remains. :/

Although it could be that even if we don't know for sure, being able to say
with certainty that the ten most-common unobserved options *will* take live
effect is useful enough on its own. Patches welcome! ;)
-Greg


>
> I've run into this on osd args as well.
>
> Ceph Luminous 12.2.1 (CentOS 7.4.1708)
>
> [root@mon0 ceph-admin]# ceph --admin-daemon
> /var/run/ceph/ceph-mon.mon0.asok config show | grep "mon_allow_pool_delete"
> "mon_allow_pool_delete": "false",
>
> [root@mon0 ceph-admin]# ceph tell mon.0 injectargs
> '--mon_allow_pool_delete=true'
> injectargs:mon_allow_pool_delete = 'true' (not observed, change may
> require restart)
> [root@mon0 ceph-admin]# ceph tell mon.1 injectargs
> '--mon_allow_pool_delete=true'
> injectargs:mon_allow_pool_delete = 'true' (not observed, change may
> require restart)
> [root@mon0 ceph-admin]# ceph tell mon.2 injectargs
> '--mon_allow_pool_delete=true'
> injectargs:mon_allow_pool_delete = 'true' (not observed, change may
> require restart)
>
> [root@mon0 ceph-admin]# ceph --admin-daemon
> /var/run/ceph/ceph-mon.mon0.asok config show | grep "mon_allow_pool_delete"
> "mon_allow_pool_delete": "true",
>
> [root@mon0 ceph-admin]# ceph tell mon.0 injectargs
> '--mon_allow_pool_delete=false'
> injectargs:mon_allow_pool_delete = 'false' (not observed, change may
> require restart)
> [root@mon0 ceph-admin]# ceph tell mon.1 injectargs
> '--mon_allow_pool_delete=false'
> injectargs:mon_allow_pool_delete = 'false' (not observed, change may
> require restart)
> [root@mon0 ceph-admin]# ceph tell mon.2 injectargs
> '--mon_allow_pool_delete=false'
> injectargs:mon_allow_pool_delete = 'false' (not observed, change may
> require restart)
>
> [root@mon0 ceph-admin]# ceph --admin-daemon
> /var/run/ceph/ceph-mon.mon0.asok config show | grep "mon_allow_pool_delete"
> "mon_allow_pool_delete": "false",
>
> Thanks for the hard work, devs!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] injecting args output misleading

2017-12-04 Thread Brady Deetz
I'm not sure if this is a bug where ceph incorrectly reports to the user or
if this is just a matter of misleading language. Thought I might bring it
up in any case.

I under stand that "may require restart" is fairly direct in its ambiguity,
but this probably shouldn't be ambiguous without a good technical reason.
But, I find "not observed" to be quite misleading. These arg injections are
very clearly being observed. Maybe the output should be "not observed by
'component x', change may require restart." But, I'd still like a
definitive yes or no for service restarts required by arg injects.

I've run into this on osd args as well.

Ceph Luminous 12.2.1 (CentOS 7.4.1708)

[root@mon0 ceph-admin]# ceph --admin-daemon
/var/run/ceph/ceph-mon.mon0.asok config show | grep "mon_allow_pool_delete"
"mon_allow_pool_delete": "false",

[root@mon0 ceph-admin]# ceph tell mon.0 injectargs
'--mon_allow_pool_delete=true'
injectargs:mon_allow_pool_delete = 'true' (not observed, change may require
restart)
[root@mon0 ceph-admin]# ceph tell mon.1 injectargs
'--mon_allow_pool_delete=true'
injectargs:mon_allow_pool_delete = 'true' (not observed, change may require
restart)
[root@mon0 ceph-admin]# ceph tell mon.2 injectargs
'--mon_allow_pool_delete=true'
injectargs:mon_allow_pool_delete = 'true' (not observed, change may require
restart)

[root@mon0 ceph-admin]# ceph --admin-daemon
/var/run/ceph/ceph-mon.mon0.asok config show | grep "mon_allow_pool_delete"
"mon_allow_pool_delete": "true",

[root@mon0 ceph-admin]# ceph tell mon.0 injectargs
'--mon_allow_pool_delete=false'
injectargs:mon_allow_pool_delete = 'false' (not observed, change may
require restart)
[root@mon0 ceph-admin]# ceph tell mon.1 injectargs
'--mon_allow_pool_delete=false'
injectargs:mon_allow_pool_delete = 'false' (not observed, change may
require restart)
[root@mon0 ceph-admin]# ceph tell mon.2 injectargs
'--mon_allow_pool_delete=false'
injectargs:mon_allow_pool_delete = 'false' (not observed, change may
require restart)

[root@mon0 ceph-admin]# ceph --admin-daemon
/var/run/ceph/ceph-mon.mon0.asok config show | grep "mon_allow_pool_delete"
"mon_allow_pool_delete": "false",

Thanks for the hard work, devs!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HELP with some basics please

2017-12-04 Thread Denes Dolhay

Yep, you are correct, thanks!


On 12/04/2017 07:31 PM, David Turner wrote:
"The journals can only be moved back by a complete rebuild of that osd 
as to my knowledge."


I'm assuming that since this is a cluster that he's inherited and that 
it's configured like this that it's probably not running luminous or 
bluestore OSDs. Again more information needed about your cluster and 
hardware.  On filestore OSDs you can easily replace/migrate a journal.


On Mon, Dec 4, 2017 at 1:18 PM tim taler > wrote:


> In size=2 losing any 2 discs on different hosts would probably
cause data to
> be unavailable / lost, as the pg copys are randomly distribbuted
across the
> osds. Chances are, that you can find a pg which's acting group
is the two
> failed osd (you lost all your replicas)

okay I see, getting clearer at least ;-)
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HELP with some basics please

2017-12-04 Thread Ronny Aasen

On 04.12.2017 19:18, tim taler wrote:

In size=2 losing any 2 discs on different hosts would probably cause data to
be unavailable / lost, as the pg copys are randomly distribbuted across the
osds. Chances are, that you can find a pg which's acting group is the two
failed osd (you lost all your replicas)

okay I see, getting clearer at least ;-)



you can also consider running  size=2, min_size=2 while restructuring.
it will block your problematic pg's if there is a failure, until the 
rebuild/rebalance is done. But it should be a bit more resistant to full 
cluster loss and/or corruption.


basically it means if there is less then 2 copies do not accept writes.  
if you want to do this depends on your requirements,
is it a bigger disaster to be unavailable a while, then there is to 
restore from backup.


kind regards
Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding multiple OSD

2017-12-04 Thread David Turner
No, I would only add disks to 1 failure domain at a time.  So in your
situation where you're adding 2 more disks to each node, I would recommend
adding the 2 disks into 1 node at a time.  Your failure domain is the
crush-failure-domain=host.  So you can lose a host and only lose 1 copy of
the data.  If all of your pools are using the k=5 m=3 profile, then I would
say it's fine to add the disks into 2 nodes at a time.  If you have any
replica pools for RGW metadata or anything, then I would stick with the 1
host at a time.

On Mon, Dec 4, 2017 at 2:29 PM Karun Josy  wrote:

> Thanks for your reply!
>
> I am using erasure coded profile with k=5, m=3 settings
>
> $ ceph osd erasure-code-profile get profile5by3
> crush-device-class=
> crush-failure-domain=host
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=5
> m=3
> plugin=jerasure
> technique=reed_sol_van
> w=8
>
>
> Cluster has 8 nodes, with 3 disks each. We are planning to add 2 more on
> each nodes.
>
> If I understand correctly, then I can add 3 disks at once right , assuming
> 3 disks can fail at a time as per the ec code profile.
>
> Karun Josy
>
> On Tue, Dec 5, 2017 at 12:06 AM, David Turner 
> wrote:
>
>> Depending on how well you burn-in/test your new disks, I like to only add
>> 1 failure domain of disks at a time in case you have bad disks that you're
>> adding.  If you are confident that your disks aren't likely to fail during
>> the backfilling, then you can go with more.  I just added 8 servers (16
>> OSDs each) to a cluster with 15 servers (16 OSDs each) all at the same
>> time, but we spent 2 weeks testing the hardware before adding the new nodes
>> to the cluster.
>>
>> If you add 1 failure domain at a time, then any DoA disks in the new
>> nodes will only be able to fail with 1 copy of your data instead of across
>> multiple nodes.
>>
>> On Mon, Dec 4, 2017 at 12:54 PM Karun Josy  wrote:
>>
>>> Hi,
>>>
>>> Is it recommended to add OSD disks one by one or can I add couple of
>>> disks at a time ?
>>>
>>> Current cluster size is about 4 TB.
>>>
>>>
>>>
>>> Karun
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding multiple OSD

2017-12-04 Thread Karun Josy
Thanks for your reply!

I am using erasure coded profile with k=5, m=3 settings

$ ceph osd erasure-code-profile get profile5by3
crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=5
m=3
plugin=jerasure
technique=reed_sol_van
w=8


Cluster has 8 nodes, with 3 disks each. We are planning to add 2 more on
each nodes.

If I understand correctly, then I can add 3 disks at once right , assuming
3 disks can fail at a time as per the ec code profile.

Karun Josy

On Tue, Dec 5, 2017 at 12:06 AM, David Turner  wrote:

> Depending on how well you burn-in/test your new disks, I like to only add
> 1 failure domain of disks at a time in case you have bad disks that you're
> adding.  If you are confident that your disks aren't likely to fail during
> the backfilling, then you can go with more.  I just added 8 servers (16
> OSDs each) to a cluster with 15 servers (16 OSDs each) all at the same
> time, but we spent 2 weeks testing the hardware before adding the new nodes
> to the cluster.
>
> If you add 1 failure domain at a time, then any DoA disks in the new nodes
> will only be able to fail with 1 copy of your data instead of across
> multiple nodes.
>
> On Mon, Dec 4, 2017 at 12:54 PM Karun Josy  wrote:
>
>> Hi,
>>
>> Is it recommended to add OSD disks one by one or can I add couple of
>> disks at a time ?
>>
>> Current cluster size is about 4 TB.
>>
>>
>>
>> Karun
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] luminous 12.2.2 traceback (ceph fs status)

2017-12-04 Thread German Anders
Hi,

I just upgrade a ceph cluster from version 12.2.0 (rc) to 12.2.2 (stable),
and i'm getting a traceback while trying to run:

*# ceph fs status*

Error EINVAL: Traceback (most recent call last):
  File "/usr/lib/ceph/mgr/status/module.py", line 301, in handle_command
return self.handle_fs_status(cmd)
  File "/usr/lib/ceph/mgr/status/module.py", line 219, in handle_fs_status
stats = pool_stats[pool_id]
KeyError: (15L,)


*# ceph fs ls*
name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]


Any ideas?

Thanks in advance,

*Germ​an​*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding multiple OSD

2017-12-04 Thread David Turner
Depending on how well you burn-in/test your new disks, I like to only add 1
failure domain of disks at a time in case you have bad disks that you're
adding.  If you are confident that your disks aren't likely to fail during
the backfilling, then you can go with more.  I just added 8 servers (16
OSDs each) to a cluster with 15 servers (16 OSDs each) all at the same
time, but we spent 2 weeks testing the hardware before adding the new nodes
to the cluster.

If you add 1 failure domain at a time, then any DoA disks in the new nodes
will only be able to fail with 1 copy of your data instead of across
multiple nodes.

On Mon, Dec 4, 2017 at 12:54 PM Karun Josy  wrote:

> Hi,
>
> Is it recommended to add OSD disks one by one or can I add couple of disks
> at a time ?
>
> Current cluster size is about 4 TB.
>
>
>
> Karun
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HELP with some basics please

2017-12-04 Thread David Turner
"The journals can only be moved back by a complete rebuild of that osd as to
my knowledge."

I'm assuming that since this is a cluster that he's inherited and that it's
configured like this that it's probably not running luminous or bluestore
OSDs. Again more information needed about your cluster and hardware.  On
filestore OSDs you can easily replace/migrate a journal.

On Mon, Dec 4, 2017 at 1:18 PM tim taler  wrote:

> > In size=2 losing any 2 discs on different hosts would probably cause
> data to
> > be unavailable / lost, as the pg copys are randomly distribbuted across
> the
> > osds. Chances are, that you can find a pg which's acting group is the two
> > failed osd (you lost all your replicas)
>
> okay I see, getting clearer at least ;-)
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Any way to get around selinux-policy-base dependency

2017-12-04 Thread Bryan Banister
Hi all,

I would like to upgrade to the latest Luminous release but found that it 
requires the absolute latest selinux-policy-base.  We aren't using selinux, so 
was wondering if there is a way around this dependency requirement?

[carf-ceph-osd15][WARNIN] Error: Package: 2:ceph-selinux-12.2.2-0.el7.x86_64 
(ceph)
[carf-ceph-osd15][WARNIN]Requires: selinux-policy-base >= 
3.13.1-166.el7_4.5
[carf-ceph-osd15][WARNIN]Installed: 
selinux-policy-targeted-3.13.1-102.el7_3.13.noarch 
(@rhel7.3-rhn-server-production/7.3)

Thanks for any help!
-Bryan



Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you are hereby notified that any review, dissemination 
or copying of this email is strictly prohibited, and to please notify the 
sender immediately and destroy this email and any attachments. Email 
transmission cannot be guaranteed to be secure or error-free. The Company, 
therefore, does not make any guarantees as to the completeness or accuracy of 
this email or any attachments. This email is for informational purposes only 
and does not constitute a recommendation, offer, request or solicitation of any 
kind to buy, sell, subscribe, redeem or perform any type of transaction of a 
financial product.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HELP with some basics please

2017-12-04 Thread tim taler
> In size=2 losing any 2 discs on different hosts would probably cause data to
> be unavailable / lost, as the pg copys are randomly distribbuted across the
> osds. Chances are, that you can find a pg which's acting group is the two
> failed osd (you lost all your replicas)

okay I see, getting clearer at least ;-)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous, RGW bucket resharding

2017-12-04 Thread Orit Wasserman
Hi Andreas,

On Mon, Dec 4, 2017 at 11:26 AM, Andreas Calminder
 wrote:
> Hello,
> With release 12.2.2 dynamic resharding bucket index has been disabled
> when running a multisite environment
> (http://tracker.ceph.com/issues/21725). Does this mean that resharding
> of bucket indexes shouldn't be done at all, manually, while running
> multisite as there's a risk of corruption?
>

You will need to stop the sync on the bucket before doing the
resharding and start it again after the resharding completes.
It will start a full sync on the bucket (it doesn't mean we copy the
objects but we go over on all of them to check if the need to be
synced).
We will automate this as part of the reshard admin command in the next
Luminous release.

> Also, as dynamic bucket resharding was/is the main motivator moving to
> Luminous (for me at least) is dynamic reshardning something that is
> planned to be fixed for multisite environments later in the Luminous
> life-cycle or will it be left disabled forever?
>

We are planning to enable it in Luminous time.

Regards,
Orit

> Thanks!
> /andreas
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Adding multiple OSD

2017-12-04 Thread Karun Josy
Hi,

Is it recommended to add OSD disks one by one or can I add couple of disks
at a time ?

Current cluster size is about 4 TB.



Karun
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HELP with some basics please

2017-12-04 Thread Denes Dolhay

Hi,

I would not rip out the discs, but I would reweight the osd to 0, wait 
for the cluster to reconfigure, and when it is done, you can remove the 
disc / raid pair without ever going down to 1 copy only.


The jornals can only be moved back by a complete rebiuld of that osd as 
to my knowledge.


In size=2 losing any 2 discs on different hosts would probably cause 
data to be unavailable / lost, as the pg copys are randomly distribbuted 
across the osds. Chances are, that you can find a pg which's acting 
group is the two failed osd (you lost all your replicas)


Ceph by default will not put both acting group members of a pg onto the 
same host. This is determined by the failure domain setting.


Denes.


On 12/04/2017 06:27 PM, tim taler wrote:

thnx a lot again,
makes sense to me.

We have all journals of the HDD-OSDs on partitions on an extra
SSD-raid1 (each OSD got it's own journal partition on that raid1)
but as I understand they could be moved back to the OSD, at least for
the time of the restructuring.

What makes my tommy turn though, is the thought of ripping out a raid0
pair and plug it into another machine, (it's hwraid not zfs!)
in the hope of keeping the data on it, even if I can get the same sort
of controller (which might be possible, although the machines are a
couple of years old and
Machine C is not the same as A and B) .

And I'm still puzzled about the implication of the cluster size on the
amount of OSD failures.
With size=2 min_size=1 one host could die and (if by chance there is
NO read error on any bit on the living host) I could (theoretically)
recover, is that right?
OR is it that if any two disks in the cluster fail at the same time
(or while one is still being rebuild) all my data would be gone?



On Mon, Dec 4, 2017 at 4:42 PM, David Turner  wrote:

Your current node configuration cannot do size=3 for any pools.  You only
have 2 hosts with HDDs and 2 hosts with SSDs in each root.  You cannot put 3
copies of data for an HDD pool on 3 separate nodes when you only have 2
nodes with HDDs...  In this configuration, size=2 is putting a copy of the
data on every available node.  That is why you need to have the space
available on the host with the failed OSD to be able to recover; there is no
other way for the cluster to keep 2 copies of the data on different nodes.
The same will be true if you only have 3 available nodes and size=3; any
failed disk can only backfill onto the same node.

I would start by recommending that you restructure your nodes quite heavily.
You want as close to the same number of disks in each node as you can get.
A balanced setup might look like...  This is of course assuming that the
CPU, RAM, and disk controllers are similar between the 3 nodes.

machine A:
2x 3.6TB
2x 3.6TB RAID0
1x 1.8TB
2x .7TB SSD (1 each from machines B & C)

machine B:
2x 3.6TB
2x 3.6TB RAID0
1x 1.8TB
2x .7TB SSD

machine C:
1x 3.6TB (from machine B)
2x 3.6TB RAID0 (1 each from machines A & B)
2x 1.8TB  (from machine A)
2x .7TB SSD

After all of that is configured and backfilled (a lot of backfilling). The
next step is to remove the RAID0 OSDs and add them back in as individual
1.8TB OSDs.  You can also consider size=3 min_size=2 for some of your pools
in this configuration.  Both rebuilding the RAID0 OSDs and increasing the
size of a pool will require that you have enough space in your
cluster/nodes.  Depending on how you have your journals configured moving
the OSDs between hosts is usually fairly trivial (except for the
backfilling).

Your % used is going to be a problem throughout this as an inherent issue
with Ceph not being perfect at balancing data which is a trade-off for data
integrity in the CRUSH algorithm.  There are ways to change the weights of
the OSDs to help fix the balance issue, but it is not indicative of a
problem in your configuration... just something that you need to be aware of
to be able to prevent it from being a major problem.

There is a lot of material on why size=2 min_size=1 is bad.  Read back
through the ML archives to find some.  My biggest take-away is... if you
lose all but 1 copy of your data... do you really want to make changes to
it?  I've also noticed that the majority of clusters on the ML that have
irreparably lost data were running with size=2 min_size=1.

On Mon, Dec 4, 2017 at 6:12 AM tim taler  wrote:

Hi,

thnx a lot for the quick response
and for laying out some of the issues


I'm also new, but I'll try to help. IMHO most of the pros here would be
quite worried about this cluster if it is production:

thought so ;-/


-A prod ceph cluster should not be run with size=2 min_size=1, because:
--In case of a down'ed osd / host the cluster could have problems
determining which data is the correct when the osd/host came back up

Uhm  I thought at least THAT wouldn't be the case here since we hace
three mons??
don't THEY keep track of which osd has the latest data
isn't the size set on the pool 

Re: [ceph-users] HELP with some basics please

2017-12-04 Thread David Turner
Flushing a journal, and creating a new journal device before turning the
OSD on is viable and simple enough to do.  Moving a raid0 while the new
host doesn't have the same controller wouldn't be recommended for obvious
reasons.  That would change my recommendation for how to distribute the
OSDs, but now you haven't given enough information about the hosts to
indicate where the OSDs should go.

size=2 min_size=1 is not a guarantee for data loss if you lose a host.
It's just a bad idea to run in production.  You can lose a host, run for 20
years, and never lose any data... as long as nothing else goes wrong
(obviously you know the likelihood of that or you wouldn't have been
worried when you saw the configuration).  Then replace the node and be back
up to 2 copies of the data.  It's how it's designed to work, but additional
hardware failures can cause problems.

On Mon, Dec 4, 2017 at 12:27 PM tim taler  wrote:

> thnx a lot again,
> makes sense to me.
>
> We have all journals of the HDD-OSDs on partitions on an extra
> SSD-raid1 (each OSD got it's own journal partition on that raid1)
> but as I understand they could be moved back to the OSD, at least for
> the time of the restructuring.
>
> What makes my tommy turn though, is the thought of ripping out a raid0
> pair and plug it into another machine, (it's hwraid not zfs!)
> in the hope of keeping the data on it, even if I can get the same sort
> of controller (which might be possible, although the machines are a
> couple of years old and
> Machine C is not the same as A and B) .
>
> And I'm still puzzled about the implication of the cluster size on the
> amount of OSD failures.
> With size=2 min_size=1 one host could die and (if by chance there is
> NO read error on any bit on the living host) I could (theoretically)
> recover, is that right?
> OR is it that if any two disks in the cluster fail at the same time
> (or while one is still being rebuild) all my data would be gone?
>
>
>
> On Mon, Dec 4, 2017 at 4:42 PM, David Turner 
> wrote:
> > Your current node configuration cannot do size=3 for any pools.  You only
> > have 2 hosts with HDDs and 2 hosts with SSDs in each root.  You cannot
> put 3
> > copies of data for an HDD pool on 3 separate nodes when you only have 2
> > nodes with HDDs...  In this configuration, size=2 is putting a copy of
> the
> > data on every available node.  That is why you need to have the space
> > available on the host with the failed OSD to be able to recover; there
> is no
> > other way for the cluster to keep 2 copies of the data on different
> nodes.
> > The same will be true if you only have 3 available nodes and size=3; any
> > failed disk can only backfill onto the same node.
> >
> > I would start by recommending that you restructure your nodes quite
> heavily.
> > You want as close to the same number of disks in each node as you can
> get.
> > A balanced setup might look like...  This is of course assuming that the
> > CPU, RAM, and disk controllers are similar between the 3 nodes.
> >
> > machine A:
> > 2x 3.6TB
> > 2x 3.6TB RAID0
> > 1x 1.8TB
> > 2x .7TB SSD (1 each from machines B & C)
> >
> > machine B:
> > 2x 3.6TB
> > 2x 3.6TB RAID0
> > 1x 1.8TB
> > 2x .7TB SSD
> >
> > machine C:
> > 1x 3.6TB (from machine B)
> > 2x 3.6TB RAID0 (1 each from machines A & B)
> > 2x 1.8TB  (from machine A)
> > 2x .7TB SSD
> >
> > After all of that is configured and backfilled (a lot of backfilling).
> The
> > next step is to remove the RAID0 OSDs and add them back in as individual
> > 1.8TB OSDs.  You can also consider size=3 min_size=2 for some of your
> pools
> > in this configuration.  Both rebuilding the RAID0 OSDs and increasing the
> > size of a pool will require that you have enough space in your
> > cluster/nodes.  Depending on how you have your journals configured moving
> > the OSDs between hosts is usually fairly trivial (except for the
> > backfilling).
> >
> > Your % used is going to be a problem throughout this as an inherent issue
> > with Ceph not being perfect at balancing data which is a trade-off for
> data
> > integrity in the CRUSH algorithm.  There are ways to change the weights
> of
> > the OSDs to help fix the balance issue, but it is not indicative of a
> > problem in your configuration... just something that you need to be
> aware of
> > to be able to prevent it from being a major problem.
> >
> > There is a lot of material on why size=2 min_size=1 is bad.  Read back
> > through the ML archives to find some.  My biggest take-away is... if you
> > lose all but 1 copy of your data... do you really want to make changes to
> > it?  I've also noticed that the majority of clusters on the ML that have
> > irreparably lost data were running with size=2 min_size=1.
> >
> > On Mon, Dec 4, 2017 at 6:12 AM tim taler  wrote:
> >>
> >> Hi,
> >>
> >> thnx a lot for the quick response
> >> and for laying out some of the issues
> >>
> >> > I'm also new, but 

Re: [ceph-users] Replaced a disk, first time. Quick question

2017-12-04 Thread Drew Weaver


19446/16764 objects degraded (115.999%) <-- I noticed that number seems odd
I don't think that's normal!

40795/16764 objects degraded (243.349%) <-- Now I’m really concerned.

I'd recommend providing more info, Ceph version, bluestore or filestore, 
crushmap etc.

Hi, thanks for the reply.

12.2.2 bluestore

ceph osd crush tree

ID  CLASS WEIGHTTYPE NAME

 -1   174.62842 root default

 -329.10474 host OSD0

  0   hdd   3.63809 osd.0

  6   hdd   3.63809 osd.6

 12   hdd   3.63809 osd.12

 18   hdd   3.63809 osd.18

 24   hdd   3.63809 osd.24

 30   hdd   3.63809 osd.30

 46   hdd   3.63809 osd.46

 47   hdd   3.63809 osd.47

 -529.10474 host OSD1

  1   hdd   3.63809 osd.1

  7   hdd   3.63809 osd.7

 13   hdd   3.63809 osd.13

 19   hdd   3.63809 osd.19

 25   hdd   3.63809 osd.25

 31   hdd   3.63809 osd.31

 36   hdd   3.63809 osd.36

 41   hdd   3.63809 osd.41

 -729.10474 host OSD2

  2   hdd   3.63809 osd.2

  8   hdd   3.63809 osd.8

 14   hdd   3.63809 osd.14

 20   hdd   3.63809 osd.20

 26   hdd   3.63809 osd.26

 32   hdd   3.63809 osd.32

 37   hdd   3.63809 osd.37

 42   hdd   3.63809 osd.42

 -929.10474 host OSD3

  3   hdd   3.63809 osd.3

  9   hdd   3.63809 osd.9

 15   hdd   3.63809 osd.15

 21   hdd   3.63809 osd.21

 27   hdd   3.63809 osd.27

 33   hdd   3.63809 osd.33

 38   hdd   3.63809 osd.38

 43   hdd   3.63809 osd.43

-1129.10474 host OSD4

  4   hdd   3.63809 osd.4

 10   hdd   3.63809 osd.10

 16   hdd   3.63809 osd.16

 22   hdd   3.63809 osd.22

 28   hdd   3.63809 osd.28

 34   hdd   3.63809 osd.34

 39   hdd   3.63809 osd.39

 44   hdd   3.63809 osd.44

-1329.10474 host OSD5

  5   hdd   3.63809 osd.5

 11   hdd   3.63809 osd.11

 17   hdd   3.63809 osd.17

 23   hdd   3.63809 osd.23

 29   hdd   3.63809 osd.29

 35   hdd   3.63809 osd.35

 40   hdd   3.63809 osd.40

45   hdd   3.63809 osd.45

ceph osd crush rule ls

replicated_rule

thanks.

-Drew


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HELP with some basics please

2017-12-04 Thread tim taler
thnx a lot again,
makes sense to me.

We have all journals of the HDD-OSDs on partitions on an extra
SSD-raid1 (each OSD got it's own journal partition on that raid1)
but as I understand they could be moved back to the OSD, at least for
the time of the restructuring.

What makes my tommy turn though, is the thought of ripping out a raid0
pair and plug it into another machine, (it's hwraid not zfs!)
in the hope of keeping the data on it, even if I can get the same sort
of controller (which might be possible, although the machines are a
couple of years old and
Machine C is not the same as A and B) .

And I'm still puzzled about the implication of the cluster size on the
amount of OSD failures.
With size=2 min_size=1 one host could die and (if by chance there is
NO read error on any bit on the living host) I could (theoretically)
recover, is that right?
OR is it that if any two disks in the cluster fail at the same time
(or while one is still being rebuild) all my data would be gone?



On Mon, Dec 4, 2017 at 4:42 PM, David Turner  wrote:
> Your current node configuration cannot do size=3 for any pools.  You only
> have 2 hosts with HDDs and 2 hosts with SSDs in each root.  You cannot put 3
> copies of data for an HDD pool on 3 separate nodes when you only have 2
> nodes with HDDs...  In this configuration, size=2 is putting a copy of the
> data on every available node.  That is why you need to have the space
> available on the host with the failed OSD to be able to recover; there is no
> other way for the cluster to keep 2 copies of the data on different nodes.
> The same will be true if you only have 3 available nodes and size=3; any
> failed disk can only backfill onto the same node.
>
> I would start by recommending that you restructure your nodes quite heavily.
> You want as close to the same number of disks in each node as you can get.
> A balanced setup might look like...  This is of course assuming that the
> CPU, RAM, and disk controllers are similar between the 3 nodes.
>
> machine A:
> 2x 3.6TB
> 2x 3.6TB RAID0
> 1x 1.8TB
> 2x .7TB SSD (1 each from machines B & C)
>
> machine B:
> 2x 3.6TB
> 2x 3.6TB RAID0
> 1x 1.8TB
> 2x .7TB SSD
>
> machine C:
> 1x 3.6TB (from machine B)
> 2x 3.6TB RAID0 (1 each from machines A & B)
> 2x 1.8TB  (from machine A)
> 2x .7TB SSD
>
> After all of that is configured and backfilled (a lot of backfilling). The
> next step is to remove the RAID0 OSDs and add them back in as individual
> 1.8TB OSDs.  You can also consider size=3 min_size=2 for some of your pools
> in this configuration.  Both rebuilding the RAID0 OSDs and increasing the
> size of a pool will require that you have enough space in your
> cluster/nodes.  Depending on how you have your journals configured moving
> the OSDs between hosts is usually fairly trivial (except for the
> backfilling).
>
> Your % used is going to be a problem throughout this as an inherent issue
> with Ceph not being perfect at balancing data which is a trade-off for data
> integrity in the CRUSH algorithm.  There are ways to change the weights of
> the OSDs to help fix the balance issue, but it is not indicative of a
> problem in your configuration... just something that you need to be aware of
> to be able to prevent it from being a major problem.
>
> There is a lot of material on why size=2 min_size=1 is bad.  Read back
> through the ML archives to find some.  My biggest take-away is... if you
> lose all but 1 copy of your data... do you really want to make changes to
> it?  I've also noticed that the majority of clusters on the ML that have
> irreparably lost data were running with size=2 min_size=1.
>
> On Mon, Dec 4, 2017 at 6:12 AM tim taler  wrote:
>>
>> Hi,
>>
>> thnx a lot for the quick response
>> and for laying out some of the issues
>>
>> > I'm also new, but I'll try to help. IMHO most of the pros here would be
>> > quite worried about this cluster if it is production:
>>
>> thought so ;-/
>>
>> > -A prod ceph cluster should not be run with size=2 min_size=1, because:
>> > --In case of a down'ed osd / host the cluster could have problems
>> > determining which data is the correct when the osd/host came back up
>>
>> Uhm  I thought at least THAT wouldn't be the case here since we hace
>> three mons??
>> don't THEY keep track of which osd has the latest data
>> isn't the size set on the pool level not on the cluster level??
>>
>> > --If an osd dies, the others get more io (has to compensate the lost io
>> > capacity and the rebuilding too) which can instantly kill another close to
>> > death disc (not with ceph, but with raid i have been there)
>> > --If an osd dies ANY other osd serving that pool has well placed
>> > inconsistency, like bitrot you'll lose data
>>
>> good point, with scrubbing the checksums of the the objects are checked,
>> right?
>> can I get somewhere the report how much errors where found by the last
>> scrub run (like in zpool status)
>> to estimate how well a 

Re: [ceph-users] HELP with some basics please

2017-12-04 Thread Denes Dolhay

Hi,


On 12/04/2017 12:12 PM, tim taler wrote:

Hi,

thnx a lot for the quick response
and for laying out some of the issues


I'm also new, but I'll try to help. IMHO most of the pros here would be quite 
worried about this cluster if it is production:

thought so ;-/


-A prod ceph cluster should not be run with size=2 min_size=1, because:
--In case of a down'ed osd / host the cluster could have problems determining 
which data is the correct when the osd/host came back up

Uhm  I thought at least THAT wouldn't be the case here since we hace
three mons??
don't THEY keep track of which osd has the latest data
isn't the size set on the pool level not on the cluster level??
It is not a mon related issue, I do not really understand the cause 
euther, but either way, the problem exists. You can read back the thread 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-November/022108.html



--If an osd dies, the others get more io (has to compensate the lost io 
capacity and the rebuilding too) which can instantly kill another close to 
death disc (not with ceph, but with raid i have been there)
--If an osd dies ANY other osd serving that pool has well placed inconsistency, 
like bitrot you'll lose data

good point, with scrubbing the checksums of the the objects are checked, right?
can I get somewhere the report how much errors where found by the last
scrub run (like in zpool status)
to estimate how well a disk is doing (right now the raid controller
won't let me read the smart data from the disks)

The deep scrub does check for these, I do not know how to check the stats.
You probably can access the underlying disc smart somehow, for ex. lsi 
hbas provide this at /dev/sgX




-There are not enough hosts in your setup, or rather the discs are not 
distributed well:
--If an osd / host dies, the cluster trys to repair itself and relocate the 
data onto another host. In your config there is no other host to reallocate 
data to if ANY of the hosts fail (I guess that hdds and ssds are separated)

Yupp, HDD and SDD form seperate pools.
Good point, not in my list of arguments yet


-The disks should nod be placed in raid arrays if it can be avoided especially 
raid0:
--You multiply the possibility of an un-recoverable disc error (and since the 
data is striped) the other disks data is unrecoverable too
--When an osd dies, the cluster should relocate the data onto another osd. When this 
happens now there is double the data that need to be moved, this causes 2 problems: 
Recovery time / io, and free space. The cluster should have enough free space to reallocate 
data to, in this setup you cannot do that in case of a host dies (see above), but in case 
an osd dies, ceph would try to replicate the data onto other osds in the machine. So you 
have to have enough free space on >>the same host<< in this setup to replicate 
data to.

ON THE SAME MACHINE ?
is that so?
So than there should be at the BARE MINIMUM always be more free space
on each machine than the biggest OSD it hosts, right?
This applies to your current config. Since by default ceph will not 
replicate the data onto the same host twice (it is sensible!) and you 
have only 2 hosts for each of your pools, ceph does not have any other 
choice but to replicate the data onto another osd on the same host. If 
you would have more hosts, the additional load would be divided between 
them.


There is a max usage limit for the cluster'shealing process (maybe 90%?).
The rule of thumb is, that you should have enough space in the cluster 
to accommodate the replica placement caused by the loss of a host. So 
you should have "size"+1 hosts, and free space on the cluster, so if you 
subtract the size of the biggest host from the cluster sum size, then 
the resulting usage should be < 90%



In your case, I would recommend:
-Introducing (and activating) a fourth osd host
-setting size=3 min_size=2

that will be difficult, can't I run size=3 min_size=2 with three hosts?
Yes you could, but then in case of a failed host ceph cannot repair 
itself (there is no host to put the third copy onto).
Furthermore, the more hosts you have, the less impact it makes if you 
loose one of them, see the calculation above.



-After data migration is done, one-by-one separating the raid0 arrays: (remove, 
split) -> (zap, init, add) separately, in such a manner that hdds and ssds are 
evenly distributed across the servers

from what I understand the sizes of OSDs can vary
and the weight setting in our setup seems plausible to me (it's
directly derived from the size of the osd)
why than are the not filled on the same level nor even tending to
being filled the same?
does ceph by itself include other measurements like latency of the
OSD? that would explain why the raid0 OSDs have so much more data
than the single disks, but I haven't seen anything about that in the
docus (so far?)
The osd weight can be set (by hand) according to disc size, IO 
capability, ssd DWPD etc. depending on your 

Re: [ceph-users] Replaced a disk, first time. Quick question

2017-12-04 Thread David C
On Mon, Dec 4, 2017 at 4:39 PM, Drew Weaver  wrote:

> Howdy,
>
>
>
> I replaced a disk today because it was marked as Predicted failure. These
> were the steps I took
>
>
>
> ceph osd out osd17
>
> ceph -w #waited for it to get done
>
> systemctl stop ceph-osd@osd17
>
> ceph osd purge osd17 --yes-i-really-mean-it
>
> umount /var/lib/ceph/osd/ceph-osdX
>
>
>
> I noticed that after I ran the ‘osd out’ command that it started moving
> data around.
>

That's normal

>
>
> 19446/16764 objects degraded (115.999%) ß I noticed that number seems odd
>

I don't think that's normal!

>
>
> So then I replaced the disk
>
> Created a new label on it
>
> Ceph-deploy osd prepare OSD5:sdd
>
>
>
> THIS time, it started rebuilding
>
>
>
> 40795/16764 objects degraded (243.349%) ß Now I’m really concerned.
>
>
>
> Perhaps I don’t quite understand what the numbers are telling me but is it
> normal for it to rebuilding more objects than exist?
>
See:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020682.html,
seems to be similar issue to yours.

I'd recommend providing more info, Ceph version, bluestore or filestore,
crushmap etc.

>
>
> Thanks,
>
> -Drew
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Replaced a disk, first time. Quick question

2017-12-04 Thread Drew Weaver
Howdy,

I replaced a disk today because it was marked as Predicted failure. These were 
the steps I took

ceph osd out osd17
ceph -w #waited for it to get done
systemctl stop ceph-osd@osd17
ceph osd purge osd17 --yes-i-really-mean-it
umount /var/lib/ceph/osd/ceph-osdX

I noticed that after I ran the 'osd out' command that it started moving data 
around.

19446/16764 objects degraded (115.999%) <-- I noticed that number seems odd

So then I replaced the disk
Created a new label on it
Ceph-deploy osd prepare OSD5:sdd

THIS time, it started rebuilding

40795/16764 objects degraded (243.349%) <-- Now I'm really concerned.

Perhaps I don't quite understand what the numbers are telling me but is it 
normal for it to rebuilding more objects than exist?

Thanks,
-Drew


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dropping trusty

2017-12-04 Thread kefu chai
On Mon, Dec 4, 2017 at 11:48 PM, David Galloway  wrote:
>
> On 12/04/2017 01:12 AM, kefu chai wrote:
>> On Fri, Dec 1, 2017 at 1:55 AM, David Galloway  wrote:
>>> On 11/30/2017 12:21 PM, Sage Weil wrote:
 We're talking about dropping trusty support for mimic due to the old
 compiler (incomplete C++11), hassle of using an updated toolchain, general
 desire to stop supporting old stuff, and lack of user objections to
 dropping it in the next release.

 We would continue to build trusty packages for luminous and older
 releases, just not mimic going forward.

 My question is whether we should drop all of the trusty installs on smithi
 and focus testing on xenial and centos.  I haven't seen any trusty related
 failures in half a year.  There were some kernel-related issues 6+ months
 ago that are resolved, and there is a valgrind issue with xenial that is
 making us do valgrind only on centos, but otherwise I don't recall any
 other problems.  I think the likelihood of a trusty-specific regression on
 luminous/jewel is low.  Note that we can still do install and smoke
 testing on VMs to ensure the packages work; we just wouldn't stress test.

 Does this seem reasonable?  If so, we could reimage the trusty hosts
 immediately, right?

 Am I missing anything?

>>>
>>> Someone would need to prune through the qa dir and make sure nothing
>>> relies on trusty for tests.  We've gotten into a bind recently with the
>>
>> David, thanks for point out the direction. i removed the references to trusty
>> and updated related bits in https://github.com/ceph/ceph/pull/19307.
>>
>>> testing of FOG [1] where jobs are stuck in Waiting for a long time
>>> (tying up workers) because jobs are requesting Trusty.  We got close to
>>> having zero Trusty testnodes since the wip-fog branch has been reimaging
>>> baremetal testnodes on every job.
>>>
>>> But other than that, yes, I can reimage the Trusty testnodes.  Once FOG
>>> is merged into teuthology master, we won't have to worry about this
>>> anymore since jobs will automatically reimage machines based on what
>>> distro they require.
>>
>> since https://github.com/ceph/teuthology/pull/1126 is merged, could you help
>> reimage the trusty test nodes?
>>
>
> No need.  Since that's merged, all testnodes are automatically reimaged
> on every job now :).  An e-mail to Sepia user list is forthcoming with
> more details.

awesome, thank you! love this!

>
>>>
>>> [1] https://github.com/ceph/teuthology/compare/wip-fog
>



-- 
Regards
Kefu Chai
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dropping trusty

2017-12-04 Thread David Galloway

On 12/04/2017 01:12 AM, kefu chai wrote:
> On Fri, Dec 1, 2017 at 1:55 AM, David Galloway  wrote:
>> On 11/30/2017 12:21 PM, Sage Weil wrote:
>>> We're talking about dropping trusty support for mimic due to the old
>>> compiler (incomplete C++11), hassle of using an updated toolchain, general
>>> desire to stop supporting old stuff, and lack of user objections to
>>> dropping it in the next release.
>>>
>>> We would continue to build trusty packages for luminous and older
>>> releases, just not mimic going forward.
>>>
>>> My question is whether we should drop all of the trusty installs on smithi
>>> and focus testing on xenial and centos.  I haven't seen any trusty related
>>> failures in half a year.  There were some kernel-related issues 6+ months
>>> ago that are resolved, and there is a valgrind issue with xenial that is
>>> making us do valgrind only on centos, but otherwise I don't recall any
>>> other problems.  I think the likelihood of a trusty-specific regression on
>>> luminous/jewel is low.  Note that we can still do install and smoke
>>> testing on VMs to ensure the packages work; we just wouldn't stress test.
>>>
>>> Does this seem reasonable?  If so, we could reimage the trusty hosts
>>> immediately, right?
>>>
>>> Am I missing anything?
>>>
>>
>> Someone would need to prune through the qa dir and make sure nothing
>> relies on trusty for tests.  We've gotten into a bind recently with the
> 
> David, thanks for point out the direction. i removed the references to trusty
> and updated related bits in https://github.com/ceph/ceph/pull/19307.
> 
>> testing of FOG [1] where jobs are stuck in Waiting for a long time
>> (tying up workers) because jobs are requesting Trusty.  We got close to
>> having zero Trusty testnodes since the wip-fog branch has been reimaging
>> baremetal testnodes on every job.
>>
>> But other than that, yes, I can reimage the Trusty testnodes.  Once FOG
>> is merged into teuthology master, we won't have to worry about this
>> anymore since jobs will automatically reimage machines based on what
>> distro they require.
> 
> since https://github.com/ceph/teuthology/pull/1126 is merged, could you help
> reimage the trusty test nodes?
> 

No need.  Since that's merged, all testnodes are automatically reimaged
on every job now :).  An e-mail to Sepia user list is forthcoming with
more details.

>>
>> [1] https://github.com/ceph/teuthology/compare/wip-fog

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HELP with some basics please

2017-12-04 Thread David Turner
Your current node configuration cannot do size=3 for any pools.  You only
have 2 hosts with HDDs and 2 hosts with SSDs in each root.  You cannot put
3 copies of data for an HDD pool on 3 separate nodes when you only have 2
nodes with HDDs...  In this configuration, size=2 is putting a copy of the
data on every available node.  That is why you need to have the space
available on the host with the failed OSD to be able to recover; there is
no other way for the cluster to keep 2 copies of the data on different
nodes.  The same will be true if you only have 3 available nodes and
size=3; any failed disk can only backfill onto the same node.

I would start by recommending that you restructure your nodes quite
heavily.  You want as close to the same number of disks in each node as you
can get.  A balanced setup might look like...  This is of course assuming
that the CPU, RAM, and disk controllers are similar between the 3 nodes.

machine A:
2x 3.6TB
2x 3.6TB RAID0
1x 1.8TB
2x .7TB SSD (1 each from machines B & C)

machine B:
2x 3.6TB
2x 3.6TB RAID0
1x 1.8TB
2x .7TB SSD

machine C:
1x 3.6TB (from machine B)
2x 3.6TB RAID0 (1 each from machines A & B)
2x 1.8TB  (from machine A)
2x .7TB SSD

After all of that is configured and backfilled (a lot of backfilling). The
next step is to remove the RAID0 OSDs and add them back in as individual
1.8TB OSDs.  You can also consider size=3 min_size=2 for some of your pools
in this configuration.  Both rebuilding the RAID0 OSDs and increasing the
size of a pool will require that you have enough space in your
cluster/nodes.  Depending on how you have your journals configured moving
the OSDs between hosts is usually fairly trivial (except for the
backfilling).

Your % used is going to be a problem throughout this as an inherent issue
with Ceph not being perfect at balancing data which is a trade-off for data
integrity in the CRUSH algorithm.  There are ways to change the weights of
the OSDs to help fix the balance issue, but it is not indicative of a
problem in your configuration... just something that you need to be aware
of to be able to prevent it from being a major problem.

There is a lot of material on why size=2 min_size=1 is bad.  Read back
through the ML archives to find some.  My biggest take-away is... if you
lose all but 1 copy of your data... do you really want to make changes to
it?  I've also noticed that the majority of clusters on the ML that have
irreparably lost data were running with size=2 min_size=1.

On Mon, Dec 4, 2017 at 6:12 AM tim taler  wrote:

> Hi,
>
> thnx a lot for the quick response
> and for laying out some of the issues
>
> > I'm also new, but I'll try to help. IMHO most of the pros here would be
> quite worried about this cluster if it is production:
>
> thought so ;-/
>
> > -A prod ceph cluster should not be run with size=2 min_size=1, because:
> > --In case of a down'ed osd / host the cluster could have problems
> determining which data is the correct when the osd/host came back up
>
> Uhm  I thought at least THAT wouldn't be the case here since we hace
> three mons??
> don't THEY keep track of which osd has the latest data
> isn't the size set on the pool level not on the cluster level??
>
> > --If an osd dies, the others get more io (has to compensate the lost io
> capacity and the rebuilding too) which can instantly kill another close to
> death disc (not with ceph, but with raid i have been there)
> > --If an osd dies ANY other osd serving that pool has well placed
> inconsistency, like bitrot you'll lose data
>
> good point, with scrubbing the checksums of the the objects are checked,
> right?
> can I get somewhere the report how much errors where found by the last
> scrub run (like in zpool status)
> to estimate how well a disk is doing (right now the raid controller
> won't let me read the smart data from the disks)
>
>
> > -There are not enough hosts in your setup, or rather the discs are not
> distributed well:
> > --If an osd / host dies, the cluster trys to repair itself and relocate
> the data onto another host. In your config there is no other host to
> reallocate data to if ANY of the hosts fail (I guess that hdds and ssds are
> separated)
> Yupp, HDD and SDD form seperate pools.
> Good point, not in my list of arguments yet
>
> > -The disks should nod be placed in raid arrays if it can be avoided
> especially raid0:
> > --You multiply the possibility of an un-recoverable disc error (and
> since the data is striped) the other disks data is unrecoverable too
> > --When an osd dies, the cluster should relocate the data onto another
> osd. When this happens now there is double the data that need to be moved,
> this causes 2 problems: Recovery time / io, and free space. The cluster
> should have enough free space to reallocate data to, in this setup you
> cannot do that in case of a host dies (see above), but in case an osd dies,
> ceph would try to replicate the data onto other osds in the machine. So you
> have 

Re: [ceph-users] Luminous 12.2.2 rpm's not signed?

2017-12-04 Thread Konstantin Shalygin

Total size: 51 M
Is this ok [y/d/N]: y
Downloading packages:


Package ceph-common-12.2.2-0.el7.x86_64.rpm is not signed


http://tracker.ceph.com/issues/22311

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous 12.2.2 rpm's not signed?

2017-12-04 Thread Marc Roos
 


Total size: 51 M
Is this ok [y/d/N]: y
Downloading packages:


Package ceph-common-12.2.2-0.el7.x86_64.rpm is not signed



-Original Message-
From: Rafał Wądołowski [mailto:rwadolow...@cloudferro.com] 
Sent: maandag 4 december 2017 14:18
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Monitoring bluestore compression ratio

Hi,

Is there any command or tool to show effectiveness of bluestore 
compression?

I see the difference (in ceph osd df tree), while uploading a object to 
ceph, but maybe there are more friendly method to do it.


-- 
Regards,

Rafał Wądołowski
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Docs] s/ceph-disk/ceph-volume/g ?

2017-12-04 Thread Alfredo Deza
On Mon, Dec 4, 2017 at 3:34 AM, Yoann Moulin  wrote:
> Hello,
>
> By the fact ceph-disk is now deprecated, that would be great to update 
> documentation to have also processes with ceph-volume.
>
> for example :
>
> add-or-rm-osds => 
> http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/
>
> bluestore-migration => 
> http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/
>

Yes, there are also a few other places as well, and a PR is being worked on:

https://github.com/ceph/ceph/pull/19241

> In my opinion, documentation for luminous branch should keep both options 
> (ceph-disk and ceph-volume) but with a warning message to
> encourage people to use ceph-volume instead of ceph-disk.
>
> I guess, there is plenty of reference to ceph-disk that need to be updated.
>
> --
> Yoann Moulin
> EPFL IC-IT
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Monitoring bluestore compression ratio

2017-12-04 Thread Rafał Wądołowski

Hi,

Is there any command or tool to show effectiveness of bluestore 
compression?


I see the difference (in ceph osd df tree), while uploading a object to 
ceph, but maybe there are more friendly method to do it.



--
Regards,

Rafał Wądołowski
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-12-04 Thread Gerhard W. Recher
I got error on this:
 sysbench
--test=/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua
--mysql-host=127.0.0.1 --mysql-port=33033 --mysql-user=sysbench
--mysql-password=password --mysql-db=sysbench
--mysql-table-engine=innodb --db-driver=mysql --oltp_tables_count=10
--oltp-test-mode=complex --oltp-read-only=off --oltp-table-size=20
--threads=10 --rand-type=uniform --rand-init=on cleanup
Unknown option: --oltp_tables_count.
Usage:
  sysbench [general-options]... --test= [test-options]... command

General options:
  --num-threads=N    number of threads to use [1]
  --max-requests=N   limit for total number of requests [1]
  --max-time=N   limit for total execution time in seconds [0]
  --forced-shutdown=STRING   amount of time to wait after --max-time
before forcing shutdown [off]
  --thread-stack-size=SIZE   size of stack per thread [32K]
  --init-rng=[on|off]    initialize random number generator [off]
  --test=STRING  test to run
  --debug=[on|off]   print more debugging info [off]
  --validate=[on|off]    perform validation checks where possible [off]
  --help=[on|off]    print help and exit
  --version=[on|off] print version and exit

Compiled-in tests:
  fileio - File I/O test
  cpu - CPU performance test
  memory - Memory functions speed test
  threads - Threads subsystem performance test
  mutex - Mutex performance test
  oltp - OLTP test

Commands: prepare run cleanup help version

See 'sysbench --test= help' for a list of options for each test.



but i have these:
echo "Performing test SQ-${thread}T-${run}"
sysbench --test=oltp --db-driver=mysql --oltp-table-size=4000
--mysql-db=sysbench --mysql-user=sysbench --mysql-password=password
--max-time=60 --max-requests=0 --num-threads=${thread} run >
/root/SQ-${thread}T-${run}


[client]
port    = 3306
socket  = /var/run/mysqld/mysqld.sock
[mysqld_safe]
socket  = /var/run/mysqld/mysqld.sock
nice    = 0
[mysqld]
user    = mysql
pid-file    = /var/run/mysqld/mysqld.pid
socket  = /var/run/mysqld/mysqld.sock
port    = 3306
basedir = /usr
datadir = /var/lib/mysql
tmpdir  = /tmp
lc-messages-dir = /usr/share/mysql
skip-external-locking
bind-address    = 127.0.0.1
key_buffer  = 16M
max_allowed_packet  = 16M
thread_stack    = 192K
thread_cache_size   = 8
myisam-recover = BACKUP
query_cache_limit   = 1M
query_cache_size    = 16M
log_error = /var/log/mysql/error.log
expire_logs_days    = 10
max_binlog_size = 100M
[mysqldump]
quick
quote-names
max_allowed_packet  = 16M
[mysql]
[isamchk]
key_buffer  = 16M
!includedir /etc/mysql/conf.d/



sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing OLTP test.
Running mixed OLTP test
Using Special distribution (12 iterations,  1 pct of values are returned
in 75 pct cases)
Using "BEGIN" for starting transactions
Using auto_inc on the id column
Threads started!
Time limit exceeded, exiting...
Done.

OLTP test statistics:
    queries performed:
    read:    84126
    write:   30045
    other:   12018
    total:   126189
    transactions:    6009   (100.14 per sec.)
    deadlocks:   0  (0.00 per sec.)
    read/write requests: 114171 (1902.71 per sec.)
    other operations:    12018  (200.28 per sec.)

Test execution summary:
    total time:  60.0045s
    total number of events:  6009
    total time taken by event execution: 59.9812
    per-request statistics:
 min:  4.47ms
 avg:  9.98ms
 max: 91.38ms
 approx.  95 percentile:  19.44ms

Threads fairness:
    events (avg/stddev):   6009./0.00
    execution time (avg/stddev):   59.9812/0.00

sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 4

Doing OLTP test.
Running mixed OLTP test
Using Special distribution (12 iterations,  1 pct of values are returned
in 75 pct cases)
Using "BEGIN" for starting transactions
Using auto_inc on the id column
Threads started!
Time limit exceeded, exiting...
(last message repeated 3 times)
Done.

OLTP test statistics:
    queries performed:
    read:    372036
    write:   132870
    other:   53148
    total:   558054
    transactions:    26574  (442.84 per sec.)
    deadlocks:   0  (0.00 per sec.)
    read/write 

Re: [ceph-users] osd/bluestore: Get block.db usage

2017-12-04 Thread Wido den Hollander

> Op 4 december 2017 om 13:10 schreef Hans van den Bogert 
> :
> 
> 
> Hi all,
> 
> Is there a way to get the current usage of the bluestore's block.db?
> I'd really like to monitor this as we have a relatively high number of
> objects per OSD.
> 

Yes, using 'perf dump':

root@bravo:~# ceph daemon osd.1 perf dump|jq '.bluefs'
{
  "gift_bytes": 0,
  "reclaim_bytes": 0,
  "db_total_bytes": 1073741824,
  "db_used_bytes": 33554432,
  "wal_total_bytes": 0,
  "wal_used_bytes": 0,
  "slow_total_bytes": 0,
  "slow_used_bytes": 0,
  "num_files": 9,
  "log_bytes": 15745024,
  "log_compactions": 0,
  "logged_bytes": 6279168,
  "files_written_wal": 1,
  "files_written_sst": 2,
  "bytes_written_wal": 5535337,
  "bytes_written_sst": 792836
}
root@bravo:~#

This cluster is just a test (you might remember it ;) ) and not running with 
any data at the moment.

'db_used_bytes' tells you how many bytes there are in the DB of BlueStore.

> A second question related to the above, are there mechanisms to
> influence which objects' metadata gets spilled once the block.db is
> full? -- For instance, I could not care for the extra latency when
> object metadata gets spilled to the backing disk if it for RGW-related
> data, in contrast to RBD objects metadata, which should remain on the
> faster SSD-based block.db.
> 

Not that I'm aware of.

Wido

> Regards,
> 
> Hans
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osd/bluestore: Get block.db usage

2017-12-04 Thread Hans van den Bogert
Hi all,

Is there a way to get the current usage of the bluestore's block.db?
I'd really like to monitor this as we have a relatively high number of
objects per OSD.

A second question related to the above, are there mechanisms to
influence which objects' metadata gets spilled once the block.db is
full? -- For instance, I could not care for the extra latency when
object metadata gets spilled to the backing disk if it for RGW-related
data, in contrast to RBD objects metadata, which should remain on the
faster SSD-based block.db.

Regards,

Hans
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-12-04 Thread German Anders
Could anyone run the tests? and share some results..

Thanks in advance,

Best,


*German*

2017-11-30 14:25 GMT-03:00 German Anders :

> That's correct, IPoIB for the backend (already configured the irq
> affinity),  and 10GbE on the frontend. I would love to try rdma but like
> you said is not stable for production, so I think I'll have to wait for
> that. Yeah, the thing is that it's not my decision to go for 50GbE or
> 100GbE... :( so.. 10GbE for the front-end will be...
>
> Would be really helpful if someone could run the following sysbench test
> on a mysql db so I could make some compares:
>
> *my.cnf *configuration file:
>
> [mysqld_safe]
> nice= 0
> pid-file= /home/test_db/mysql/mysql.pid
>
> [client]
> port= 33033
> socket  = /home/test_db/mysql/mysql.sock
>
> [mysqld]
> user= test_db
> port= 33033
> socket  = /home/test_db/mysql/mysql.sock
> pid-file= /home/test_db/mysql/mysql.pid
> log-error   = /home/test_db/mysql/mysql.err
> datadir = /home/test_db/mysql/data
> tmpdir  = /tmp
> server-id   = 1
>
> # ** Binlogging **
> #log-bin= /home/test_db/mysql/binlog/
> mysql-bin
> #log_bin_index  = /home/test_db/mysql/binlog/
> mysql-bin.index
> expire_logs_days= 1
> max_binlog_size = 512MB
>
> thread_handling = pool-of-threads
> thread_pool_max_threads = 300
>
>
> # ** Slow query log **
> slow_query_log  = 1
> slow_query_log_file = /home/test_db/mysql/mysql-
> slow.log
> long_query_time = 10
> log_output  = FILE
> log_slow_slave_statements   = 1
> log_slow_verbosity  = query_plan,innodb,explain
>
> # ** INNODB Specific options **
> transaction_isolation   = READ-COMMITTED
> innodb_buffer_pool_size = 12G
> innodb_data_file_path   = ibdata1:256M:autoextend
> innodb_thread_concurrency   = 16
> innodb_log_file_size= 256M
> innodb_log_files_in_group   = 3
> innodb_file_per_table
> innodb_log_buffer_size  = 16M
> innodb_stats_on_metadata= 0
> innodb_lock_wait_timeout= 30
> # innodb_flush_method   = O_DSYNC
> innodb_flush_method = O_DIRECT
> max_connections = 1
> max_connect_errors  = 99
> max_allowed_packet  = 128M
> skip-host-cache
> skip-name-resolve
> explicit_defaults_for_timestamp = 1
> performance_schema  = OFF
> log_warnings= 2
> event_scheduler = ON
>
> # ** Specific Galera Cluster Settings **
> binlog_format   = ROW
> default-storage-engine  = innodb
> query_cache_size= 0
> query_cache_type= 0
>
>
> Volume is just an RBD (on a RF=3 pool) with the default 22 bit order
> mounted on */home/test_db/mysql/data*
>
> commands for the test:
>
> sysbench 
> --test=/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua
> --mysql-host= --mysql-port=33033 --mysql-user=sysbench
> --mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb
> --db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex
> --oltp-read-only=off --oltp-table-size=20 --threads=10
> --rand-type=uniform --rand-init=on cleanup > /dev/null 2>/dev/null
>
> sysbench 
> --test=/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua
> --mysql-host= --mysql-port=33033 --mysql-user=sysbench
> --mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb
> --db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex
> --oltp-read-only=off --oltp-table-size=20 --threads=10
> --rand-type=uniform --rand-init=on prepare > /dev/null 2>/dev/null
>
> sysbench --test=/usr/share/sysbench/tests/include/oltp_legacy/oltp.lua
> --mysql-host= --mysql-port=33033 --mysql-user=sysbench
> --mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb
> --db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex
> --oltp-read-only=off --oltp-table-size=20 --threads=20
> --rand-type=uniform --rand-init=on --time=120 run >
> result_sysbench_perf_test.out 2>/dev/null
>
> Im looking for tps, qps and 95th perc, could anyone with a all-nvme
> cluster run the test and share the results? I would really 

Re: [ceph-users] Another OSD broken today. How can I recover it?

2017-12-04 Thread Ronny Aasen

On 04. des. 2017 10:22, Gonzalo Aguilar Delgado wrote:

Hello,

Things are going worse every day.


ceph -w
     cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
  health HEALTH_ERR
     1 pgs are stuck inactive for more than 300 seconds
     8 pgs inconsistent
     1 pgs repair
     1 pgs stale
     1 pgs stuck stale
     recovery 20266198323167232/288980 objects degraded 
(7013010700798.405%)

     37154696925806624 scrub errors
     no legacy OSD present but 'sortbitwise' flag is not set


But I'm finally finding time to recover. The disk seems to be correct, 
no smart errors and everything looks fine just ceph not starting. Today 
I started to look for the ceph-objectstore-tool. That I don't really 
know much.


It just works nice. No crash as expected like on the OSD.

So I'm lost. Since both OSD and ceph objectstore tool use same backend 
how is this posible?


Can someone help me on fixing this, please?




this line seems quite insane:
recovery 20266198323167232/288980 objects degraded (7013010700798.405%)

there is obviously something wrong in your cluster. once the defect osd 
id down/out does the cluster eventually heal to HEALTH_OK ?


you should start by reading and understanding this page.
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/

also in order to get assistance you need to provide a lot more detail.
how many nodes, how many osd's per node. what kinf of nodes cpu/ram. 
what kind of networking setup.


show the output from
ceph -s
ceph osd tree
ceph osd pool ls detail
ceph health detail




since you are systematically loosing osd's i would start by checking the 
timestamp in the defect osd for when it died.
doublecheck your clock sync settingts that all servers are time 
syncronized and then check all logs for the time in question.


especialy dmesg, did OOM killer do something ? was networking flaky ?
mon logs ?  did they complain about the osd in some fashion ?


also since you fail to start the osd again there is probably some 
corruption going on. bump the log for that osd in the nodes ceph.conf, 
something like


[osd.XX]
debug osd = 20

rename the log for the osd so you have a fresh file. and try to start 
the osd once. put the log on some pastebin and send the url.
read 
http://ceph.com/planet/how-to-increase-debug-levels-and-harvest-a-detailed-osd-log/ 
for details.




generally: try to make it easy for people to help you without having to 
drag details out of you. If you can collect all of the above on a 
pastebin like http://paste.debian.net/ instead of piecing it together 
from 3-4 different email threads, you will find a lot more eyeballs 
willing to give it a look.




good luck and kind regards
Ronny Aasen



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HELP with some basics please

2017-12-04 Thread tim taler
Hi,

thnx a lot for the quick response
and for laying out some of the issues

> I'm also new, but I'll try to help. IMHO most of the pros here would be quite 
> worried about this cluster if it is production:

thought so ;-/

> -A prod ceph cluster should not be run with size=2 min_size=1, because:
> --In case of a down'ed osd / host the cluster could have problems determining 
> which data is the correct when the osd/host came back up

Uhm  I thought at least THAT wouldn't be the case here since we hace
three mons??
don't THEY keep track of which osd has the latest data
isn't the size set on the pool level not on the cluster level??

> --If an osd dies, the others get more io (has to compensate the lost io 
> capacity and the rebuilding too) which can instantly kill another close to 
> death disc (not with ceph, but with raid i have been there)
> --If an osd dies ANY other osd serving that pool has well placed 
> inconsistency, like bitrot you'll lose data

good point, with scrubbing the checksums of the the objects are checked, right?
can I get somewhere the report how much errors where found by the last
scrub run (like in zpool status)
to estimate how well a disk is doing (right now the raid controller
won't let me read the smart data from the disks)


> -There are not enough hosts in your setup, or rather the discs are not 
> distributed well:
> --If an osd / host dies, the cluster trys to repair itself and relocate the 
> data onto another host. In your config there is no other host to reallocate 
> data to if ANY of the hosts fail (I guess that hdds and ssds are separated)
Yupp, HDD and SDD form seperate pools.
Good point, not in my list of arguments yet

> -The disks should nod be placed in raid arrays if it can be avoided 
> especially raid0:
> --You multiply the possibility of an un-recoverable disc error (and since the 
> data is striped) the other disks data is unrecoverable too
> --When an osd dies, the cluster should relocate the data onto another osd. 
> When this happens now there is double the data that need to be moved, this 
> causes 2 problems: Recovery time / io, and free space. The cluster should 
> have enough free space to reallocate data to, in this setup you cannot do 
> that in case of a host dies (see above), but in case an osd dies, ceph would 
> try to replicate the data onto other osds in the machine. So you have to have 
> enough free space on >>the same host<< in this setup to replicate data to.

ON THE SAME MACHINE ?
is that so?
So than there should be at the BARE MINIMUM always be more free space
on each machine than the biggest OSD it hosts, right?

> In your case, I would recommend:
> -Introducing (and activating) a fourth osd host
> -setting size=3 min_size=2

that will be difficult, can't I run size=3 min_size=2 with three hosts?

> -After data migration is done, one-by-one separating the raid0 arrays: 
> (remove, split) -> (zap, init, add) separately, in such a manner that hdds 
> and ssds are evenly distributed across the servers

from what I understand the sizes of OSDs can vary
and the weight setting in our setup seems plausible to me (it's
directly derived from the size of the osd)
why than are the not filled on the same level nor even tending to
being filled the same?
does ceph by itself include other measurements like latency of the
OSD? that would explain why the raid0 OSDs have so much more data
than the single disks, but I haven't seen anything about that in the
docus (so far?)

> -Always keeping that much free space, so the cluster could lose a host and 
> still has space to repair (calculating with the repair max usage % setting).

thnx again!
yupp that was helpfull

> I hope this helps, and please keep in mind that I'm a noob too :)
>
> Denes.
>
>
> On 12/04/2017 10:07 AM, tim taler wrote:
>
> Hi
> I'm new to ceph but have to honor to look after a cluster that I haven't set 
> up by myself.
> Rushing to the ceph docs and having a first glimpse on our cluster I start 
> worrying about our setup,
> so I need some advice and guidance here.
>
> The set up is:
> 3 machines, each running a ceph-monitor.
> all of them are also hosting OSDs
>
> machine A:
> 2 OSDs, each 3.6 TB - consisitng of 1 disk each (spinning disk)
> 3 OSDs, each 3.6 TB - consisting each of a 2 disk hardware-raid 0 (spinning 
> disk)
> 3 OSDs, each 1.8 TB - consisting each of a 2 disk hardware-raid 0 (spinning 
> disk)
>
> machine B:
> 3 OSDs, each 3.6 TB - consisitng of 1 disk each (spinning disk)
> 3 OSDs, each 3.6 TB - consisting each of a 2 disk hardware-raid 0 (spinning 
> disk)
> 1 OSDs, each 1.8 TB - consisting each of a 2 disk hardware-raid 0 (spinning 
> disk)
>
> 3 OSDs, each, 0.7 TB - consisitng of 1 disk each (SSD)
>
> machine C:
> 3 OSDs, each, 0.7 TB - consisitng of 1 disk each (SSD)
>
> the spinning disks and the SSD disks are forming two seperate pools.
>
> Now what I'm worrying about is that I read "don't use raid together with ceph"
> in combination with our poolsize
> :~ 

Re: [ceph-users] Ceph+RBD+ISCSI = ESXI issue

2017-12-04 Thread David Disseldorp
Hi Nigel,

On Fri, 1 Dec 2017 13:32:43 +, nigel davies wrote:

> Ceph version 10.2.5
> 
> i have had an Ceph cluster going for a few months, with iscsi servers that
> are linked to Ceph by RBD.
> 
> All of an sudden i am starting the ESXI server will louse the isscsi data
> store (disk space goes to 0 B) and i only fix this by rebooting the ISCSI
> server
> 
> When checking syslogs on the iscsi server i get a loads of errors like
> 
> SENDING TMR_TASK_DOES_NOT_EXIST for ref_tag: 
> like 100+ lines
> 
> i looked at the logs and cant see anything saying hung io or an OSD come
> out and back in.
> 
> does any one have any susgestions on whats going on??

The TMR_TASK_DOES_NOT_EXIST error indicates that your initiator (ESXi)
is attempting to abort outstanding I/Os. ESXi is pretty latency
sensitive, so I'd guess that the abort-task requests are being sent by
the initiator after tripping a local I/O timeout. Your vmkernel logs
should shed a bit more light on this.

Cheers, David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing mon_pg_warn_max_per_osd in v12.2.2

2017-12-04 Thread SOLTECSIS - Victor Rodriguez Cortes

> the option is now called 'mon_max_pg_per_osd'.
>
> this was originally slated for v12.2.1 where it was erroneously
> mentioned in the release notes[1] despite note being part of the
> release (I remember asking for updated/fixed release notes after 12.2.1,
> seems like that never happened?). now it was applied as part of v12.2.2,
> but is not mentioned at all in the release notes[2]...
>
> 1: http://docs.ceph.com/docs/master/release-notes/#v12-2-1-luminous
> 2: http://docs.ceph.com/docs/master/release-notes/#v12-2-2-luminous
>
That explains why I found nothing in the release notes of 12.2.2 :) 
Thanks a lot for pointing that out.

Using mon_max_pg_per_osd = 300 does work and now HEALTH is ok in this
cluster. Anyway I will move data to pools with less PGs asap, because
this cluster was supposed to have a few more OSDs that it will finally
have and the current PGs are suboptimal.

Thanks a lot to everyone.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing mon_pg_warn_max_per_osd in v12.2.2

2017-12-04 Thread Fabian Grünbichler
On Mon, Dec 04, 2017 at 11:21:42AM +0100, SOLTECSIS - Victor Rodriguez Cortes 
wrote:
> 
> > Why are you OK with this? A high amount of PGs can cause serious peering 
> > issues. OSDs might eat up a lot of memory and CPU after a reboot or such.
> >
> > Wido
> 
> Mainly because there was no warning at all in v12.2.1 and it just
> appeared after upgrading to v12.2.2. Besides,its not a "too high" number
> of PGs for this environment and no CPU/peering issues have been detected
> yet.
> 
> I'll plan a way to create new OSD's/new CephFS and move files to it, but
> in the mean time I would like to just increase that variable, which is
> supposed to be supported and easy.
> 
> Thanks

the option is now called 'mon_max_pg_per_osd'.

this was originally slated for v12.2.1 where it was erroneously
mentioned in the release notes[1] despite note being part of the
release (I remember asking for updated/fixed release notes after 12.2.1,
seems like that never happened?). now it was applied as part of v12.2.2,
but is not mentioned at all in the release notes[2]...

1: http://docs.ceph.com/docs/master/release-notes/#v12-2-1-luminous
2: http://docs.ceph.com/docs/master/release-notes/#v12-2-2-luminous

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing mon_pg_warn_max_per_osd in v12.2.2

2017-12-04 Thread SOLTECSIS - Victor Rodriguez Cortes

> Why are you OK with this? A high amount of PGs can cause serious peering 
> issues. OSDs might eat up a lot of memory and CPU after a reboot or such.
>
> Wido

Mainly because there was no warning at all in v12.2.1 and it just
appeared after upgrading to v12.2.2. Besides,its not a "too high" number
of PGs for this environment and no CPU/peering issues have been detected
yet.

I'll plan a way to create new OSD's/new CephFS and move files to it, but
in the mean time I would like to just increase that variable, which is
supposed to be supported and easy.

Thanks

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing mon_pg_warn_max_per_osd in v12.2.2

2017-12-04 Thread Wido den Hollander

> Op 4 december 2017 om 10:59 schreef SOLTECSIS - Victor Rodriguez Cortes 
> :
> 
> 
> Hello,
> 
> I have upgraded from v12.2.1 to v12.2.2 and now a warning shows using
> "ceph status":
> 
> ---
> # ceph status
>   cluster:
> id:
> health: HEALTH_WARN
> too many PGs per OSD (208 > max 200)
> ---
> 
> I'm ok with the amount of PGs, so I'm trying to increase the max PGs.

Why are you OK with this? A high amount of PGs can cause serious peering 
issues. OSDs might eat up a lot of memory and CPU after a reboot or such.

Wido

> I've tried adding this to /etc/ceph/ceph.conf and restarting
> services/servers:
> 
> ---
> [global]
> mon_pg_warn_max_per_osd = 300
> ---
> 
> 
> I've also tried to inject the config  to running daemons with:
> 
> ---
> ceph tell mon.* injectargs  "-mon_pg_warn_max_per_osd 0"
> ---
> 
> But I'm getting "Error EINVAL: injectargs: failed to parse arguments:
> --mon_pg_warn_max_per_osd,300" messages and I'm still getting the
> HEALTH_WARN message in the status command.
> 
> How can I increase mon_pg_warn_max_per_osd?
> 
> Thank you!
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Increasing mon_pg_warn_max_per_osd in v12.2.2

2017-12-04 Thread SOLTECSIS - Victor Rodriguez Cortes
Hello,

I have upgraded from v12.2.1 to v12.2.2 and now a warning shows using
"ceph status":

---
# ceph status
  cluster:
    id:
    health: HEALTH_WARN
    too many PGs per OSD (208 > max 200)
---

I'm ok with the amount of PGs, so I'm trying to increase the max PGs.
I've tried adding this to /etc/ceph/ceph.conf and restarting
services/servers:

---
[global]
mon_pg_warn_max_per_osd = 300
---


I've also tried to inject the config  to running daemons with:

---
ceph tell mon.* injectargs  "-mon_pg_warn_max_per_osd 0"
---

But I'm getting "Error EINVAL: injectargs: failed to parse arguments:
--mon_pg_warn_max_per_osd,300" messages and I'm still getting the
HEALTH_WARN message in the status command.

How can I increase mon_pg_warn_max_per_osd?

Thank you!



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HELP with some basics please

2017-12-04 Thread Denes Dolhay

Hi,

I'm also new, but I'll try to help. IMHO most of the pros here would be 
quite worried about this cluster if it is production:


-A prod ceph cluster should not be run with size=2 min_size=1, because:

--In case of a down'ed osd / host the cluster could have problems 
determining which data is the correct when the osd/host came back up


--If an osd dies, the others get more io (has to compensate the lost io 
capacity and the rebuilding too) which can instantly kill another close 
to death disc (not with ceph, but with raid i have been there)


--If an osd dies ANY other osd serving that pool has well placed 
inconsistency, like bitrot you'll lose data



-There are not enough hosts in your setup, or rather the discs are not 
distributed well:


--If an osd / host dies, the cluster trys to repair itself and relocate 
the data onto another host. In your config there is no other host to 
reallocate data to if ANY of the hosts fail (I guess that hdds and ssds 
are separated)



-The disks should nod be placed in raid arrays if it can be avoided 
especially raid0:


--You multiply the possibility of an un-recoverable disc error (and 
since the data is striped) the other disks data is unrecoverable too


--When an osd dies, the cluster should relocate the data onto another 
osd. When this happens now there is double the data that need to be 
moved, this causes 2 problems: Recovery time / io, and free space. The 
cluster should have enough free space to reallocate data to, in this 
setup you cannot do that in case of a host dies (see above), but in case 
an osd dies, ceph would try to replicate the data onto other osds in the 
machine. So you have to have enough free space on >>the same host<< in 
this setup to replicate data to.



In your case, I would recommend:

-Introducing (and activating) a fourth osd host

-setting size=3 min_size=2

-After data migration is done, one-by-one separating the raid0 arrays: 
(remove, split) -> (zap, init, add) separately, in such a manner that 
hdds and ssds are evenly distributed across the servers


-Always keeping that much free space, so the cluster could lose a host 
and still has space to repair (calculating with the repair max usage % 
setting).



I hope this helps, and please keep in mind that I'm a noob too :)

Denes.


On 12/04/2017 10:07 AM, tim taler wrote:

Hi
I'm new to ceph but have to honor to look after a cluster that I 
haven't set up by myself.
Rushing to the ceph docs and having a first glimpse on our cluster I 
start worrying about our setup,

so I need some advice and guidance here.

The set up is:
3 machines, each running a ceph-monitor.
all of them are also hosting OSDs

machine A:
2 OSDs, each 3.6 TB - consisitng of 1 disk each (spinning disk)
3 OSDs, each 3.6 TB - consisting each of a 2 disk hardware-raid 0 
(spinning disk)
3 OSDs, each 1.8 TB - consisting each of a 2 disk hardware-raid 0 
(spinning disk)


machine B:
3 OSDs, each 3.6 TB - consisitng of 1 disk each (spinning disk)
3 OSDs, each 3.6 TB - consisting each of a 2 disk hardware-raid 0 
(spinning disk)
1 OSDs, each 1.8 TB - consisting each of a 2 disk hardware-raid 0 
(spinning disk)


3 OSDs, each, 0.7 TB - consisitng of 1 disk each (SSD)

machine C:
3 OSDs, each, 0.7 TB - consisitng of 1 disk each (SSD)

the spinning disks and the SSD disks are forming two seperate pools.

Now what I'm worrying about is that I read "don't use raid together 
with ceph"

in combination with our poolsize
:~ ceph osd pool get  size
size: 2

From what I understand from the ceph docu the size tell me "how many 
disks may fail" without loosing the data of the whole pool.
Is that right? or can HALF the OSDs fail (since all objects are 
duplicated)?


Unfortunately I'm not very good in stochastic but given a probability 
of 1% disk failure per year
I'm not feeling very secure with this set up (How do I calculate the 
value that two disks fail "at the same time"? - or ahs anybody a rough 
number about that?)
although looking at our OSD tree it seems we try to spread the objects 
always between two peers:


ID  CLASS WEIGHT   TYPE NAME                      STATUS REWEIGHT PRI-AFF
-19        4.76700 root here_ssd
-15        2.38350     room 2_ssd
-14        2.38350         rack 2_ssd
 -4        2.38350             host B_ssd
  4   hdd  0.79449                 osd.4              up  1.0 1.0
  5   hdd  0.79449                 osd.5              up  1.0 1.0
 13   hdd  0.79449                 osd.13             up  1.0 1.0
-18        2.38350     room 1_ssd
-17        2.38350         rack 1_ssd
 -5        2.38350             host C_ssd
  0   hdd  0.79449                 osd.0              up  1.0 1.0
  1   hdd  0.79449                 osd.1              up  1.0 1.0
  2   hdd  0.79449                 osd.2              up  1.0 1.0
 -1       51.96059 root here_spinning
-12       25.98090     room 2_spinning
-11       25.98090         rack 2_spinning
 -2       25.98090 

[ceph-users] Luminous, RGW bucket resharding

2017-12-04 Thread Andreas Calminder
Hello,
With release 12.2.2 dynamic resharding bucket index has been disabled
when running a multisite environment
(http://tracker.ceph.com/issues/21725). Does this mean that resharding
of bucket indexes shouldn't be done at all, manually, while running
multisite as there's a risk of corruption?

Also, as dynamic bucket resharding was/is the main motivator moving to
Luminous (for me at least) is dynamic reshardning something that is
planned to be fixed for multisite environments later in the Luminous
life-cycle or will it be left disabled forever?

Thanks!
/andreas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Another OSD broken today. How can I recover it?

2017-12-04 Thread Gonzalo Aguilar Delgado
Hello,

Things are going worse every day.


ceph -w
    cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
 health HEALTH_ERR
    1 pgs are stuck inactive for more than 300 seconds
    8 pgs inconsistent
    1 pgs repair
    1 pgs stale
    1 pgs stuck stale
    recovery 20266198323167232/288980 objects degraded
(7013010700798.405%)
    37154696925806624 scrub errors
    no legacy OSD present but 'sortbitwise' flag is not set


But I'm finally finding time to recover. The disk seems to be correct,
no smart errors and everything looks fine just ceph not starting. Today
I started to look for the ceph-objectstore-tool. That I don't really
know much.

It just works nice. No crash as expected like on the OSD.

So I'm lost. Since both OSD and ceph objectstore tool use same backend
how is this posible?

Can someone help me on fixing this, please?



--

ceph-objectstore-tool --debug --op list-pgs --data-path
/var/lib/ceph/osd/ceph-4 --journal-path /dev/sdf3
2017-12-03 13:27:58.206069 7f02c203aa40  0
filestore(/var/lib/ceph/osd/ceph-4) backend xfs (magic 0x58465342)
2017-12-03 13:27:58.206528 7f02c203aa40  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features:
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2017-12-03 13:27:58.206546 7f02c203aa40  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features:
SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2017-12-03 13:27:58.206569 7f02c203aa40  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features:
splice is supported
2017-12-03 13:27:58.251393 7f02c203aa40  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2017-12-03 13:27:58.251459 7f02c203aa40  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_feature: extsize is
disabled by conf
2017-12-03 13:27:58.978809 7f02c203aa40  0
filestore(/var/lib/ceph/osd/ceph-4) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
2017-12-03 13:27:58.990051 7f02c203aa40  1 journal _open /dev/sdf3 fd
11: 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 1
2017-12-03 13:27:59.002345 7f02c203aa40  1 journal _open /dev/sdf3 fd
11: 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 1
2017-12-03 13:27:59.004846 7f02c203aa40  1
filestore(/var/lib/ceph/osd/ceph-4) upgrade
Cluster fsid=9028f4da-0d77-462b-be9b-dbdf7fa57771
Supported features: compat={},rocompat={},incompat={1=initial feature
set(~v.18),2=pginfo object,3=object
locator,4=last_epoch_clean,5=categories,6=hobjectpool,7=biginfo,8=leveldbinfo,9=leveldblog,10=snapmapper,11=sharded
objects,12=transaction hints,13=pg meta object}
On-disk features: compat={},rocompat={},incompat={1=initial feature
set(~v.18),2=pginfo object,3=object
locator,4=last_epoch_clean,5=categories,6=hobjectpool,7=biginfo,8=leveldbinfo,9=leveldblog,10=snapmapper,11=sharded
objects,12=transaction hints,13=pg meta object}
Performing list-pgs operation
11.7f
10.4b

10.8d
2017-12-03 13:27:59.009327 7f02c203aa40  1 journal close /dev/sdf3




It looks like the problem has something to do with map. cause there's an
assertion that's failing on size.

Can this have something to do with the fact I got this from map?

  pgmap v71223952: 764 pgs, 6 pools, 561 GB data, 141 kobjects
    1124 GB used, 1514 GB / 2639 GB avail
    *20266198323167232*/288980 objects degraded (7013010700798.405%)

This is the current crash from the command line.

starting osd.4 at :/0 osd_data /var/lib/ceph/osd/ceph-4
/var/lib/ceph/osd/ceph-4/journal
osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*,
spg_t, epoch_t*, ceph::bufferlist*)' thread 7f467ba0b8c0 time 2017-12-03
13:39:29.495311
osd/PG.cc: 3025: FAILED assert(values.size() == 2)
 ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x80) [0x5556eab28790]
 2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x661) [0x5556ea4e6601]
 3: (OSD::load_pgs()+0x75a) [0x5556ea43a8aa]
 4: (OSD::init()+0x2026) [0x5556ea445ca6]
 5: (main()+0x2ef1) [0x5556ea3b7301]
 6: (__libc_start_main()+0xf0) [0x7f467886b830]
 7: (_start()+0x29) [0x5556ea3f8b09]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.
2017-12-03 13:39:29.497091 7f467ba0b8c0 -1 osd/PG.cc: In function
'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*,
ceph::bufferlist*)' thread 7f467ba0b8c0 time 2017-12-03 13:39:29.495311
osd/PG.cc: 3025: FAILED assert(values.size() == 2)


So it looks like the offending code is this one:

  int r = store->omap_get_values(coll, pgmeta_oid, keys, );
  if (r == 0) {
    assert(values.size() == 2); <-- Here

    // sanity check version

How can this values be different of 2. Can 

[ceph-users] HELP with some basics please

2017-12-04 Thread tim taler
Hi
I'm new to ceph but have to honor to look after a cluster that I haven't
set up by myself.
Rushing to the ceph docs and having a first glimpse on our cluster I start
worrying about our setup,
so I need some advice and guidance here.

The set up is:
3 machines, each running a ceph-monitor.
all of them are also hosting OSDs

machine A:
2 OSDs, each 3.6 TB - consisitng of 1 disk each (spinning disk)
3 OSDs, each 3.6 TB - consisting each of a 2 disk hardware-raid 0 (spinning
disk)
3 OSDs, each 1.8 TB - consisting each of a 2 disk hardware-raid 0 (spinning
disk)

machine B:
3 OSDs, each 3.6 TB - consisitng of 1 disk each (spinning disk)
3 OSDs, each 3.6 TB - consisting each of a 2 disk hardware-raid 0 (spinning
disk)
1 OSDs, each 1.8 TB - consisting each of a 2 disk hardware-raid 0 (spinning
disk)

3 OSDs, each, 0.7 TB - consisitng of 1 disk each (SSD)

machine C:
3 OSDs, each, 0.7 TB - consisitng of 1 disk each (SSD)

the spinning disks and the SSD disks are forming two seperate pools.

Now what I'm worrying about is that I read "don't use raid together with
ceph"
in combination with our poolsize
:~ ceph osd pool get  size
size: 2

>From what I understand from the ceph docu the size tell me "how many disks
may fail" without loosing the data of the whole pool.
Is that right? or can HALF the OSDs fail (since all objects are duplicated)?

Unfortunately I'm not very good in stochastic but given a probability of 1%
disk failure per year
I'm not feeling very secure with this set up (How do I calculate the value
that two disks fail "at the same time"? - or ahs anybody a rough number
about that?)
although looking at our OSD tree it seems we try to spread the objects
always between two peers:

ID  CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
-194.76700 root here_ssd
-152.38350 room 2_ssd
-142.38350 rack 2_ssd
 -42.38350 host B_ssd
  4   hdd  0.79449 osd.4  up  1.0 1.0
  5   hdd  0.79449 osd.5  up  1.0 1.0
 13   hdd  0.79449 osd.13 up  1.0 1.0
-182.38350 room 1_ssd
-172.38350 rack 1_ssd
 -52.38350 host C_ssd
  0   hdd  0.79449 osd.0  up  1.0 1.0
  1   hdd  0.79449 osd.1  up  1.0 1.0
  2   hdd  0.79449 osd.2  up  1.0 1.0
 -1   51.96059 root here_spinning
-12   25.98090 room 2_spinning
-11   25.98090 rack 2_spinning
 -2   25.98090 host B_spinning
  3   hdd  3.99959 osd.3  up  1.0 1.0
  8   hdd  3.99429 osd.8  up  1.0 1.0
  9   hdd  3.99429 osd.9  up  1.0 1.0
 10   hdd  3.99429 osd.10 up  1.0 1.0
 11   hdd  1.99919 osd.11 up  1.0 1.0
 12   hdd  3.99959 osd.12 up  1.0 1.0
 20   hdd  3.99959 osd.20 up  1.0 1.0
-10   25.97969 room 1_spinning
 -8   25.97969 rack l1_spinning
 -3   25.97969 host A_spinning
  6   hdd  3.99959 osd.6  up  1.0 1.0
  7   hdd  3.99959 osd.7  up  1.0 1.0
 14   hdd  3.99429 osd.14 up  1.0 1.0
 15   hdd  3.99429 osd.15 up  1.0 1.0
 16   hdd  3.99429 osd.16 up  1.0 1.0
 17   hdd  1.99919 osd.17 up  1.0 1.0
 18   hdd  1.99919 osd.18 up  1.0 1.0
 19   hdd  1.99919 osd.19 up  1.0 1.0



And the second question
I tracked the disk usage of our OSDs over the last two weeks and it looks
somehow strange too:
While osd.14, and osd.20 are filled only well below 60%
the osd 9,16 and 18 are well about 80%
graphing that shows pretty stable parallel lines, with no hint of
convergence
That's true for both the HDD and the SSD pool.
How is that and why and is that normal and okay or is there a(nother)
glitch in our config?

any hints and comments are welcome

TIA
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [Docs] s/ceph-disk/ceph-volume/g ?

2017-12-04 Thread Yoann Moulin
Hello,

By the fact ceph-disk is now deprecated, that would be great to update 
documentation to have also processes with ceph-volume.

for example :

add-or-rm-osds => 
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/

bluestore-migration => 
http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/

In my opinion, documentation for luminous branch should keep both options 
(ceph-disk and ceph-volume) but with a warning message to
encourage people to use ceph-volume instead of ceph-disk.

I guess, there is plenty of reference to ceph-disk that need to be updated.

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com