Re: [ceph-users] Right way to delete OSD from cluster?

2019-02-28 Thread Fyodor Ustinov
Hi!

Yes. But I am a little surprised by what is written in the documentation:
http://docs.ceph.com/docs/mimic/rados/operations/add-or-rm-osds/

---
Before you remove an OSD, it is usually up and in. You need to take it out of 
the cluster so that Ceph can begin rebalancing and copying its data to other 
OSDs.
ceph osd out {osd-num}
[...]
---

That is, it is argued that this is the most correct way (otherwise it would not 
have been written in the documentation).



- Original Message -
From: "David Turner" 
To: "Fyodor Ustinov" 
Cc: "Scottix" , "ceph-users" 
Sent: Friday, 1 March, 2019 05:13:27
Subject: Re: [ceph-users] Right way to delete OSD from cluster?

The reason is that an osd still contributes to the host weight in the crush
map even while it is marked out. When you out and then purge, the purging
operation removed the osd from the map and changes the weight of the host
which changes the crush map and data moves. By weighting the osd to 0.0,
the hosts weight is already the same it will be when you purge the osd.
Weighting to 0.0 is definitely the best option for removing storage if you
can trust the data on the osd being removed.

On Tue, Feb 26, 2019, 3:19 AM Fyodor Ustinov  wrote:

> Hi!
>
> Thank you so much!
>
> I do not understand why, but your variant really causes only one rebalance
> compared to the "osd out".
>
> - Original Message -
> From: "Scottix" 
> To: "Fyodor Ustinov" 
> Cc: "ceph-users" 
> Sent: Wednesday, 30 January, 2019 20:31:32
> Subject: Re: [ceph-users] Right way to delete OSD from cluster?
>
> I generally have gone the crush reweight 0 route
> This way the drive can participate in the rebalance, and the rebalance
> only happens once. Then you can take it out and purge.
>
> If I am not mistaken this is the safest.
>
> ceph osd crush reweight  0
>
> On Wed, Jan 30, 2019 at 7:45 AM Fyodor Ustinov  wrote:
> >
> > Hi!
> >
> > But unless after "ceph osd crush remove" I will not got the undersized
> objects? That is, this is not the same thing as simply turning off the OSD
> and waiting for the cluster to be restored?
> >
> > - Original Message -
> > From: "Wido den Hollander" 
> > To: "Fyodor Ustinov" , "ceph-users" <
> ceph-users@lists.ceph.com>
> > Sent: Wednesday, 30 January, 2019 15:05:35
> > Subject: Re: [ceph-users] Right way to delete OSD from cluster?
> >
> > On 1/30/19 2:00 PM, Fyodor Ustinov wrote:
> > > Hi!
> > >
> > > I thought I should first do "ceph osd out", wait for the end
> relocation of the misplaced objects and after that do "ceph osd purge".
> > > But after "purge" the cluster starts relocation again.
> > >
> > > Maybe I'm doing something wrong? Then what is the correct way to
> delete the OSD from the cluster?
> > >
> >
> > You are not doing anything wrong, this is the expected behavior. There
> > are two CRUSH changes:
> >
> > - Marking it out
> > - Purging it
> >
> > You could do:
> >
> > $ ceph osd crush remove osd.X
> >
> > Wait for all good
> >
> > $ ceph osd purge X
> >
> > The last step should then not initiate any data movement.
> >
> > Wido
> >
> > > WBR,
> > > Fyodor.
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> T: @Thaumion
> IG: Thaumion
> scot...@gmail.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd pg-upmap-items not working

2019-02-28 Thread Dan van der Ster
It looks like that somewhat unusual crush rule is confusing the new
upmap cleaning.
(debug_mon 10 on the active mon should show those cleanups).

I'm copying Xie Xingguo, and probably you should create a tracker for this.

-- dan




On Fri, Mar 1, 2019 at 3:12 AM Kári Bertilsson  wrote:
>
> This is the pool
> pool 41 'ec82_pool' erasure size 10 min_size 8 crush_rule 1 object_hash 
> rjenkins pg_num 512 pgp_num 512 last_change 63794 lfor 21731/21731 flags 
> hashpspool,ec_overwrites stripe_width 32768 application cephfs
>removed_snaps [1~5]
>
> Here is the relevant crush rule:
> rule ec_pool { id 1 type erasure min_size 3 max_size 10 step 
> set_chooseleaf_tries 5 step set_choose_tries 100 step take default class hdd 
> step choose indep 5 type host step choose indep 2 type osd step emit }
>
> Both OSD 23 and 123 are in the same host. So this change should be perfectly 
> acceptable by the rule set.
> Something must be blocking the change, but i can't find anything about it in 
> any logs.
>
> - Kári
>
> On Thu, Feb 28, 2019 at 8:07 AM Dan van der Ster  wrote:
>>
>> Hi,
>>
>> pg-upmap-items became more strict in v12.2.11 when validating upmaps.
>> E.g., it now won't let you put two PGs in the same rack if the crush
>> rule doesn't allow it.
>>
>> Where are OSDs 23 and 123 in your cluster? What is the relevant crush rule?
>>
>> -- dan
>>
>>
>> On Wed, Feb 27, 2019 at 9:17 PM Kári Bertilsson  
>> wrote:
>> >
>> > Hello
>> >
>> > I am trying to diagnose why upmap stopped working where it was previously 
>> > working fine.
>> >
>> > Trying to move pg 41.1 to 123 has no effect and seems to be ignored.
>> >
>> > # ceph osd pg-upmap-items 41.1 23 123
>> > set 41.1 pg_upmap_items mapping to [23->123]
>> >
>> > No rebalacing happens and if i run it again it shows the same output every 
>> > time.
>> >
>> > I have in config
>> > debug mgr = 4/5
>> > debug mon = 4/5
>> >
>> > Paste from mon & mgr logs. Also output from "ceph osd dump"
>> > https://pastebin.com/9VrT4YcU
>> >
>> >
>> > I have run "ceph osd set-require-min-compat-client luminous" long time 
>> > ago. And all servers running ceph have been rebooted numerous times since 
>> > then.
>> > But somehow i am still seeing "min_compat_client jewel". I believe that 
>> > upmap was previously working anyway with that "jewel" line present.
>> >
>> > I see no indication in any logs why the upmap commands are being ignored.
>> >
>> > Any suggestions on how to debug further or what could be the issue ?
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic 13.2.4 rbd du slowness

2019-02-28 Thread Glen Baars
Here is the strace result.

% time seconds  usecs/call callserrors syscall
-- --- --- - - 
99.940.236170 790   299 5 futex
  0.060.000136   0   365   brk
  0.000.00   041 2 read
  0.000.00   048   write
  0.000.00   07227 open
  0.000.00   043   close
  0.000.00   010 5 stat
  0.000.00   036   fstat
  0.000.00   0 1   lseek
  0.000.00   0   103   mmap
  0.000.00   070   mprotect
  0.000.00   019   munmap
  0.000.00   011   rt_sigaction
  0.000.00   032   rt_sigprocmask
  0.000.00   02626 access
  0.000.00   0 3   pipe
  0.000.00   019   clone
  0.000.00   0 1   execve
  0.000.00   0 7   uname
  0.000.00   012   fcntl
  0.000.00   0 1   getrlimit
  0.000.00   0 2   sysinfo
  0.000.00   0 1   getuid
  0.000.00   0 1   prctl
  0.000.00   0 1   arch_prctl
  0.000.00   0 1   gettid
  0.000.00   0 3   epoll_create
  0.000.00   0 1   set_tid_address
  0.000.00   0 1   set_robust_list
  0.000.00   0 1   membarrier
-- --- --- - - 
100.000.236306  123165 total
From: David Turner 
Sent: Friday, 1 March 2019 11:46 AM
To: Glen Baars 
Cc: Wido den Hollander ; ceph-users 
Subject: Re: [ceph-users] Mimic 13.2.4 rbd du slowness

Have you used strace on the du command to see what it's spending its time doing?

On Thu, Feb 28, 2019, 8:45 PM Glen Baars 
mailto:g...@onsitecomputers.com.au>> wrote:
Hello Wido,

The cluster layout is as follows:

3 x Monitor hosts ( 2 x 10Gbit bonded )
9 x OSD hosts (
2 x 10Gbit bonded,
LSI cachecade and write cache drives set to single,
All HDD in this pool,
no separate DB / WAL. With the write cache and the SSD read cache on the LSI 
card it seems to perform well.
168 OSD disks

No major increase in OSD disk usage or CPU usage. The RBD DU process uses 100% 
of a single 2.4Ghz core while running - I think that is the limiting factor.

I have just tried removing most of the snapshots for that volume ( from 14 
snapshots down to 1 snapshot ) and the rbd du command now takes around 2-3 
minutes.

Kind regards,
Glen Baars

-Original Message-
From: Wido den Hollander mailto:w...@42on.com>>
Sent: Thursday, 28 February 2019 5:05 PM
To: Glen Baars 
mailto:g...@onsitecomputers.com.au>>; 
ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Mimic 13.2.4 rbd du slowness



On 2/28/19 9:41 AM, Glen Baars wrote:
> Hello Wido,
>
> I have looked at the libvirt code and there is a check to ensure that 
> fast-diff is enabled on the image and only then does it try to get the real 
> disk usage. The issue for me is that even with fast-diff enabled it takes 
> 25min to get the space usage for a 50TB image.
>
> I had considered turning off fast-diff on the large images to get
> around to issue but I think that will hurt my snapshot removal times (
> untested )
>

Can you tell a bit more about the Ceph cluster? HDD? SSD? DB and WAL on SSD?

Do you see OSDs spike in CPU or Disk I/O when you do a 'rbd du' on these images?

Wido

> I can't see in the code any other way of bypassing the disk usage check but I 
> am not that familiar with the code.
>
> ---
> if (volStorageBackendRBDUseFastDiff(features)) {
> VIR_DEBUG("RBD image %s/%s has fast-diff feature enabled. "
>   "Querying for actual allocation",
>   def->source.name, vol->name);
>
> if (virStorageBackendRBDSetAllocation(vol, image, &info) < 0)
> goto cleanup;
> } else {
> vol->target.allocation = info.obj_size * info.num_objs; }
> --
>
> Kind regards,
> Glen Baars
>
> -Original Message-
> From: Wido den Hollander mailto:w...@42on.com>>
> Sent: Thursday, 28 February 2019 3:49 PM
> To: Glen Baars 
> mailto:g...@onsitecomputers.com.au>>;
> ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Mimic 13.2.4 rbd du slowness
>
>
>
> On 2/28/19 2:59 AM, Glen Baars wrote:
>> Hello Ceph Us

Re: [ceph-users] Mimic 13.2.4 rbd du slowness

2019-02-28 Thread David Turner
Have you used strace on the du command to see what it's spending its time
doing?

On Thu, Feb 28, 2019, 8:45 PM Glen Baars 
wrote:

> Hello Wido,
>
> The cluster layout is as follows:
>
> 3 x Monitor hosts ( 2 x 10Gbit bonded )
> 9 x OSD hosts (
> 2 x 10Gbit bonded,
> LSI cachecade and write cache drives set to single,
> All HDD in this pool,
> no separate DB / WAL. With the write cache and the SSD read cache on the
> LSI card it seems to perform well.
> 168 OSD disks
>
> No major increase in OSD disk usage or CPU usage. The RBD DU process uses
> 100% of a single 2.4Ghz core while running - I think that is the limiting
> factor.
>
> I have just tried removing most of the snapshots for that volume ( from 14
> snapshots down to 1 snapshot ) and the rbd du command now takes around 2-3
> minutes.
>
> Kind regards,
> Glen Baars
>
> -Original Message-
> From: Wido den Hollander 
> Sent: Thursday, 28 February 2019 5:05 PM
> To: Glen Baars ; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Mimic 13.2.4 rbd du slowness
>
>
>
> On 2/28/19 9:41 AM, Glen Baars wrote:
> > Hello Wido,
> >
> > I have looked at the libvirt code and there is a check to ensure that
> fast-diff is enabled on the image and only then does it try to get the real
> disk usage. The issue for me is that even with fast-diff enabled it takes
> 25min to get the space usage for a 50TB image.
> >
> > I had considered turning off fast-diff on the large images to get
> > around to issue but I think that will hurt my snapshot removal times (
> > untested )
> >
>
> Can you tell a bit more about the Ceph cluster? HDD? SSD? DB and WAL on
> SSD?
>
> Do you see OSDs spike in CPU or Disk I/O when you do a 'rbd du' on these
> images?
>
> Wido
>
> > I can't see in the code any other way of bypassing the disk usage check
> but I am not that familiar with the code.
> >
> > ---
> > if (volStorageBackendRBDUseFastDiff(features)) {
> > VIR_DEBUG("RBD image %s/%s has fast-diff feature enabled. "
> >   "Querying for actual allocation",
> >   def->source.name, vol->name);
> >
> > if (virStorageBackendRBDSetAllocation(vol, image, &info) < 0)
> > goto cleanup;
> > } else {
> > vol->target.allocation = info.obj_size * info.num_objs; }
> > --
> >
> > Kind regards,
> > Glen Baars
> >
> > -Original Message-
> > From: Wido den Hollander 
> > Sent: Thursday, 28 February 2019 3:49 PM
> > To: Glen Baars ;
> > ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Mimic 13.2.4 rbd du slowness
> >
> >
> >
> > On 2/28/19 2:59 AM, Glen Baars wrote:
> >> Hello Ceph Users,
> >>
> >> Has anyone found a way to improve the speed of the rbd du command on
> large rbd images? I have object map and fast diff enabled - no invalid
> flags on the image or it's snapshots.
> >>
> >> We recently upgraded our Ubuntu 16.04 KVM servers for Cloudstack to
> Ubuntu 18.04. The upgrades libvirt to version 4. When libvirt 4 adds an rbd
> pool it discovers all images in the pool and tries to get their disk usage.
> We are seeing a 50TB image take 25min. The pool has over 300TB of images in
> it and takes hours for libvirt to start.
> >>
> >
> > This is actually a pretty bad thing imho. As a lot of images people will
> be using do not have fast-diff enabled (images from the past) and that will
> kill their performance.
> >
> > Isn't there a way to turn this off in libvirt?
> >
> > Wido
> >
> >> We can replicate the issue without libvirt by just running a rbd du on
> the large images. The limiting factor is the cpu on the rbd du command, it
> uses 100% of a single core.
> >>
> >> Our cluster is completely bluestore/mimic 13.2.4. 168 OSDs, 12 Ubuntu
> 16.04 hosts.
> >>
> >> Kind regards,
> >> Glen Baars
> >> This e-mail is intended solely for the benefit of the addressee(s) and
> any other named recipient. It is confidential and may contain legally
> privileged or confidential information. If you are not the recipient, any
> use, distribution, disclosure or copying of this e-mail is prohibited. The
> confidentiality and legal privilege attached to this communication is not
> waived or lost by reason of the mistaken transmission or delivery to you.
> If you have received this e-mail in error, please notify us immediately.
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > This e-mail is intended solely for the benefit of the addressee(s) and
> any other named recipient. It is confidential and may contain legally
> privileged or confidential information. If you are not the recipient, any
> use, distribution, disclosure or copying of this e-mail is prohibited. The
> confidentiality and legal privilege attached to this communication is not
> waived or lost by reason of the mistaken transmission or delivery to you.
> If you have received this e-mail

Re: [ceph-users] rbd unmap fails with error: rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy

2019-02-28 Thread David Turner
Why are you making the same rbd to multiple servers?

On Wed, Feb 27, 2019, 9:50 AM Ilya Dryomov  wrote:

> On Wed, Feb 27, 2019 at 12:00 PM Thomas <74cmo...@gmail.com> wrote:
> >
> > Hi,
> > I have noticed an error when writing to a mapped RBD.
> > Therefore I unmounted the block device.
> > Then I tried to unmap it w/o success:
> > ld2110:~ # rbd unmap /dev/rbd0
> > rbd: sysfs write failed
> > rbd: unmap failed: (16) Device or resource busy
> >
> > The same block device is mapped on another client and there are no
> issues:
> > root@ld4257:~# rbd info hdb-backup/ld2110
> > rbd image 'ld2110':
> > size 7.81TiB in 2048000 objects
> > order 22 (4MiB objects)
> > block_name_prefix: rbd_data.3cda0d6b8b4567
> > format: 2
> > features: layering
> > flags:
> > create_timestamp: Fri Feb 15 10:53:50 2019
> > root@ld4257:~# rados -p hdb-backup  listwatchers rbd_data.3cda0d6b8b4567
> > error listing watchers hdb-backup/rbd_data.3cda0d6b8b4567: (2) No such
> > file or directory
> > root@ld4257:~# rados -p hdb-backup  listwatchers
> rbd_header.3cda0d6b8b4567
> > watcher=10.76.177.185:0/1144812735 client.21865052 cookie=1
> > watcher=10.97.206.97:0/4023931980 client.18484780
> > cookie=18446462598732841027
> >
> >
> > Question:
> > How can I force to unmap the RBD on client ld2110 (= 10.76.177.185)?
>
> Hi Thomas,
>
> It appears that /dev/rbd0 is still open on that node.
>
> Was the unmount successful?  Which filesystem (ext4, xfs, etc)?
>
> What is the output of "ps aux | grep rbd" on that node?
>
> Try lsof, fuser, check for LVM volumes and multipath -- these have been
> reported to cause this issue previously:
>
>   http://tracker.ceph.com/issues/12763
>
> Thanks,
>
> Ilya
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG Calculations Issue

2019-02-28 Thread David Turner
Those numbers look right for a pool only containing 10% of your data. Now
continue to calculate the pg counts for the remaining 90% of your data.

On Wed, Feb 27, 2019, 12:17 PM Krishna Venkata 
wrote:

> Greetings,
>
>
> I am having issues in the way PGs are calculated in
> https://ceph.com/pgcalc/ [Ceph PGs per Pool Calculator ] and the formulae
> mentioned in the site.
>
> Below are my findings
>
> The formula to calculate PGs as mentioned in the https://ceph.com/pgcalc/
>  :
>
> 1.  Need to pick the highest value from either of the formulas
>
> *(( Target PGs per OSD ) x ( OSD # ) x ( %Data ))/(size)*
>
> Or
>
> *( OSD# ) / ( Size )*
>
> 2.  The output value is then rounded to the nearest power of 2
>
>1. If the nearest power of 2 is more than 25% below the original
>value, the next higher power of 2 is used.
>
>
>
> Based on the above procedure, we calculated PGs for 25, 32 and 64 OSDs
>
> *Our Dataset:*
>
> *%Data:* 0.10
>
> *Target PGs per OSD:* 100
>
> *OSDs* 25, 32 and 64
>
>
>
> *For 25 OSDs*
>
>
>
> (100*25* (0.10/100))/(3) = 0.833
>
>
>
> ( 25 ) / ( 3 ) = 8.33
>
>
>
> 1. Raw pg num 8.33  ( Since we need to pick the highest of (0.833, 8.33))
>
> 2. max pg 16 ( For, 8.33 the nearest power of 2 is 16)
>
> 3. 16 > 2.08  ( 25 % of 8.33 is 2.08 which is more than 25% the power of 2)
>
>
>
> So 16 PGs
>
> ü  GUI Calculator gives the same value and matches with Formula.
>
>
>
> *For 32 OSD*
>
>
>
> (100*32*(0.10/100))/3 = 1.066
>
> ( 32 ) / ( 3 ) = 10.66
>
>
>
> 1. Raw pg num 10.66 ( Since we need to pick the highest of (1.066, 10.66))
>
> 2. max pg 16 ( For, 10.66 the nearest power of 2 is 16)
>
> 3.  16 > 2.655 ( 25 % of 10.66 is 2.655 which is more than 25% the power
> of 2)
>
>
>
> So 16 PGs
>
> û  GUI Calculator gives different value (32 PGs) which doesn’t match with
> Formula.
>
>
>
> *For 64 OSD*
>
>
>
> (100 * 64 * (0.10/100))/3 = 2.133
>
> ( 64 ) / ( 3 ) 21.33
>
>
>
> 1. Raw pg num 21.33 ( Since we need to pick the highest of (2.133, 21.33))
>
> 2. max pg 32 ( For, 21.33 the nearest power of 2 is 32)
>
> 3. 32 > 5.3325 ( 25 % of 21.33 is 5.3325 which is more than 25% the power
> of 2)
>
>
>
> So 32 PGs
>
> û  GUI Calculator gives different value (64 PGs) which doesn’t match with
> Formula.
>
>
>
> We checked the PG calculator logic from [
> https://ceph.com/pgcalc_assets/pgcalc.js ] which is not matching from
> above formulae.
>
>
>
> Can someone Guide/reference us to correct formulae to calculate PGs.
>
>
>
> Thanks in advance.
>
>
>
> Regards,
>
> Krishna Venkata
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] redirect log to syslog and disable log to stderr

2019-02-28 Thread David Turner
You can always set it in your ceph.conf file and restart the mgr daemon.

On Tue, Feb 26, 2019, 1:30 PM Alex Litvak 
wrote:

> Dear Cephers,
>
> In mimic 13.2.2
> ceph tell mgr.* injectargs --log-to-stderr=false
> Returns an error (no valid command found ...).  What is the correct way to
> inject mgr configuration values?
>
> The same command works on mon
>
> ceph tell mon.* injectargs --log-to-stderr=false
>
>
> Thank you in advance,
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Right way to delete OSD from cluster?

2019-02-28 Thread David Turner
The reason is that an osd still contributes to the host weight in the crush
map even while it is marked out. When you out and then purge, the purging
operation removed the osd from the map and changes the weight of the host
which changes the crush map and data moves. By weighting the osd to 0.0,
the hosts weight is already the same it will be when you purge the osd.
Weighting to 0.0 is definitely the best option for removing storage if you
can trust the data on the osd being removed.

On Tue, Feb 26, 2019, 3:19 AM Fyodor Ustinov  wrote:

> Hi!
>
> Thank you so much!
>
> I do not understand why, but your variant really causes only one rebalance
> compared to the "osd out".
>
> - Original Message -
> From: "Scottix" 
> To: "Fyodor Ustinov" 
> Cc: "ceph-users" 
> Sent: Wednesday, 30 January, 2019 20:31:32
> Subject: Re: [ceph-users] Right way to delete OSD from cluster?
>
> I generally have gone the crush reweight 0 route
> This way the drive can participate in the rebalance, and the rebalance
> only happens once. Then you can take it out and purge.
>
> If I am not mistaken this is the safest.
>
> ceph osd crush reweight  0
>
> On Wed, Jan 30, 2019 at 7:45 AM Fyodor Ustinov  wrote:
> >
> > Hi!
> >
> > But unless after "ceph osd crush remove" I will not got the undersized
> objects? That is, this is not the same thing as simply turning off the OSD
> and waiting for the cluster to be restored?
> >
> > - Original Message -
> > From: "Wido den Hollander" 
> > To: "Fyodor Ustinov" , "ceph-users" <
> ceph-users@lists.ceph.com>
> > Sent: Wednesday, 30 January, 2019 15:05:35
> > Subject: Re: [ceph-users] Right way to delete OSD from cluster?
> >
> > On 1/30/19 2:00 PM, Fyodor Ustinov wrote:
> > > Hi!
> > >
> > > I thought I should first do "ceph osd out", wait for the end
> relocation of the misplaced objects and after that do "ceph osd purge".
> > > But after "purge" the cluster starts relocation again.
> > >
> > > Maybe I'm doing something wrong? Then what is the correct way to
> delete the OSD from the cluster?
> > >
> >
> > You are not doing anything wrong, this is the expected behavior. There
> > are two CRUSH changes:
> >
> > - Marking it out
> > - Purging it
> >
> > You could do:
> >
> > $ ceph osd crush remove osd.X
> >
> > Wait for all good
> >
> > $ ceph osd purge X
> >
> > The last step should then not initiate any data movement.
> >
> > Wido
> >
> > > WBR,
> > > Fyodor.
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> T: @Thaumion
> IG: Thaumion
> scot...@gmail.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Configuration about using nvme SSD

2019-02-28 Thread 韦皓诚
I have tried to devide an nvme disk into four partitions. However, no
significant improvement was found in performance by rados bench.
nvme with partition: 1 node 3 nvme 12 osd, 166066 iops in 4K read
nvme without partition: 1 node 3 nvme 3 osd 163336 iops in 4K read
My ceph version is 12.2.4.
What's wrong with my test?

Wido den Hollander  于2019年2月25日周一 下午7:02写道:
>
>
>
> On 2/24/19 4:34 PM, David Turner wrote:
> > One thing that's worked for me to get more out of nvmes with Ceph is to
> > create multiple partitions on the nvme with an osd on each partition.
> > That way you get more osd processes and CPU per nvme device. I've heard
> > of people using up to 4 partitions like this.
> >
>
> Increasing the amount of Placement Groups also works. In addition you
> should also increase osd_op_num_threads_per_shard to something like 4.
>
> This will increase CPU usage, but you should also be able to get more
> out of the NVMe devices.
>
> In addition, make sure you pin the CPU C-States to 1 and disable
> powersaving for the CPU.
>
> Wido
>
> > On Sun, Feb 24, 2019, 10:25 AM Vitaliy Filippov  > > wrote:
> >
> > > We can get 513558 IOPS in 4K read per nvme by fio but only 45146 IOPS
> > > per OSD.by rados.
> >
> > Don't expect Ceph to fully utilize NVMe's, it's software and it's
> > slow :)
> > some colleagues tell that SPDK works out of the box, but almost
> > doesn't
> > increase performance, because the userland-kernel interaction isn't
> > the
> > bottleneck currently, it's Ceph code itself. I also tried once, but I
> > couldn't make it work. When I have some spare NVMe's I'll make another
> > attempt.
> >
> > So... try it and share your results here :) we're all interested.
> >
> > --
> > With best regards,
> >Vitaliy Filippov
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd pg-upmap-items not working

2019-02-28 Thread Kári Bertilsson
This is the pool
pool 41 'ec82_pool' erasure size 10 min_size 8 crush_rule 1 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 63794 lfor 21731/21731 flags
hashpspool,ec_overwrites stripe_width 32768 application cephfs
   removed_snaps [1~5]

Here is the relevant crush rule:
rule ec_pool { id 1 type erasure min_size 3 max_size 10 step
set_chooseleaf_tries 5 step set_choose_tries 100 step take default class
hdd step choose indep 5 type host step choose indep 2 type osd step emit }

Both OSD 23 and 123 are in the same host. So this change should be
perfectly acceptable by the rule set.
Something must be blocking the change, but i can't find anything about it
in any logs.

- Kári

On Thu, Feb 28, 2019 at 8:07 AM Dan van der Ster  wrote:

> Hi,
>
> pg-upmap-items became more strict in v12.2.11 when validating upmaps.
> E.g., it now won't let you put two PGs in the same rack if the crush
> rule doesn't allow it.
>
> Where are OSDs 23 and 123 in your cluster? What is the relevant crush rule?
>
> -- dan
>
>
> On Wed, Feb 27, 2019 at 9:17 PM Kári Bertilsson 
> wrote:
> >
> > Hello
> >
> > I am trying to diagnose why upmap stopped working where it was
> previously working fine.
> >
> > Trying to move pg 41.1 to 123 has no effect and seems to be ignored.
> >
> > # ceph osd pg-upmap-items 41.1 23 123
> > set 41.1 pg_upmap_items mapping to [23->123]
> >
> > No rebalacing happens and if i run it again it shows the same output
> every time.
> >
> > I have in config
> > debug mgr = 4/5
> > debug mon = 4/5
> >
> > Paste from mon & mgr logs. Also output from "ceph osd dump"
> > https://pastebin.com/9VrT4YcU
> >
> >
> > I have run "ceph osd set-require-min-compat-client luminous" long time
> ago. And all servers running ceph have been rebooted numerous times since
> then.
> > But somehow i am still seeing "min_compat_client jewel". I believe that
> upmap was previously working anyway with that "jewel" line present.
> >
> > I see no indication in any logs why the upmap commands are being ignored.
> >
> > Any suggestions on how to debug further or what could be the issue ?
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic 13.2.4 rbd du slowness

2019-02-28 Thread Glen Baars
Hello Wido,

The cluster layout is as follows:

3 x Monitor hosts ( 2 x 10Gbit bonded )
9 x OSD hosts (
2 x 10Gbit bonded,
LSI cachecade and write cache drives set to single,
All HDD in this pool,
no separate DB / WAL. With the write cache and the SSD read cache on the LSI 
card it seems to perform well.
168 OSD disks

No major increase in OSD disk usage or CPU usage. The RBD DU process uses 100% 
of a single 2.4Ghz core while running - I think that is the limiting factor.

I have just tried removing most of the snapshots for that volume ( from 14 
snapshots down to 1 snapshot ) and the rbd du command now takes around 2-3 
minutes.

Kind regards,
Glen Baars

-Original Message-
From: Wido den Hollander 
Sent: Thursday, 28 February 2019 5:05 PM
To: Glen Baars ; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Mimic 13.2.4 rbd du slowness



On 2/28/19 9:41 AM, Glen Baars wrote:
> Hello Wido,
>
> I have looked at the libvirt code and there is a check to ensure that 
> fast-diff is enabled on the image and only then does it try to get the real 
> disk usage. The issue for me is that even with fast-diff enabled it takes 
> 25min to get the space usage for a 50TB image.
>
> I had considered turning off fast-diff on the large images to get
> around to issue but I think that will hurt my snapshot removal times (
> untested )
>

Can you tell a bit more about the Ceph cluster? HDD? SSD? DB and WAL on SSD?

Do you see OSDs spike in CPU or Disk I/O when you do a 'rbd du' on these images?

Wido

> I can't see in the code any other way of bypassing the disk usage check but I 
> am not that familiar with the code.
>
> ---
> if (volStorageBackendRBDUseFastDiff(features)) {
> VIR_DEBUG("RBD image %s/%s has fast-diff feature enabled. "
>   "Querying for actual allocation",
>   def->source.name, vol->name);
>
> if (virStorageBackendRBDSetAllocation(vol, image, &info) < 0)
> goto cleanup;
> } else {
> vol->target.allocation = info.obj_size * info.num_objs; }
> --
>
> Kind regards,
> Glen Baars
>
> -Original Message-
> From: Wido den Hollander 
> Sent: Thursday, 28 February 2019 3:49 PM
> To: Glen Baars ;
> ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Mimic 13.2.4 rbd du slowness
>
>
>
> On 2/28/19 2:59 AM, Glen Baars wrote:
>> Hello Ceph Users,
>>
>> Has anyone found a way to improve the speed of the rbd du command on large 
>> rbd images? I have object map and fast diff enabled - no invalid flags on 
>> the image or it's snapshots.
>>
>> We recently upgraded our Ubuntu 16.04 KVM servers for Cloudstack to Ubuntu 
>> 18.04. The upgrades libvirt to version 4. When libvirt 4 adds an rbd pool it 
>> discovers all images in the pool and tries to get their disk usage. We are 
>> seeing a 50TB image take 25min. The pool has over 300TB of images in it and 
>> takes hours for libvirt to start.
>>
>
> This is actually a pretty bad thing imho. As a lot of images people will be 
> using do not have fast-diff enabled (images from the past) and that will kill 
> their performance.
>
> Isn't there a way to turn this off in libvirt?
>
> Wido
>
>> We can replicate the issue without libvirt by just running a rbd du on the 
>> large images. The limiting factor is the cpu on the rbd du command, it uses 
>> 100% of a single core.
>>
>> Our cluster is completely bluestore/mimic 13.2.4. 168 OSDs, 12 Ubuntu 16.04 
>> hosts.
>>
>> Kind regards,
>> Glen Baars
>> This e-mail is intended solely for the benefit of the addressee(s) and any 
>> other named recipient. It is confidential and may contain legally privileged 
>> or confidential information. If you are not the recipient, any use, 
>> distribution, disclosure or copying of this e-mail is prohibited. The 
>> confidentiality and legal privilege attached to this communication is not 
>> waived or lost by reason of the mistaken transmission or delivery to you. If 
>> you have received this e-mail in error, please notify us immediately.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> This e-mail is intended solely for the benefit of the addressee(s) and any 
> other named recipient. It is confidential and may contain legally privileged 
> or confidential information. If you are not the recipient, any use, 
> distribution, disclosure or copying of this e-mail is prohibited. The 
> confidentiality and legal privilege attached to this communication is not 
> waived or lost by reason of the mistaken transmission or delivery to you. If 
> you have received this e-mail in error, please notify us immediately.
>
This e-mail is intended solely for the benefit of the addressee(s) and any 
other named recipient. It is confidential and may contain legally privileged or 
confidential information. If you are not the recipient, any use, distribution, 

Re: [ceph-users] rbd space usage

2019-02-28 Thread Matthew H
It looks like he used 'rbd map' to map his volume. If so, then yes just run 
fstrim on the device.

If it's an instance with a cinder, or a nova ephemeral disk (on ceph) then you 
have to use virtio-scsi to run discard in your instance.


From: ceph-users  on behalf of Jack 

Sent: Thursday, February 28, 2019 5:39 PM
To: solarflow99
Cc: Ceph Users
Subject: Re: [ceph-users] rbd space usage

Ha, that was your issue

RBD does not know that your space (on the filesystem level) is now free
to use

You have to trim your filesystem, see fstrim(8) as well as the discard
mount option

The related scsi command have to be passed down the stack, so you may
need to check on other level (for instance, your hypervisor's configuration)

Regards,

On 02/28/2019 11:31 PM, solarflow99 wrote:
> yes, but:
>
> # rbd showmapped
> id pool image snap device
> 0  rbd  nfs1  -/dev/rbd0
> 1  rbd  nfs2  -/dev/rbd1
>
>
> # df -h
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/rbd0   8.0T  4.8T  3.3T  60% /mnt/nfsroot/rbd0
> /dev/rbd1   9.8T   34M  9.8T   1% /mnt/nfsroot/rbd1
>
>
> only 5T is taken up
>
>
> On Thu, Feb 28, 2019 at 2:26 PM Jack  wrote:
>
>> Are not you using 3-replicas pool ?
>>
>> (15745GB + 955GB + 1595M) * 3 ~= 51157G (there is overhead involved)
>>
>> Best regards,
>>
>> On 02/28/2019 11:09 PM, solarflow99 wrote:
>>> thanks, I still can't understand whats taking up all the space 27.75
>>>
>>> On Thu, Feb 28, 2019 at 7:18 AM Mohamad Gebai  wrote:
>>>
 On 2/27/19 4:57 PM, Marc Roos wrote:
> They are 'thin provisioned' meaning if you create a 10GB rbd, it does
> not use 10GB at the start. (afaik)

 You can use 'rbd -p rbd du' to see how much of these devices is
 provisioned and see if it's coherent.

 Mohamad

>
>
> -Original Message-
> From: solarflow99 [mailto:solarflo...@gmail.com]
> Sent: 27 February 2019 22:55
> To: Ceph Users
> Subject: [ceph-users] rbd space usage
>
> using ceph df it looks as if RBD images can use the total free space
> available of the pool it belongs to, 8.54% yet I know they are created
> with a --size parameter and thats what determines the actual space.  I
> can't understand the difference i'm seeing, only 5T is being used but
> ceph df shows 51T:
>
>
> /dev/rbd0   8.0T  4.8T  3.3T  60% /mnt/nfsroot/rbd0
> /dev/rbd1   9.8T   34M  9.8T   1% /mnt/nfsroot/rbd1
>
>
>
> # ceph df
> GLOBAL:
> SIZE AVAIL RAW USED %RAW USED
> 180T  130T   51157G 27.75
> POOLS:
> NAMEID USED   %USED MAX AVAIL
> OBJECTS
> rbd 0  15745G  8.543G
> 4043495
> cephfs_data 1   0 03G
> 0
> cephfs_metadata 21962 03G
>20
> spider_stage 9   1595M 03G47835
> spider   10   955G  0.523G
> 42541237
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd space usage

2019-02-28 Thread Jack
Ha, that was your issue

RBD does not know that your space (on the filesystem level) is now free
to use

You have to trim your filesystem, see fstrim(8) as well as the discard
mount option

The related scsi command have to be passed down the stack, so you may
need to check on other level (for instance, your hypervisor's configuration)

Regards,

On 02/28/2019 11:31 PM, solarflow99 wrote:
> yes, but:
> 
> # rbd showmapped
> id pool image snap device
> 0  rbd  nfs1  -/dev/rbd0
> 1  rbd  nfs2  -/dev/rbd1
> 
> 
> # df -h
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/rbd0   8.0T  4.8T  3.3T  60% /mnt/nfsroot/rbd0
> /dev/rbd1   9.8T   34M  9.8T   1% /mnt/nfsroot/rbd1
> 
> 
> only 5T is taken up
> 
> 
> On Thu, Feb 28, 2019 at 2:26 PM Jack  wrote:
> 
>> Are not you using 3-replicas pool ?
>>
>> (15745GB + 955GB + 1595M) * 3 ~= 51157G (there is overhead involved)
>>
>> Best regards,
>>
>> On 02/28/2019 11:09 PM, solarflow99 wrote:
>>> thanks, I still can't understand whats taking up all the space 27.75
>>>
>>> On Thu, Feb 28, 2019 at 7:18 AM Mohamad Gebai  wrote:
>>>
 On 2/27/19 4:57 PM, Marc Roos wrote:
> They are 'thin provisioned' meaning if you create a 10GB rbd, it does
> not use 10GB at the start. (afaik)

 You can use 'rbd -p rbd du' to see how much of these devices is
 provisioned and see if it's coherent.

 Mohamad

>
>
> -Original Message-
> From: solarflow99 [mailto:solarflo...@gmail.com]
> Sent: 27 February 2019 22:55
> To: Ceph Users
> Subject: [ceph-users] rbd space usage
>
> using ceph df it looks as if RBD images can use the total free space
> available of the pool it belongs to, 8.54% yet I know they are created
> with a --size parameter and thats what determines the actual space.  I
> can't understand the difference i'm seeing, only 5T is being used but
> ceph df shows 51T:
>
>
> /dev/rbd0   8.0T  4.8T  3.3T  60% /mnt/nfsroot/rbd0
> /dev/rbd1   9.8T   34M  9.8T   1% /mnt/nfsroot/rbd1
>
>
>
> # ceph df
> GLOBAL:
> SIZE AVAIL RAW USED %RAW USED
> 180T  130T   51157G 27.75
> POOLS:
> NAMEID USED   %USED MAX AVAIL
> OBJECTS
> rbd 0  15745G  8.543G
> 4043495
> cephfs_data 1   0 03G
> 0
> cephfs_metadata 21962 03G
>20
> spider_stage 9   1595M 03G47835
> spider   10   955G  0.523G
> 42541237
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd space usage

2019-02-28 Thread Matthew H
I think the command you are looking for is 'rbd du'

example

rbd du rbd/myimagename


From: ceph-users  on behalf of solarflow99 

Sent: Thursday, February 28, 2019 5:31 PM
To: Jack
Cc: Ceph Users
Subject: Re: [ceph-users] rbd space usage

yes, but:

# rbd showmapped
id pool image snap device
0  rbd  nfs1  -/dev/rbd0
1  rbd  nfs2  -/dev/rbd1


# df -h
Filesystem  Size  Used Avail Use% Mounted on
/dev/rbd0   8.0T  4.8T  3.3T  60% /mnt/nfsroot/rbd0
/dev/rbd1   9.8T   34M  9.8T   1% /mnt/nfsroot/rbd1


only 5T is taken up


On Thu, Feb 28, 2019 at 2:26 PM Jack 
mailto:c...@jack.fr.eu.org>> wrote:
Are not you using 3-replicas pool ?

(15745GB + 955GB + 1595M) * 3 ~= 51157G (there is overhead involved)

Best regards,

On 02/28/2019 11:09 PM, solarflow99 wrote:
> thanks, I still can't understand whats taking up all the space 27.75
>
> On Thu, Feb 28, 2019 at 7:18 AM Mohamad Gebai 
> mailto:mge...@suse.de>> wrote:
>
>> On 2/27/19 4:57 PM, Marc Roos wrote:
>>> They are 'thin provisioned' meaning if you create a 10GB rbd, it does
>>> not use 10GB at the start. (afaik)
>>
>> You can use 'rbd -p rbd du' to see how much of these devices is
>> provisioned and see if it's coherent.
>>
>> Mohamad
>>
>>>
>>>
>>> -Original Message-
>>> From: solarflow99 
>>> [mailto:solarflo...@gmail.com]
>>> Sent: 27 February 2019 22:55
>>> To: Ceph Users
>>> Subject: [ceph-users] rbd space usage
>>>
>>> using ceph df it looks as if RBD images can use the total free space
>>> available of the pool it belongs to, 8.54% yet I know they are created
>>> with a --size parameter and thats what determines the actual space.  I
>>> can't understand the difference i'm seeing, only 5T is being used but
>>> ceph df shows 51T:
>>>
>>>
>>> /dev/rbd0   8.0T  4.8T  3.3T  60% /mnt/nfsroot/rbd0
>>> /dev/rbd1   9.8T   34M  9.8T   1% /mnt/nfsroot/rbd1
>>>
>>>
>>>
>>> # ceph df
>>> GLOBAL:
>>> SIZE AVAIL RAW USED %RAW USED
>>> 180T  130T   51157G 27.75
>>> POOLS:
>>> NAMEID USED   %USED MAX AVAIL
>>> OBJECTS
>>> rbd 0  15745G  8.543G
>>> 4043495
>>> cephfs_data 1   0 03G
>>> 0
>>> cephfs_metadata 21962 03G
>>>20
>>> spider_stage 9   1595M 03G47835
>>> spider   10   955G  0.523G
>>> 42541237
>>>
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd space usage

2019-02-28 Thread solarflow99
yes, but:

# rbd showmapped
id pool image snap device
0  rbd  nfs1  -/dev/rbd0
1  rbd  nfs2  -/dev/rbd1


# df -h
Filesystem  Size  Used Avail Use% Mounted on
/dev/rbd0   8.0T  4.8T  3.3T  60% /mnt/nfsroot/rbd0
/dev/rbd1   9.8T   34M  9.8T   1% /mnt/nfsroot/rbd1


only 5T is taken up


On Thu, Feb 28, 2019 at 2:26 PM Jack  wrote:

> Are not you using 3-replicas pool ?
>
> (15745GB + 955GB + 1595M) * 3 ~= 51157G (there is overhead involved)
>
> Best regards,
>
> On 02/28/2019 11:09 PM, solarflow99 wrote:
> > thanks, I still can't understand whats taking up all the space 27.75
> >
> > On Thu, Feb 28, 2019 at 7:18 AM Mohamad Gebai  wrote:
> >
> >> On 2/27/19 4:57 PM, Marc Roos wrote:
> >>> They are 'thin provisioned' meaning if you create a 10GB rbd, it does
> >>> not use 10GB at the start. (afaik)
> >>
> >> You can use 'rbd -p rbd du' to see how much of these devices is
> >> provisioned and see if it's coherent.
> >>
> >> Mohamad
> >>
> >>>
> >>>
> >>> -Original Message-
> >>> From: solarflow99 [mailto:solarflo...@gmail.com]
> >>> Sent: 27 February 2019 22:55
> >>> To: Ceph Users
> >>> Subject: [ceph-users] rbd space usage
> >>>
> >>> using ceph df it looks as if RBD images can use the total free space
> >>> available of the pool it belongs to, 8.54% yet I know they are created
> >>> with a --size parameter and thats what determines the actual space.  I
> >>> can't understand the difference i'm seeing, only 5T is being used but
> >>> ceph df shows 51T:
> >>>
> >>>
> >>> /dev/rbd0   8.0T  4.8T  3.3T  60% /mnt/nfsroot/rbd0
> >>> /dev/rbd1   9.8T   34M  9.8T   1% /mnt/nfsroot/rbd1
> >>>
> >>>
> >>>
> >>> # ceph df
> >>> GLOBAL:
> >>> SIZE AVAIL RAW USED %RAW USED
> >>> 180T  130T   51157G 27.75
> >>> POOLS:
> >>> NAMEID USED   %USED MAX AVAIL
> >>> OBJECTS
> >>> rbd 0  15745G  8.543G
> >>> 4043495
> >>> cephfs_data 1   0 03G
> >>> 0
> >>> cephfs_metadata 21962 03G
> >>>20
> >>> spider_stage 9   1595M 03G47835
> >>> spider   10   955G  0.523G
> >>> 42541237
> >>>
> >>>
> >>>
> >>>
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd space usage

2019-02-28 Thread Jack
Are not you using 3-replicas pool ?

(15745GB + 955GB + 1595M) * 3 ~= 51157G (there is overhead involved)

Best regards,

On 02/28/2019 11:09 PM, solarflow99 wrote:
> thanks, I still can't understand whats taking up all the space 27.75
> 
> On Thu, Feb 28, 2019 at 7:18 AM Mohamad Gebai  wrote:
> 
>> On 2/27/19 4:57 PM, Marc Roos wrote:
>>> They are 'thin provisioned' meaning if you create a 10GB rbd, it does
>>> not use 10GB at the start. (afaik)
>>
>> You can use 'rbd -p rbd du' to see how much of these devices is
>> provisioned and see if it's coherent.
>>
>> Mohamad
>>
>>>
>>>
>>> -Original Message-
>>> From: solarflow99 [mailto:solarflo...@gmail.com]
>>> Sent: 27 February 2019 22:55
>>> To: Ceph Users
>>> Subject: [ceph-users] rbd space usage
>>>
>>> using ceph df it looks as if RBD images can use the total free space
>>> available of the pool it belongs to, 8.54% yet I know they are created
>>> with a --size parameter and thats what determines the actual space.  I
>>> can't understand the difference i'm seeing, only 5T is being used but
>>> ceph df shows 51T:
>>>
>>>
>>> /dev/rbd0   8.0T  4.8T  3.3T  60% /mnt/nfsroot/rbd0
>>> /dev/rbd1   9.8T   34M  9.8T   1% /mnt/nfsroot/rbd1
>>>
>>>
>>>
>>> # ceph df
>>> GLOBAL:
>>> SIZE AVAIL RAW USED %RAW USED
>>> 180T  130T   51157G 27.75
>>> POOLS:
>>> NAMEID USED   %USED MAX AVAIL
>>> OBJECTS
>>> rbd 0  15745G  8.543G
>>> 4043495
>>> cephfs_data 1   0 03G
>>> 0
>>> cephfs_metadata 21962 03G
>>>20
>>> spider_stage 9   1595M 03G47835
>>> spider   10   955G  0.523G
>>> 42541237
>>>
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd space usage

2019-02-28 Thread solarflow99
thanks, I still can't understand whats taking up all the space 27.75

On Thu, Feb 28, 2019 at 7:18 AM Mohamad Gebai  wrote:

> On 2/27/19 4:57 PM, Marc Roos wrote:
> > They are 'thin provisioned' meaning if you create a 10GB rbd, it does
> > not use 10GB at the start. (afaik)
>
> You can use 'rbd -p rbd du' to see how much of these devices is
> provisioned and see if it's coherent.
>
> Mohamad
>
> >
> >
> > -Original Message-
> > From: solarflow99 [mailto:solarflo...@gmail.com]
> > Sent: 27 February 2019 22:55
> > To: Ceph Users
> > Subject: [ceph-users] rbd space usage
> >
> > using ceph df it looks as if RBD images can use the total free space
> > available of the pool it belongs to, 8.54% yet I know they are created
> > with a --size parameter and thats what determines the actual space.  I
> > can't understand the difference i'm seeing, only 5T is being used but
> > ceph df shows 51T:
> >
> >
> > /dev/rbd0   8.0T  4.8T  3.3T  60% /mnt/nfsroot/rbd0
> > /dev/rbd1   9.8T   34M  9.8T   1% /mnt/nfsroot/rbd1
> >
> >
> >
> > # ceph df
> > GLOBAL:
> > SIZE AVAIL RAW USED %RAW USED
> > 180T  130T   51157G 27.75
> > POOLS:
> > NAMEID USED   %USED MAX AVAIL
> > OBJECTS
> > rbd 0  15745G  8.543G
> > 4043495
> > cephfs_data 1   0 03G
> > 0
> > cephfs_metadata 21962 03G
> >20
> > spider_stage 9   1595M 03G47835
> > spider   10   955G  0.523G
> > 42541237
> >
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-28 Thread Igor Fedotov
Also I think it makes sense to create a ticket at this point. Any 
volunteers?


On 3/1/2019 1:00 AM, Igor Fedotov wrote:
Wondering if somebody would be able to apply simple patch that 
periodically resets StupidAllocator?


Just to verify/disprove the hypothesis it's allocator relateted

On 2/28/2019 11:57 PM, Stefan Kooman wrote:

Quoting Wido den Hollander (w...@42on.com):

Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe
OSDs as well. Over time their latency increased until we started to
notice I/O-wait inside VMs.

On a Luminous 12.2.8 cluster with only SSDs we also hit this issue I
guess. After restarting the OSD servers the latency would drop to normal
values again. See https://owncloud.kooman.org/s/BpkUc7YM79vhcDj

Reboots were finished at ~ 19:00.

Gr. Stefan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-28 Thread Igor Fedotov
Wondering if somebody would be able to apply simple patch that 
periodically resets StupidAllocator?


Just to verify/disprove the hypothesis it's allocator relateted

On 2/28/2019 11:57 PM, Stefan Kooman wrote:

Quoting Wido den Hollander (w...@42on.com):
  

Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe
OSDs as well. Over time their latency increased until we started to
notice I/O-wait inside VMs.

On a Luminous 12.2.8 cluster with only SSDs we also hit this issue I
guess. After restarting the OSD servers the latency would drop to normal
values again. See https://owncloud.kooman.org/s/BpkUc7YM79vhcDj

Reboots were finished at ~ 19:00.

Gr. Stefan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS_SLOW_METADATA_IO

2019-02-28 Thread Patrick Donnelly
On Thu, Feb 28, 2019 at 12:49 PM Stefan Kooman  wrote:
>
> Dear list,
>
> After upgrading to 12.2.11 the MDSes are reporting slow metadata IOs
> (MDS_SLOW_METADATA_IO). The metadata IOs would have been blocked for
> more that 5 seconds. We have one active, and one active standby MDS. All
> storage on SSD (Samsung PM863a / Intel DC4500). No other (OSD) slow ops
> reported. The MDSes are underutilized, only a handful of active clients
> and almost no load (fast hexacore CPU, 256 GB RAM, 20 Gb/s network). The
> cluster is also far from busy.
>
> I've dumped ops in flight on the MDSes but all ops that are printed are
> finished in a split second (duration: 0.000152), flag_point": "acquired
> locks".

I believe you're looking at the wrong "ops" dump. You want to check
"objector_requests".

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-28 Thread Stefan Kooman
Quoting Wido den Hollander (w...@42on.com):
 
> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe
> OSDs as well. Over time their latency increased until we started to
> notice I/O-wait inside VMs.

On a Luminous 12.2.8 cluster with only SSDs we also hit this issue I
guess. After restarting the OSD servers the latency would drop to normal
values again. See https://owncloud.kooman.org/s/BpkUc7YM79vhcDj

Reboots were finished at ~ 19:00.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS_SLOW_METADATA_IO

2019-02-28 Thread Stefan Kooman
Dear list,

After upgrading to 12.2.11 the MDSes are reporting slow metadata IOs
(MDS_SLOW_METADATA_IO). The metadata IOs would have been blocked for
more that 5 seconds. We have one active, and one active standby MDS. All
storage on SSD (Samsung PM863a / Intel DC4500). No other (OSD) slow ops
reported. The MDSes are underutilized, only a handful of active clients
and almost no load (fast hexacore CPU, 256 GB RAM, 20 Gb/s network). The
cluster is also far from busy.

I've dumped ops in flight on the MDSes but all ops that are printed are
finished in a split second (duration: 0.000152), flag_point": "acquired
locks".

I've googled for "MDS_SLOW_METADATA_IO" but no useful info whatsoever.
Are we the only ones getting these slow metadata IOS? 

Any hints on how to proceed to debug this are welcome.

Thanks,

Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fuse-Ceph mount timeout

2019-02-28 Thread Doug Bell
I am having trouble where all of the clients attached to a Ceph cluster are
timing out when trying to perform a fuse mount of the cephfs volume.

# ceph-fuse -f -m 10.1.2.157,10.1.2.194,10.0.2.191 /v --keyring
/etc/ceph/ceph.client.admin.keyring --name client.admin -o debug
2019-02-21 20:13:46.707225 7fe1eb04e080 -1 init, newargv = 0x561593d14220
newargc=11ceph-fuse[22559]: starting ceph client
ceph-fuse[22559]: ceph mount failed with (110) Connection timed out


I have confirmed that MDS was running on the host in question.

# ceph -s
  cluster:
id: c14e77f1-9898-48d8-8a52-cd1f1c5bf689
health: HEALTH_OK

  services:
mon: 3 daemons, quorum cm1,cm3,cm2
mgr: cm3(active), standbys: cm1
mds: cephfs-1/1/1 up  {0=cm3=up:active}, 1 up:standby-replay
osd: 7 osds: 6 up, 6 in

  data:
pools:   2 pools, 256 pgs
objects: 2365k objects, 1307 GB
usage:   3979 GB used, 1609 GB / 5589 GB avail
pgs: 256 active+clean

  io:
client:   1278 B/s rd, 2 op/s rd, 0 op/s wr

I have looked in all of the logs and cannot find any error messages that
would seem to indicate what might be happening.

Where should I look next?

-- 
Doug Bell
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] collectd problems with pools

2019-02-28 Thread Reed Dier
I've been collecting with collectd since Jewel, and experienced the growing 
pains when moving to Luminous and collectd-ceph needing to be reworked to 
support Luminous.

It is also worth mentioning that in Luminous+ there is an Influx plugin for 
ceph-mgr that has some per pool statistics.

Reed

> On Feb 28, 2019, at 11:04 AM, Matthew Vernon  wrote:
> 
> Hi,
> 
> On 28/02/2019 17:00, Marc Roos wrote:
> 
>> Should you not be pasting that as an issue on github collectd-ceph? I
>> hope you don't mind me asking, I am also using collectd and dumping the
>> data to influx. Are you downsampling with influx? ( I am not :/ [0])
> 
> It might be "ask collectd-ceph authors nicely" is the answer, but I figured 
> I'd ask here first, since there might be a solution available already.
> 
> Also, given collectd-ceph works currently by asking the various daemons about 
> their perf data, there's not an obvious analogue for pool-related metrics, 
> since there isn't a daemon socket to poke in the same manner.
> 
> We use graphite/carbon as our data store, so no, nothing influx-related 
> (we're trying to get rid of our last few uses of influxdb here).
> 
> Regards,
> 
> Matthew
> 
> 
> 
> -- 
> The Wellcome Sanger Institute is operated by Genome Research Limited, a 
> charity registered in England with number 1021457 and a company registered in 
> England with number 2742969, whose registered office is 215 Euston Road, 
> London, NW1 2BE. ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw sync falling behind regularly

2019-02-28 Thread Christian Rice
Yeah my bad on the typo, not running 12.8.8 ☺  It’s 12.2.8.  We can upgrade and 
will attempt to do so asap.  Thanks for that, I need to read my release notes 
more carefully, I guess!

From: Matthew H 
Date: Wednesday, February 27, 2019 at 8:33 PM
To: Christian Rice , ceph-users 
Subject: Re: radosgw sync falling behind regularly

Hey Christian,

I'm making a while guess, but assuming this is 12.2.8. If so, it it possible 
that you can upgrade to 12.2.11? There's been rgw multisite bug fixes for 
metadata syncing and data syncing ( both separate issues ) that you could be 
hitting.

Thanks,

From: ceph-users  on behalf of Christian 
Rice 
Sent: Wednesday, February 27, 2019 7:05 PM
To: ceph-users
Subject: [ceph-users] radosgw sync falling behind regularly


Debian 9; ceph 12.8.8-bpo90+1; no rbd or cephfs, just radosgw; three clusters 
in one zonegroup.



Often we find either metadata or data sync behind, and it doesn’t look to ever 
recover until…we restart the endpoint radosgw target service.



eg at 15:45:40:



dc11-ceph-rgw1:/var/log/ceph# radosgw-admin sync status

  realm b3e2afe7-2254-494a-9a34-ce50358779fd (savagebucket)

  zonegroup de6af748-1a2f-44a1-9d44-30799cf1313e (us)

   zone 107d29a0-b732-4bf1-a26e-1f64f820e839 (dc11-prod)

  metadata sync syncing

full sync: 0/64 shards

incremental sync: 64/64 shards

metadata is behind on 2 shards

behind shards: [19,41]

oldest incremental change not applied: 2019-02-27 
14:42:24.0.408263s

  data sync source: 1e27bf9c-3a2f-4845-85b6-33a24bbe1c04 (sv5-corp)

syncing

full sync: 0/128 shards

incremental sync: 128/128 shards

data is caught up with source

source: 331d3f1e-1b72-4c56-bb5a-d1d0fcf6d0b8 (sv3-prod)

syncing

full sync: 0/128 shards

incremental sync: 128/128 shards

data is caught up with source





so at 15:46:07:



dc11-ceph-rgw1:/var/log/ceph# sudo systemctl restart 
ceph-radosgw@rgw.dc11-ceph-rgw1.service



and by the time I checked at 15:48:08:



dc11-ceph-rgw1:/var/log/ceph# radosgw-admin sync status

  realm b3e2afe7-2254-494a-9a34-ce50358779fd (savagebucket)

  zonegroup de6af748-1a2f-44a1-9d44-30799cf1313e (us)

   zone 107d29a0-b732-4bf1-a26e-1f64f820e839 (dc11-prod)

  metadata sync syncing

full sync: 0/64 shards

incremental sync: 64/64 shards

metadata is caught up with master

  data sync source: 1e27bf9c-3a2f-4845-85b6-33a24bbe1c04 (sv5-corp)

syncing

full sync: 0/128 shards

incremental sync: 128/128 shards

data is caught up with source

source: 331d3f1e-1b72-4c56-bb5a-d1d0fcf6d0b8 (sv3-prod)

syncing

full sync: 0/128 shards

incremental sync: 128/128 shards

data is caught up with source





There’s no way this is “lag.”  It’s stuck, and happens frequently, though 
perhaps not daily.  Any suggestions?  Our cluster isn’t heavily used yet, but 
it’s production.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] collectd problems with pools

2019-02-28 Thread Matthew Vernon

Hi,

On 28/02/2019 17:00, Marc Roos wrote:


Should you not be pasting that as an issue on github collectd-ceph? I
hope you don't mind me asking, I am also using collectd and dumping the
data to influx. Are you downsampling with influx? ( I am not :/ [0])


It might be "ask collectd-ceph authors nicely" is the answer, but I 
figured I'd ask here first, since there might be a solution available 
already.


Also, given collectd-ceph works currently by asking the various daemons 
about their perf data, there's not an obvious analogue for pool-related 
metrics, since there isn't a daemon socket to poke in the same manner.


We use graphite/carbon as our data store, so no, nothing influx-related 
(we're trying to get rid of our last few uses of influxdb here).


Regards,

Matthew



--
The Wellcome Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 
___

ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD poor performance

2019-02-28 Thread Maged Mokhtar

Hi Mark,

The 38K iops for single OSD is quite good. For the 4 OSDs, I think the 
55K iops may start to be impacted by network latency on the server node.


It will be interesting to know when using something more common like 3x 
replica, what additional amplification factor we see over the replica count.


Maged


On 28/02/2019 01:22, Mark Nelson wrote:
FWIW, I've got recent tests of a fairly recent master build 
(14.0.1-3118-gd239c2a) showing a single OSD hitting ~33-38K 4k 
randwrite IOPS with 3 client nodes running fio (io_depth = 32) both 
with RBD and with CephFS.  The OSD node had older gen CPUs (Xeon 
E5-2650 v3) and NVMe drives (Intel P3700).  The OSD process and 
threads were pinned to run on the first socket.  It took between 5-7 
cores to pull off that throughput though.



Jumping up to 4 OSDs in the node (no replication) improved aggregate 
throughput to ~54-55K IOPS with ~15 cores used, so 13-14K IOPS per OSD 
with around 3.5-4 cores each on average.  IE with more OSDs running on 
the same socket competing for cores, the throughput per OSD went down 
and the IOPS/core rate went down too.  With NVMe, you are likely best 
off when multiple OSD processes aren't competing with each other for 
cores and can mostly just run on a specific set of cores without 
contention. I'd expect that numa pinning each OSD process to specific 
cores with enough cores to satisfy the OSD might help.  (Nick Fisk 
also showed a while back that forcing the CPU to not drop into 
low-power C/P states can help dramatically as well).



Mark


On 2/27/19 4:30 PM, Vitaliy Filippov wrote:
By "maximum write iops of an osd" I mean total iops divided by the 
number of OSDs. For example, an expensive setup from Micron 
(https://www.micron.com/about/blog/2018/april/micron-9200-max-red-hat-ceph-storage-30-reference-architecture-block-performance) 
has got only 8750 peak write iops per an NVMe. These exact NVMes they 
used are rated for 26+ iops when connected directly :). CPU is a 
real bottleneck. The need for a Seastar-based rewrite is not a joke! :)


Total iops is the number coming from a test like:

fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=128 
-rw=randwrite -pool= -runtime=60 -rbdname=testimg


...or from several such jobs run in parallel each over a separate RBD 
image.


This is a "random write bandwidth" test, and, in fact, it's not the 
most useful one - the single-thread latency usually does matter more 
than just total bandwidth. To test for it, run:


fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=1 
-rw=randwrite -pool= -runtime=60 -rbdname=testimg


You'll get a pretty low number (< 100 for HDD clusters, 500-1000 for 
SSD clusters). It's as expected that it's low. Everything above 1000 
iops (< 1ms latency, single-thread iops = 1 / avg latency) is hard to 
achieve with Ceph no matter what disks you're using. Also 
single-thread latency does not depend on the number of OSDs in the 
cluster, because the workload is not parallel.


However you can also test iops of single OSDs by creating a pool with 
size=1 and using a custom benchmark tool we've made with our 
colleagues from a russian Ceph chat... we can publish it here a short 
time later if you want :).



At some point I would expect the cpu to be the bottleneck. They have
always been saying this here for better latency get fast cpu's.
Would be nice to know what GHz you are testing, and how that scales. 
Rep

1-3, erasure propably also takes a hit.
How do you test maximum iops of the osd? (Just curious, so I can test
mine)

I have posted here a while ago a cephfs test on ssd rep 1. that was
performing nowhere near native, asking if this was normal. But never 
got

a response to it. I can remember that they send everyone a questionaire
and asked if they should focus on performance more, now I wished I
checked that box ;)



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Maged Mokhtar
CEO PetaSAN
4 Emad El Deen Kamel
Cairo 11371, Egypt
www.petasan.org
+201006979931
skype: maged.mokhtar

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] collectd problems with pools

2019-02-28 Thread Marc Roos
 

Should you not be pasting that as an issue on github collectd-ceph? I 
hope you don't mind me asking, I am also using collectd and dumping the 
data to influx. Are you downsampling with influx? ( I am not :/ [0])


[0] 
https://community.influxdata.com/t/how-does-grouping-work-does-it-work/7936



-Original Message-
From: Matthew Vernon [mailto:m...@sanger.ac.uk] 
Sent: 28 February 2019 17:11
To: ceph-users
Subject: [ceph-users] collectd problems with pools

Hi,

We monitor our Ceph clusters (production is Jewel, test clusters are on
Luminous) with collectd and its official ceph plugin.

The one thing that's missing is per-pool outputs - the collectd plugin 
just talks to the individual daemons, none of which have pool details in
- those are available via

ceph osd pool stats -f json

...which I could wrap to emit collectd metrics, but surely this is an 
already-invented wheel?

Regards,

Matthew


--
 The Wellcome Sanger Institute is operated by Genome Research  Limited, 
a charity registered in England with number 1021457 and a  company 
registered in England with number 2742969, whose registered  office is 
215 Euston Road, London, NW1 2BE. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] collectd problems with pools

2019-02-28 Thread Matthew Vernon

Hi,

We monitor our Ceph clusters (production is Jewel, test clusters are on 
Luminous) with collectd and its official ceph plugin.


The one thing that's missing is per-pool outputs - the collectd plugin 
just talks to the individual daemons, none of which have pool details in 
- those are available via


ceph osd pool stats -f json

...which I could wrap to emit collectd metrics, but surely this is an 
already-invented wheel?


Regards,

Matthew


--
The Wellcome Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 
___

ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [EXTERNAL] Re: Multi-Site Cluster RGW Sync issues

2019-02-28 Thread Benjamin . Zieglmeier
The output has 57000 lines (and growing). I’ve uploaded the output to:

https://gist.github.com/zieg8301/7e6952e9964c1e0964fb63f61e7b7be7

Thanks,
Ben

From: Matthew H 
Date: Wednesday, February 27, 2019 at 11:02 PM
To: "Benjamin. Zieglmeier" 
Cc: "ceph-users@lists.ceph.com" 
Subject: [EXTERNAL] Re: Multi-Site Cluster RGW Sync issues

Hey Ben,

Could you include the following?


radosgw-admin mdlog list

Thanks,


From: ceph-users  on behalf of 
Benjamin.Zieglmeier 
Sent: Tuesday, February 26, 2019 9:33 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Multi-Site Cluster RGW Sync issues


Hello,



We have a two zone multisite configured Luminous 12.2.5 cluster. Cluster has 
been running for about 1 year, and has only ~140G of data (~350k objects). We 
recently added a third zone to the zonegroup to facilitate a migration out of 
an existing site. Sync appears to be working and running `radosgw-admin sync 
status` and `radosgw-admin sync status –rgw-zone=` reflects the 
same. The problem we are having, is that once the data replication completes, 
one of the rgws serving the new zone has the radosgw process consuming all the 
CPU, and the rgw log is flooded with “ERROR: failed to read mdlog info with (2) 
No such file or directory”, to the amount of 1000 log entries/sec.



This has been happening for days on end now, and are concerned about what is 
going on between these two zones. Logs are constantly filling up on the rgws 
and we are out of ideas. Are they trying to catch up on metadata? After 
extensive searching and racking our brains, we are unable to figure out what is 
causing all these requests (and errors) between the two zones.



Thanks,

Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd space usage

2019-02-28 Thread Mohamad Gebai
On 2/27/19 4:57 PM, Marc Roos wrote:
> They are 'thin provisioned' meaning if you create a 10GB rbd, it does 
> not use 10GB at the start. (afaik)

You can use 'rbd -p rbd du' to see how much of these devices is
provisioned and see if it's coherent.

Mohamad

>
>
> -Original Message-
> From: solarflow99 [mailto:solarflo...@gmail.com] 
> Sent: 27 February 2019 22:55
> To: Ceph Users
> Subject: [ceph-users] rbd space usage
>
> using ceph df it looks as if RBD images can use the total free space 
> available of the pool it belongs to, 8.54% yet I know they are created 
> with a --size parameter and thats what determines the actual space.  I 
> can't understand the difference i'm seeing, only 5T is being used but 
> ceph df shows 51T:
>
>
> /dev/rbd0   8.0T  4.8T  3.3T  60% /mnt/nfsroot/rbd0
> /dev/rbd1   9.8T   34M  9.8T   1% /mnt/nfsroot/rbd1
>
>
>
> # ceph df
> GLOBAL:
> SIZE AVAIL RAW USED %RAW USED
> 180T  130T   51157G 27.75
> POOLS:
> NAMEID USED   %USED MAX AVAIL 
> OBJECTS
> rbd 0  15745G  8.543G  
> 4043495
> cephfs_data 1   0 03G
> 0
> cephfs_metadata 21962 03G
>20
> spider_stage 9   1595M 03G47835
> spider   10   955G  0.523G 
> 42541237
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Re: Blocked ops after change from filestore on HDD to bluestore on SDD

2019-02-28 Thread Uwe Sauter
olcDbShmKey only applies to BDB and HDB backends but I'm using the new MDB 
backend.


Am 28.02.19 um 14:47 schrieb Marc Roos:
> If you have every second disk io with your current settings, which I 
> also had with 'default' settings. There are some optimizations you can 
> do, bringing it down to every 50 seconds or so. Adding the olcDbShmKey 
> will allow for slapd to access the db cache. 
> I am getting an error of sharedmemory settings when rebooting (centos7), 
> but maintainers of slapd said that I can ignore that. Dont have any 
> problems since using this also.
> 
> 
> 
> -Original Message-
> From: Uwe Sauter [mailto:uwe.sauter...@gmail.com] 
> Sent: 28 February 2019 14:34
> To: Marc Roos; ceph-users; vitalif
> Subject: Re: [ceph-users] Fwd: Re: Blocked ops after change from 
> filestore on HDD to bluestore on SDD
> 
> Do you have anything particular in mind? I'm using mdb backend with 
> maxsize = 1GB but currently the files are only about 23MB.
> 
> 
>>
>> I am having quite a few openldap servers (slaves) running also, make 
>> sure to use proper caching that saves a lot of disk io.
>>
>>
>>
>>
>> -Original Message-
>> Sent: 28 February 2019 13:56
>> To: uwe.sauter...@gmail.com; Uwe Sauter; Ceph Users
>> Subject: *SPAM* Re: [ceph-users] Fwd: Re: Blocked ops after 
>> change from filestore on HDD to bluestore on SDD
>>
>> "Advanced power loss protection" is in fact a performance feature, not 
> 
>> a safety one.
>>
>>
>> 28 февраля 2019 г. 13:03:51 GMT+03:00, Uwe Sauter 
>>  пишет:
>>
>>  Hi all,
>>  
>>  thanks for your insights.
>>  
>>  Eneko,
>>  
>>
>>  We tried to use a Samsung 840 Pro SSD as OSD some time ago 
> and it 
>> was a no-go; it wasn't that performance was bad, it
>>  just didn't work for the kind of use of OSD. Any HDD was 
> better than 
>> it (the disk was healthy and have been used in a
>>  software raid-1 for a pair of years).
>>  
>>  I suggest you check first that your Samsung 860 Pro disks 
> work well 
>> for Ceph. Also, how is your host's RAM?
>>
>>
>>  As already mentioned the hosts each have 64GB RAM. Each host has 
> 3 
>> SSDs for OSD usage. Each OSD is using about 1.3GB virtual
>>  memory / 400MB residual memory.
>>  
>>  
>>  
>>  Joachim,
>>  
>>
>>  I can only recommend the use of enterprise SSDs. We've 
> tested many 
>> consumer SSDs in the past, including your SSDs. Many
>>  of them are not suitable for long-term use and some weard 
> out within 
>> 6 months.
>>
>>
>>  Unfortunately I couldn't afford enterprise grade SSDs. But I 
> suspect 
>> that my workload (about 20 VMs for our infrastructure, the
>>  most IO demanding is probably LDAP) is light enough that wearout 
>> won't be a problem.
>>  
>>  The issue I'm seeing then is probably related to direct IO if 
> using 
>> bluestore. But with filestore, the file system cache probably
>>  hides the latency issues.
>>  
>>  
>>  Igor,
>>  
>>
>>  AFAIR Samsung 860 Pro isn't for enterprise market, you 
> shouldn't use 
>> consumer SSDs for Ceph.
>>  
>>  I had some experience with Samsung 960 Pro a while ago and 
> it turned 
>> out that it handled fsync-ed writes very slowly
>>  (comparing to the original/advertised performance). Which 
> one can 
>> probably explain by the lack of power loss protection
>>  for these drives. I suppose it's the same in your case.
>>  
>>  Here are a couple links on the topic:
>>  
>>  
>> https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devi
>> ces/
>>  
>>  
>> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-
>> ssd-is-suitable-as-a-journal-device/
>>
>>
>>  Power loss protection wasn't a criteria for me as the cluster 
> hosts 
>> are distributed in two buildings with separate battery backed
>>  UPSs. As mentioned above I suspect the main difference for my 
> case 
>> between filestore and bluestore is file system cache vs. direct
>>  IO. Which means I will keep using filestore.
>>  
>>  Regards,
>>  
>>  Uwe
>> 
>>
>>  ceph-users mailing list
>>  ceph-users@lists.ceph.com
>>  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> --
>> With best regards,
>> Vitaliy Filippov
>>
>>
> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs recursive stats | rctime in the future

2019-02-28 Thread Yan, Zheng
On Thu, Feb 28, 2019 at 5:33 PM David C  wrote:
>
> On Wed, Feb 27, 2019 at 11:35 AM Hector Martin  wrote:
>>
>> On 27/02/2019 19:22, David C wrote:
>> > Hi All
>> >
>> > I'm seeing quite a few directories in my filesystem with rctime years in
>> > the future. E.g
>> >
>> > ]# getfattr -d -m ceph.dir.* /path/to/dir
>> > getfattr: Removing leading '/' from absolute path names
>> > # file:  path/to/dir
>> > ceph.dir.entries="357"
>> > ceph.dir.files="1"
>> > ceph.dir.rbytes="35606883904011"
>> > ceph.dir.rctime="1851480065.090"
>> > ceph.dir.rentries="12216551"
>> > ceph.dir.rfiles="10540827"
>> > ceph.dir.rsubdirs="1675724"
>> > ceph.dir.subdirs="356"
>> >
>> > That's showing a last modified time of 2 Sept 2028, the day and month
>> > are also wrong.
>>
>> Obvious question: are you sure the date/time on your cluster nodes and
>> your clients is correct? Can you track down which files (if any) have
>> the ctime in the future by following the rctime down the filesystem tree?
>
>
> Times are all correct on the nodes and CephFS clients however the fs is being 
> exported over NFS. It's possible some NFS clients have the wrong time 
> although I'm reasonably confident they are all correct as the machines are 
> synced to local time servers and they use AD for auth, things wouldn't work 
> if the time was that wildly out of sync.
>
> Good idea on checking down the tree. I've found the offending files but can't 
> find any explanation as to why they have a modified date so far in the future.
>
> For example one dir is "/.config/caja/" in a users home dir. The files in 
> this dir are all wildly different, the modified times are 1984, 1997, 2028...
>

mds takes ctime/mtime from client requests. It's likely the client
node that operated on this dir had incorrect date/time


> It certainly feels like a MDS issue to me. I've used the recursive stats 
> since Jewel and I've never seen this before.
>
> Any ideas?
>
>
>>
>> --
>> Hector Martin (hec...@marcansoft.com)
>> Public Key: https://mrcn.st/pub
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore lvm wal and db in ssd disk with ceph-ansible

2019-02-28 Thread Andres Rojas Guerrero
Hi all, I have another newbie question, we are trying to deploy a ceph
cluster mimic with bluestore with the wal a db data in a SSD disks.

For this we are using ceph-ansible approach, we have seen that
ceph-ansible has a playbook in order to create lvm structure
(lv-create.yml) but it's seems only works for the case of filestore,
does anyone know if there's anything similar for the bluestore case?


Thank you in advance and regards.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] How does ceph use the STS service?

2019-02-28 Thread Sage Weil
On Thu, 28 Feb 2019, Matthew H wrote:
> This feature is in the Nautilus release.
> 
> The first release (14.1.0) of Nautilus is available from 
> download.ceph.com as of last Friday.

Please keep in mind this is a release candidate.  The first official 
stable nautilus release will be 14.2.0 in a week or two.

sage

> 
> 
> From: ceph-users  on behalf of admin 
> 
> Sent: Thursday, February 28, 2019 4:22 AM
> To: Pritha Srivastava; Sage Weil; ceph-us...@ceph.com
> Subject: Re: [ceph-users] [Ceph-community] How does ceph use the STS service?
> 
> Hi, can you tell me the version that includes STS lite?
> Thanks,
> myxingkong
> 
> 
> 发件人: Pritha Srivastava
> 发送时间: 2019-02-27 23:53:58
> 收件人:  Sage Weil
> 抄送:  admin; 
> ceph-us...@ceph.com
> 主题: Re: [ceph-users] [Ceph-community] How does ceph use the STS service?
> Sorry I overlooked the ceph versions in the email.
> 
> STS Lite is not a part of ceph version 12.2.11 or ceph version 13.2.2.
> 
> Thanks,
> Pritha
> 
> On Wed, Feb 27, 2019 at 9:09 PM Pritha Srivastava 
> mailto:prsri...@redhat.com>> wrote:
> You need to attach a policy to be able to invoke GetSessionToken. Please read 
> the documentation below at:
> 
> https://github.com/ceph/ceph/pull/24818/commits/512b6d8bd951239d44685b25dccaf904f19872b2
> 
> Thanks,
> Pritha
> 
> On Wed, Feb 27, 2019 at 8:59 PM Sage Weil 
> mailto:s...@newdream.net>> wrote:
> Moving this to ceph-users.
> 
> On Wed, 27 Feb 2019, admin wrote:
> 
> > I want to use the STS service to generate temporary credentials for use by 
> > third-party clients.
> >
> > I configured STS lite based on the documentation.
> > http://docs.ceph.com/docs/master/radosgw/STSLite/
> >
> > This is my configuration file:
> >
> > [global]
> > fsid = 42a7cae1-84d1-423e-93f4-04b0736c14aa
> > mon_initial_members = admin, node1, node2, node3
> > mon_host = 192.168.199.81,192.168.199.82,192.168.199.83,192.168.199.84
> > auth_cluster_required = cephx
> > auth_service_required = cephx
> > auth_client_required = cephx
> >
> > osd pool default size = 2
> >
> > [client.rgw.admin]
> > rgw sts key = "1234567890"
> > rgw s3 auth use sts = true
> >
> > When I execute the getSessionToken method, return a 405 error:
> >
> > 
> > MethodNotAllowed
> > tx3-005c73aed8-5e48-default
> > 5e48-default-default
> > 
> >
> > This is my test code:
> >
> > import os
> > import sys
> > import traceback
> >
> > import boto3
> > from boto.s3.connection import S3Connection
> > from boto.sts import STSConnection
> >
> > try:
> > host = 'http://192.168.199.81:7480'
> > access_key = '2324YFZ7QDEOSRL18QHR'
> > secret_key = 'rL9FabxCOw5LDbrHtmykiGSCjzpKLmEs9WPiNjVJ'
> >
> > client = boto3.client('sts',
> >   aws_access_key_id = access_key,
> >   aws_secret_access_key = secret_key,
> >   endpoint_url = host)
> > response = client.get_session_token(DurationSeconds=999)
> > print response
> > except:
> > print traceback.format_exc()
> >
> > Who can tell me if my configuration is incorrect or if the version I tested 
> > does not provide STS service?
> >
> > This is the version I tested:
> >
> > ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous 
> > (stable)
> >
> > ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic 
> > (stable)___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Re: Blocked ops after change from filestore on HDD to bluestore on SDD

2019-02-28 Thread Uwe Sauter
I already sent my configuration to the list about 3,5h ago but here it is again:


[global]
  auth client required = cephx
  auth cluster required = cephx
  auth service required = cephx
  cluster network = 169.254.42.0/24
  fsid = 753c9bbd-74bd-4fea-8c1e-88da775c5ad4
  keyring = /etc/pve/priv/$cluster.$name.keyring
  public network = 169.254.42.0/24

[mon]
  mon allow pool delete = true
  mon data avail crit = 5
  mon data avail warn = 15

[osd]
  keyring = /var/lib/ceph/osd/ceph-$id/keyring
  osd journal size = 5120
  osd pool default min size = 2
  osd pool default size = 3
  osd max backfills = 6
  osd recovery max active = 12

[mon.px-golf-cluster]
  host = px-golf-cluster
  mon addr = 169.254.42.54:6789

[mon.px-hotel-cluster]
  host = px-hotel-cluster
  mon addr = 169.254.42.55:6789

[mon.px-india-cluster]
  host = px-india-cluster
  mon addr = 169.254.42.56:6789



Am 28.02.19 um 14:44 schrieb Matthew H:
> Could you send your ceph.conf file over please? Are you setting any tunables 
> for OSD or Bluestore currently?
> 
> --
> *From:* ceph-users  on behalf of Uwe 
> Sauter 
> *Sent:* Thursday, February 28, 2019 8:33 AM
> *To:* Marc Roos; ceph-users; vitalif
> *Subject:* Re: [ceph-users] Fwd: Re: Blocked ops after change from filestore 
> on HDD to bluestore on SDD
>  
> Do you have anything particular in mind? I'm using mdb backend with maxsize = 
> 1GB but currently the files are only about 23MB.
> 
> 
>> 
>> I am having quite a few openldap servers (slaves) running also, make 
>> sure to use proper caching that saves a lot of disk io.  
>> 
>> 
>> 
>> 
>> -Original Message-
>> Sent: 28 February 2019 13:56
>> To: uwe.sauter...@gmail.com; Uwe Sauter; Ceph Users
>> Subject: *SPAM* Re: [ceph-users] Fwd: Re: Blocked ops after 
>> change from filestore on HDD to bluestore on SDD
>> 
>> "Advanced power loss protection" is in fact a performance feature, not a 
>> safety one.
>> 
>> 
>> 28 февраля 2019 г. 13:03:51 GMT+03:00, Uwe Sauter 
>>  пишет:
>> 
>>    Hi all,
>>    
>>    thanks for your insights.
>>    
>>    Eneko,
>>    
>> 
>>    We tried to use a Samsung 840 Pro SSD as OSD some time ago 
>>and 
>> it was a no-go; it wasn't that performance was bad, it 
>>    just didn't work for the kind of use of OSD. Any HDD was 
>> better than it (the disk was healthy and have been used in a 
>>    software raid-1 for a pair of years).
>>    
>>    I suggest you check first that your Samsung 860 Pro disks 
>>work 
>> well for Ceph. Also, how is your host's RAM?
>> 
>> 
>>    As already mentioned the hosts each have 64GB RAM. Each host has 3 
>> SSDs for OSD usage. Each OSD is using about 1.3GB virtual
>>    memory / 400MB residual memory.
>>    
>>    
>>    
>>    Joachim,
>>    
>> 
>>    I can only recommend the use of enterprise SSDs. We've tested 
>> many consumer SSDs in the past, including your SSDs. Many 
>>    of them are not suitable for long-term use and some weard out 
>> within 6 months.
>> 
>> 
>>    Unfortunately I couldn't afford enterprise grade SSDs. But I 
>> suspect that my workload (about 20 VMs for our infrastructure, the
>>    most IO demanding is probably LDAP) is light enough that wearout 
>> won't be a problem.
>>    
>>    The issue I'm seeing then is probably related to direct IO if using 
>> bluestore. But with filestore, the file system cache probably
>>    hides the latency issues.
>>    
>>    
>>    Igor,
>>    
>> 
>>    AFAIR Samsung 860 Pro isn't for enterprise market, you 
>> shouldn't use consumer SSDs for Ceph.
>>    
>>    I had some experience with Samsung 960 Pro a while ago and it 
>> turned out that it handled fsync-ed writes very slowly 
>>    (comparing to the original/advertised performance). Which one 
>> can probably explain by the lack of power loss protection 
>>    for these drives. I suppose it's the same in your case.
>>    
>>    Here are a couple links on the topic:
>>    
>>    
>> https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/
>>    
>>    
>> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>> 
>> 
>>    Power loss protection wasn't a criteria for me as the cluster hosts 
>> are distributed in two buildings with separate battery backed
>>    UPSs. As mentioned above I suspect the main difference for my case 
>> between filestore and bluestore is file system cache vs. direct
>>    IO. Which means I will keep using filestore.
>>    
>>    Regards,
>>    
>>    Uwe
>> ___

Re: [ceph-users] Fwd: Re: Blocked ops after change from filestore on HDD to bluestore on SDD

2019-02-28 Thread Marc Roos
If you have every second disk io with your current settings, which I 
also had with 'default' settings. There are some optimizations you can 
do, bringing it down to every 50 seconds or so. Adding the olcDbShmKey 
will allow for slapd to access the db cache. 
I am getting an error of sharedmemory settings when rebooting (centos7), 
but maintainers of slapd said that I can ignore that. Dont have any 
problems since using this also.



-Original Message-
From: Uwe Sauter [mailto:uwe.sauter...@gmail.com] 
Sent: 28 February 2019 14:34
To: Marc Roos; ceph-users; vitalif
Subject: Re: [ceph-users] Fwd: Re: Blocked ops after change from 
filestore on HDD to bluestore on SDD

Do you have anything particular in mind? I'm using mdb backend with 
maxsize = 1GB but currently the files are only about 23MB.


> 
> I am having quite a few openldap servers (slaves) running also, make 
> sure to use proper caching that saves a lot of disk io.
> 
> 
> 
> 
> -Original Message-
> Sent: 28 February 2019 13:56
> To: uwe.sauter...@gmail.com; Uwe Sauter; Ceph Users
> Subject: *SPAM* Re: [ceph-users] Fwd: Re: Blocked ops after 
> change from filestore on HDD to bluestore on SDD
> 
> "Advanced power loss protection" is in fact a performance feature, not 

> a safety one.
> 
> 
> 28 февраля 2019 г. 13:03:51 GMT+03:00, Uwe Sauter 
>  пишет:
> 
>   Hi all,
>   
>   thanks for your insights.
>   
>   Eneko,
>   
> 
>   We tried to use a Samsung 840 Pro SSD as OSD some time ago 
and it 
> was a no-go; it wasn't that performance was bad, it
>   just didn't work for the kind of use of OSD. Any HDD was 
better than 
> it (the disk was healthy and have been used in a
>   software raid-1 for a pair of years).
>   
>   I suggest you check first that your Samsung 860 Pro disks 
work well 
> for Ceph. Also, how is your host's RAM?
> 
> 
>   As already mentioned the hosts each have 64GB RAM. Each host has 
3 
> SSDs for OSD usage. Each OSD is using about 1.3GB virtual
>   memory / 400MB residual memory.
>   
>   
>   
>   Joachim,
>   
> 
>   I can only recommend the use of enterprise SSDs. We've 
tested many 
> consumer SSDs in the past, including your SSDs. Many
>   of them are not suitable for long-term use and some weard 
out within 
> 6 months.
> 
> 
>   Unfortunately I couldn't afford enterprise grade SSDs. But I 
suspect 
> that my workload (about 20 VMs for our infrastructure, the
>   most IO demanding is probably LDAP) is light enough that wearout 
> won't be a problem.
>   
>   The issue I'm seeing then is probably related to direct IO if 
using 
> bluestore. But with filestore, the file system cache probably
>   hides the latency issues.
>   
>   
>   Igor,
>   
> 
>   AFAIR Samsung 860 Pro isn't for enterprise market, you 
shouldn't use 
> consumer SSDs for Ceph.
>   
>   I had some experience with Samsung 960 Pro a while ago and 
it turned 
> out that it handled fsync-ed writes very slowly
>   (comparing to the original/advertised performance). Which 
one can 
> probably explain by the lack of power loss protection
>   for these drives. I suppose it's the same in your case.
>   
>   Here are a couple links on the topic:
>   
>   
> https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devi
> ces/
>   
>   
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-
> ssd-is-suitable-as-a-journal-device/
> 
> 
>   Power loss protection wasn't a criteria for me as the cluster 
hosts 
> are distributed in two buildings with separate battery backed
>   UPSs. As mentioned above I suspect the main difference for my 
case 
> between filestore and bluestore is file system cache vs. direct
>   IO. Which means I will keep using filestore.
>   
>   Regards,
>   
>   Uwe
> 
> 
>   ceph-users mailing list
>   ceph-users@lists.ceph.com
>   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> --
> With best regards,
> Vitaliy Filippov
> 
> 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Re: Blocked ops after change from filestore on HDD to bluestore on SDD

2019-02-28 Thread Matthew H
Could you send your ceph.conf file over please? Are you setting any tunables 
for OSD or Bluestore currently?


From: ceph-users  on behalf of Uwe Sauter 

Sent: Thursday, February 28, 2019 8:33 AM
To: Marc Roos; ceph-users; vitalif
Subject: Re: [ceph-users] Fwd: Re: Blocked ops after change from filestore on 
HDD to bluestore on SDD

Do you have anything particular in mind? I'm using mdb backend with maxsize = 
1GB but currently the files are only about 23MB.


>
> I am having quite a few openldap servers (slaves) running also, make
> sure to use proper caching that saves a lot of disk io.
>
>
>
>
> -Original Message-
> Sent: 28 February 2019 13:56
> To: uwe.sauter...@gmail.com; Uwe Sauter; Ceph Users
> Subject: *SPAM* Re: [ceph-users] Fwd: Re: Blocked ops after
> change from filestore on HDD to bluestore on SDD
>
> "Advanced power loss protection" is in fact a performance feature, not a
> safety one.
>
>
> 28 февраля 2019 г. 13:03:51 GMT+03:00, Uwe Sauter
>  пишет:
>
>Hi all,
>
>thanks for your insights.
>
>Eneko,
>
>
>We tried to use a Samsung 840 Pro SSD as OSD some time ago and
> it was a no-go; it wasn't that performance was bad, it
>just didn't work for the kind of use of OSD. Any HDD was
> better than it (the disk was healthy and have been used in a
>software raid-1 for a pair of years).
>
>I suggest you check first that your Samsung 860 Pro disks work
> well for Ceph. Also, how is your host's RAM?
>
>
>As already mentioned the hosts each have 64GB RAM. Each host has 3
> SSDs for OSD usage. Each OSD is using about 1.3GB virtual
>memory / 400MB residual memory.
>
>
>
>Joachim,
>
>
>I can only recommend the use of enterprise SSDs. We've tested
> many consumer SSDs in the past, including your SSDs. Many
>of them are not suitable for long-term use and some weard out
> within 6 months.
>
>
>Unfortunately I couldn't afford enterprise grade SSDs. But I
> suspect that my workload (about 20 VMs for our infrastructure, the
>most IO demanding is probably LDAP) is light enough that wearout
> won't be a problem.
>
>The issue I'm seeing then is probably related to direct IO if using
> bluestore. But with filestore, the file system cache probably
>hides the latency issues.
>
>
>Igor,
>
>
>AFAIR Samsung 860 Pro isn't for enterprise market, you
> shouldn't use consumer SSDs for Ceph.
>
>I had some experience with Samsung 960 Pro a while ago and it
> turned out that it handled fsync-ed writes very slowly
>(comparing to the original/advertised performance). Which one
> can probably explain by the lack of power loss protection
>for these drives. I suppose it's the same in your case.
>
>Here are a couple links on the topic:
>
>
> https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/
>
>
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>
>
>Power loss protection wasn't a criteria for me as the cluster hosts
> are distributed in two buildings with separate battery backed
>UPSs. As mentioned above I suspect the main difference for my case
> between filestore and bluestore is file system cache vs. direct
>IO. Which means I will keep using filestore.
>
>Regards,
>
>Uwe
> 
>
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> With best regards,
> Vitaliy Filippov
>
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph tracker login failed

2019-02-28 Thread M Ranga Swami Reddy
I tried to login to ceph tracker -  it failing with openID url.?

I tried with my OpenID:
http://tracker.ceph.com/login

my id: https://code.launchpad.net/~swamireddy
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Re: Blocked ops after change from filestore on HDD to bluestore on SDD

2019-02-28 Thread Uwe Sauter
Do you have anything particular in mind? I'm using mdb backend with maxsize = 
1GB but currently the files are only about 23MB.


> 
> I am having quite a few openldap servers (slaves) running also, make 
> sure to use proper caching that saves a lot of disk io.  
> 
> 
> 
> 
> -Original Message-
> Sent: 28 February 2019 13:56
> To: uwe.sauter...@gmail.com; Uwe Sauter; Ceph Users
> Subject: *SPAM* Re: [ceph-users] Fwd: Re: Blocked ops after 
> change from filestore on HDD to bluestore on SDD
> 
> "Advanced power loss protection" is in fact a performance feature, not a 
> safety one.
> 
> 
> 28 февраля 2019 г. 13:03:51 GMT+03:00, Uwe Sauter 
>  пишет:
> 
>   Hi all,
>   
>   thanks for your insights.
>   
>   Eneko,
>   
> 
>   We tried to use a Samsung 840 Pro SSD as OSD some time ago and 
> it was a no-go; it wasn't that performance was bad, it 
>   just didn't work for the kind of use of OSD. Any HDD was 
> better than it (the disk was healthy and have been used in a 
>   software raid-1 for a pair of years).
>   
>   I suggest you check first that your Samsung 860 Pro disks work 
> well for Ceph. Also, how is your host's RAM?
> 
> 
>   As already mentioned the hosts each have 64GB RAM. Each host has 3 
> SSDs for OSD usage. Each OSD is using about 1.3GB virtual
>   memory / 400MB residual memory.
>   
>   
>   
>   Joachim,
>   
> 
>   I can only recommend the use of enterprise SSDs. We've tested 
> many consumer SSDs in the past, including your SSDs. Many 
>   of them are not suitable for long-term use and some weard out 
> within 6 months.
> 
> 
>   Unfortunately I couldn't afford enterprise grade SSDs. But I 
> suspect that my workload (about 20 VMs for our infrastructure, the
>   most IO demanding is probably LDAP) is light enough that wearout 
> won't be a problem.
>   
>   The issue I'm seeing then is probably related to direct IO if using 
> bluestore. But with filestore, the file system cache probably
>   hides the latency issues.
>   
>   
>   Igor,
>   
> 
>   AFAIR Samsung 860 Pro isn't for enterprise market, you 
> shouldn't use consumer SSDs for Ceph.
>   
>   I had some experience with Samsung 960 Pro a while ago and it 
> turned out that it handled fsync-ed writes very slowly 
>   (comparing to the original/advertised performance). Which one 
> can probably explain by the lack of power loss protection 
>   for these drives. I suppose it's the same in your case.
>   
>   Here are a couple links on the topic:
>   
>   
> https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/
>   
>   
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> 
> 
>   Power loss protection wasn't a criteria for me as the cluster hosts 
> are distributed in two buildings with separate battery backed
>   UPSs. As mentioned above I suspect the main difference for my case 
> between filestore and bluestore is file system cache vs. direct
>   IO. Which means I will keep using filestore.
>   
>   Regards,
>   
>   Uwe
> 
> 
>   ceph-users mailing list
>   ceph-users@lists.ceph.com
>   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> --
> With best regards,
> Vitaliy Filippov
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Re: Blocked ops after change from filestore on HDD to bluestore on SDD

2019-02-28 Thread Marc Roos

I am having quite a few openldap servers (slaves) running also, make 
sure to use proper caching that saves a lot of disk io.  




-Original Message-
Sent: 28 February 2019 13:56
To: uwe.sauter...@gmail.com; Uwe Sauter; Ceph Users
Subject: *SPAM* Re: [ceph-users] Fwd: Re: Blocked ops after 
change from filestore on HDD to bluestore on SDD

"Advanced power loss protection" is in fact a performance feature, not a 
safety one.


28 февраля 2019 г. 13:03:51 GMT+03:00, Uwe Sauter 
 пишет:

Hi all,

thanks for your insights.

Eneko,


We tried to use a Samsung 840 Pro SSD as OSD some time ago and 
it was a no-go; it wasn't that performance was bad, it 
just didn't work for the kind of use of OSD. Any HDD was 
better than it (the disk was healthy and have been used in a 
software raid-1 for a pair of years).

I suggest you check first that your Samsung 860 Pro disks work 
well for Ceph. Also, how is your host's RAM?


As already mentioned the hosts each have 64GB RAM. Each host has 3 
SSDs for OSD usage. Each OSD is using about 1.3GB virtual
memory / 400MB residual memory.



Joachim,


I can only recommend the use of enterprise SSDs. We've tested 
many consumer SSDs in the past, including your SSDs. Many 
of them are not suitable for long-term use and some weard out 
within 6 months.


Unfortunately I couldn't afford enterprise grade SSDs. But I 
suspect that my workload (about 20 VMs for our infrastructure, the
most IO demanding is probably LDAP) is light enough that wearout 
won't be a problem.

The issue I'm seeing then is probably related to direct IO if using 
bluestore. But with filestore, the file system cache probably
hides the latency issues.


Igor,


AFAIR Samsung 860 Pro isn't for enterprise market, you 
shouldn't use consumer SSDs for Ceph.

I had some experience with Samsung 960 Pro a while ago and it 
turned out that it handled fsync-ed writes very slowly 
(comparing to the original/advertised performance). Which one 
can probably explain by the lack of power loss protection 
for these drives. I suppose it's the same in your case.

Here are a couple links on the topic:


https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/


https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/


Power loss protection wasn't a criteria for me as the cluster hosts 
are distributed in two buildings with separate battery backed
UPSs. As mentioned above I suspect the main difference for my case 
between filestore and bluestore is file system cache vs. direct
IO. Which means I will keep using filestore.

Regards,

Uwe


ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
With best regards,
Vitaliy Filippov


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Re: Blocked ops after change from filestore on HDD to bluestore on SDD

2019-02-28 Thread Виталий Филиппов
"Advanced power loss protection" is in fact a performance feature, not a safety 
one.

28 февраля 2019 г. 13:03:51 GMT+03:00, Uwe Sauter  
пишет:
>Hi all,
>
>thanks for your insights.
>
>Eneko,
>
>> We tried to use a Samsung 840 Pro SSD as OSD some time ago and it was
>a no-go; it wasn't that performance was bad, it 
>> just didn't work for the kind of use of OSD. Any HDD was better than
>it (the disk was healthy and have been used in a 
>> software raid-1 for a pair of years).
>> 
>> I suggest you check first that your Samsung 860 Pro disks work well
>for Ceph. Also, how is your host's RAM?
>
>As already mentioned the hosts each have 64GB RAM. Each host has 3 SSDs
>for OSD usage. Each OSD is using about 1.3GB virtual
>memory / 400MB residual memory.
>
>
>
>Joachim,
>
>> I can only recommend the use of enterprise SSDs. We've tested many
>consumer SSDs in the past, including your SSDs. Many 
>> of them are not suitable for long-term use and some weard out within
>6 months.
>
>Unfortunately I couldn't afford enterprise grade SSDs. But I suspect
>that my workload (about 20 VMs for our infrastructure, the
>most IO demanding is probably LDAP) is light enough that wearout won't
>be a problem.
>
>The issue I'm seeing then is probably related to direct IO if using
>bluestore. But with filestore, the file system cache probably
>hides the latency issues.
>
>
>Igor,
>
>> AFAIR Samsung 860 Pro isn't for enterprise market, you shouldn't use
>consumer SSDs for Ceph.
>> 
>> I had some experience with Samsung 960 Pro a while ago and it turned
>out that it handled fsync-ed writes very slowly 
>> (comparing to the original/advertised performance). Which one can
>probably explain by the lack of power loss protection 
>> for these drives. I suppose it's the same in your case.
>> 
>> Here are a couple links on the topic:
>> 
>>
>https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/
>> 
>>
>https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>
>Power loss protection wasn't a criteria for me as the cluster hosts are
>distributed in two buildings with separate battery backed
>UPSs. As mentioned above I suspect the main difference for my case
>between filestore and bluestore is file system cache vs. direct
>IO. Which means I will keep using filestore.
>
>Regards,
>
>   Uwe
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd exit common/Thread.cc: 160: FAILED assert(ret == 0)--10.2.10

2019-02-28 Thread lin zhou
Thanks Greg. I found the limit. it is /proc/sys/kernel/threads-max.
I count thread numbers using:
ps -eo nlwp | tail -n +2 | awk '{ num_threads += $1 } END { print
num_threads }'" -o 97981

lin zhou  于2019年2月28日周四 上午10:33写道:

> Thanks, Greg. Your reply always so fast.
>
> I check my system these limits.
> # ulimit -a
> core file size  (blocks, -c) unlimited
> data seg size   (kbytes, -d) unlimited
> scheduling priority (-e) 0
> file size   (blocks, -f) unlimited
> pending signals (-i) 257219
> max locked memory   (kbytes, -l) 64
> max memory size (kbytes, -m) unlimited
> open files  (-n) 65535
> pipe size(512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> real-time priority  (-r) 0
> stack size  (kbytes, -s) 8192
> cpu time   (seconds, -t) unlimited
> max user processes  (-u) 257219
> virtual memory  (kbytes, -v) unlimited
> file locks  (-x) unlimited
>
> # cat /proc/sys/kernel/pid_max
> 196608
> # cat /proc/sys/kernel/threads-max
> 10
> # cat /proc/sys/fs/file-max
> 6584236
> # sysctl fs.file-nr
> fs.file-nr = 55520 0 6584236
> # sysctl fs.file-max
> fs.file-max = 6584236
>
> I try to count all fd:
> # total=0;for pid in `ls /proc/`;do num=`ls  /proc/$pid/fd 2>/dev/null|wc
> -l`;total=$((total+num));done;echo ${total}
> 53727
>
> I check every osd service open files limit is all 32768
> # for pid in `ps aux|grep osd|grep -v grep|awk '{print $2}'`;do cat
> /proc/$pid/limits |grep open;done
> Max open files3276832768files
> Max open files3276832768files
> Max open files3276832768files
> Max open files3276832768files
>
> free -g
>  total   used   free sharedbuffers cached
> Mem:62 46 16  0  0  5
> -/+ buffers/cache: 41 21
> Swap:3  0  3
>
> another situation is 14 osds in five hosts appeared this problem and all
> in the same failure domain so far.
>
> Gregory Farnum  于2019年2月28日周四 上午1:59写道:
>
>> The OSD tried to create a new thread, and the kernel told it no. You
>> probably need to turn up the limits on threads and/or file descriptors.
>> -Greg
>>
>> On Wed, Feb 27, 2019 at 2:36 AM hnuzhoulin2 
>> wrote:
>>
>>> Hi, guys
>>>
>>> So far, there have been 10 osd service exit because of this error.
>>> the error messages are all the same.
>>>
>>> 2019-02-27 17:14:59.757146 7f89925ff700 0 -- 10.191.175.15:6886/192803
>>> >> 10.191.175.49:6833/188731 pipe(0x55ebba819400 sd=741 :6886 s=0 pgs=0
>>> cs=0 l=0 c=0x55ebbb8ba900).accept connect_seq 3912 vs existing 3911 state
>>> standby
>>> 2019-02-27 17:15:05.858802 7f89d9856700 -1 common/Thread.cc: In function
>>> 'void Thread::create(const char*, size_t)' thread 7f89d9856700 time
>>> 2019-02-27 17:15:05.806607
>>> common/Thread.cc: 160: FAILED assert(ret 0)
>>>
>>> ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x82) [0x55eb7a849e12]
>>>  2: (Thread::create(char const*, unsigned long)+0xba) [0x55eb7a82c14a]
>>>  3: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0x55eb7a8203ef]
>>>  4: (Accepter::entry()+0x379) [0x55eb7a8f3ee9]
>>>  5: (()+0x8064) [0x7f89ecf76064]
>>>  6: (clone()+0x6d) [0x7f89eb07762d]
>>>  NOTE: a copy of the executable, or `objdump -rdS ` is
>>> needed to interpret this.
>>> --- begin dump of recent events ---
>>> 1> 2019-02-27 17:14:50.999276 7f893e811700 1 -
>>> 10.191.175.15:0/192803 < osd.850 10.191.175.46:6837/190855 6953447 
>>> osd_ping(ping_reply e17846 stamp 2019-02-27 17:14:50.995043) v3 
>>> 2004+0+0 (3980167553 0 0) 0x55eba12b7400 con 0x55eb96ada600
>>>
>>> detail logs see:
>>> https://drive.google.com/file/d/1fZyhTj06CJlcRjmllaPQMNJknI9gAg6J/view
>>>
>>> when I restart these osd services, it looks works well. But I do not
>>> know if it will happen in the other osds.
>>> And I can not find any error log in the system except the following
>>> dmesg info:
>>>
>>> [三 1月 30 08:14:11 2019] megasas: Command pool (fusion) empty!
>>> [三 1月 30 08:14:11 2019] Couldn't build MFI pass thru cmd
>>> [三 1月 30 08:14:11 2019] Couldn't issue MFI pass thru cmd
>>> [三 1月 30 08:14:11 2019] megasas: Command pool empty!
>>> [三 1月 30 08:14:11 2019] megasas: Failed to get a cmd packet
>>> [三 1月 30 08:14:11 2019] megasas: Command pool empty!
>>> [三 1月 30 08:14:11 2019] megasas: Failed to get a cmd packet
>>> [三 1月 30 08:14:11 2019] megasas: Command pool empty!
>>> [三 1月 30 08:14:11 2019] megasas: Failed to get a cmd packet
>>> [三 1月 30 08:14:11 2019] megasas: Command pool (fusion) empty!
>>> [三 1月 30 08:14:11 2019] megasas: Err returned from build_and_

[ceph-users] Fwd: Re: Blocked ops after change from filestore on HDD to bluestore on SDD

2019-02-28 Thread Uwe Sauter
Hi all,

thanks for your insights.

Eneko,

> We tried to use a Samsung 840 Pro SSD as OSD some time ago and it was a 
> no-go; it wasn't that performance was bad, it 
> just didn't work for the kind of use of OSD. Any HDD was better than it (the 
> disk was healthy and have been used in a 
> software raid-1 for a pair of years).
> 
> I suggest you check first that your Samsung 860 Pro disks work well for Ceph. 
> Also, how is your host's RAM?

As already mentioned the hosts each have 64GB RAM. Each host has 3 SSDs for OSD 
usage. Each OSD is using about 1.3GB virtual
memory / 400MB residual memory.



Joachim,

> I can only recommend the use of enterprise SSDs. We've tested many consumer 
> SSDs in the past, including your SSDs. Many 
> of them are not suitable for long-term use and some weard out within 6 months.

Unfortunately I couldn't afford enterprise grade SSDs. But I suspect that my 
workload (about 20 VMs for our infrastructure, the
most IO demanding is probably LDAP) is light enough that wearout won't be a 
problem.

The issue I'm seeing then is probably related to direct IO if using bluestore. 
But with filestore, the file system cache probably
hides the latency issues.


Igor,

> AFAIR Samsung 860 Pro isn't for enterprise market, you shouldn't use consumer 
> SSDs for Ceph.
> 
> I had some experience with Samsung 960 Pro a while ago and it turned out that 
> it handled fsync-ed writes very slowly 
> (comparing to the original/advertised performance). Which one can probably 
> explain by the lack of power loss protection 
> for these drives. I suppose it's the same in your case.
> 
> Here are a couple links on the topic:
> 
> https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/
> 
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

Power loss protection wasn't a criteria for me as the cluster hosts are 
distributed in two buildings with separate battery backed
UPSs. As mentioned above I suspect the main difference for my case between 
filestore and bluestore is file system cache vs. direct
IO. Which means I will keep using filestore.

Regards,

Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Blocked ops after change from filestore on HDD to bluestore on SDD

2019-02-28 Thread Uwe Sauter
Am 28.02.19 um 10:42 schrieb Matthew H:
> Have you made any changes to your ceph.conf? If so, would you mind copying 
> them into this thread?

No, I just deleted an OSD, replaced HDD with SDD and created a new OSD (with 
bluestore). Once the cluster was healty again, I
repeated with the next OSD.


[global]
  auth client required = cephx
  auth cluster required = cephx
  auth service required = cephx
  cluster network = 169.254.42.0/24
  fsid = 753c9bbd-74bd-4fea-8c1e-88da775c5ad4
  keyring = /etc/pve/priv/$cluster.$name.keyring
  public network = 169.254.42.0/24

[mon]
  mon allow pool delete = true
  mon data avail crit = 5
  mon data avail warn = 15

[osd]
  keyring = /var/lib/ceph/osd/ceph-$id/keyring
  osd journal size = 5120
  osd pool default min size = 2
  osd pool default size = 3
  osd max backfills = 6
  osd recovery max active = 12

[mon.px-golf-cluster]
  host = px-golf-cluster
  mon addr = 169.254.42.54:6789

[mon.px-hotel-cluster]
  host = px-hotel-cluster
  mon addr = 169.254.42.55:6789

[mon.px-india-cluster]
  host = px-india-cluster
  mon addr = 169.254.42.56:6789




> 
> --
> *From:* ceph-users  on behalf of Vitaliy 
> Filippov 
> *Sent:* Wednesday, February 27, 2019 4:21 PM
> *To:* Ceph Users
> *Subject:* Re: [ceph-users] Blocked ops after change from filestore on HDD to 
> bluestore on SDD
>  
> I think this should not lead to blocked ops in any case, even if the 
> performance is low...
> 
> -- 
> With best regards,
>    Vitaliy Filippov
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] REQUEST_SLOW across many OSDs at the same time

2019-02-28 Thread Matthew H
Is fstrim or discard enabled for these SSD's? If so, how did you enable it?

I've seen similiar issues with poor controllers on SSDs. They tend to block I/O 
when trim kicks off.

Thanks,


From: ceph-users  on behalf of Paul Emmerich 

Sent: Friday, February 22, 2019 9:04 AM
To: Massimo Sgaravatto
Cc: Ceph Users
Subject: Re: [ceph-users] REQUEST_SLOW across many OSDs at the same time

Bad SSDs can also cause this. Which SSD are you using?

Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Feb 22, 2019 at 2:53 PM Massimo Sgaravatto
 wrote:
>
> A couple of hints to debug the issue (since I had to recently debug a problem 
> with the same symptoms):
>
> - As far as I understand the reported 'implicated osds' are only the primary 
> ones. In the log of the osds you should find also the relevant pg number, and 
> with this information you can get all the involved OSDs. This might be useful 
> e.g. to see if a specific OSD node is always involved. This was my case (a 
> the problem was with the patch cable connecting the node)
>
> - You can use the "ceph daemon osd.x dump_historic_ops" command to debug some 
> of these slow requests (to see which events take much time)
>
> Cheers, Massimo
>
> On Fri, Feb 22, 2019 at 10:28 AM mart.v  wrote:
>>
>> Hello everyone,
>>
>> I'm experiencing a strange behaviour. My cluster is relatively small (43 
>> OSDs, 11 nodes), running Ceph 12.2.10 (and Proxmox 5). Nodes are connected 
>> via 10 Gbit network (Nexus 6000). Cluster is mixed (SSD and HDD), but with 
>> different pools. Descibed error is only on the SSD part of the cluster.
>>
>> I noticed that few times a day the cluster slows down a bit and I have 
>> discovered this in logs:
>>
>> 2019-02-22 08:21:20.064396 mon.node1 mon.0 172.16.254.101:6789/0 1794159 : 
>> cluster [WRN] Health check failed: 27 slow requests are blocked > 32 sec. 
>> Implicated osds 10,22,33 (REQUEST_SLOW)
>> 2019-02-22 08:21:26.589202 mon.node1 mon.0 172.16.254.101:6789/0 1794169 : 
>> cluster [WRN] Health check update: 199 slow requests are blocked > 32 sec. 
>> Implicated osds 0,4,5,6,7,8,9,10,12,16,17,19,20,21,22,25,26,33,41 
>> (REQUEST_SLOW)
>> 2019-02-22 08:21:32.655671 mon.node1 mon.0 172.16.254.101:6789/0 1794183 : 
>> cluster [WRN] Health check update: 448 slow requests are blocked > 32 sec. 
>> Implicated osds 0,3,4,5,6,7,8,9,10,12,15,16,17,19,20,21,22,24,25,26,33,41 
>> (REQUEST_SLOW)
>> 2019-02-22 08:21:38.744210 mon.node1 mon.0 172.16.254.101:6789/0 1794210 : 
>> cluster [WRN] Health check update: 388 slow requests are blocked > 32 sec. 
>> Implicated osds 4,8,10,16,24,33 (REQUEST_SLOW)
>> 2019-02-22 08:21:42.790346 mon.node1 mon.0 172.16.254.101:6789/0 1794214 : 
>> cluster [INF] Health check cleared: REQUEST_SLOW (was: 18 slow requests are 
>> blocked > 32 sec. Implicated osds 8,16)
>>
>> "ceph health detail" shows nothing more
>>
>> It is happening through the whole day and the times can't be linked to any 
>> read or write intensive task (e.g. backup). I also tried to disable 
>> scrubbing, but it kept on going. These errors were not there since 
>> beginning, but unfortunately I cannot track the day they started (it is 
>> beyond my logs).
>>
>> Any ideas?
>>
>> Thank you!
>> Martin
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Blocked ops after change from filestore on HDD to bluestore on SDD

2019-02-28 Thread Matthew H
Have you made any changes to your ceph.conf? If so, would you mind copying them 
into this thread?


From: ceph-users  on behalf of Vitaliy 
Filippov 
Sent: Wednesday, February 27, 2019 4:21 PM
To: Ceph Users
Subject: Re: [ceph-users] Blocked ops after change from filestore on HDD to 
bluestore on SDD

I think this should not lead to blocked ops in any case, even if the
performance is low...

--
With best regards,
   Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs recursive stats | rctime in the future

2019-02-28 Thread David C
On Wed, Feb 27, 2019 at 11:35 AM Hector Martin 
wrote:

> On 27/02/2019 19:22, David C wrote:
> > Hi All
> >
> > I'm seeing quite a few directories in my filesystem with rctime years in
> > the future. E.g
> >
> > ]# getfattr -d -m ceph.dir.* /path/to/dir
> > getfattr: Removing leading '/' from absolute path names
> > # file:  path/to/dir
> > ceph.dir.entries="357"
> > ceph.dir.files="1"
> > ceph.dir.rbytes="35606883904011"
> > ceph.dir.rctime="1851480065.090"
> > ceph.dir.rentries="12216551"
> > ceph.dir.rfiles="10540827"
> > ceph.dir.rsubdirs="1675724"
> > ceph.dir.subdirs="356"
> >
> > That's showing a last modified time of 2 Sept 2028, the day and month
> > are also wrong.
>
> Obvious question: are you sure the date/time on your cluster nodes and
> your clients is correct? Can you track down which files (if any) have
> the ctime in the future by following the rctime down the filesystem tree?
>

Times are all correct on the nodes and CephFS clients however the fs is
being exported over NFS. It's possible some NFS clients have the wrong time
although I'm reasonably confident they are all correct as the machines are
synced to local time servers and they use AD for auth, things wouldn't work
if the time was that wildly out of sync.

Good idea on checking down the tree. I've found the offending files but
can't find any explanation as to why they have a modified date so far in
the future.

For example one dir is "/.config/caja/" in a users home dir. The files in
this dir are all wildly different, the modified times are 1984, 1997,
2028...

It certainly feels like a MDS issue to me. I've used the recursive stats
since Jewel and I've never seen this before.

Any ideas?



> --
> Hector Martin (hec...@marcansoft.com)
> Public Key: https://mrcn.st/pub
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] How does ceph use the STS service?

2019-02-28 Thread Matthew H
This feature is in the Nautilus release.

The first release (14.1.0) of Nautilus is available from download.ceph.com as 
of last Friday.


From: ceph-users  on behalf of admin 

Sent: Thursday, February 28, 2019 4:22 AM
To: Pritha Srivastava; Sage Weil; ceph-us...@ceph.com
Subject: Re: [ceph-users] [Ceph-community] How does ceph use the STS service?

Hi, can you tell me the version that includes STS lite?
Thanks,
myxingkong


发件人: Pritha Srivastava
发送时间: 2019-02-27 23:53:58
收件人:  Sage Weil
抄送:  admin; 
ceph-us...@ceph.com
主题: Re: [ceph-users] [Ceph-community] How does ceph use the STS service?
Sorry I overlooked the ceph versions in the email.

STS Lite is not a part of ceph version 12.2.11 or ceph version 13.2.2.

Thanks,
Pritha

On Wed, Feb 27, 2019 at 9:09 PM Pritha Srivastava 
mailto:prsri...@redhat.com>> wrote:
You need to attach a policy to be able to invoke GetSessionToken. Please read 
the documentation below at:

https://github.com/ceph/ceph/pull/24818/commits/512b6d8bd951239d44685b25dccaf904f19872b2

Thanks,
Pritha

On Wed, Feb 27, 2019 at 8:59 PM Sage Weil 
mailto:s...@newdream.net>> wrote:
Moving this to ceph-users.

On Wed, 27 Feb 2019, admin wrote:

> I want to use the STS service to generate temporary credentials for use by 
> third-party clients.
>
> I configured STS lite based on the documentation.
> http://docs.ceph.com/docs/master/radosgw/STSLite/
>
> This is my configuration file:
>
> [global]
> fsid = 42a7cae1-84d1-423e-93f4-04b0736c14aa
> mon_initial_members = admin, node1, node2, node3
> mon_host = 192.168.199.81,192.168.199.82,192.168.199.83,192.168.199.84
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
>
> osd pool default size = 2
>
> [client.rgw.admin]
> rgw sts key = "1234567890"
> rgw s3 auth use sts = true
>
> When I execute the getSessionToken method, return a 405 error:
>
> 
> MethodNotAllowed
> tx3-005c73aed8-5e48-default
> 5e48-default-default
> 
>
> This is my test code:
>
> import os
> import sys
> import traceback
>
> import boto3
> from boto.s3.connection import S3Connection
> from boto.sts import STSConnection
>
> try:
> host = 'http://192.168.199.81:7480'
> access_key = '2324YFZ7QDEOSRL18QHR'
> secret_key = 'rL9FabxCOw5LDbrHtmykiGSCjzpKLmEs9WPiNjVJ'
>
> client = boto3.client('sts',
>   aws_access_key_id = access_key,
>   aws_secret_access_key = secret_key,
>   endpoint_url = host)
> response = client.get_session_token(DurationSeconds=999)
> print response
> except:
> print traceback.format_exc()
>
> Who can tell me if my configuration is incorrect or if the version I tested 
> does not provide STS service?
>
> This is the version I tested:
>
> ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous 
> (stable)
>
> ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic 
> (stable)___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] How does ceph use the STS service?

2019-02-28 Thread admin
Hi, can you tell me the version that includes STS lite?Thanks,myxingkong___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic 13.2.4 rbd du slowness

2019-02-28 Thread Wido den Hollander



On 2/28/19 9:41 AM, Glen Baars wrote:
> Hello Wido,
> 
> I have looked at the libvirt code and there is a check to ensure that 
> fast-diff is enabled on the image and only then does it try to get the real 
> disk usage. The issue for me is that even with fast-diff enabled it takes 
> 25min to get the space usage for a 50TB image.
> 
> I had considered turning off fast-diff on the large images to get around to 
> issue but I think that will hurt my snapshot removal times ( untested )
> 

Can you tell a bit more about the Ceph cluster? HDD? SSD? DB and WAL on SSD?

Do you see OSDs spike in CPU or Disk I/O when you do a 'rbd du' on these
images?

Wido

> I can't see in the code any other way of bypassing the disk usage check but I 
> am not that familiar with the code.
> 
> ---
> if (volStorageBackendRBDUseFastDiff(features)) {
> VIR_DEBUG("RBD image %s/%s has fast-diff feature enabled. "
>   "Querying for actual allocation",
>   def->source.name, vol->name);
> 
> if (virStorageBackendRBDSetAllocation(vol, image, &info) < 0)
> goto cleanup;
> } else {
> vol->target.allocation = info.obj_size * info.num_objs;
> }
> --
> 
> Kind regards,
> Glen Baars
> 
> -Original Message-
> From: Wido den Hollander 
> Sent: Thursday, 28 February 2019 3:49 PM
> To: Glen Baars ; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Mimic 13.2.4 rbd du slowness
> 
> 
> 
> On 2/28/19 2:59 AM, Glen Baars wrote:
>> Hello Ceph Users,
>>
>> Has anyone found a way to improve the speed of the rbd du command on large 
>> rbd images? I have object map and fast diff enabled - no invalid flags on 
>> the image or it's snapshots.
>>
>> We recently upgraded our Ubuntu 16.04 KVM servers for Cloudstack to Ubuntu 
>> 18.04. The upgrades libvirt to version 4. When libvirt 4 adds an rbd pool it 
>> discovers all images in the pool and tries to get their disk usage. We are 
>> seeing a 50TB image take 25min. The pool has over 300TB of images in it and 
>> takes hours for libvirt to start.
>>
> 
> This is actually a pretty bad thing imho. As a lot of images people will be 
> using do not have fast-diff enabled (images from the past) and that will kill 
> their performance.
> 
> Isn't there a way to turn this off in libvirt?
> 
> Wido
> 
>> We can replicate the issue without libvirt by just running a rbd du on the 
>> large images. The limiting factor is the cpu on the rbd du command, it uses 
>> 100% of a single core.
>>
>> Our cluster is completely bluestore/mimic 13.2.4. 168 OSDs, 12 Ubuntu 16.04 
>> hosts.
>>
>> Kind regards,
>> Glen Baars
>> This e-mail is intended solely for the benefit of the addressee(s) and any 
>> other named recipient. It is confidential and may contain legally privileged 
>> or confidential information. If you are not the recipient, any use, 
>> distribution, disclosure or copying of this e-mail is prohibited. The 
>> confidentiality and legal privilege attached to this communication is not 
>> waived or lost by reason of the mistaken transmission or delivery to you. If 
>> you have received this e-mail in error, please notify us immediately.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> This e-mail is intended solely for the benefit of the addressee(s) and any 
> other named recipient. It is confidential and may contain legally privileged 
> or confidential information. If you are not the recipient, any use, 
> distribution, disclosure or copying of this e-mail is prohibited. The 
> confidentiality and legal privilege attached to this communication is not 
> waived or lost by reason of the mistaken transmission or delivery to you. If 
> you have received this e-mail in error, please notify us immediately.
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic 13.2.4 rbd du slowness

2019-02-28 Thread Glen Baars
Hello Wido,

I have looked at the libvirt code and there is a check to ensure that fast-diff 
is enabled on the image and only then does it try to get the real disk usage. 
The issue for me is that even with fast-diff enabled it takes 25min to get the 
space usage for a 50TB image.

I had considered turning off fast-diff on the large images to get around to 
issue but I think that will hurt my snapshot removal times ( untested )

I can't see in the code any other way of bypassing the disk usage check but I 
am not that familiar with the code.

---
if (volStorageBackendRBDUseFastDiff(features)) {
VIR_DEBUG("RBD image %s/%s has fast-diff feature enabled. "
  "Querying for actual allocation",
  def->source.name, vol->name);

if (virStorageBackendRBDSetAllocation(vol, image, &info) < 0)
goto cleanup;
} else {
vol->target.allocation = info.obj_size * info.num_objs;
}
--

Kind regards,
Glen Baars

-Original Message-
From: Wido den Hollander 
Sent: Thursday, 28 February 2019 3:49 PM
To: Glen Baars ; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Mimic 13.2.4 rbd du slowness



On 2/28/19 2:59 AM, Glen Baars wrote:
> Hello Ceph Users,
>
> Has anyone found a way to improve the speed of the rbd du command on large 
> rbd images? I have object map and fast diff enabled - no invalid flags on the 
> image or it's snapshots.
>
> We recently upgraded our Ubuntu 16.04 KVM servers for Cloudstack to Ubuntu 
> 18.04. The upgrades libvirt to version 4. When libvirt 4 adds an rbd pool it 
> discovers all images in the pool and tries to get their disk usage. We are 
> seeing a 50TB image take 25min. The pool has over 300TB of images in it and 
> takes hours for libvirt to start.
>

This is actually a pretty bad thing imho. As a lot of images people will be 
using do not have fast-diff enabled (images from the past) and that will kill 
their performance.

Isn't there a way to turn this off in libvirt?

Wido

> We can replicate the issue without libvirt by just running a rbd du on the 
> large images. The limiting factor is the cpu on the rbd du command, it uses 
> 100% of a single core.
>
> Our cluster is completely bluestore/mimic 13.2.4. 168 OSDs, 12 Ubuntu 16.04 
> hosts.
>
> Kind regards,
> Glen Baars
> This e-mail is intended solely for the benefit of the addressee(s) and any 
> other named recipient. It is confidential and may contain legally privileged 
> or confidential information. If you are not the recipient, any use, 
> distribution, disclosure or copying of this e-mail is prohibited. The 
> confidentiality and legal privilege attached to this communication is not 
> waived or lost by reason of the mistaken transmission or delivery to you. If 
> you have received this e-mail in error, please notify us immediately.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
This e-mail is intended solely for the benefit of the addressee(s) and any 
other named recipient. It is confidential and may contain legally privileged or 
confidential information. If you are not the recipient, any use, distribution, 
disclosure or copying of this e-mail is prohibited. The confidentiality and 
legal privilege attached to this communication is not waived or lost by reason 
of the mistaken transmission or delivery to you. If you have received this 
e-mail in error, please notify us immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd pg-upmap-items not working

2019-02-28 Thread Dan van der Ster
Hi,

pg-upmap-items became more strict in v12.2.11 when validating upmaps.
E.g., it now won't let you put two PGs in the same rack if the crush
rule doesn't allow it.

Where are OSDs 23 and 123 in your cluster? What is the relevant crush rule?

-- dan


On Wed, Feb 27, 2019 at 9:17 PM Kári Bertilsson  wrote:
>
> Hello
>
> I am trying to diagnose why upmap stopped working where it was previously 
> working fine.
>
> Trying to move pg 41.1 to 123 has no effect and seems to be ignored.
>
> # ceph osd pg-upmap-items 41.1 23 123
> set 41.1 pg_upmap_items mapping to [23->123]
>
> No rebalacing happens and if i run it again it shows the same output every 
> time.
>
> I have in config
> debug mgr = 4/5
> debug mon = 4/5
>
> Paste from mon & mgr logs. Also output from "ceph osd dump"
> https://pastebin.com/9VrT4YcU
>
>
> I have run "ceph osd set-require-min-compat-client luminous" long time ago. 
> And all servers running ceph have been rebooted numerous times since then.
> But somehow i am still seeing "min_compat_client jewel". I believe that upmap 
> was previously working anyway with that "jewel" line present.
>
> I see no indication in any logs why the upmap commands are being ignored.
>
> Any suggestions on how to debug further or what could be the issue ?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com