Re: [ceph-users] Slow Performance - Sequential IO

2020-01-17 Thread Christian Balzer

Hello,

I had very odd results in the past with the fio rbd engine and would
suggest testing things in the environment you're going to deploy in, end
to end.

That said, without any caching and coalescing of writes, sequential 4k
writes will hit the same set of OSDs for 4MB worth of data, thus limiting
things to whatever the overall latency (network, 3x write) is here.
With random writes you will engage more or less all OSDs that hold your
fio file, thus spreading things out.
This becomes more and more visible with increasing number of OSDs and
nodes.

Regards,

Christian
On Fri, 17 Jan 2020 23:01:09 + Anthony Brandelli (abrandel) wrote:

> Not been able to make any headway on this after some significant effort.
> 
> -Tested all 48 SSDs with FIO directly, all tested with 10% of each other for 
> 4k iops in rand|seq read|write.
> -Disabled all CPU power save.
> -Tested with both rbd cache enabled and disabled on the client.
> -Tested with drive caches enabled and disabled (hdparm)
> -Minimal TCP retransmissions under load (<10 for a 2 minute duration).
> -No drops/pause frames noted on upstream switches.
> -CPU load on OSD nodes peaks at 6~.
> -iostat shows a peak of 15ms under read/write workloads, %util peaks at about 
> 10%.
> -Swapped out the RBD client for a bigger box, since the load was peaking at 
> 16. Now a 24 core box, load still peaks at 16.
> -Disabled cephx signatures
> -Verified hardware health (nothing in dmesg, nothing in CIMC fault logs, 
> storage controller logs)
> -Test multiple SSDs at once to find the controllers iops limit, which is 
> apparently 650k @ 4k.
> 
> Nothing has made a noticeable difference here. I'm pretty baffled as to what 
> would be causing the awful sequential read and write performance, but 
> allowing good random r/w speeds.
> 
> I switched up fio testing methodologies to use more threads, but this didn't 
> seem to help either:
> 
> [global]
> bs=4k
> ioengine=rbd
> iodepth=32
> size=5g
> runtime=120
> numjobs=4
> group_reporting=1
> pool=rbd_af1
> rbdname=image1
> 
> [seq-read]
> rw=read
> stonewall
> 
> [rand-read]
> rw=randread
> stonewall
> 
> [seq-write]
> rw=write
> stonewall
> 
> [rand-write]
> rw=randwrite
> stonewall
> 
> Any pointers are appreciated at this point. I've been following other threads 
> on the mailing list, and looked at the archives, related to RBD performance 
> but none of the solutions that worked for others seem to have helped this 
> setup.
> 
> Thanks,
> Anthony
> 
> 
> From: Anthony Brandelli (abrandel) 
> Sent: Tuesday, January 14, 2020 12:43 AM
> To: ceph-users@lists.ceph.com 
> Subject: Slow Performance - Sequential IO
> 
> 
> I have a newly setup test cluster that is giving some surprising numbers when 
> running fio against an RBD. The end goal here is to see how viable a Ceph 
> based iSCSI SAN of sorts is for VMware clusters, which require a bunch of 
> random IO.
> 
> 
> 
> Hardware:
> 
> 2x E5-2630L v2 (2.4GHz, 6 core)
> 
> 256GB RAM
> 
> 2x 10gbps bonded network, Intel X520
> 
> LSI 9271-8i, SSDs used for OSDs in JBOD mode
> 
> Mons: 2x 1.2TB 10K SAS in RAID1
> 
> OSDs: 12x Samsung MZ6ER800HAGL-3 800GB SAS SSDs, super cap/power loss 
> protection
> 
> 
> 
> Cluster setup:
> 
> Three mon nodes, four OSD nodes
> 
> Two OSDs per SSD
> 
> Replica 3 pool
> 
> Ceph 14.2.5
> 
> 
> 
> Ceph status:
> 
>   cluster:
> 
> id: e3d93b4a-520c-4d82-a135-97d0bda3e69d
> 
> health: HEALTH_WARN
> 
> application not enabled on 1 pool(s)
> 
>   services:
> 
> mon: 3 daemons, quorum mon1,mon2,mon3 (age 6d)
> 
> mgr: mon2(active, since 6d), standbys: mon3, mon1
> 
> osd: 96 osds: 96 up (since 3d), 96 in (since 3d)
> 
>   data:
> 
> pools:   1 pools, 3072 pgs
> 
> objects: 857.00k objects, 1.8 TiB
> 
> usage:   432 GiB used, 34 TiB / 35 TiB avail
> 
> pgs: 3072 active+clean
> 
> 
> 
> Network between nodes tests at 9.88gbps. Direct testing of the SSDs using a 
> 4K block in fio shows 127k seq read, 86k randm read, 107k seq write, 52k 
> random write. No high CPU load/interface saturation is noted when running 
> tests against the rbd.
> 
> 
> 
> When testing with a 4K block size against an RBD on a dedicated metal test 
> host (same specs as other cluster nodes noted above) I get the following 
> (command similar to fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=32 
> -rw= -pool=scbench -runtime=60 -rbdname=datatest):
> 
> 
> 
> 10k sequential read iops
> 
> 69k random read iops
> 
> 13k sequential write iops
> 
> 22k random write iops
> 
> 
> 
> I’m not clear why the random ops, especially read, would be so much quicker 
> compared to the sequential ops.
> 
> 
> 
> Any points appreciated.
> 
> 
> 
> Thanks,
> 
> Anthony


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Mobile Inc.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://li

Re: [ceph-users] Default Pools

2020-01-17 Thread Daniele Riccucci

Hello,
I'm still a bit confused by the .rgw.root and the 
default.rgw.{control,meta,log} pools.
I recently removed the RGW daemon I had running and the aforementioned 
pools, however after a rebalance I suddenly find them again in the 
output of:


$ ceph osd pool ls
cephfs_data
cephfs_metadata
.rgw.root
default.rgw.control
default.rgw.meta
default.rgw.log

Each has 8 pgs but zero usage.
I was unable to find logs or indications as to which daemon or action 
recreated them or whether it is safe to remove them again, where should 
I look?

I'm on Nautilus 14.2.5, container deployment.
Thank you.

Regards,
Daniele

Il 23/04/19 22:14, David Turner ha scritto:
You should be able to see all pools in use in a RGW zone from the 
radosgw-admin command. This [1] is probably overkill for most, but I 
deal with multi-realm clusters so I generally think like this when 
dealing with RGW.  Running this as is will create a file in your current 
directory for each zone in your deployment (likely to be just one 
file).  My rough guess for what you would find in that file based on 
your pool names would be this [2].


If you identify any pools not listed from the zone get command, then you 
can rename [3] the pool to see if it is being created and/or used by rgw 
currently.  The process here would be to stop all RGW daemons, rename 
the pools, start a RGW daemon, stop it again, and see which pools were 
recreated.  Clean up the pools that were freshly made and rename the 
original pools back into place before starting your RGW daemons again.  
Please note that .rgw.root is a required pool in every RGW deployment 
and will not be listed in the zones themselves.



[1]
for realm in $(radosgw-admin realm list --format=json | jq '.realms[]' 
-r); do
   for zonegroup in $(radosgw-admin --rgw-realm=$realm zonegroup list 
--format=json | jq '.zonegroups[]' -r); do
     for zone in $(radosgw-admin --rgw-realm=$realm 
--rgw-zonegroup=$zonegroup zone list --format=json | jq '.zones[]' -r); do

       echo $realm.$zonegroup.$zone.json
       radosgw-admin --rgw-realm=$realm --rgw-zonegroup=$zonegroup 
--rgw-zone=$zone zone get > $realm.$zonegroup.$zone.json

     done
   done
done

[2] default.default.default.json
{
     "id": "{{ UUID }}",
     "name": "default",
     "domain_root": "default.rgw.meta",
     "control_pool": "default.rgw.control",
     "gc_pool": ".rgw.gc",
     "log_pool": "default.rgw.log",
     "user_email_pool": ".users.email",
     "user_uid_pool": ".users.uid",
     "system_key": {
     },
     "placement_pools": [
         {
             "key": "default-placement",
             "val": {
                 "index_pool": "default.rgw.buckets.index",
                 "data_pool": "default.rgw.buckets.data",
                 "data_extra_pool": "default.rgw.buckets.non-ec",
                 "index_type": 0,
                 "compression": ""
             }
         }
     ],
     "metadata_heap": "",
     "tier_config": [],
     "realm_id": "{{ UUID }}"
}

[3] ceph osd pool rename  

On Thu, Apr 18, 2019 at 10:46 AM Brent Kennedy > wrote:


Yea, that was a cluster created during firefly...

Wish there was a good article on the naming and use of these, or
perhaps a way I could make sure they are not used before deleting
them.  I know RGW will recreate anything it uses, but I don’t want
to lose data because I wanted a clean system.

-Brent

-Original Message-
From: Gregory Farnum mailto:gfar...@redhat.com>>
Sent: Monday, April 15, 2019 5:37 PM
To: Brent Kennedy mailto:bkenn...@cfl.rr.com>>
Cc: Ceph Users mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] Default Pools

On Mon, Apr 15, 2019 at 1:52 PM Brent Kennedy mailto:bkenn...@cfl.rr.com>> wrote:
 >
 > I was looking around the web for the reason for some of the
default pools in Ceph and I cant find anything concrete.  Here is
our list, some show no use at all.  Can any of these be deleted ( or
is there an article my googlefu failed to find that covers the
default pools?
 >
 > We only use buckets, so I took out .rgw.buckets, .users and
 > .rgw.buckets.index…
 >
 > Name
 > .log
 > .rgw.root
 > .rgw.gc
 > .rgw.control
 > .rgw
 > .users.uid
 > .users.email
 > .rgw.buckets.extra
 > default.rgw.control
 > default.rgw.meta
 > default.rgw.log
 > default.rgw.buckets.non-ec

All of these are created by RGW when you run it, not by the core
Ceph system. I think they're all used (although they may report
sizes of 0, as they mostly make use of omap).

 > metadata

Except this one used to be created-by-default for CephFS metadata,
but that hasn't been true in many releases. So I guess you're
looking at an old cluster? (In which case it's *possible* some of
those RGW pools are also unused now but were needed in the past; I
haven't kept good track of them.) -Greg

[ceph-users] Monitor handle_auth_bad_method

2020-01-17 Thread Justin Engwer
Hi,
I'm a home user of ceph. Most of the time I can look at the email lists and
articles and figure things out on my own. I've unfortunately run into an
issue I can't troubleshoot myself.

Starting one of my monitors yields this error:

2020-01-17 15:34:13.497 7fca3d006040  0 mon.kvm2@-1(probing) e11  my rank
is now 2 (was -1)
2020-01-17 15:34:13.696 7fca2909b700 -1 mon.kvm2@2(probing) e11
handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
2020-01-17 15:34:14.098 7fca2909b700 -1 mon.kvm2@2(probing) e11
handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
2020-01-17 15:34:14.899 7fca2909b700 -1 mon.kvm2@2(probing) e11
handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied


I've grabbed a good monmap from other monitors and double checked
permissions on /var/lib/ceph/mon/ceph-kvm2/ to make sure that it's not a
filesystem error and everything looks good to me.

/var/lib/ceph/mon/ceph-kvm2/:
total 32
drwxr-xr-x  4 ceph ceph  4096 May  7  2018 .
drwxr-x---. 4 ceph ceph44 Jan  8 11:44 ..
-rw-r--r--. 1 ceph ceph 0 Apr 25  2018 done
-rw---. 1 ceph ceph77 Apr 25  2018 keyring
-rw-r--r--. 1 ceph ceph 8 Apr 25  2018 kv_backend
drwx--  2 ceph ceph 16384 May  7  2018 lost+found
drwxr-xr-x. 2 ceph ceph  4096 Jan 17 15:27 store.db
-rw-r--r--. 1 ceph ceph 0 Apr 25  2018 systemd

/var/lib/ceph/mon/ceph-kvm2/lost+found:
total 20
drwx-- 2 ceph ceph 16384 May  7  2018 .
drwxr-xr-x 4 ceph ceph  4096 May  7  2018 ..

/var/lib/ceph/mon/ceph-kvm2/store.db:
total 68424
drwxr-xr-x. 2 ceph ceph 4096 Jan 17 15:27 .
drwxr-xr-x  4 ceph ceph 4096 May  7  2018 ..
-rw---  1 ceph ceph 65834705 Jan 17 14:57 1557088.sst
-rw---  1 ceph ceph 1833 Jan 17 15:27 1557090.sst
-rw---  1 ceph ceph0 Jan 17 15:27 1557092.log
-rw---  1 ceph ceph   17 Jan 17 15:27 CURRENT
-rw-r--r--. 1 ceph ceph   37 Apr 25  2018 IDENTITY
-rw-r--r--. 1 ceph ceph0 Apr 25  2018 LOCK
-rw---  1 ceph ceph  185 Jan 17 15:27 MANIFEST-1557091
-rw---  1 ceph ceph 4941 Jan 17 14:57 OPTIONS-1557087
-rw---  1 ceph ceph 4941 Jan 17 15:27 OPTIONS-1557094


Any help would be appreciated.



-- 

*Justin Engwer*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow Performance - Sequential IO

2020-01-17 Thread Anthony Brandelli (abrandel)
Not been able to make any headway on this after some significant effort.

-Tested all 48 SSDs with FIO directly, all tested with 10% of each other for 4k 
iops in rand|seq read|write.
-Disabled all CPU power save.
-Tested with both rbd cache enabled and disabled on the client.
-Tested with drive caches enabled and disabled (hdparm)
-Minimal TCP retransmissions under load (<10 for a 2 minute duration).
-No drops/pause frames noted on upstream switches.
-CPU load on OSD nodes peaks at 6~.
-iostat shows a peak of 15ms under read/write workloads, %util peaks at about 
10%.
-Swapped out the RBD client for a bigger box, since the load was peaking at 16. 
Now a 24 core box, load still peaks at 16.
-Disabled cephx signatures
-Verified hardware health (nothing in dmesg, nothing in CIMC fault logs, 
storage controller logs)
-Test multiple SSDs at once to find the controllers iops limit, which is 
apparently 650k @ 4k.

Nothing has made a noticeable difference here. I'm pretty baffled as to what 
would be causing the awful sequential read and write performance, but allowing 
good random r/w speeds.

I switched up fio testing methodologies to use more threads, but this didn't 
seem to help either:

[global]
bs=4k
ioengine=rbd
iodepth=32
size=5g
runtime=120
numjobs=4
group_reporting=1
pool=rbd_af1
rbdname=image1

[seq-read]
rw=read
stonewall

[rand-read]
rw=randread
stonewall

[seq-write]
rw=write
stonewall

[rand-write]
rw=randwrite
stonewall

Any pointers are appreciated at this point. I've been following other threads 
on the mailing list, and looked at the archives, related to RBD performance but 
none of the solutions that worked for others seem to have helped this setup.

Thanks,
Anthony


From: Anthony Brandelli (abrandel) 
Sent: Tuesday, January 14, 2020 12:43 AM
To: ceph-users@lists.ceph.com 
Subject: Slow Performance - Sequential IO


I have a newly setup test cluster that is giving some surprising numbers when 
running fio against an RBD. The end goal here is to see how viable a Ceph based 
iSCSI SAN of sorts is for VMware clusters, which require a bunch of random IO.



Hardware:

2x E5-2630L v2 (2.4GHz, 6 core)

256GB RAM

2x 10gbps bonded network, Intel X520

LSI 9271-8i, SSDs used for OSDs in JBOD mode

Mons: 2x 1.2TB 10K SAS in RAID1

OSDs: 12x Samsung MZ6ER800HAGL-3 800GB SAS SSDs, super cap/power loss 
protection



Cluster setup:

Three mon nodes, four OSD nodes

Two OSDs per SSD

Replica 3 pool

Ceph 14.2.5



Ceph status:

  cluster:

id: e3d93b4a-520c-4d82-a135-97d0bda3e69d

health: HEALTH_WARN

application not enabled on 1 pool(s)

  services:

mon: 3 daemons, quorum mon1,mon2,mon3 (age 6d)

mgr: mon2(active, since 6d), standbys: mon3, mon1

osd: 96 osds: 96 up (since 3d), 96 in (since 3d)

  data:

pools:   1 pools, 3072 pgs

objects: 857.00k objects, 1.8 TiB

usage:   432 GiB used, 34 TiB / 35 TiB avail

pgs: 3072 active+clean



Network between nodes tests at 9.88gbps. Direct testing of the SSDs using a 4K 
block in fio shows 127k seq read, 86k randm read, 107k seq write, 52k random 
write. No high CPU load/interface saturation is noted when running tests 
against the rbd.



When testing with a 4K block size against an RBD on a dedicated metal test host 
(same specs as other cluster nodes noted above) I get the following (command 
similar to fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=32 -rw= 
-pool=scbench -runtime=60 -rbdname=datatest):



10k sequential read iops

69k random read iops

13k sequential write iops

22k random write iops



I’m not clear why the random ops, especially read, would be so much quicker 
compared to the sequential ops.



Any points appreciated.



Thanks,

Anthony
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Beginner questions

2020-01-17 Thread Dave Hall

Frank,

Thank you for your input.  It is good to know that the cluster will go 
read-only in if a node goes down.  Our circumstance is probably a bit 
unusual, which is why I'm considering the2+1 solution.  We have a 
researcher who will be collecting extremely large amounts of data in 
real time, requiring both high write and high read bandwith, but it's 
pretty much going to be a single user or a small research group.  Right 
now we have 3 physical storage hosts and we need to get into 
production.  We also need to maximize the available storage on these 3 
nodes.


We chose Ceph doe to scalability.  As the research (and the funding) 
progresses we expect to add many more Ceph nodes, and to move the 
MONs/MGRs/MDSs off on to dedicated systems.  At that time I'd likely lay 
out more rational pools and be more thoughtful about resiliency, 
understanding, of course, that I'd have to play games and migrate data 
around.


But for now we have to make the most of the hardware we have. I'm 
thinking 2+1 because that gives me more usable storage that keeping 
copies, and much more than keeping 3 copies.


-Dave

Dave Hall
Binghamton University
kdh...@binghamton.edu
607-760-2328 (Cell)
607-777-4641 (Office)


On 1/17/2020 3:50 AM, Frank Schilder wrote:

I would strongly advise against 2+1 EC pools for production if stability is 
your main concern. There was a discussion towards the end of last year 
addressing this in more detail. Short story, if you don't have at least 8-10 
nodes (in the short run), EC is not suitable. You cannot maintain a cluster 
with such EC-pools.

Reasoning: k+1 is a no-go in production. You can set min_size to k, but 
whenever a node is down (maintenance or whatever), new writes are 
non-redundant. Loosing just one more disk means data loss. This is not a 
problem with replication x3 and min_size=2. Be aware that maintenance more 
often than not takes more than a day. Parts may need to be shipped. An upgrade 
goes wrong and requires lengthy support for fixing. Etc.

In addition, admins make mistakes. You need to build your cluster such that it 
can survive mistakes (shut down wrong host, etc.) in degraded state. Redundancy 
m=1 means zero tolerance for errors. Often the recommendation therefore is m=3, 
while m=2 is the bare minimum. Note that EC 1+2 is equal in redundancy as 
replication x3, but will use more compute (hence, its useless). In your 
situation, I would start with replicated pools and move to EC once enough nodes 
are at hand.

If you want to use the benefits of EC, you need to build large clusters. 
Starting with 3 nodes and failure domain disk will be a horrible experience. 
You will not be able to maintain, upgrade or fix anything without downtime.

Plan for sleeping well in worst-case situations.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Bastiaan Visser 

Sent: 17 January 2020 06:55:25
To: Dave Hall
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Beginner questions

There is no difference in allocation between replication or EC. If failure 
domain is host, one osd per host ok s used for a PG. So if you use a 2+1 EC 
profile with a host failure domain, you need 3 hosts for a healthy cluster. The 
pool will go read-only when you have a failure (host or disk), or are doing 
maintenance on a node (reboot). On a node failure there will be no rebuilding, 
since there is no place to find a 3rd osd for a pg, so you'll have to 
fix/replace the node before any writes will be accepted.

So yes, you can do a 2+1 EC pool on 3 nodes, you are paying the price in 
reliability, flexibility and maybe performance. Only way to really know the 
latter is benchmarking with your setup.

I think you will be fine on the hardware side. Memory recommendations swing 
around between 512M and 1G per Tb storage.I usually go with 1 gig. But I never 
use disks larger than 4Tb. On the cpu I always try to have a few more cores 
than I have osd's in a machine. So 16 is fine in your case.


On Fri, Jan 17, 2020, 03:29 Dave Hall 
mailto:kdh...@binghamton.edu>> wrote:

Bastiaan,

Regarding EC pools:   Our concern at 3 nodes is that 2-way replication seems 
risky - if the two copies don't match, which one is corrupted.  However,  3-way 
replication on a 3 node cluster triples the price per TB.   Doing EC pools that 
are the equivalent of RAID-5 2+1 seems like the right place to start as far as 
maximizing capacity is concerned, although I do understand the potential time 
involved in rebuilding a 12 TB drive.  Early on I'd be more concerned about a 
drive failure than about a node failure.

Regarding the hardware, our nodes are single socket EPYC 7302 (16 core, 32 
thread) with 128GB RAM.  From what I recall reading I think the RAM, at least, 
is a bit higher than recommended.

Question:  Does a PG (EC or replicated) span multiple drives per node?  I 
haven't got to the point of unders

Re: [ceph-users] Weird mount issue (Ubuntu 18.04, Ceph 14.2.5 & 14.2.6)

2020-01-17 Thread Jeff Layton
On Fri, 2020-01-17 at 17:10 +0100, Ilya Dryomov wrote:
> On Fri, Jan 17, 2020 at 2:21 AM Aaron  wrote:
> > No worries, can definitely do that.
> > 
> > Cheers
> > Aaron
> > 
> > On Thu, Jan 16, 2020 at 8:08 PM Jeff Layton  wrote:
> > > On Thu, 2020-01-16 at 18:42 -0500, Jeff Layton wrote:
> > > > On Wed, 2020-01-15 at 08:05 -0500, Aaron wrote:
> > > > > Seeing a weird mount issue.  Some info:
> > > > > 
> > > > > No LSB modules are available.
> > > > > Distributor ID: Ubuntu
> > > > > Description: Ubuntu 18.04.3 LTS
> > > > > Release: 18.04
> > > > > Codename: bionic
> > > > > 
> > > > > Ubuntu 18.04.3 with kerne 4.15.0-74-generic
> > > > > Ceph 14.2.5 & 14.2.6
> > > > > 
> > > > > With ceph-common, ceph-base, etc installed:
> > > > > 
> > > > > ceph/stable,now 14.2.6-1bionic amd64 [installed]
> > > > > ceph-base/stable,now 14.2.6-1bionic amd64 [installed]
> > > > > ceph-common/stable,now 14.2.6-1bionic amd64 [installed,automatic]
> > > > > ceph-mds/stable,now 14.2.6-1bionic amd64 [installed]
> > > > > ceph-mgr/stable,now 14.2.6-1bionic amd64 [installed,automatic]
> > > > > ceph-mgr-dashboard/stable,stable,now 14.2.6-1bionic all [installed]
> > > > > ceph-mon/stable,now 14.2.6-1bionic amd64 [installed]
> > > > > ceph-osd/stable,now 14.2.6-1bionic amd64 [installed]
> > > > > libcephfs2/stable,now 14.2.6-1bionic amd64 [installed,automatic]
> > > > > python-ceph-argparse/stable,stable,now 14.2.6-1bionic all 
> > > > > [installed,automatic]
> > > > > python-cephfs/stable,now 14.2.6-1bionic amd64 [installed,automatic]
> > > > > 
> > > > > I create a user via get-or-create cmd, and I have a users/secret now.
> > > > > When I try to mount on these Ubuntu nodes,
> > > > > 
> > > > > The mount cmd I run for testing is:
> > > > > sudo mount -t ceph -o
> > > > > name=user-20c5338c-34db-11ea-b27a-de7033e905f6,secret=AQC6dhpeyczkDxAAhRcr7oERUY4BcD2NCUkuNg==
> > > > > 10.10.10.10:6789:/work/20c5332d-34db-11ea-b27a-de7033e905f6 /tmp/test
> > > > > 
> > > > > I get the error:
> > > > > couldn't finalize options: -34
> > > > > 
> > > > > From some tracking down, it's part of the get_secret_option() in
> > > > > common/secrets.c and the Linux System Error:
> > > > > 
> > > > > #define ERANGE  34  /* Math result not representable */
> > > > > 
> > > > > Now the weird part...when I remove all the above libs above, the mount
> > > > > command works. I know that there are ceph.ko modules in the Ubuntu
> > > > > filesystems DIR, and that Ubuntu comes with some understanding of how
> > > > > to mount a cephfs system.  So, that explains how it can mount
> > > > > cephfs...but, what I don't understand is why I'm getting that -34
> > > > > error with the 14.2.5 and 14.2.6 libs installed. I didn't have this
> > > > > issue with 14.2.3 or 14.2.4.
> > > > 
> > > > This sounds like a regression in mount.ceph, probably due to something
> > > > that went in for v14.2.5. I can reproduce the problem on Fedora, and I
> > > > think it has something to do with the very long username you're using.
> > > > 
> > > > I'll take a closer look and let you know. Stay tuned.
> > > > 
> > > 
> > > I think I see the issue. The SECRET_OPTION_BUFSIZE is just too small for
> > > your use case. We need to make that a little larger than the largest
> > > name= parameter can be. Prior to v14.2.5, it was ~1000 bytes, but I made
> > > it smaller in that set thinking that was too large. Mea culpa.
> > > 
> > > The problem is determining how big that size can be. AFAICT EntityName
> > > is basically a std::string in the ceph code, which can be an arbitrary
> > > size (up to 4g or so).
> 
> It's just that you made SECRET_OPTION_BUFSIZE account precisely for
> "secret=", but it can also be "key=".
> 
> I don't think there is much of a problem.  Defining it back to ~1000 is
> guaranteed to work.  Or we could remove it and just compute the size of
> secret_option exactly the same way as get_secret_option() does it:
> 
>   strlen(cmi->cmi_secret) + strlen(cmi->cmi_name) + 7 + 1
> 

Yeah, it's not hard to do a simple fix like that, but I opted to rework
the code to just safe_cat the secret option string(s) directly into the 
options buffer.

That eliminates some extra copies of this info and the need for an
arbitrary limit altogether. It also removes a chunk of code that doesn't
really need to be in the common lib.

See:

https://github.com/ceph/ceph/pull/32706

Aaron, if you have a way to build and test this, it'd be good if you
could confirm that it fixes the problem for you.
-- 
Jeff Layton 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird mount issue (Ubuntu 18.04, Ceph 14.2.5 & 14.2.6)

2020-01-17 Thread Ilya Dryomov
On Fri, Jan 17, 2020 at 2:21 AM Aaron  wrote:
>
> No worries, can definitely do that.
>
> Cheers
> Aaron
>
> On Thu, Jan 16, 2020 at 8:08 PM Jeff Layton  wrote:
>>
>> On Thu, 2020-01-16 at 18:42 -0500, Jeff Layton wrote:
>> > On Wed, 2020-01-15 at 08:05 -0500, Aaron wrote:
>> > > Seeing a weird mount issue.  Some info:
>> > >
>> > > No LSB modules are available.
>> > > Distributor ID: Ubuntu
>> > > Description: Ubuntu 18.04.3 LTS
>> > > Release: 18.04
>> > > Codename: bionic
>> > >
>> > > Ubuntu 18.04.3 with kerne 4.15.0-74-generic
>> > > Ceph 14.2.5 & 14.2.6
>> > >
>> > > With ceph-common, ceph-base, etc installed:
>> > >
>> > > ceph/stable,now 14.2.6-1bionic amd64 [installed]
>> > > ceph-base/stable,now 14.2.6-1bionic amd64 [installed]
>> > > ceph-common/stable,now 14.2.6-1bionic amd64 [installed,automatic]
>> > > ceph-mds/stable,now 14.2.6-1bionic amd64 [installed]
>> > > ceph-mgr/stable,now 14.2.6-1bionic amd64 [installed,automatic]
>> > > ceph-mgr-dashboard/stable,stable,now 14.2.6-1bionic all [installed]
>> > > ceph-mon/stable,now 14.2.6-1bionic amd64 [installed]
>> > > ceph-osd/stable,now 14.2.6-1bionic amd64 [installed]
>> > > libcephfs2/stable,now 14.2.6-1bionic amd64 [installed,automatic]
>> > > python-ceph-argparse/stable,stable,now 14.2.6-1bionic all 
>> > > [installed,automatic]
>> > > python-cephfs/stable,now 14.2.6-1bionic amd64 [installed,automatic]
>> > >
>> > > I create a user via get-or-create cmd, and I have a users/secret now.
>> > > When I try to mount on these Ubuntu nodes,
>> > >
>> > > The mount cmd I run for testing is:
>> > > sudo mount -t ceph -o
>> > > name=user-20c5338c-34db-11ea-b27a-de7033e905f6,secret=AQC6dhpeyczkDxAAhRcr7oERUY4BcD2NCUkuNg==
>> > > 10.10.10.10:6789:/work/20c5332d-34db-11ea-b27a-de7033e905f6 /tmp/test
>> > >
>> > > I get the error:
>> > > couldn't finalize options: -34
>> > >
>> > > From some tracking down, it's part of the get_secret_option() in
>> > > common/secrets.c and the Linux System Error:
>> > >
>> > > #define ERANGE  34  /* Math result not representable */
>> > >
>> > > Now the weird part...when I remove all the above libs above, the mount
>> > > command works. I know that there are ceph.ko modules in the Ubuntu
>> > > filesystems DIR, and that Ubuntu comes with some understanding of how
>> > > to mount a cephfs system.  So, that explains how it can mount
>> > > cephfs...but, what I don't understand is why I'm getting that -34
>> > > error with the 14.2.5 and 14.2.6 libs installed. I didn't have this
>> > > issue with 14.2.3 or 14.2.4.
>> >
>> > This sounds like a regression in mount.ceph, probably due to something
>> > that went in for v14.2.5. I can reproduce the problem on Fedora, and I
>> > think it has something to do with the very long username you're using.
>> >
>> > I'll take a closer look and let you know. Stay tuned.
>> >
>>
>> I think I see the issue. The SECRET_OPTION_BUFSIZE is just too small for
>> your use case. We need to make that a little larger than the largest
>> name= parameter can be. Prior to v14.2.5, it was ~1000 bytes, but I made
>> it smaller in that set thinking that was too large. Mea culpa.
>>
>> The problem is determining how big that size can be. AFAICT EntityName
>> is basically a std::string in the ceph code, which can be an arbitrary
>> size (up to 4g or so).

It's just that you made SECRET_OPTION_BUFSIZE account precisely for
"secret=", but it can also be "key=".

I don't think there is much of a problem.  Defining it back to ~1000 is
guaranteed to work.  Or we could remove it and just compute the size of
secret_option exactly the same way as get_secret_option() does it:

  strlen(cmi->cmi_secret) + strlen(cmi->cmi_name) + 7 + 1

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird mount issue (Ubuntu 18.04, Ceph 14.2.5 & 14.2.6)

2020-01-17 Thread Jeff Layton
Actually, scratch that. I went ahead and opened this:

https://tracker.ceph.com/issues/43649

Feel free to watch that one for updates.

On Fri, 2020-01-17 at 07:43 -0500, Jeff Layton wrote:
> No problem. Can you let me know the tracker bug number once you've
> opened it?
> 
> Thanks,
> Jeff
> 
> On Thu, 2020-01-16 at 20:24 -0500, Aaron wrote:
> > This debugging started because the ceph-provisioner from k8s was making 
> > those users...but what we found was doing something similar by hand caused 
> > the same issue. Just surprised no one else using k8s and ceph backed 
> > PVC/PVs  ran into this issue. 
> > 
> > Thanks again for all your help!
> > 
> > Cheers
> > Aaron
> > 
> > On Thu, Jan 16, 2020 at 8:21 PM Aaron  wrote:
> > > No worries, can definitely do that. 
> > > 
> > > Cheers
> > > Aaron
> > > 
> > > On Thu, Jan 16, 2020 at 8:08 PM Jeff Layton  wrote:
> > > > On Thu, 2020-01-16 at 18:42 -0500, Jeff Layton wrote:
> > > > > On Wed, 2020-01-15 at 08:05 -0500, Aaron wrote:
> > > > > > Seeing a weird mount issue.  Some info:
> > > > > > 
> > > > > > No LSB modules are available.
> > > > > > Distributor ID: Ubuntu
> > > > > > Description: Ubuntu 18.04.3 LTS
> > > > > > Release: 18.04
> > > > > > Codename: bionic
> > > > > > 
> > > > > > Ubuntu 18.04.3 with kerne 4.15.0-74-generic
> > > > > > Ceph 14.2.5 & 14.2.6
> > > > > > 
> > > > > > With ceph-common, ceph-base, etc installed:
> > > > > > 
> > > > > > ceph/stable,now 14.2.6-1bionic amd64 [installed]
> > > > > > ceph-base/stable,now 14.2.6-1bionic amd64 [installed]
> > > > > > ceph-common/stable,now 14.2.6-1bionic amd64 [installed,automatic]
> > > > > > ceph-mds/stable,now 14.2.6-1bionic amd64 [installed]
> > > > > > ceph-mgr/stable,now 14.2.6-1bionic amd64 [installed,automatic]
> > > > > > ceph-mgr-dashboard/stable,stable,now 14.2.6-1bionic all [installed]
> > > > > > ceph-mon/stable,now 14.2.6-1bionic amd64 [installed]
> > > > > > ceph-osd/stable,now 14.2.6-1bionic amd64 [installed]
> > > > > > libcephfs2/stable,now 14.2.6-1bionic amd64 [installed,automatic]
> > > > > > python-ceph-argparse/stable,stable,now 14.2.6-1bionic all 
> > > > > > [installed,automatic]
> > > > > > python-cephfs/stable,now 14.2.6-1bionic amd64 [installed,automatic]
> > > > > > 
> > > > > > I create a user via get-or-create cmd, and I have a users/secret 
> > > > > > now.
> > > > > > When I try to mount on these Ubuntu nodes,
> > > > > > 
> > > > > > The mount cmd I run for testing is:
> > > > > > sudo mount -t ceph -o
> > > > > > name=user-20c5338c-34db-11ea-b27a-de7033e905f6,secret=AQC6dhpeyczkDxAAhRcr7oERUY4BcD2NCUkuNg==
> > > > > > 10.10.10.10:6789:/work/20c5332d-34db-11ea-b27a-de7033e905f6 
> > > > > > /tmp/test
> > > > > > 
> > > > > > I get the error:
> > > > > > couldn't finalize options: -34
> > > > > > 
> > > > > > From some tracking down, it's part of the get_secret_option() in
> > > > > > common/secrets.c and the Linux System Error:
> > > > > > 
> > > > > > #define ERANGE  34  /* Math result not representable */
> > > > > > 
> > > > > > Now the weird part...when I remove all the above libs above, the 
> > > > > > mount
> > > > > > command works. I know that there are ceph.ko modules in the Ubuntu
> > > > > > filesystems DIR, and that Ubuntu comes with some understanding of 
> > > > > > how
> > > > > > to mount a cephfs system.  So, that explains how it can mount
> > > > > > cephfs...but, what I don't understand is why I'm getting that -34
> > > > > > error with the 14.2.5 and 14.2.6 libs installed. I didn't have this
> > > > > > issue with 14.2.3 or 14.2.4.
> > > > > 
> > > > > This sounds like a regression in mount.ceph, probably due to something
> > > > > that went in for v14.2.5. I can reproduce the problem on Fedora, and I
> > > > > think it has something to do with the very long username you're using.
> > > > > 
> > > > > I'll take a closer look and let you know. Stay tuned.
> > > > > 
> > > > 
> > > > I think I see the issue. The SECRET_OPTION_BUFSIZE is just too small for
> > > > your use case. We need to make that a little larger than the largest
> > > > name= parameter can be. Prior to v14.2.5, it was ~1000 bytes, but I made
> > > > it smaller in that set thinking that was too large. Mea culpa.
> > > > 
> > > > The problem is determining how big that size can be. AFAICT EntityName
> > > > is basically a std::string in the ceph code, which can be an arbitrary
> > > > size (up to 4g or so).
> > > > 
> > > > Aaron, would you mind opening a bug for this at tracker.ceph.com? We
> > > > should be able to get it fixed up, once I do a bit more research to
> > > > figure out how big to make this buffer.

-- 
Jeff Layton 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS randomly hangs with no useful error message

2020-01-17 Thread Janek Bevendorff
Thanks. I will do that. Right now, we have quite a few lags when listing 
folders, which is probably due to another client heavily using the 
system. Unfortunately, it's rather hard to debug at the moment, since 
the suspected client has to use our Ganesha bridge instead of connecting 
to the Ceph directly. The FS is overall operable and generally usable, 
but the increased latency is quite annoying, still. I wonder if that can 
be mitigated with optimised cluster settings.


I will send you a GDB trace once we encounter another potential MDS loop.


On 17/01/2020 13:07, Yan, Zheng wrote:

On Fri, Jan 17, 2020 at 4:47 PM Janek Bevendorff
 wrote:

Hi,

We have a CephFS in our cluster with 3 MDS to which > 300 clients
connect at any given time. The FS contains about 80 TB of data and many
million files, so it is important that meta data operations work
smoothly even when listing large directories.

Previously, we had massive stability problems causing the MDS nodes to
crash or time out regularly as a result of failing to recall caps fast
enough and weren't able to rejoin afterwards without resetting the
mds*_openfiles objects (see
https://lists.ceph.io/hyperkitty/list/ceph-us...@ceph.io/thread/AOYWQSONTFROPB4DXVYADWW7V25C3G6Z/
for details).

We have managed to adjust our configuration to avoid this problem. This
comes down mostly to adjusting the recall decay rate (which still isn't
documented), massively reducing any scrubbing activities, allowing for
no more than 10G for mds_cache_memory_limit (the default of 1G is way
too low, but more than 10G seems to cause trouble during replay),
increasing osd_map_message_max to 100, and osd_map_cache_size to 150. We
haven't seen crashes since. But what we do see is that one of the MDS
nodes will randomly lock up and the ceph_mds_reply_latency metric goes
up and then stays at a higher level than any other MDS. The result is
not that the FS is completely down, but everything lags massively to the
point where it's not usable.

Unfortunately, all the hung MDS is reporting is:

 -77> 2020-01-17 09:29:17.891 7f34c967b700  0 mds.beacon.XXX Skipping
beacon heartbeat to monitors (last acked 320.587s ago); MDS internal
heartbeat is not healthy!
 -76> 2020-01-17 09:29:18.391 7f34c967b700  1 heartbeat_map
is_healthy 'MDSRank' had timed out after 15

and ceph fs status reports only single-digit ops/s for all three MDSs
(mostly flat 0). I ran ceph mds fail 1 to fail the MDS and force a
standby to take over, which went without problems. Almost immediately
after, all three now-active MDSs started reporting > 900 ops/s and the
FS started working properly again. For some strange reason, the failed
MDS didn't restart, though. It kept reporting the log message above
until I manually restarted the daemon process.


Looks like mds entered same long (/infinite) loops. If this happens
again, could you use gdb to attach it, and run command 'thread apply
all bt' inside gdb


Is anybody else experiencing such issues or are there any configuration
parameters that I can tweak to avoid this behaviour?

Thanks
Janek

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS randomly hangs with no useful error message

2020-01-17 Thread Yan, Zheng
On Fri, Jan 17, 2020 at 4:47 PM Janek Bevendorff
 wrote:
>
> Hi,
>
> We have a CephFS in our cluster with 3 MDS to which > 300 clients
> connect at any given time. The FS contains about 80 TB of data and many
> million files, so it is important that meta data operations work
> smoothly even when listing large directories.
>
> Previously, we had massive stability problems causing the MDS nodes to
> crash or time out regularly as a result of failing to recall caps fast
> enough and weren't able to rejoin afterwards without resetting the
> mds*_openfiles objects (see
> https://lists.ceph.io/hyperkitty/list/ceph-us...@ceph.io/thread/AOYWQSONTFROPB4DXVYADWW7V25C3G6Z/
> for details).
>
> We have managed to adjust our configuration to avoid this problem. This
> comes down mostly to adjusting the recall decay rate (which still isn't
> documented), massively reducing any scrubbing activities, allowing for
> no more than 10G for mds_cache_memory_limit (the default of 1G is way
> too low, but more than 10G seems to cause trouble during replay),
> increasing osd_map_message_max to 100, and osd_map_cache_size to 150. We
> haven't seen crashes since. But what we do see is that one of the MDS
> nodes will randomly lock up and the ceph_mds_reply_latency metric goes
> up and then stays at a higher level than any other MDS. The result is
> not that the FS is completely down, but everything lags massively to the
> point where it's not usable.
>
> Unfortunately, all the hung MDS is reporting is:
>
> -77> 2020-01-17 09:29:17.891 7f34c967b700  0 mds.beacon.XXX Skipping
> beacon heartbeat to monitors (last acked 320.587s ago); MDS internal
> heartbeat is not healthy!
> -76> 2020-01-17 09:29:18.391 7f34c967b700  1 heartbeat_map
> is_healthy 'MDSRank' had timed out after 15
>
> and ceph fs status reports only single-digit ops/s for all three MDSs
> (mostly flat 0). I ran ceph mds fail 1 to fail the MDS and force a
> standby to take over, which went without problems. Almost immediately
> after, all three now-active MDSs started reporting > 900 ops/s and the
> FS started working properly again. For some strange reason, the failed
> MDS didn't restart, though. It kept reporting the log message above
> until I manually restarted the daemon process.
>

Looks like mds entered same long (/infinite) loops. If this happens
again, could you use gdb to attach it, and run command 'thread apply
all bt' inside gdb

> Is anybody else experiencing such issues or are there any configuration
> parameters that I can tweak to avoid this behaviour?
>
> Thanks
> Janek
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous Bluestore OSDs crashing with ASSERT

2020-01-17 Thread Stefan Priebe - Profihost AG
HI Igor,

Am 17.01.20 um 12:10 schrieb Igor Fedotov:
> hmmm..
> 
> Just in case - suggest to check H/W errors with dmesg.

this happens on around 80 nodes - i don't expect all of those have not
identified hw errors. Also all of them are monitored - no dmesg outpout
contains any errors.

> Also there are some (not very much though) chances this is another
> incarnation of the following bug:
> https://tracker.ceph.com/issues/22464
> https://github.com/ceph/ceph/pull/24649
> 
> The corresponding PR works around it for main device reads (user data
> only!) but theoretically it might still happen
> 
> either for DB device or DB data at main device.
>
> Can you observe any bluefs spillovers? Are there any correlation between
> failing OSDs and spillover presence if any, e.g. failing OSDs always
> have a spillover. While OSDs without spillovers never face the issue...
>
> To validate this hypothesis one can try to monitor/check (e.g. once a
> day for a week or something) "bluestore_reads_with_retries" counter over
> OSDs to learn if the issue is happening
> 
> in the system.  Non-zero values mean it's there for user data/main
> device and hence is likely to happen for DB ones as well (which doesn't
> have any workaround yet).

OK i checked bluestore_reads_with_retries on 360 osds but all of them say 0.


> Additionally you might want to monitor memory usage as the above
> mentioned PR denotes high memory pressure as potential trigger for these
> read errors. So if such pressure happens the hypothesis becomes more valid.

we already do this heavily and have around 10GB of memory per OSD. Also
no of those machines show any io pressure at all.

All hosts show a constant rate of around 38GB to 45GB mem available in
/proc/meminfo.

Stefan

> Thanks,
> 
> Igor
> 
> PS. Everything above is rather a speculation for now. Available
> information is definitely not enough for extensive troubleshooting the
> cases which happens that rarely.
> 
> You might want to start collecting failure-related information
> (including but not limited to failure logs, perf counter dumps, system
> resource reports etc) for future analysis.
> 
> 
> 
> On 1/16/2020 11:58 PM, Stefan Priebe - Profihost AG wrote:
>> Hi Igor,
>>
>> answers inline.
>>
>> Am 16.01.20 um 21:34 schrieb Igor Fedotov:
>>> you may want to run fsck against failing OSDs. Hopefully it will shed
>>> some light.
>> fsck just says everything fine:
>>
>> # ceph-bluestore-tool --command fsck --path /var/lib/ceph/osd/ceph-27/
>> fsck success
>>
>>
>>> Also wondering if OSD is able to recover (startup and proceed working)
>>> after facing the issue?
>> no recover needed. It just runs forever after restarting.
>>
>>> If so do you have any one which failed multiple times? Do you have logs
>>> for these occurrences?
>> may be but there are most probably weeks or month between those failures
>> - most probably logs are already deleted.
>>
>>> Also please note that patch you mentioned doesn't fix previous issues
>>> (i.e. duplicate allocations), it prevents from new ones only.
>>>
>>> But fsck should show them if any...
>> None showed.
>>
>> Stefan
>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>>
>>> On 1/16/2020 10:04 PM, Stefan Priebe - Profihost AG wrote:
 Hi Igor,

 ouch sorry. Here we go:

  -1> 2020-01-16 01:10:13.404090 7f3350a14700 -1 rocksdb:
 submit_transaction error: Corruption: block checksum mismatch code = 2
 Rocksdb transaction:
 Put( Prefix = M key =
 0x0402'.OBJ_0002.953BFD0A.bb85c.rbd%udata%e3e8eac6b8b4567%e1f2e..'


 Value size = 97)
 Put( Prefix = M key =
 0x0402'.MAP_000BB85C_0002.953BFD0A.bb85c.rbd%udata%e3e8eac6b8b4567%e1f2e..'


 Value size = 93)
 Put( Prefix = M key =
 0x0916'.823257.73922044' Value size = 196)
 Put( Prefix = M key =
 0x0916'.823257.73922045' Value size = 184)
 Put( Prefix = M key = 0x0916'._info' Value size = 899)
 Put( Prefix = O key =
 0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f'x'


 Value size = 418)
 Put( Prefix = O key =
 0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f0003'x'


 Value size = 474)
 Put( Prefix = O key =
 0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f0007c000'x'


 Value size = 392)
 Put( Prefix = O key =
 0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f0009'x'


 Value size = 317)
 Put( Prefix = O key =
 0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f000a'x'


Re: [ceph-users] Luminous Bluestore OSDs crashing with ASSERT

2020-01-17 Thread Igor Fedotov

hmmm..

Just in case - suggest to check H/W errors with dmesg.

Also there are some (not very much though) chances this is another 
incarnation of the following bug:


https://tracker.ceph.com/issues/22464

https://github.com/ceph/ceph/pull/24649

The corresponding PR works around it for main device reads (user data 
only!) but theoretically it might still happen


either for DB device or DB data at main device.

Can you observe any bluefs spillovers? Are there any correlation between 
failing OSDs and spillover presence if any, e.g. failing OSDs always 
have a spillover. While OSDs without spillovers never face the issue...


To validate this hypothesis one can try to monitor/check (e.g. once a 
day for a week or something) "bluestore_reads_with_retries" counter over 
OSDs to learn if the issue is happening


in the system.  Non-zero values mean it's there for user data/main 
device and hence is likely to happen for DB ones as well (which doesn't 
have any workaround yet).


Additionally you might want to monitor memory usage as the above 
mentioned PR denotes high memory pressure as potential trigger for these 
read errors. So if such pressure happens the hypothesis becomes more valid.



Thanks,

Igor

PS. Everything above is rather a speculation for now. Available 
information is definitely not enough for extensive troubleshooting the 
cases which happens that rarely.


You might want to start collecting failure-related information 
(including but not limited to failure logs, perf counter dumps, system 
resource reports etc) for future analysis.




On 1/16/2020 11:58 PM, Stefan Priebe - Profihost AG wrote:

Hi Igor,

answers inline.

Am 16.01.20 um 21:34 schrieb Igor Fedotov:

you may want to run fsck against failing OSDs. Hopefully it will shed
some light.

fsck just says everything fine:

# ceph-bluestore-tool --command fsck --path /var/lib/ceph/osd/ceph-27/
fsck success



Also wondering if OSD is able to recover (startup and proceed working)
after facing the issue?

no recover needed. It just runs forever after restarting.


If so do you have any one which failed multiple times? Do you have logs
for these occurrences?

may be but there are most probably weeks or month between those failures
- most probably logs are already deleted.


Also please note that patch you mentioned doesn't fix previous issues
(i.e. duplicate allocations), it prevents from new ones only.

But fsck should show them if any...

None showed.

Stefan


Thanks,

Igor



On 1/16/2020 10:04 PM, Stefan Priebe - Profihost AG wrote:

Hi Igor,

ouch sorry. Here we go:

     -1> 2020-01-16 01:10:13.404090 7f3350a14700 -1 rocksdb:
submit_transaction error: Corruption: block checksum mismatch code = 2
Rocksdb transaction:
Put( Prefix = M key =
0x0402'.OBJ_0002.953BFD0A.bb85c.rbd%udata%e3e8eac6b8b4567%e1f2e..'

Value size = 97)
Put( Prefix = M key =
0x0402'.MAP_000BB85C_0002.953BFD0A.bb85c.rbd%udata%e3e8eac6b8b4567%e1f2e..'

Value size = 93)
Put( Prefix = M key =
0x0916'.823257.73922044' Value size = 196)
Put( Prefix = M key =
0x0916'.823257.73922045' Value size = 184)
Put( Prefix = M key = 0x0916'._info' Value size = 899)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f'x'

Value size = 418)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f0003'x'

Value size = 474)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f0007c000'x'

Value size = 392)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f0009'x'

Value size = 317)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f000a'x'

Value size = 521)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f000f4000'x'

Value size = 558)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f0013'x'

Value size = 649)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f00194000'x'

Value size = 449)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f001cc000'x'

Value size = 580)
Put( Prefix = O key =
0x7f80029acdfb05217262'd_data.3e8eac6b8b4567.1f2e!='0x000bb85c6f0020'x'

Value size = 435)
Put( Prefix = O key =
0x7f80029acdfb052

Re: [ceph-users] Beginner questions

2020-01-17 Thread Frank Schilder
I would strongly advise against 2+1 EC pools for production if stability is 
your main concern. There was a discussion towards the end of last year 
addressing this in more detail. Short story, if you don't have at least 8-10 
nodes (in the short run), EC is not suitable. You cannot maintain a cluster 
with such EC-pools.

Reasoning: k+1 is a no-go in production. You can set min_size to k, but 
whenever a node is down (maintenance or whatever), new writes are 
non-redundant. Loosing just one more disk means data loss. This is not a 
problem with replication x3 and min_size=2. Be aware that maintenance more 
often than not takes more than a day. Parts may need to be shipped. An upgrade 
goes wrong and requires lengthy support for fixing. Etc.

In addition, admins make mistakes. You need to build your cluster such that it 
can survive mistakes (shut down wrong host, etc.) in degraded state. Redundancy 
m=1 means zero tolerance for errors. Often the recommendation therefore is m=3, 
while m=2 is the bare minimum. Note that EC 1+2 is equal in redundancy as 
replication x3, but will use more compute (hence, its useless). In your 
situation, I would start with replicated pools and move to EC once enough nodes 
are at hand.

If you want to use the benefits of EC, you need to build large clusters. 
Starting with 3 nodes and failure domain disk will be a horrible experience. 
You will not be able to maintain, upgrade or fix anything without downtime.

Plan for sleeping well in worst-case situations.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Bastiaan 
Visser 
Sent: 17 January 2020 06:55:25
To: Dave Hall
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Beginner questions

There is no difference in allocation between replication or EC. If failure 
domain is host, one osd per host ok s used for a PG. So if you use a 2+1 EC 
profile with a host failure domain, you need 3 hosts for a healthy cluster. The 
pool will go read-only when you have a failure (host or disk), or are doing 
maintenance on a node (reboot). On a node failure there will be no rebuilding, 
since there is no place to find a 3rd osd for a pg, so you'll have to 
fix/replace the node before any writes will be accepted.

So yes, you can do a 2+1 EC pool on 3 nodes, you are paying the price in 
reliability, flexibility and maybe performance. Only way to really know the 
latter is benchmarking with your setup.

I think you will be fine on the hardware side. Memory recommendations swing 
around between 512M and 1G per Tb storage.I usually go with 1 gig. But I never 
use disks larger than 4Tb. On the cpu I always try to have a few more cores 
than I have osd's in a machine. So 16 is fine in your case.


On Fri, Jan 17, 2020, 03:29 Dave Hall 
mailto:kdh...@binghamton.edu>> wrote:

Bastiaan,

Regarding EC pools:   Our concern at 3 nodes is that 2-way replication seems 
risky - if the two copies don't match, which one is corrupted.  However,  3-way 
replication on a 3 node cluster triples the price per TB.   Doing EC pools that 
are the equivalent of RAID-5 2+1 seems like the right place to start as far as 
maximizing capacity is concerned, although I do understand the potential time 
involved in rebuilding a 12 TB drive.  Early on I'd be more concerned about a 
drive failure than about a node failure.

Regarding the hardware, our nodes are single socket EPYC 7302 (16 core, 32 
thread) with 128GB RAM.  From what I recall reading I think the RAM, at least, 
is a bit higher than recommended.

Question:  Does a PG (EC or replicated) span multiple drives per node?  I 
haven't got to the point of understanding this part yet, so pardon the totally 
naive question.  I'll probably be conversant on this by Monday.

-Dave

Dave Hall
Binghamton University
kdh...@binghamton.edu
607-760-2328 (Cell)
607-777-4641 (Office)




On 1/16/2020 4:27 PM, Bastiaan Visser wrote:
Dave made a good point WAL + DB might end up a little over 60G, I would 
probably go with ~70Gig partitions /LV's per OSD in your case. (if the nvme 
drive is smart enough to spread the writes over all available capacity, mort 
recent nvme's are). I have not yet seen a WAL larger or even close to than a 
gigabyte.

We don't even think about EC-coded pools on clusters with less than 6 nodes 
(spindles, full SSD is another story).
EC pools neer more processing resources  We usually settle with 1 gig per TB of 
storage on replicated only sluters, but whet EC polls are involved, we add at 
least 50% to that. Also make sure your processors are up for it.

Do not base your calculations on a healthy cluster -> build to fail.
How long are you willing to be in a degraded state on node failure. Especially 
when using many larger spindles. recovery time might be way longer than you 
think. 12 * 12TB is 144TB storage, on a 4+2 EC pool you might end up with over 
200 TB of 

[ceph-users] Ceph MDS randomly hangs with no useful error message

2020-01-17 Thread Janek Bevendorff

Hi,

We have a CephFS in our cluster with 3 MDS to which > 300 clients 
connect at any given time. The FS contains about 80 TB of data and many 
million files, so it is important that meta data operations work 
smoothly even when listing large directories.


Previously, we had massive stability problems causing the MDS nodes to 
crash or time out regularly as a result of failing to recall caps fast 
enough and weren't able to rejoin afterwards without resetting the 
mds*_openfiles objects (see 
https://lists.ceph.io/hyperkitty/list/ceph-us...@ceph.io/thread/AOYWQSONTFROPB4DXVYADWW7V25C3G6Z/ 
for details).


We have managed to adjust our configuration to avoid this problem. This 
comes down mostly to adjusting the recall decay rate (which still isn't 
documented), massively reducing any scrubbing activities, allowing for 
no more than 10G for mds_cache_memory_limit (the default of 1G is way 
too low, but more than 10G seems to cause trouble during replay), 
increasing osd_map_message_max to 100, and osd_map_cache_size to 150. We 
haven't seen crashes since. But what we do see is that one of the MDS 
nodes will randomly lock up and the ceph_mds_reply_latency metric goes 
up and then stays at a higher level than any other MDS. The result is 
not that the FS is completely down, but everything lags massively to the 
point where it's not usable.


Unfortunately, all the hung MDS is reporting is:

   -77> 2020-01-17 09:29:17.891 7f34c967b700  0 mds.beacon.XXX Skipping 
beacon heartbeat to monitors (last acked 320.587s ago); MDS internal 
heartbeat is not healthy!
   -76> 2020-01-17 09:29:18.391 7f34c967b700  1 heartbeat_map 
is_healthy 'MDSRank' had timed out after 15


and ceph fs status reports only single-digit ops/s for all three MDSs 
(mostly flat 0). I ran ceph mds fail 1 to fail the MDS and force a 
standby to take over, which went without problems. Almost immediately 
after, all three now-active MDSs started reporting > 900 ops/s and the 
FS started working properly again. For some strange reason, the failed 
MDS didn't restart, though. It kept reporting the log message above 
until I manually restarted the daemon process.


Is anybody else experiencing such issues or are there any configuration 
parameters that I can tweak to avoid this behaviour?


Thanks
Janek

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com