[ceph-users] Mimic - cephfs scrub errors

2019-11-15 Thread Andras Pataki

Dear cephers,

We've had a few (dozen or so) rather odd scrub errors in our Mimic 
(13.2.6) cephfs:


2019-11-15 07:52:52.614 7fffcc41f700  0 log_channel(cluster) log [DBG] : 
2.b5b scrub starts
2019-11-15 07:52:55.190 7fffcc41f700 -1 log_channel(cluster) log [ERR] : 
2.b5b shard 599 soid 2:dad01506:::100314224ad.0160:head : candidate 
size 4158512 info size 0 mismatch
2019-11-15 07:52:55.190 7fffcc41f700 -1 log_channel(cluster) log [ERR] : 
2.b5b shard 2768 soid 2:dad01506:::100314224ad.0160:head : candidate 
size 4158512 info size 0 mismatch
2019-11-15 07:52:55.190 7fffcc41f700 -1 log_channel(cluster) log [ERR] : 
2.b5b shard 3512 soid 2:dad01506:::100314224ad.0160:head : candidate 
size 4158512 info size 0 mismatch
2019-11-15 07:52:55.190 7fffcc41f700 -1 log_channel(cluster) log [ERR] : 
2.b5b soid 2:dad01506:::100314224ad.0160:head : failed to pick 
suitable object info
2019-11-15 07:52:55.198 7fffcc41f700 -1 log_channel(cluster) log [ERR] : 
scrub 2.b5b 2:dad01506:::100314224ad.0160:head : on disk size 
(4158512) does not match object info size (0) adjusted for ondisk to (0)
2019-11-15 07:53:55.441 7fffcc41f700 -1 log_channel(cluster) log [ERR] : 
2.b5b scrub 4 errors


Finding the file - it turns out to be a very small file:

1100338046125 -rw-r- 1 schen schen 41237 Nov 14 17:18 
/mnt/ceph/users/schen/main/Jellium/3u3d_3D/fort.4321169


We use 4MB stripe size - and it looks like the scrub complains about 
object 0x160, which is way beyond the end of the file (since the file 
should fit in just one object).  Retrieving the object gets an empty one 
- and it looks like all the objects between object 1 and 0x160 also 
exist as empty objects (and object 0 contains the whole, correct file 
contents).  Any ideas why so many empty objects get created beyond the 
end of the file?  Would this be the result of the file being 
overwritten/truncated?  Just for my understanding - if it is truncation, 
is that done by the client, or the MDS?


Any ideas how the inconsistencies could have come about?  Possibly 
something failed during the file truncation?


Thanks,

Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clust recovery stuck

2019-10-22 Thread Andras Pataki

Hi Philipp,

Given 256 PG's triple replicated onto 4 OSD's you might be encountering 
the "PG overdose protection" of OSDs.  Take a look at 'ceph osd df' and 
see the number of PG's that are mapped to each OSD (last column or near 
the last).  The default limit is 200, so if any OSD exceeds that, it 
would explain the freeze, since the OSD will simply ignore the excess.  
In that case, try increasing mon_max_pg_per_osd to, say, 400 and see if 
that helps.  This would allow the recovery to proceed - but you should 
consider adding OSDs (or at least increase the memory allocated to OSDs 
above the defaults).


Andras

On 10/22/19 3:02 PM, Philipp Schwaha wrote:

hi,

On 2019-10-22 08:05, Eugen Block wrote:

Hi,

can you share `ceph osd tree`? What crush rules are in use in your
cluster? I assume that the two failed OSDs prevent the remapping because
the rules can't be applied.


ceph osd tree gives:

ID WEIGHT   TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 27.94199 root default
-2  9.31400 host alpha.local
  0  4.65700 osd.0   down0  1.0
  3  4.65700 osd.3 up  1.0  1.0
-3  9.31400 host beta.local
  1  4.65700 osd.1 up  1.0  1.0
  6  4.65700 osd.6   down0  1.0
-4  9.31400 host gamma.local
  2  4.65700 osd.2 up  1.0  1.0
  4  4.65700 osd.4 up  1.0  1.0


the crush rules should be fairly simple, nothing particularly customized
as far as I can tell:
'ceph osd crush tree' gives:
[
 {
 "id": -1,
 "name": "default",
 "type": "root",
 "type_id": 10,
 "items": [
 {
 "id": -2,
 "name": "alpha.local",
 "type": "host",
 "type_id": 1,
 "items": [
 {
 "id": 0,
 "name": "osd.0",
 "type": "osd",
 "type_id": 0,
 "crush_weight": 4.656998,
 "depth": 2
 },
 {
 "id": 3,
 "name": "osd.3",
 "type": "osd",
 "type_id": 0,
 "crush_weight": 4.656998,
 "depth": 2
 }
 ]
 },
 {
 "id": -3,
 "name": "beta.local",
 "type": "host",
 "type_id": 1,
 "items": [
 {
 "id": 1,
 "name": "osd.1",
 "type": "osd",
 "type_id": 0,
 "crush_weight": 4.656998,
 "depth": 2
 },
 {
 "id": 6,
 "name": "osd.6",
 "type": "osd",
 "type_id": 0,
 "crush_weight": 4.656998,
 "depth": 2
 }
 ]
 },
 {
 "id": -4,
 "name": "gamma.local",
 "type": "host",
 "type_id": 1,
 "items": [
 {
 "id": 2,
 "name": "osd.2",
 "type": "osd",
 "type_id": 0,
 "crush_weight": 4.656998,
 "depth": 2
 },
 {
 "id": 4,
 "name": "osd.4",
 "type": "osd",
 "type_id": 0,
 "crush_weight": 4.656998,
 "depth": 2
 }
 ]
 }
 ]
 }
]

and 'ceph osd crush rule dump' gives:
[
 {
 "rule_id": 0,
 "rule_name": "replicated_ruleset",
 "ruleset": 0,
 "type": 1,
 "min_size": 1,
 "max_size": 10,
 "steps": [
 {
 "op": "take",
 "item": -1,
 "item_name": "default"
 },
 {
 "op": "chooseleaf_firstn",
 "num": 0,
 "type": "host"
 },
 {
 "op": "emit"
 }
 ]
 }
]

the cluster actually reached health ok after osd.0 went down, but when
osd.6 went down it did not recover. the cluster is running ceph version
10.2.2.

any help is greatly appreciated!

thanks & cheers
Philipp


Zitat von Philipp Schwaha :


hi,

I have a problem with a cluster 

[ceph-users] Nautilus - inconsistent PGs - stat mismatch

2019-10-21 Thread Andras Pataki

We have a new ceph Nautilus setup (Nautilus from scratch - not upgraded):

# ceph versions
{
    "mon": {
    "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) 
nautilus (stable)": 3

    },
    "mgr": {
    "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) 
nautilus (stable)": 3

    },
    "osd": {
    "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) 
nautilus (stable)": 169

    },
    "mds": {},
    "overall": {
    "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) 
nautilus (stable)": 175

    }
}

We only have CephFS on it with the two required triple replicated 
pools.  After creating the pools, we've increased the PG numbers on 
them, but did not turn autoscaling on:


# ceph osd pool ls detail
pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 1 
object_hash rjenkins pg_num 8192 pgp_num 8192 autoscale_mode warn 
last_change 2017 lfor 0/0/886 flags hashpspool stripe_width 0 
application cephfs
pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 2 
object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode warn 
last_change 1995 flags hashpspool stripe_width 0 pg_autoscale_bias 4 
pg_num_min 16 recovery_priority 5 application cephfs


After a few days of running, we started seeing inconsistent placement 
groups:


# ceph pg dump | grep incons
dumped all
1.bb4    21  0    0 0   0 
67108864   0  0 3059 3059 active+clean+inconsistent 
2019-10-20 11:13:43.270022 5830'5655 5831:18901 [346,426,373]    346 
[346,426,373]    346 5830'5655 2019-10-20 11:13:43.269992   
1763'5424 2019-10-18 08:14:37.582180 0
1.795    29  0    0 0   0 
96468992   0  0 3081 3081 active+clean+inconsistent 
2019-10-20 17:06:45.876483 5830'5472 5831:17921 [468,384,403]    468 
[468,384,403]    468 5830'5472 2019-10-20 17:06:45.876455   
1763'5235 2019-10-18 08:16:07.166754 0
1.1fa    18  0    0 0   0 
33554432   0  0 3065 3065 active+clean+inconsistent 
2019-10-20 15:35:29.755622 5830'5268 5831:17139 [337,401,455]    337 
[337,401,455]    337 5830'5268 2019-10-20 15:35:29.755588   
1763'5084 2019-10-18 08:17:17.962888 0
1.579    26  0    0 0   0 
75497472   0  0 3068 3068 active+clean+inconsistent 
2019-10-20 21:45:42.914200 5830'5218 5831:15405 [477,364,332]    477 
[477,364,332]    477 5830'5218 2019-10-20 21:45:42.914173   
5830'5218 2019-10-19 12:13:53.259686 0
1.11c5   21  0    0 0   0 
71303168   0  0 3010 3010 active+clean+inconsistent 
2019-10-20 23:31:36.183053 5831'5183 5831:16214 [458,370,416]    458 
[458,370,416]    458 5831'5183 2019-10-20 23:31:36.183030   
5831'5183 2019-10-19 16:35:17.195721 0
1.128d   17  0    0 0   0 
46137344   0  0 3073 3073 active+clean+inconsistent 
2019-10-20 19:14:55.459236 5830'5368 5831:17584 [441,422,377]    441 
[441,422,377]    441 5830'5368 2019-10-20 19:14:55.459209   
1763'5110 2019-10-18 08:12:51.062548 0
1.19ef   16  0    0 0   0 
41943040   0  0 3076 3076 active+clean+inconsistent 
2019-10-20 23:33:02.020050 5830'5502 5831:18244 [323,431,439]    323 
[323,431,439]    323 5830'5502 2019-10-20 23:33:02.020025   
1763'5220 2019-10-18 08:12:51.117020 0


The logs look like this (the 1.bb4 PG for example):

2019-10-20 11:13:43.261 7fffd3633700  0 log_channel(cluster) log [DBG] : 
1.bb4 scrub starts
2019-10-20 11:13:43.265 7fffd3633700 -1 log_channel(cluster) log [ERR] : 
1.bb4 scrub : stat mismatch, got 21/21 objects, 0/0 clones, 21/21 dirty, 
0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 
88080384/67108864 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.
2019-10-20 11:13:43.265 7fffd3633700 -1 log_channel(cluster) log [ERR] : 
1.bb4 scrub 1 errors


It looks like doing a pg repair fixes the issue:

2019-10-21 09:17:50.125 7fffd3633700  0 log_channel(cluster) log [DBG] : 
1.bb4 repair starts
2019-10-21 09:17:50.653 7fffd3633700 -1 log_channel(cluster) log [ERR] : 
1.bb4 repair : stat mismatch, got 21/21 objects, 0/0 clones, 21/21 
dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 
88080384/67108864 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.
2019-10-21 09:17:50.653 7fffd3633700 -1 log_channel(cluster) log [ERR] : 
1.bb4 repair 1 errors, 1 fixed


Is this a known issue with Nautilus?  We have other Luminous/Mimic 
clusters, where I haven't seen this come up.


Thanks,

Andras

___
ceph-users mailing list

[ceph-users] ceph-fuse segfaults in 14.2.2

2019-09-04 Thread Andras Pataki

Dear ceph users,

After upgrading our ceph-fuse clients to 14.2.2, we've been seeing 
sporadic segfaults with not super revealing stack traces:


   in thread 7fff5a7fc700 thread_name:ceph-fuse

 ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
   nautilus (stable)
 1: (()+0xf5d0) [0x760b85d0]
 2: (()+0x255a0c) [0x557a9a0c]
 3: (()+0x16b6b) [0x77bb3b6b]
 4: (()+0x13401) [0x77bb0401]
 5: (()+0x7dd5) [0x760b0dd5]
 6: (clone()+0x6d) [0x74b5cead]
 NOTE: a copy of the executable, or `objdump -rdS ` is
   needed to interpret this.


Prior to 14.2.2, we've run 12.2.11 and 13.2.5 and have not seen this 
issue.  Has anyone encountered this?  If it isn't known - I can file a 
bug tracker for it.


Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Poor cephfs (ceph_fuse) write performance in Mimic

2019-04-04 Thread Andras Pataki

Hi cephers,

I'm working through our testing cycle to upgrade our main ceph cluster 
from Luminous to Mimic, and I ran into a problem with ceph_fuse.  With 
Luminous, a single client can pretty much max out a 10Gbps network 
connection writing sequentially on our cluster with Luminous ceph_fuse.  
When I tried the Mimic ceph_fuse, I could only get about 150MB/sec 
through a single connection.  After a bit of debugging and instrumenting 
the ceph-fuse code, I saw that the Luminous ceph-fuse gets 128kB writes 
(about 8000 calls/sec, leading to about 1000MB/sec write throughput), 
while the Mimic client sees 4kB writes (about 38k calls/sec, around 
150MB/sec throughput). There seems to be an undocumented config option, 
"fuse_big_writes", which controls the "big_writes" fuse mount option.  
In Luminous this defaults to true, while in Mimic it is false.  Once I 
added this option to ceph.conf, the performance of Mimic ceph-fuse 
improved back to the Luminous level.


It would perhaps be helpful to say something about this option in the 
docs.  The git commit seems to be 
f37f2ea10c8298fc08777c74c8264b4b0cb6, the pull request mentions that 
the cause for removal of this option was the removal of the option in 
libfuse 3.0 (where it defaults to true).  Unfortunately CentOS packages 
fuse 2.9.2.  Thus an upgrade to Mimic on CentOS with ceph-fuse (without 
knowing about this undocumented option) leads to 4k writes in the fuse 
layer with very poor performance ...


Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mimic and cephfs

2019-02-25 Thread Andras Pataki

Hi ceph users,

As I understand, cephfs in Mimic had significant issues up to and 
including version 13.2.2.  With some critical patches in Mimic 13.2.4, 
is cephfs now production quality in Mimic?  Are there folks out there 
using it in a production setting?  If so, could you share your 
experience with is (as compared to Luminous)?


Thanks,

Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client instability

2019-01-24 Thread Andras Pataki

Hi Ilya,

Thanks for the clarification - very helpful.
I've lowered osd_map_messages_max to 10, and this resolves the issue 
about the kernel being unhappy about large messages when the OSDMap 
changes.  One comment here though: you mentioned that Luminous uses 40 
as the default, which is indeed the case.  The documentation for 
Luminous (and master), however, says that the default is 100.


One other follow-up question on the kernel client about something I've 
been seeing while testing.  Does the kernel client clean up when the MDS 
asks due to cache pressure?  On a machine I ran something that touches a 
lot of files, so the kernel client accumulated over 4 million caps.  
Many hours after all the activity finished (i.e. many hours after 
anything accesses ceph on that node) the kernel client still holds 
millions of caps, and the MDS periodically complains about clients not 
responding to cache pressure.  How is this supposed to be handled?  
Obviously asking the kernel to drop caches via /proc/sys/vm/drop_caches 
does a very thorough cleanup, but something in the middle would be better.


Andras


On 1/16/19 1:45 PM, Ilya Dryomov wrote:

On Wed, Jan 16, 2019 at 7:12 PM Andras Pataki
 wrote:

Hi Ilya/Kjetil,

I've done some debugging and tcpdump-ing to see what the interaction
between the kernel client and the mon looks like.  Indeed -
CEPH_MSG_MAX_FRONT defined as 16Mb seems low for the default mon
messages for our cluster (with osd_mon_messages_max at 100).  We have
about 3500 osd's, and the kernel advertises itself as older than

This is too big, especially for a fairly large cluster such as yours.
The default was reduced to 40 in luminous.  Given about 3500 OSDs, you
might want to set it to 20 or even 10.


Luminous, so it gets full map updates.  The FRONT message size on the
wire I saw was over 24Mb.  I'll try setting osd_mon_messages_max to 30
and do some more testing, but from the debugging it definitely seems
like the issue.

Is the kernel driver really not up to date to be considered at least a
Luminous client by the mon (i.e. it has some feature really missing)?  I
looked at the bits, and the MON seems to want is bit 59 in ceph features
shared by FS_BTIME, FS_CHANGE_ATTR, MSG_ADDR2.  Can the kernel client be
used when setting require-min-compat to luminous (either with the 4.19.x
kernel or the Redhat/Centos 7.6 kernel)?  Some background here would be
helpful.

Yes, the kernel client is missing support for that feature bit, however
4.13+ and RHEL 7.5+ _can_ be used with require-min-compat-client set to
luminous.  See

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/027002.html

Thanks,

 Ilya

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client instability

2019-01-16 Thread Andras Pataki

Hi Ilya/Kjetil,

I've done some debugging and tcpdump-ing to see what the interaction 
between the kernel client and the mon looks like.  Indeed - 
CEPH_MSG_MAX_FRONT defined as 16Mb seems low for the default mon 
messages for our cluster (with osd_mon_messages_max at 100).  We have 
about 3500 osd's, and the kernel advertises itself as older than 
Luminous, so it gets full map updates.  The FRONT message size on the 
wire I saw was over 24Mb.  I'll try setting osd_mon_messages_max to 30 
and do some more testing, but from the debugging it definitely seems 
like the issue.


Is the kernel driver really not up to date to be considered at least a 
Luminous client by the mon (i.e. it has some feature really missing)?  I 
looked at the bits, and the MON seems to want is bit 59 in ceph features 
shared by FS_BTIME, FS_CHANGE_ATTR, MSG_ADDR2.  Can the kernel client be 
used when setting require-min-compat to luminous (either with the 4.19.x 
kernel or the Redhat/Centos 7.6 kernel)?  Some background here would be 
helpful.


Thanks for your help and suggestions!

Andras


On 1/16/19 5:20 AM, Ilya Dryomov wrote:

On Wed, Jan 16, 2019 at 1:27 AM Kjetil Joergensen  wrote:

Hi,

you could try reducing "osd map message max", some code paths that end up as 
-EIO (kernel: libceph: mon1 *** io error) is exceeding 
include/linux/ceph/libceph.h:CEPH_MSG_MAX_{FRONT,MIDDLE,DATA}_LEN.

This "worked for us" - YMMV.

Kjetil, how big is your cluster?  Do you remember the circumstances
of when you started seeing these errors?

Andras, please let us know if this resolves the issue.  Decreasing
"osd map message max" for large clusters can help with the overall
memory consumption and is probably a good idea in general, but then
these kernel client limits are pretty arbitrary, so we can look at
bumping them.

Thanks,

 Ilya

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client instability

2019-01-15 Thread Andras Pataki
An update on our cephfs kernel client troubles.  After doing some 
heavier testing with a newer kernel 4.19.13, it seems like it also gets 
into a bad state when it can't connect to monitors (all back end 
processes are on 12.2.8):


Jan 15 08:49:00 mon5 kernel: libceph: mon1 10.128.150.11:6789 session 
established

Jan 15 08:49:01 mon5 kernel: libceph: mon1 10.128.150.11:6789 io error
Jan 15 08:49:01 mon5 kernel: libceph: mon1 10.128.150.11:6789 session 
lost, hunting for new mon
Jan 15 08:49:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 session 
established

Jan 15 08:49:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 io error
Jan 15 08:49:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 session 
lost, hunting for new mon
Jan 15 08:49:04 mon5 kernel: libceph: mon1 10.128.150.11:6789 session 
established

Jan 15 08:49:04 mon5 kernel: libceph: mon1 10.128.150.11:6789 io error
Jan 15 08:49:04 mon5 kernel: libceph: mon1 10.128.150.11:6789 session 
lost, hunting for new mon
Jan 15 08:49:06 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
established

Jan 15 08:49:07 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
Jan 15 08:49:07 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
lost, hunting for new mon

... repeating forever ...

# uname -r
4.19.13

and on the mon node (10.128.150.10) at log level 20, I see that it is 
building/encoding a lot of maps (10.128.36.18 is the client in question):
2019-01-15 08:49:01.355017 7fffee40c700 10 mon.cephmon00@0(leader) e40 
_ms_dispatch new session 0x62dc6c00 MonSession(client.36800361 
10.128.36.18:0/2911716500 is open , features 0x27018fb86aa42ada (jewel)) 
features 0x27018fb86aa42ada
2019-01-15 08:49:01.355021 7fffee40c700 20 mon.cephmon00@0(leader) e40  
caps
2019-01-15 08:49:01.355026 7fffee40c700 10 mon.cephmon00@0(leader).auth 
v58457 preprocess_query auth(proto 0 34 bytes epoch 0) from 
client.36800361 10.128.36.18:0/2911716500


2019-01-15 08:49:01.355817 7fffee40c700 10 mon.cephmon00@0(leader).osd 
e1254390 check_osdmap_sub 0x65373340 next 1254102 (onetime)
2019-01-15 08:49:01.355819 7fffee40c700  5 mon.cephmon00@0(leader).osd 
e1254390 send_incremental [1254102..1254390] to client.36800361 
10.128.36.18:0/2911716500
2019-01-15 08:49:01.355821 7fffee40c700 10 mon.cephmon00@0(leader).osd 
e1254390 build_incremental [1254102..1254141] with features 27018fb86aa42ada
2019-01-15 08:49:01.364859 7fffee40c700 20 mon.cephmon00@0(leader).osd 
e1254390 reencode_incremental_map 1254141 with features 504412504116439552
2019-01-15 08:49:01.372131 7fffee40c700 20 mon.cephmon00@0(leader).osd 
e1254390 build_incremental    inc 1254141 1237271 bytes
2019-01-15 08:49:01.372180 7fffee40c700 20 mon.cephmon00@0(leader).osd 
e1254390 reencode_incremental_map 1254140 with features 504412504116439552
2019-01-15 08:49:01.372187 7fffee40c700 20 mon.cephmon00@0(leader).osd 
e1254390 build_incremental    inc 1254140 260 bytes
2019-01-15 08:49:01.380981 7fffee40c700 20 mon.cephmon00@0(leader).osd 
e1254390 reencode_incremental_map 1254139 with features 504412504116439552
2019-01-15 08:49:01.387983 7fffee40c700 20 mon.cephmon00@0(leader).osd 
e1254390 build_incremental    inc 1254139 1237351 bytes
2019-01-15 08:49:01.388043 7fffee40c700 20 mon.cephmon00@0(leader).osd 
e1254390 reencode_incremental_map 1254138 with features 504412504116439552
2019-01-15 08:49:01.388049 7fffee40c700 20 mon.cephmon00@0(leader).osd 
e1254390 build_incremental    inc 1254138 232 bytes
2019-01-15 08:49:01.396781 7fffee40c700 20 mon.cephmon00@0(leader).osd 
e1254390 reencode_incremental_map 1254137 with features 504412504116439552

 ... a lot more of similar messages
2019-01-15 08:49:04.210936 7fffee40c700 20 mon.cephmon00@0(leader).osd 
e1254390 reencode_incremental_map 1254382 with features 504412504116439552
2019-01-15 08:49:04.211032 7fffee40c700 20 mon.cephmon00@0(leader).osd 
e1254390 build_incremental    inc 1254382 232 bytes
2019-01-15 08:49:04.211066 7fffee40c700 10 mon.cephmon00@0(leader) e40 
ms_handle_reset 0x6450f800 10.128.36.18:0/2911716500
2019-01-15 08:49:04.211070 7fffee40c700 10 mon.cephmon00@0(leader) e40 
reset/close on session client.36800361 10.128.36.18:0/2911716500
2019-01-15 08:49:04.211073 7fffee40c700 10 mon.cephmon00@0(leader) e40 
remove_session 0x62dc6c00 client.36800361 10.128.36.18:0/2911716500 
features 0x27018fb86aa42ada


looks like the client disconnects (either for waiting too long, or for 
some protocol error?).  Any hints on why so many maps need to be 
reencoded (to jewel), or how to improve this behavior would be much 
appreciated.  We would really be interested in using the kernel client 
instead of fuse, but this seems to be a stumbling block.


Thanks,

Andras


On 1/3/19 6:49 AM, Andras Pataki wrote:
I wonder if anyone could offer any insight on the issue below, 
regarding the CentOS 7.6 kernel cephfs client connecting to a Luminous 
cluster.  I have since tried a much newer 4.19.13 kernel, which did 
not show the same issue

Re: [ceph-users] cephfs kernel client instability

2019-01-03 Thread Andras Pataki
I wonder if anyone could offer any insight on the issue below, regarding 
the CentOS 7.6 kernel cephfs client connecting to a Luminous cluster.  I 
have since tried a much newer 4.19.13 kernel, which did not show the 
same issue (but unfortunately for various reasons unrelated to ceph, we 
can't go to such a new kernel).


Am I reading it right that somehow the monitor thinks this kernel is old 
and needs to prepare special maps in some older format for it, and that 
takes too long and the kernel just gives up, or perhaps has some other 
communication protocol error?  It seems like one of these mon 
communication sessions only lasts half a second.  Then it reconnects to 
another mon, and gets the same result, etc.  Any way around this?


Andras


On 12/26/18 7:55 PM, Andras Pataki wrote:
We've been using ceph-fuse with a pretty good stability record 
(against the Luminous 12.2.8 back end).  Unfortunately ceph-fuse has 
extremely poor small file performance (understandably), so we've been 
testing the kernel client.  The latest RedHat kernel 
3.10.0-957.1.3.el7.x86_64 seems to work pretty well, as long as the 
cluster is running in a completely clean state.  However, it seems 
that as soon as there is something happening to the cluster, the 
kernel client crashes pretty badly.


Today's example: I've reweighted some OSDs to balance the disk usage a 
bit (set nobackfill, reweight the OSDs, check the new hypothetical 
space usage, then unset nobackfill).   As soon as the reweighting 
procedure started, the kernel client went into an infinite loop trying 
to unsuccessfully connect to mons:


Dec 26 19:28:53 mon5 kernel: libceph: mon0 10.128.150.10:6789 io error
Dec 26 19:28:53 mon5 kernel: libceph: mon0 10.128.150.10:6789 session 
lost, hunting for new mon
Dec 26 19:28:53 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
established

Dec 26 19:28:58 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
Dec 26 19:28:58 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
lost, hunting for new mon
Dec 26 19:28:58 mon5 kernel: libceph: mon1 10.128.150.11:6789 session 
established

Dec 26 19:28:59 mon5 kernel: libceph: mon1 10.128.150.11:6789 io error
Dec 26 19:28:59 mon5 kernel: libceph: mon1 10.128.150.11:6789 session 
lost, hunting for new mon
Dec 26 19:28:59 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
established

Dec 26 19:28:59 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
Dec 26 19:28:59 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
lost, hunting for new mon
Dec 26 19:28:59 mon5 kernel: libceph: mon0 10.128.150.10:6789 session 
established

Dec 26 19:29:00 mon5 kernel: libceph: mon0 10.128.150.10:6789 io error
Dec 26 19:29:00 mon5 kernel: libceph: mon0 10.128.150.10:6789 session 
lost, hunting for new mon
Dec 26 19:29:00 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
established

Dec 26 19:29:00 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
Dec 26 19:29:00 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
lost, hunting for new mon
Dec 26 19:29:00 mon5 kernel: libceph: mon1 10.128.150.11:6789 session 
established

Dec 26 19:29:00 mon5 kernel: libceph: mon1 10.128.150.11:6789 io error
Dec 26 19:29:00 mon5 kernel: libceph: mon1 10.128.150.11:6789 session 
lost, hunting for new mon
Dec 26 19:29:00 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
established

Dec 26 19:29:01 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
Dec 26 19:29:01 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
lost, hunting for new mon
Dec 26 19:29:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 session 
established

Dec 26 19:29:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 io error
Dec 26 19:29:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 session 
lost, hunting for new mon
Dec 26 19:29:01 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
established

Dec 26 19:29:02 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
Dec 26 19:29:02 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
lost, hunting for new mon
Dec 26 19:29:02 mon5 kernel: libceph: mon1 10.128.150.11:6789 session 
established

Dec 26 19:29:02 mon5 kernel: libceph: mon1 10.128.150.11:6789 io error
Dec 26 19:29:02 mon5 kernel: libceph: mon1 10.128.150.11:6789 session 
lost, hunting for new mon

... etc ...

seemingly never recovering.  The cluster is healthy, all other clients 
are successfully doing I/O:


[root@cephmon00 ceph]# ceph -s
  cluster:
    id: d7b33135-0940-4e48-8aa6-1d2026597c2f
    health: HEALTH_WARN
    noout flag(s) set
    1 backfillfull osd(s)
    4 pool(s) backfillfull
    239119058/12419244975 objects misplaced (1.925%)

  services:
    mon: 3 daemons, quorum cephmon00,cephmon01,cephmon02
    mgr: cephmon00(active)
    mds: cephfs-1/1/1 up  {0=cephmds00=up:active}, 1 up:standby
    osd: 3534 osds: 3534 up, 3534 in; 5040 remapped pgs
 flags noout

[ceph-users] cephfs kernel client instability

2018-12-26 Thread Andras Pataki
We've been using ceph-fuse with a pretty good stability record (against 
the Luminous 12.2.8 back end).  Unfortunately ceph-fuse has extremely 
poor small file performance (understandably), so we've been testing the 
kernel client.  The latest RedHat kernel 3.10.0-957.1.3.el7.x86_64 seems 
to work pretty well, as long as the cluster is running in a completely 
clean state.  However, it seems that as soon as there is something 
happening to the cluster, the kernel client crashes pretty badly.


Today's example: I've reweighted some OSDs to balance the disk usage a 
bit (set nobackfill, reweight the OSDs, check the new hypothetical space 
usage, then unset nobackfill).   As soon as the reweighting procedure 
started, the kernel client went into an infinite loop trying to 
unsuccessfully connect to mons:


Dec 26 19:28:53 mon5 kernel: libceph: mon0 10.128.150.10:6789 io error
Dec 26 19:28:53 mon5 kernel: libceph: mon0 10.128.150.10:6789 session 
lost, hunting for new mon
Dec 26 19:28:53 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
established

Dec 26 19:28:58 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
Dec 26 19:28:58 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
lost, hunting for new mon
Dec 26 19:28:58 mon5 kernel: libceph: mon1 10.128.150.11:6789 session 
established

Dec 26 19:28:59 mon5 kernel: libceph: mon1 10.128.150.11:6789 io error
Dec 26 19:28:59 mon5 kernel: libceph: mon1 10.128.150.11:6789 session 
lost, hunting for new mon
Dec 26 19:28:59 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
established

Dec 26 19:28:59 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
Dec 26 19:28:59 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
lost, hunting for new mon
Dec 26 19:28:59 mon5 kernel: libceph: mon0 10.128.150.10:6789 session 
established

Dec 26 19:29:00 mon5 kernel: libceph: mon0 10.128.150.10:6789 io error
Dec 26 19:29:00 mon5 kernel: libceph: mon0 10.128.150.10:6789 session 
lost, hunting for new mon
Dec 26 19:29:00 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
established

Dec 26 19:29:00 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
Dec 26 19:29:00 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
lost, hunting for new mon
Dec 26 19:29:00 mon5 kernel: libceph: mon1 10.128.150.11:6789 session 
established

Dec 26 19:29:00 mon5 kernel: libceph: mon1 10.128.150.11:6789 io error
Dec 26 19:29:00 mon5 kernel: libceph: mon1 10.128.150.11:6789 session 
lost, hunting for new mon
Dec 26 19:29:00 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
established

Dec 26 19:29:01 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
Dec 26 19:29:01 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
lost, hunting for new mon
Dec 26 19:29:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 session 
established

Dec 26 19:29:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 io error
Dec 26 19:29:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 session 
lost, hunting for new mon
Dec 26 19:29:01 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
established

Dec 26 19:29:02 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
Dec 26 19:29:02 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
lost, hunting for new mon
Dec 26 19:29:02 mon5 kernel: libceph: mon1 10.128.150.11:6789 session 
established

Dec 26 19:29:02 mon5 kernel: libceph: mon1 10.128.150.11:6789 io error
Dec 26 19:29:02 mon5 kernel: libceph: mon1 10.128.150.11:6789 session 
lost, hunting for new mon

... etc ...

seemingly never recovering.  The cluster is healthy, all other clients 
are successfully doing I/O:


   [root@cephmon00 ceph]# ceph -s
  cluster:
    id: d7b33135-0940-4e48-8aa6-1d2026597c2f
    health: HEALTH_WARN
    noout flag(s) set
    1 backfillfull osd(s)
    4 pool(s) backfillfull
    239119058/12419244975 objects misplaced (1.925%)

  services:
    mon: 3 daemons, quorum cephmon00,cephmon01,cephmon02
    mgr: cephmon00(active)
    mds: cephfs-1/1/1 up  {0=cephmds00=up:active}, 1 up:standby
    osd: 3534 osds: 3534 up, 3534 in; 5040 remapped pgs
 flags noout

  data:
    pools:   5 pools, 50688 pgs
    objects: 2.51G objects, 7.77PiB
    usage:   19.1PiB used, 5.90PiB / 25.0PiB avail
    pgs: 239119058/12419244975 objects misplaced (1.925%)
 45639 active+clean
 2914  active+remapped+backfilling
 2126  active+remapped+backfill_wait
 9 active+clean+scrubbing+deep

  io:
    client:   10.3MiB/s rd, 2.23GiB/s wr, 32op/s rd, 3.77kop/s wr
    recovery: 114GiB/s, 33.12kobjects/s


The client machine in question is otherwise healthy also (not out of 
memory, etc.).  I've checked the osd blacklist, nothing on that. Nothing 
suspicious in the mon logs regarding this client either. Setting one of 
the mons to log level 20, the relevant section when this client 

Re: [ceph-users] Ceph monitors overloaded on large cluster restart

2018-12-19 Thread Andras Pataki

Hi Dan,

'noup' now makes a lot of sense - that's probably the major help that 
our cluster start would have needed.  Essentially this way only one map 
change occurs in the cluster when all the OSDs are marked 'in' and that 
gets distributed, vs hundreds or thousands of map changes as various 
OSDs boot at slightly different times.  I have a smaller cluster that I 
can test it with and measure how the network traffic changes to the 
mons, and will plan this in the next shutdown/restart/upgrade.


Thanks for the quick response and the tip!

Andras


On 12/19/18 6:47 PM, Dan van der Ster wrote:

Hey Andras,

Three mons is possibly too few for such a large cluster. We've had 
lots of good stable experience with 5-mon clusters. I've never tried 
7, so I can't say if that would lead to other problems (e.g. 
leader/peon sync scalability).


That said, our 1-osd bigbang tests managed with only 3 mons, and I 
assume that outside of this full system reboot scenario your 3 cope 
well enough. You should probably add 2 more, but I wouldn't expect 
that alone to solve this problem in the future.


Instead, with a slightly tuned procedure and a bit of osd log 
grepping, I think you could've booted this cluster more quickly than 4 
hours with those mere 3 mons.


As you know, each osds boot process requires the downloading of all 
known osdmaps. If all osds are booting together, and the mons are 
saturated, the osds can become sluggish when responding to their 
peers, which could lead to the flapping scenario you saw. Flapping 
leads to new osdmap epochs that then need to be distributed, worsening 
the issue. It's good that you used nodown and noout, because without 
these the boot time would've been even longer. Next time also set noup 
and noin to further reduce the osdmap churn.


One other thing: there's a debug_osd level -- 10 or 20, I forget 
exactly -- that you can set to watch the maps sync up on each osd. 
Grep the osd logs for some variations on "map" and "epoch".


In short, here's what I would've done:

0. boot the mons, waiting until they have a full quorum.
1. set nodown, noup, noin, noout   <-- with these, there should be 
zero new osdmaps generated while the osds boot.
2. start booting osds. set the necessary debug_osd level to see the 
osdmap sync progress in the ceph-osd logs.
3. if the mons are over saturated, boot progressively -- one rack at a 
time, for example.
4. once all osds have caught up to the current osdmap, unset noup. The 
osds should then all "boot" (as far as the mons are concerned) and be 
marked up. (this might be sluggish on a 3400 osd cluster, perhaps 
taking a few 10s of seconds). the pgs should be active+clean at this 
point.
5. unset nodown, noin, noout. which should change nothing provided all 
went well.


Hope that helps for next time!

Dan


On Wed, Dec 19, 2018 at 11:39 PM Andras Pataki 
mailto:apat...@flatironinstitute.org>> 
wrote:


Forgot to mention: all nodes are on Luminous 12.2.8 currently on
    CentOS 7.5.

On 12/19/18 5:34 PM, Andras Pataki wrote:
> Dear ceph users,
>
> We have a large-ish ceph cluster with about 3500 osds. We run 3
mons
> on dedicated hosts, and the mons typically use a few percent of a
> core, and generate about 50Mbits/sec network traffic. They are
> connected at 20Gbits/sec (bonded dual 10Gbit) and are running on
2x14
> core servers.
>
> We recently had to shut ceph down completely for maintenance
(which we
> rarely do), and had significant difficulties starting it up.  The
> symptoms included OSDs hanging on startup, being marked down,
flapping
> and all that bad stuff.  After some investigation we found that the
> 20Gbit/sec network interfaces of the monitors were completely
> saturated as the OSDs were starting, while the monitor processes
were
> using about 3 cores (300% CPU).  We ended up having to start the
OSDs
> up super slow to make sure that the monitors could keep up - it
took
> about 4 hours to start 3500 OSDs (at a rate about 4 seconds per
OSD).
> We've tried setting noout and nodown, but that didn't really help
> either.  A few questions that would be good to understand in
order to
> move to a better configuration.
>
> 1. How does the monitor traffic scale with the number of OSDs?
> Presumably the traffic comes from distributing cluster maps as the
> cluster changes on OSD starts.  The cluster map is perhaps O(N)
for N
> OSDs, and each OSD needs an update on a cluster change so that
would
> make one change an O(N^2) traffic.  As OSDs start, the cluster
changes
> quite a lot (N times?), so would that make the startup traffic
> O(N^3)?  If so, that sounds pretty scary for scalability.
>
> 2. Would adding more monitors help

Re: [ceph-users] Ceph monitors overloaded on large cluster restart

2018-12-19 Thread Andras Pataki

Forgot to mention: all nodes are on Luminous 12.2.8 currently on CentOS 7.5.

On 12/19/18 5:34 PM, Andras Pataki wrote:

Dear ceph users,

We have a large-ish ceph cluster with about 3500 osds.  We run 3 mons 
on dedicated hosts, and the mons typically use a few percent of a 
core, and generate about 50Mbits/sec network traffic.  They are 
connected at 20Gbits/sec (bonded dual 10Gbit) and are running on 2x14 
core servers.


We recently had to shut ceph down completely for maintenance (which we 
rarely do), and had significant difficulties starting it up.  The 
symptoms included OSDs hanging on startup, being marked down, flapping 
and all that bad stuff.  After some investigation we found that the 
20Gbit/sec network interfaces of the monitors were completely 
saturated as the OSDs were starting, while the monitor processes were 
using about 3 cores (300% CPU).  We ended up having to start the OSDs 
up super slow to make sure that the monitors could keep up - it took 
about 4 hours to start 3500 OSDs (at a rate about 4 seconds per OSD).  
We've tried setting noout and nodown, but that didn't really help 
either.  A few questions that would be good to understand in order to 
move to a better configuration.


1. How does the monitor traffic scale with the number of OSDs? 
Presumably the traffic comes from distributing cluster maps as the 
cluster changes on OSD starts.  The cluster map is perhaps O(N) for N 
OSDs, and each OSD needs an update on a cluster change so that would 
make one change an O(N^2) traffic.  As OSDs start, the cluster changes 
quite a lot (N times?), so would that make the startup traffic 
O(N^3)?  If so, that sounds pretty scary for scalability.


2. Would adding more monitors help here?  I.e. presumably each OSD 
gets its maps from one monitor, so they would share the traffic. Would 
the inter-monitor communication/elections/etc. be problematic for more 
monitors (5, 7 or even more)?  Would more monitors be recommended?  If 
so, how many is practical?


3. Are there any config parameters useful for tuning the traffic 
(perhaps send mon updates less frequently, or something along those 
lines)?


Any other advice on this topic would also be helpful.

Thanks,

Andras


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph monitors overloaded on large cluster restart

2018-12-19 Thread Andras Pataki

Dear ceph users,

We have a large-ish ceph cluster with about 3500 osds.  We run 3 mons on 
dedicated hosts, and the mons typically use a few percent of a core, and 
generate about 50Mbits/sec network traffic.  They are connected at 
20Gbits/sec (bonded dual 10Gbit) and are running on 2x14 core servers.


We recently had to shut ceph down completely for maintenance (which we 
rarely do), and had significant difficulties starting it up.  The 
symptoms included OSDs hanging on startup, being marked down, flapping 
and all that bad stuff.  After some investigation we found that the 
20Gbit/sec network interfaces of the monitors were completely saturated 
as the OSDs were starting, while the monitor processes were using about 
3 cores (300% CPU).  We ended up having to start the OSDs up super slow 
to make sure that the monitors could keep up - it took about 4 hours to 
start 3500 OSDs (at a rate about 4 seconds per OSD).  We've tried 
setting noout and nodown, but that didn't really help either.  A few 
questions that would be good to understand in order to move to a better 
configuration.


1. How does the monitor traffic scale with the number of OSDs? 
Presumably the traffic comes from distributing cluster maps as the 
cluster changes on OSD starts.  The cluster map is perhaps O(N) for N 
OSDs, and each OSD needs an update on a cluster change so that would 
make one change an O(N^2) traffic.  As OSDs start, the cluster changes 
quite a lot (N times?), so would that make the startup traffic O(N^3)?  
If so, that sounds pretty scary for scalability.


2. Would adding more monitors help here?  I.e. presumably each OSD gets 
its maps from one monitor, so they would share the traffic. Would the 
inter-monitor communication/elections/etc. be problematic for more 
monitors (5, 7 or even more)?  Would more monitors be recommended?  If 
so, how many is practical?


3. Are there any config parameters useful for tuning the traffic 
(perhaps send mon updates less frequently, or something along those lines)?


Any other advice on this topic would also be helpful.

Thanks,

Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] move directories in cephfs

2018-12-10 Thread Andras Pataki
Moving data between pools when a file is moved to a different directory 
is most likely problematic - for example an inode can be hard linked to 
two different directories that are in two different pools - then what 
happens to the file?  Unix/posix semantics don't really specify a parent 
directory to a regular file.


That being said - it would be really nice if there were a way to move an 
inode from one pool to another transparently (with some explicit 
command).  Perhaps locking the inode up for the duration of the move, 
and releasing it when the move is complete (so that clients that have 
the file open don't notice any disruptions).  Are there any plans in 
this direction?


Andras

On 12/10/18 10:55 AM, Marc Roos wrote:
  


Except if you have different pools on these directories. Then the data
is not moved(copied), which I think should be done. This should be
changed, because no one will expect a symlink to the old pool.




-Original Message-
From: Jack [mailto:c...@jack.fr.eu.org]
Sent: 10 December 2018 15:14
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] move directories in cephfs

Having the / mounted somewhere, you can simply "mv" directories around

On 12/10/2018 02:59 PM, Zhenshi Zhou wrote:

Hi,

Is there a way I can move sub-directories outside the directory.
For instance, a directory /parent contains 3 sub-directories
/parent/a, /parent/b, /parent/c. All these directories have huge data
in it. I'm gonna move /parent/b to /b. I don't want to copy the whole
directory outside cause it will be so slow.

Besides, I heard about cephfs-shell early today. I'm wondering which
version will ceph have this command tool. My cluster is luminous
12.2.5.

Thanks



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume: recreate OSD with same ID after drive replacement

2018-10-03 Thread Andras Pataki

Ok, understood (for next time).

But just as an update/closure to my investigation - it seems this is a 
feature of ceph-volume (that it can't just create an OSD from scratch 
with a given ID), not of base ceph.  The underlying ceph command (ceph 
osd new) very happily accepts an osd-id as an extra optional argument 
(after the fsid), and creates and osd with the given ID.  In fact, a 
quick change to ceph_volume (create_id function in prepare.py) will make 
ceph-volume recreate the OSD with a given ID.  I'm not a ceph-volume 
expert, but a feature to create an OSD with a given ID from scratch 
would be nice (given that the underlying raw ceph commands already 
support it).


Andras

On 10/3/18 11:41 AM, Alfredo Deza wrote:

On Wed, Oct 3, 2018 at 11:23 AM Andras Pataki
 wrote:

Thanks - I didn't realize that was such a recent fix.

I've now tried 12.2.8, and perhaps I'm not clear on what I should have
done to the OSD that I'm replacing, since I'm getting the error "The osd
ID 747 is already in use or does not exist.".  The case is clearly the
latter, since I've completely removed the old OSD (osd crush remove,
auth del, osd rm, wipe disk).  Should I have done something different
(i.e. not remove the OSD completely)?

Yeah, you completely removed it so now it can't be re-used. This is
the proper way if wanting to re-use the ID:

http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#rados-replacing-an-osd

Basically:

 ceph osd destroy {id} --yes-i-really-mean-it


Searching the docs I see a command 'ceph osd destroy'.  What does that
do (compared to my removal procedure, osd crush remove, auth del, osd rm)?

Thanks,

Andras


On 10/3/18 10:36 AM, Alfredo Deza wrote:

On Wed, Oct 3, 2018 at 9:57 AM Andras Pataki
 wrote:

After replacing failing drive I'd like to recreate the OSD with the same
osd-id using ceph-volume (now that we've moved to ceph-volume from
ceph-disk).  However, I seem to not be successful.  The command I'm using:

ceph-volume lvm prepare --bluestore --osd-id 747 --data H901D44/H901D44
--block.db /dev/disk/by-partlabel/H901J44

But it created an OSD the ID 601, which was the lowest it could allocate
and ignored the 747 apparently.  This is with ceph 12.2.7. Any ideas?

Yeah, this was a problem that was fixed and released as part of 12.2.8

The tracker issue is: http://tracker.ceph.com/issues/24044

The Luminous PR is https://github.com/ceph/ceph/pull/23102

Sorry for the trouble!

Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume: recreate OSD with same ID after drive replacement

2018-10-03 Thread Andras Pataki

Thanks - I didn't realize that was such a recent fix.

I've now tried 12.2.8, and perhaps I'm not clear on what I should have 
done to the OSD that I'm replacing, since I'm getting the error "The osd 
ID 747 is already in use or does not exist.".  The case is clearly the 
latter, since I've completely removed the old OSD (osd crush remove, 
auth del, osd rm, wipe disk).  Should I have done something different 
(i.e. not remove the OSD completely)?
Searching the docs I see a command 'ceph osd destroy'.  What does that 
do (compared to my removal procedure, osd crush remove, auth del, osd rm)?


Thanks,

Andras


On 10/3/18 10:36 AM, Alfredo Deza wrote:

On Wed, Oct 3, 2018 at 9:57 AM Andras Pataki
 wrote:

After replacing failing drive I'd like to recreate the OSD with the same
osd-id using ceph-volume (now that we've moved to ceph-volume from
ceph-disk).  However, I seem to not be successful.  The command I'm using:

ceph-volume lvm prepare --bluestore --osd-id 747 --data H901D44/H901D44
--block.db /dev/disk/by-partlabel/H901J44

But it created an OSD the ID 601, which was the lowest it could allocate
and ignored the 747 apparently.  This is with ceph 12.2.7. Any ideas?

Yeah, this was a problem that was fixed and released as part of 12.2.8

The tracker issue is: http://tracker.ceph.com/issues/24044

The Luminous PR is https://github.com/ceph/ceph/pull/23102

Sorry for the trouble!

Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-volume: recreate OSD with same ID after drive replacement

2018-10-03 Thread Andras Pataki
After replacing failing drive I'd like to recreate the OSD with the same 
osd-id using ceph-volume (now that we've moved to ceph-volume from 
ceph-disk).  However, I seem to not be successful.  The command I'm using:


ceph-volume lvm prepare --bluestore --osd-id 747 --data H901D44/H901D44 
--block.db /dev/disk/by-partlabel/H901J44


But it created an OSD the ID 601, which was the lowest it could allocate 
and ignored the 747 apparently.  This is with ceph 12.2.7. Any ideas?


Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client stability

2018-10-01 Thread Andras Pataki
dropping dirty+flushing Fw 
state for 997ff051a430 1099516450606
Oct  1 15:04:08 worker1004 kernel: ceph:  dropping dirty+flushing Fw 
state for 997ff0519eb0 1099516450608
Oct  1 15:04:08 worker1004 kernel: libceph: osd356 10.128.150.31:6844 
socket closed (con state OPEN)
Oct  1 15:04:08 worker1004 kernel: libceph: osd120 10.128.150.154:6885 
socket closed (con state OPEN)
Oct  1 15:04:08 worker1004 kernel: libceph: mds0 10.130.134.4:6800 
socket closed (con state NEGOTIATING)
Oct  1 15:04:08 worker1004 kernel: libceph: osd151 10.128.150.155:6808 
socket closed (con state OPEN)
Oct  1 15:04:08 worker1004 kernel: libceph: osd129 10.128.150.154:6845 
socket closed (con state OPEN)
Oct  1 15:04:08 worker1004 kernel: libceph: osd363 10.128.150.31:6872 
socket closed (con state OPEN)
Oct  1 15:04:08 worker1004 kernel: libceph: osd712 10.128.150.177:6908 
socket closed (con state OPEN)
Oct  1 15:04:08 worker1004 kernel: libceph: osd592 10.128.150.174:6917 
socket closed (con state OPEN)
Oct  1 15:04:08 worker1004 kernel: libceph: osd866 10.128.150.42:6976 
socket error on write
Oct  1 15:04:08 worker1004 kernel: libceph: osd395 10.128.150.158:6868 
socket closed (con state OPEN)
Oct  1 15:04:08 worker1004 kernel: libceph: osd159 10.128.150.155:6952 
socket closed (con state OPEN)
Oct  1 15:04:08 worker1004 kernel: libceph: osd901 10.128.150.43:6936 
socket closed (con state OPEN)
Oct  1 15:04:08 worker1004 kernel: libceph: osd658 10.128.150.176:6946 
socket closed (con state OPEN)
Oct  1 15:04:08 worker1004 kernel: libceph: osd422 10.128.150.159:6859 
socket closed (con state OPEN)
Oct  1 15:04:08 worker1004 kernel: libceph: osd718 10.128.150.177:6928 
socket error on read

and similar.

These nodes have been rock solid stable with ceph-fuse 12.2.7.  They 
definitely have no known networking issues, and the ceph cluster was 
healthy during this time.  In the MDS logs there are no entries around 
the problematic times either.  And after all this, the mount is unusable:


[root@worker1004 ~]# ls -l /mnt/cephtest
ls: cannot access /mnt/cephtest: Permission denied

Andras


On 10/1/18 3:02 PM, Andras Pataki wrote:
These hangs happen during random I/O fio benchmark loads.  Something 
like 4 or 8 fio processes doing random reads/writes to distinct large 
files (to ensure there is no caching possible).  This is all on CentOS 
7.4 nodes.  Same (and even tougher) tests run without any problems 
with ceph-fuse.  We do have jobs that do heavy parallel I/O (MPI-IO, 
HDF5 via MPI-IO, etc.) - so running 8 parallel random I/O generating 
processes on nodes with 28 cores and plenty of RAM (256GB - 512GB) 
should not be excessive.


I am going to test the latest CentOS kernel next (the one you are 
referencing).  The RedHat/CentOS kernels are not "old kernel clients" 
- they contains various backports of hundreds of patches to all kinds 
of subsystems of Linux.  What is unclear there is exactly what ceph 
client RedHat is backporting to their kernels. Any pointers there 
would be helpful.


Andras


On 10/1/18 2:26 PM, Marc Roos wrote:

  How do you test this? I have had no issues under "normal load" with an
old kernel client and a stable os.

CentOS Linux release 7.5.1804 (Core)
Linux c04 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018
x86_64 x86_64 x86_64 GNU/Linux





-Original Message-
From: Andras Pataki [mailto:apat...@flatironinstitute.org]
Sent: maandag 1 oktober 2018 20:10
To: ceph-users
Subject: [ceph-users] cephfs kernel client stability

We have so far been using ceph-fuse for mounting cephfs, but the small
file performance of ceph-fuse is often problematic.  We've been testing
the kernel client, and have seen some pretty bad crashes/hangs.

What is the policy on fixes to the kernel client?  Is only the latest
stable kernel updated (4.18.x nowadays), or are fixes backported to LTS
kernels also (like 4.14.x or 4.9.x for example)? I've seen various
threads that certain newer features require pretty new kernels - but I'm
wondering whether newer kernels are also required for better stability -
or - in general, where the kernel client stability stands nowadays.

Here is an example of kernel hang with 4.14.67.  On heavy loads the
machine isn't even pingable.

Sep 29 21:10:16 worker1004 kernel: INFO: rcu_sched self-detected stall
on CPU Sep 29 21:10:16 worker1004 kernel: #0111-...: (1 GPs behind)
idle=bee/141/0 softirq=21319/21319 fqs=7499 Sep 29 21:10:16
worker1004 kernel: #011 (t=15000 jiffies g=13989 c=13988
q=8334)
Sep 29 21:10:16 worker1004 kernel: NMI backtrace for cpu 1 Sep 29
21:10:16 worker1004 kernel: CPU: 1 PID: 19436 Comm: kworker/1:42
Tainted: P    W  O    4.14.67 #1
Sep 29 21:10:16 worker1004 kernel: Hardware name: Dell Inc. PowerEdge
C6320/082F9M, BIOS 2.6.0 10/27/2017 Sep 29 21:10:16 worker1004 kernel:
Workqueue: ceph-msgr ceph_con_workfn [libceph] Sep 29 21:10:16
worker1004 kernel: Call Trace:
Sep 29 21:10:16 worker1004 kernel:  Sep 29 21

Re: [ceph-users] cephfs kernel client stability

2018-10-01 Thread Andras Pataki
These hangs happen during random I/O fio benchmark loads.  Something 
like 4 or 8 fio processes doing random reads/writes to distinct large 
files (to ensure there is no caching possible).  This is all on CentOS 
7.4 nodes.  Same (and even tougher) tests run without any problems with 
ceph-fuse.  We do have jobs that do heavy parallel I/O (MPI-IO, HDF5 via 
MPI-IO, etc.) - so running 8 parallel random I/O generating processes on 
nodes with 28 cores and plenty of RAM (256GB - 512GB) should not be 
excessive.


I am going to test the latest CentOS kernel next (the one you are 
referencing).  The RedHat/CentOS kernels are not "old kernel clients" - 
they contains various backports of hundreds of patches to all kinds of 
subsystems of Linux.  What is unclear there is exactly what ceph client 
RedHat is backporting to their kernels.  Any pointers there would be 
helpful.


Andras


On 10/1/18 2:26 PM, Marc Roos wrote:
  
How do you test this? I have had no issues under "normal load" with an

old kernel client and a stable os.

CentOS Linux release 7.5.1804 (Core)
Linux c04 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018
x86_64 x86_64 x86_64 GNU/Linux





-Original Message-
From: Andras Pataki [mailto:apat...@flatironinstitute.org]
Sent: maandag 1 oktober 2018 20:10
To: ceph-users
Subject: [ceph-users] cephfs kernel client stability

We have so far been using ceph-fuse for mounting cephfs, but the small
file performance of ceph-fuse is often problematic.  We've been testing
the kernel client, and have seen some pretty bad crashes/hangs.

What is the policy on fixes to the kernel client?  Is only the latest
stable kernel updated (4.18.x nowadays), or are fixes backported to LTS
kernels also (like 4.14.x or 4.9.x for example)? I've seen various
threads that certain newer features require pretty new kernels - but I'm
wondering whether newer kernels are also required for better stability -
or - in general, where the kernel client stability stands nowadays.

Here is an example of kernel hang with 4.14.67.  On heavy loads the
machine isn't even pingable.

Sep 29 21:10:16 worker1004 kernel: INFO: rcu_sched self-detected stall
on CPU Sep 29 21:10:16 worker1004 kernel: #0111-...: (1 GPs behind)
idle=bee/141/0 softirq=21319/21319 fqs=7499 Sep 29 21:10:16
worker1004 kernel: #011 (t=15000 jiffies g=13989 c=13988
q=8334)
Sep 29 21:10:16 worker1004 kernel: NMI backtrace for cpu 1 Sep 29
21:10:16 worker1004 kernel: CPU: 1 PID: 19436 Comm: kworker/1:42
Tainted: P    W  O    4.14.67 #1
Sep 29 21:10:16 worker1004 kernel: Hardware name: Dell Inc. PowerEdge
C6320/082F9M, BIOS 2.6.0 10/27/2017 Sep 29 21:10:16 worker1004 kernel:
Workqueue: ceph-msgr ceph_con_workfn [libceph] Sep 29 21:10:16
worker1004 kernel: Call Trace:
Sep 29 21:10:16 worker1004 kernel:  Sep 29 21:10:16 worker1004
kernel: dump_stack+0x46/0x5f Sep 29 21:10:16 worker1004 kernel:
nmi_cpu_backtrace+0xba/0xc0 Sep 29 21:10:16 worker1004 kernel: ?
irq_force_complete_move+0xd0/0xd0 Sep 29 21:10:16 worker1004 kernel:
nmi_trigger_cpumask_backtrace+0x8a/0xc0
Sep 29 21:10:16 worker1004 kernel: rcu_dump_cpu_stacks+0x81/0xb1 Sep 29
21:10:16 worker1004 kernel: rcu_check_callbacks+0x642/0x790 Sep 29
21:10:16 worker1004 kernel: ? update_wall_time+0x26d/0x6e0 Sep 29
21:10:16 worker1004 kernel: update_process_times+0x23/0x50 Sep 29
21:10:16 worker1004 kernel: tick_sched_timer+0x2f/0x60 Sep 29 21:10:16
worker1004 kernel: __hrtimer_run_queues+0xa3/0xf0 Sep 29 21:10:16
worker1004 kernel: hrtimer_interrupt+0x94/0x170 Sep 29 21:10:16
worker1004 kernel: smp_apic_timer_interrupt+0x4c/0x90
Sep 29 21:10:16 worker1004 kernel: apic_timer_interrupt+0x84/0x90 Sep 29
21:10:16 worker1004 kernel:  Sep 29 21:10:16 worker1004 kernel:
RIP: 0010:crush_hash32_3+0x1e5/0x270 [libceph] Sep 29 21:10:16
worker1004 kernel: RSP: 0018:c9000fdff5d8 EFLAGS:
0a97 ORIG_RAX: ff10
Sep 29 21:10:16 worker1004 kernel: RAX: 06962033 RBX:
883f6e7173c0 RCX: dcdcc373
Sep 29 21:10:16 worker1004 kernel: RDX: bd5425ca RSI:
8a8b0b56 RDI: b1983b87
Sep 29 21:10:16 worker1004 kernel: RBP: 0023 R08:
bd5425ca R09: 137904e9
Sep 29 21:10:16 worker1004 kernel: R10:  R11:
0002 R12: b0f29f21
Sep 29 21:10:16 worker1004 kernel: R13: 000c R14:
f0ae R15: 0023
Sep 29 21:10:16 worker1004 kernel: crush_bucket_choose+0x2ad/0x340
[libceph] Sep 29 21:10:16 worker1004 kernel:
crush_choose_firstn+0x1b0/0x4c0 [libceph] Sep 29 21:10:16 worker1004
kernel: crush_choose_firstn+0x48d/0x4c0 [libceph] Sep 29 21:10:16
worker1004 kernel: crush_do_rule+0x28c/0x5a0 [libceph] Sep 29 21:10:16
worker1004 kernel: ceph_pg_to_up_acting_osds+0x459/0x850
[libceph]
Sep 29 21:10:16 worker1004 kernel: calc_target+0x213/0x520 [libceph] Sep
29 21:10:16 worker1004 kernel: ? ixgbe_xmit_frame_ring+0x362/0xe80
[ixgbe] Sep 29 21:10:16 worker1004 kernel: ? 

[ceph-users] cephfs kernel client stability

2018-10-01 Thread Andras Pataki
We have so far been using ceph-fuse for mounting cephfs, but the small 
file performance of ceph-fuse is often problematic.  We've been testing 
the kernel client, and have seen some pretty bad crashes/hangs.


What is the policy on fixes to the kernel client?  Is only the latest 
stable kernel updated (4.18.x nowadays), or are fixes backported to LTS 
kernels also (like 4.14.x or 4.9.x for example)? I've seen various 
threads that certain newer features require pretty new kernels - but I'm 
wondering whether newer kernels are also required for better stability - 
or - in general, where the kernel client stability stands nowadays.


Here is an example of kernel hang with 4.14.67.  On heavy loads the 
machine isn't even pingable.


Sep 29 21:10:16 worker1004 kernel: INFO: rcu_sched self-detected stall 
on CPU
Sep 29 21:10:16 worker1004 kernel: #0111-...: (1 GPs behind) 
idle=bee/141/0 softirq=21319/21319 fqs=7499
Sep 29 21:10:16 worker1004 kernel: #011 (t=15000 jiffies g=13989 c=13988 
q=8334)

Sep 29 21:10:16 worker1004 kernel: NMI backtrace for cpu 1
Sep 29 21:10:16 worker1004 kernel: CPU: 1 PID: 19436 Comm: kworker/1:42 
Tainted: P    W  O    4.14.67 #1
Sep 29 21:10:16 worker1004 kernel: Hardware name: Dell Inc. PowerEdge 
C6320/082F9M, BIOS 2.6.0 10/27/2017
Sep 29 21:10:16 worker1004 kernel: Workqueue: ceph-msgr ceph_con_workfn 
[libceph]

Sep 29 21:10:16 worker1004 kernel: Call Trace:
Sep 29 21:10:16 worker1004 kernel: 
Sep 29 21:10:16 worker1004 kernel: dump_stack+0x46/0x5f
Sep 29 21:10:16 worker1004 kernel: nmi_cpu_backtrace+0xba/0xc0
Sep 29 21:10:16 worker1004 kernel: ? irq_force_complete_move+0xd0/0xd0
Sep 29 21:10:16 worker1004 kernel: nmi_trigger_cpumask_backtrace+0x8a/0xc0
Sep 29 21:10:16 worker1004 kernel: rcu_dump_cpu_stacks+0x81/0xb1
Sep 29 21:10:16 worker1004 kernel: rcu_check_callbacks+0x642/0x790
Sep 29 21:10:16 worker1004 kernel: ? update_wall_time+0x26d/0x6e0
Sep 29 21:10:16 worker1004 kernel: update_process_times+0x23/0x50
Sep 29 21:10:16 worker1004 kernel: tick_sched_timer+0x2f/0x60
Sep 29 21:10:16 worker1004 kernel: __hrtimer_run_queues+0xa3/0xf0
Sep 29 21:10:16 worker1004 kernel: hrtimer_interrupt+0x94/0x170
Sep 29 21:10:16 worker1004 kernel: smp_apic_timer_interrupt+0x4c/0x90
Sep 29 21:10:16 worker1004 kernel: apic_timer_interrupt+0x84/0x90
Sep 29 21:10:16 worker1004 kernel: 
Sep 29 21:10:16 worker1004 kernel: RIP: 0010:crush_hash32_3+0x1e5/0x270 
[libceph]
Sep 29 21:10:16 worker1004 kernel: RSP: 0018:c9000fdff5d8 EFLAGS: 
0a97 ORIG_RAX: ff10
Sep 29 21:10:16 worker1004 kernel: RAX: 06962033 RBX: 
883f6e7173c0 RCX: dcdcc373
Sep 29 21:10:16 worker1004 kernel: RDX: bd5425ca RSI: 
8a8b0b56 RDI: b1983b87
Sep 29 21:10:16 worker1004 kernel: RBP: 0023 R08: 
bd5425ca R09: 137904e9
Sep 29 21:10:16 worker1004 kernel: R10:  R11: 
0002 R12: b0f29f21
Sep 29 21:10:16 worker1004 kernel: R13: 000c R14: 
f0ae R15: 0023

Sep 29 21:10:16 worker1004 kernel: crush_bucket_choose+0x2ad/0x340 [libceph]
Sep 29 21:10:16 worker1004 kernel: crush_choose_firstn+0x1b0/0x4c0 [libceph]
Sep 29 21:10:16 worker1004 kernel: crush_choose_firstn+0x48d/0x4c0 [libceph]
Sep 29 21:10:16 worker1004 kernel: crush_do_rule+0x28c/0x5a0 [libceph]
Sep 29 21:10:16 worker1004 kernel: ceph_pg_to_up_acting_osds+0x459/0x850 
[libceph]

Sep 29 21:10:16 worker1004 kernel: calc_target+0x213/0x520 [libceph]
Sep 29 21:10:16 worker1004 kernel: ? ixgbe_xmit_frame_ring+0x362/0xe80 
[ixgbe]

Sep 29 21:10:16 worker1004 kernel: ? put_prev_entity+0x27/0x620
Sep 29 21:10:16 worker1004 kernel: ? pick_next_task_fair+0x1c7/0x520
Sep 29 21:10:16 worker1004 kernel: 
scan_requests.constprop.55+0x16f/0x280 [libceph]

Sep 29 21:10:16 worker1004 kernel: handle_one_map+0x175/0x200 [libceph]
Sep 29 21:10:16 worker1004 kernel: ceph_osdc_handle_map+0x390/0x850 
[libceph]

Sep 29 21:10:16 worker1004 kernel: ? ceph_x_encrypt+0x46/0x70 [libceph]
Sep 29 21:10:16 worker1004 kernel: dispatch+0x2ef/0xba0 [libceph]
Sep 29 21:10:16 worker1004 kernel: ? read_partial_message+0x215/0x880 
[libceph]

Sep 29 21:10:16 worker1004 kernel: ? inet_recvmsg+0x45/0xb0
Sep 29 21:10:16 worker1004 kernel: try_read+0x6f8/0x11b0 [libceph]
Sep 29 21:10:16 worker1004 kernel: ? sched_clock_cpu+0xc/0xa0
Sep 29 21:10:16 worker1004 kernel: ? put_prev_entity+0x27/0x620
Sep 29 21:10:16 worker1004 kernel: ? pick_next_task_fair+0x415/0x520
Sep 29 21:10:16 worker1004 kernel: ceph_con_workfn+0x9d/0x5a0 [libceph]
Sep 29 21:10:16 worker1004 kernel: process_one_work+0x127/0x290
Sep 29 21:10:16 worker1004 kernel: worker_thread+0x3f/0x3b0
Sep 29 21:10:16 worker1004 kernel: kthread+0xf2/0x130
Sep 29 21:10:16 worker1004 kernel: ? process_one_work+0x290/0x290
Sep 29 21:10:16 worker1004 kernel: ? __kthread_parkme+0x90/0x90
Sep 29 21:10:16 worker1004 kernel: ret_from_fork+0x1f/0x30

Andras


Re: [ceph-users] ceph-fuse using excessive memory

2018-09-25 Thread Andras Pataki

Hi Zheng,

Here is a debug dump: 
https://users.flatironinstitute.org/apataki/public_www/7f0011f676112cd4/
I have also included some other corresponding information (cache dump, 
mempool dump, perf dump and ceph.conf).  This corresponds to a 100GB 
ceph-fuse process while the client code is running.  I can reproduce 
this issue at will in about 6 to 8 hours of running one of our 
scientific jobs - and I can also run a more instrumented/patched/etc. 
code to try.


Andras


On 9/24/18 10:06 PM, Yan, Zheng wrote:

On Tue, Sep 25, 2018 at 2:23 AM Andras Pataki
 wrote:

The whole cluster, including ceph-fuse is version 12.2.7.


If this issue happens again, please set "debug_objectcacher" option of
ceph-fuse to 15 (for 30 seconds) and set ceph-fuse log to us

Regards
Yan, Zheng



Andras

On 9/24/18 6:27 AM, Yan, Zheng wrote:

On Fri, Sep 21, 2018 at 5:40 AM Andras Pataki
 wrote:

I've done some more experiments playing with client config parameters,
and it seems like the the client_oc_size parameter is very correlated to
how big ceph-fuse grows.  With its default value of 200MB, ceph-fuse
gets to about 22GB of RSS, with our previous client_oc_size value of
2GB, the ceph-fuse process grows to 211GB. After this size is reached,
its memory usage levels out.  So it seems like there is an issue
accounting for memory for the client cache - whatever client_oc_size is
set to, about 100 times more memory gets used in our case at least.


ceph-fuse version ?


Andras

On 9/19/18 6:06 PM, Andras Pataki wrote:

Hi Zheng,

It looks like the memory growth happens even with the simple messenger:

[root@worker1032 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
config get ms_type
{
  "ms_type": "simple"
}
[root@worker1032 ~]# ps -auxwww | grep ceph-fuse
root  179133 82.2 13.5 77281896 71644120 ?   Sl   12:48 258:09
ceph-fuse --id=admin --conf=/etc/ceph/ceph.conf /mnt/ceph -o
rw,fsname=ceph,dev,suid
[root@worker1032 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
dump_mempools
{
... snip ...
  "buffer_anon": {
  "items": 16753337,
  "bytes": 68782648777
  },
  "buffer_meta": {
  "items": 771,
  "bytes": 67848
  },
... snip ...
  "osdmap": {
  "items": 28582,
  "bytes": 431840
  },
... snip ...

  "total": {
  "items": 16782690,
  "bytes": 68783148465
  }
}
Andras


On 9/6/18 11:58 PM, Yan, Zheng wrote:

Could you please try make ceph-fuse use simple messenger (add "ms type
= simple" to client section of ceph.conf).

Regards
Yan, Zheng



On Wed, Sep 5, 2018 at 10:09 PM Sage Weil  wrote:

On Wed, 5 Sep 2018, Andras Pataki wrote:

Hi cephers,

Every so often we have a ceph-fuse process that grows to rather
large size (up
to eating up the whole memory of the machine).  Here is an example
of a 200GB
RSS size ceph-fuse instance:

# ceph daemon /var/run/ceph/ceph-client.admin.asok dump_mempools
{
   "bloom_filter": {
   "items": 0,
   "bytes": 0
   },
   "bluestore_alloc": {
   "items": 0,
   "bytes": 0
   },
   "bluestore_cache_data": {
   "items": 0,
   "bytes": 0
   },
   "bluestore_cache_onode": {
   "items": 0,
   "bytes": 0
   },
   "bluestore_cache_other": {
   "items": 0,
   "bytes": 0
   },
   "bluestore_fsck": {
   "items": 0,
   "bytes": 0
   },
   "bluestore_txc": {
   "items": 0,
   "bytes": 0
   },
   "bluestore_writing_deferred": {
   "items": 0,
   "bytes": 0
   },
   "bluestore_writing": {
   "items": 0,
   "bytes": 0
   },
   "bluefs": {
   "items": 0,
   "bytes": 0
   },
   "buffer_anon": {
   "items": 51534897,
   "bytes": 207321872398
   },
   "buffer_meta": {
   "items": 64,
   "bytes": 5632
   },
   "osd": {
   "items": 0,
   "bytes": 0
   },
   "osd_mapbl": {
   "items": 0,
   "bytes": 0
   },
   "osd_pglog": {
   "items": 0,
   "bytes": 0
   },
   "osdmap": {
   "items": 28593,
   "bytes": 431872
   },
   "osdmap_mapping": {
   "items": 0,
   "

Re: [ceph-users] ceph-fuse using excessive memory

2018-09-24 Thread Andras Pataki

The whole cluster, including ceph-fuse is version 12.2.7.

Andras

On 9/24/18 6:27 AM, Yan, Zheng wrote:

On Fri, Sep 21, 2018 at 5:40 AM Andras Pataki
 wrote:

I've done some more experiments playing with client config parameters,
and it seems like the the client_oc_size parameter is very correlated to
how big ceph-fuse grows.  With its default value of 200MB, ceph-fuse
gets to about 22GB of RSS, with our previous client_oc_size value of
2GB, the ceph-fuse process grows to 211GB. After this size is reached,
its memory usage levels out.  So it seems like there is an issue
accounting for memory for the client cache - whatever client_oc_size is
set to, about 100 times more memory gets used in our case at least.


ceph-fuse version ?


Andras

On 9/19/18 6:06 PM, Andras Pataki wrote:

Hi Zheng,

It looks like the memory growth happens even with the simple messenger:

[root@worker1032 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
config get ms_type
{
 "ms_type": "simple"
}
[root@worker1032 ~]# ps -auxwww | grep ceph-fuse
root  179133 82.2 13.5 77281896 71644120 ?   Sl   12:48 258:09
ceph-fuse --id=admin --conf=/etc/ceph/ceph.conf /mnt/ceph -o
rw,fsname=ceph,dev,suid
[root@worker1032 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
dump_mempools
{
... snip ...
 "buffer_anon": {
 "items": 16753337,
 "bytes": 68782648777
 },
 "buffer_meta": {
 "items": 771,
 "bytes": 67848
 },
... snip ...
 "osdmap": {
 "items": 28582,
 "bytes": 431840
 },
... snip ...

 "total": {
 "items": 16782690,
 "bytes": 68783148465
 }
}
Andras


On 9/6/18 11:58 PM, Yan, Zheng wrote:

Could you please try make ceph-fuse use simple messenger (add "ms type
= simple" to client section of ceph.conf).

Regards
Yan, Zheng



On Wed, Sep 5, 2018 at 10:09 PM Sage Weil  wrote:

On Wed, 5 Sep 2018, Andras Pataki wrote:

Hi cephers,

Every so often we have a ceph-fuse process that grows to rather
large size (up
to eating up the whole memory of the machine).  Here is an example
of a 200GB
RSS size ceph-fuse instance:

# ceph daemon /var/run/ceph/ceph-client.admin.asok dump_mempools
{
  "bloom_filter": {
  "items": 0,
  "bytes": 0
  },
  "bluestore_alloc": {
  "items": 0,
  "bytes": 0
  },
  "bluestore_cache_data": {
  "items": 0,
  "bytes": 0
  },
  "bluestore_cache_onode": {
  "items": 0,
  "bytes": 0
  },
  "bluestore_cache_other": {
  "items": 0,
  "bytes": 0
  },
  "bluestore_fsck": {
  "items": 0,
  "bytes": 0
  },
  "bluestore_txc": {
  "items": 0,
  "bytes": 0
  },
  "bluestore_writing_deferred": {
  "items": 0,
  "bytes": 0
  },
  "bluestore_writing": {
  "items": 0,
  "bytes": 0
  },
  "bluefs": {
  "items": 0,
  "bytes": 0
  },
  "buffer_anon": {
  "items": 51534897,
  "bytes": 207321872398
  },
  "buffer_meta": {
  "items": 64,
  "bytes": 5632
  },
  "osd": {
  "items": 0,
  "bytes": 0
  },
  "osd_mapbl": {
  "items": 0,
  "bytes": 0
  },
  "osd_pglog": {
  "items": 0,
  "bytes": 0
  },
  "osdmap": {
  "items": 28593,
  "bytes": 431872
  },
  "osdmap_mapping": {
  "items": 0,
  "bytes": 0
  },
  "pgmap": {
  "items": 0,
  "bytes": 0
  },
  "mds_co": {
  "items": 0,
  "bytes": 0
  },
  "unittest_1": {
  "items": 0,
  "bytes": 0
  },
  "unittest_2": {
  "items": 0,
  "bytes": 0
  },
  "total": {
  "items": 51563554,
  "bytes": 207322309902
  }
}

The general cache size looks like this (if it is helpful I can put
a whole
cache dump somewhere):

# ceph daemon /var/run/ceph/ceph-client.admin.asok dump_cache |
grep path | wc
-l
84085
# ceph daemon /var/run/ceph/ceph-client.admin.asok dump_cache |
grep name | wc
-l
168

Re: [ceph-users] ceph-fuse using excessive memory

2018-09-20 Thread Andras Pataki
I've done some more experiments playing with client config parameters, 
and it seems like the the client_oc_size parameter is very correlated to 
how big ceph-fuse grows.  With its default value of 200MB, ceph-fuse 
gets to about 22GB of RSS, with our previous client_oc_size value of 
2GB, the ceph-fuse process grows to 211GB. After this size is reached, 
its memory usage levels out.  So it seems like there is an issue 
accounting for memory for the client cache - whatever client_oc_size is 
set to, about 100 times more memory gets used in our case at least.


Andras

On 9/19/18 6:06 PM, Andras Pataki wrote:

Hi Zheng,

It looks like the memory growth happens even with the simple messenger:

[root@worker1032 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok 
config get ms_type

{
    "ms_type": "simple"
}
[root@worker1032 ~]# ps -auxwww | grep ceph-fuse
root  179133 82.2 13.5 77281896 71644120 ?   Sl   12:48 258:09 
ceph-fuse --id=admin --conf=/etc/ceph/ceph.conf /mnt/ceph -o 
rw,fsname=ceph,dev,suid
[root@worker1032 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok 
dump_mempools

{
... snip ...
    "buffer_anon": {
    "items": 16753337,
    "bytes": 68782648777
    },
    "buffer_meta": {
    "items": 771,
    "bytes": 67848
    },
... snip ...
    "osdmap": {
    "items": 28582,
    "bytes": 431840
    },
... snip ...

    "total": {
    "items": 16782690,
    "bytes": 68783148465
    }
}
Andras


On 9/6/18 11:58 PM, Yan, Zheng wrote:

Could you please try make ceph-fuse use simple messenger (add "ms type
= simple" to client section of ceph.conf).

Regards
Yan, Zheng



On Wed, Sep 5, 2018 at 10:09 PM Sage Weil  wrote:

On Wed, 5 Sep 2018, Andras Pataki wrote:

Hi cephers,

Every so often we have a ceph-fuse process that grows to rather 
large size (up
to eating up the whole memory of the machine).  Here is an example 
of a 200GB

RSS size ceph-fuse instance:

# ceph daemon /var/run/ceph/ceph-client.admin.asok dump_mempools
{
 "bloom_filter": {
 "items": 0,
 "bytes": 0
 },
 "bluestore_alloc": {
 "items": 0,
 "bytes": 0
 },
 "bluestore_cache_data": {
 "items": 0,
 "bytes": 0
 },
 "bluestore_cache_onode": {
 "items": 0,
 "bytes": 0
 },
 "bluestore_cache_other": {
 "items": 0,
 "bytes": 0
 },
 "bluestore_fsck": {
 "items": 0,
 "bytes": 0
 },
 "bluestore_txc": {
 "items": 0,
 "bytes": 0
 },
 "bluestore_writing_deferred": {
 "items": 0,
 "bytes": 0
 },
 "bluestore_writing": {
 "items": 0,
 "bytes": 0
 },
 "bluefs": {
 "items": 0,
 "bytes": 0
 },
 "buffer_anon": {
 "items": 51534897,
 "bytes": 207321872398
 },
 "buffer_meta": {
 "items": 64,
 "bytes": 5632
 },
 "osd": {
 "items": 0,
 "bytes": 0
 },
 "osd_mapbl": {
 "items": 0,
 "bytes": 0
 },
 "osd_pglog": {
 "items": 0,
 "bytes": 0
 },
 "osdmap": {
 "items": 28593,
 "bytes": 431872
 },
 "osdmap_mapping": {
 "items": 0,
 "bytes": 0
 },
 "pgmap": {
 "items": 0,
 "bytes": 0
 },
 "mds_co": {
 "items": 0,
 "bytes": 0
 },
 "unittest_1": {
 "items": 0,
 "bytes": 0
 },
 "unittest_2": {
 "items": 0,
 "bytes": 0
 },
 "total": {
 "items": 51563554,
 "bytes": 207322309902
 }
}

The general cache size looks like this (if it is helpful I can put 
a whole

cache dump somewhere):

# ceph daemon /var/run/ceph/ceph-client.admin.asok dump_cache | 
grep path | wc

-l
84085
# ceph daemon /var/run/ceph/ceph-client.admin.asok dump_cache | 
grep name | wc

-l
168186

Any ideas what 'buffer_anon' is and what could be eating up the 
200GB of

RAM?

buffer_anon is memory consumed by the bufferlist class that hasn't been
explicitly put into a separate mempool category.  The question is
where/why are buffers getting pinned in memo

Re: [ceph-users] ceph-fuse using excessive memory

2018-09-19 Thread Andras Pataki

Hi Zheng,

It looks like the memory growth happens even with the simple messenger:

[root@worker1032 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok 
config get ms_type

{
    "ms_type": "simple"
}
[root@worker1032 ~]# ps -auxwww | grep ceph-fuse
root  179133 82.2 13.5 77281896 71644120 ?   Sl   12:48 258:09 
ceph-fuse --id=admin --conf=/etc/ceph/ceph.conf /mnt/ceph -o 
rw,fsname=ceph,dev,suid
[root@worker1032 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok 
dump_mempools

{
... snip ...
    "buffer_anon": {
    "items": 16753337,
    "bytes": 68782648777
    },
    "buffer_meta": {
    "items": 771,
    "bytes": 67848
    },
... snip ...
    "osdmap": {
    "items": 28582,
    "bytes": 431840
    },
... snip ...

    "total": {
    "items": 16782690,
    "bytes": 68783148465
    }
}
Andras


On 9/6/18 11:58 PM, Yan, Zheng wrote:

Could you please try make ceph-fuse use simple messenger (add "ms type
= simple" to client section of ceph.conf).

Regards
Yan, Zheng



On Wed, Sep 5, 2018 at 10:09 PM Sage Weil  wrote:

On Wed, 5 Sep 2018, Andras Pataki wrote:

Hi cephers,

Every so often we have a ceph-fuse process that grows to rather large size (up
to eating up the whole memory of the machine).  Here is an example of a 200GB
RSS size ceph-fuse instance:

# ceph daemon /var/run/ceph/ceph-client.admin.asok dump_mempools
{
 "bloom_filter": {
 "items": 0,
 "bytes": 0
 },
 "bluestore_alloc": {
 "items": 0,
 "bytes": 0
 },
 "bluestore_cache_data": {
 "items": 0,
 "bytes": 0
 },
 "bluestore_cache_onode": {
 "items": 0,
 "bytes": 0
 },
 "bluestore_cache_other": {
 "items": 0,
 "bytes": 0
 },
 "bluestore_fsck": {
 "items": 0,
 "bytes": 0
 },
 "bluestore_txc": {
 "items": 0,
 "bytes": 0
 },
 "bluestore_writing_deferred": {
 "items": 0,
 "bytes": 0
 },
 "bluestore_writing": {
 "items": 0,
 "bytes": 0
 },
 "bluefs": {
 "items": 0,
 "bytes": 0
 },
 "buffer_anon": {
 "items": 51534897,
 "bytes": 207321872398
 },
 "buffer_meta": {
 "items": 64,
 "bytes": 5632
 },
 "osd": {
 "items": 0,
 "bytes": 0
 },
 "osd_mapbl": {
 "items": 0,
 "bytes": 0
 },
 "osd_pglog": {
 "items": 0,
 "bytes": 0
 },
 "osdmap": {
 "items": 28593,
 "bytes": 431872
 },
 "osdmap_mapping": {
 "items": 0,
 "bytes": 0
 },
 "pgmap": {
 "items": 0,
 "bytes": 0
 },
 "mds_co": {
 "items": 0,
 "bytes": 0
 },
 "unittest_1": {
 "items": 0,
 "bytes": 0
 },
 "unittest_2": {
 "items": 0,
 "bytes": 0
 },
 "total": {
 "items": 51563554,
 "bytes": 207322309902
 }
}

The general cache size looks like this (if it is helpful I can put a whole
cache dump somewhere):

# ceph daemon /var/run/ceph/ceph-client.admin.asok dump_cache | grep path | wc
-l
84085
# ceph daemon /var/run/ceph/ceph-client.admin.asok dump_cache | grep name | wc
-l
168186

Any ideas what 'buffer_anon' is and what could be eating up the 200GB of
RAM?

buffer_anon is memory consumed by the bufferlist class that hasn't been
explicitly put into a separate mempool category.  The question is
where/why are buffers getting pinned in memory.  Can you dump the
perfcounters?  That might give some hint.

My guess is a leak, or a problem with the ObjectCacher code that is
preventing it from timming older buffers.

How reproducible is the situation?  Any idea what workloads trigger it?

Thanks!
sage


We are running with a few ceph-fuse specific parameters increased in
ceph.conf:

# Description:  Set the number of inodes that the client keeps in
the metadata cache.
# Default:  16384
client_cache_size = 262144

# Description:  Set the maximum number of dirty bytes in the object
cache.
# Default:  104857600 (100MB)
client

Re: [ceph-users] ceph-fuse using excessive memory

2018-09-06 Thread Andras Pataki
It looks like I have a process that can reproduce the problem at will.  
Attached is a quick plot of the RSS memory usage of ceph-fuse over a 
period of 13-14 hours or so (the x axis is minutes, the y axis is 
bytes).  It looks like the process steadily grows up to about 200GB and 
then its memory usage stabilizes.  So something comes to equilibrium at 
the 200GB size.  What would be a good way to further understand where 
the memory goes?  I could even run some instrumented binary if needed to 
further pin down what is happening.  As I mentioned below, we are 
running with somewhat increased memory related settings in 
/etc/ceph.conf, but based on my understanding of the parameters, they 
shouldn't amount to such high memory usage.


Andras


On 09/05/2018 10:15 AM, Andras Pataki wrote:
Below are the performance counters.  Some scientific workflows trigger 
this - some parts of them are quite data intensive - they process 
thousands of files over many hours to days.  The 200GB ceph-fuse got 
there in about 3 days.  I'm keeping the node alive for now in case we 
can extract some more definitive info on what is happening there.


Andras


# ceph daemon /var/run/ceph/ceph-client.admin.asok perf dump
{
    "AsyncMessenger::Worker-0": {
    "msgr_recv_messages": 37730,
    "msgr_send_messages": 37731,
    "msgr_recv_bytes": 1121379127,
    "msgr_send_bytes": 11913693154,
    "msgr_created_connections": 75333,
    "msgr_active_connections": 730,
    "msgr_running_total_time": 642.152166956,
    "msgr_running_send_time": 536.723862752,
    "msgr_running_recv_time": 25.429112242,
    "msgr_running_fast_dispatch_time": 63.814291954
    },
    "AsyncMessenger::Worker-1": {
    "msgr_recv_messages": 38507,
    "msgr_send_messages": 38467,
    "msgr_recv_bytes": 1240174043,
    "msgr_send_bytes": 11673685736,
    "msgr_created_connections": 75479,
    "msgr_active_connections": 729,
    "msgr_running_total_time": 628.670562086,
    "msgr_running_send_time": 523.772820969,
    "msgr_running_recv_time": 25.902871268,
    "msgr_running_fast_dispatch_time": 62.375965165
    },
    "AsyncMessenger::Worker-2": {
    "msgr_recv_messages": 597697,
    "msgr_send_messages": 504640,
    "msgr_recv_bytes": 1314713236,
    "msgr_send_bytes": 11880445442,
    "msgr_created_connections": 75338,
    "msgr_active_connections": 728,
    "msgr_running_total_time": 711.909282325,
    "msgr_running_send_time": 556.195748166,
    "msgr_running_recv_time": 127.267332682,
    "msgr_running_fast_dispatch_time": 62.209721085
    },
    "client": {
    "reply": {
    "avgcount": 236795,
    "sum": 6177.205536940,
    "avgtime": 0.026086722
    },
    "lat": {
    "avgcount": 236795,
    "sum": 6177.205536940,
    "avgtime": 0.026086722
    },
    "wrlat": {
    "avgcount": 857828153,
    "sum": 8413.835066735,
    "avgtime": 0.09808
    }
    },
    "objectcacher-libcephfs": {
    "cache_ops_hit": 4160412,
    "cache_ops_miss": 4887,
    "cache_bytes_hit": 3247294145494,
    "cache_bytes_miss": 12914144260,
    "data_read": 48923557765,
    "data_written": 35292875783,
    "data_flushed": 35292681606,
    "data_overwritten_while_flushing": 0,
    "write_ops_blocked": 0,
    "write_bytes_blocked": 0,
    "write_time_blocked": 0.0
    },
    "objecter": {
    "op_active": 0,
    "op_laggy": 0,
    "op_send": 111268,
    "op_send_bytes": 35292681606,
    "op_resend": 0,
    "op_reply": 111268,
    "op": 111268,
    "op_r": 2193,
    "op_w": 109075,
    "op_rmw": 0,
    "op_pg": 0,
    "osdop_stat": 2,
    "osdop_create": 2,
    "osdop_read": 2193,
    "osdop_write": 109071,
    "osdop_writefull": 0,
    "osdop_writesame": 0,
    "osdop_append": 0,
    "osdop_zero": 0,
    "osdop_truncate": 0,
    "osdop_delete": 0,
    "osdop_mapext": 0,
    "osdop_sparse_read": 0,

Re: [ceph-users] ceph-fuse using excessive memory

2018-09-05 Thread Andras Pataki
quot;poolop_send": 0,
    "poolop_resend": 0,
    "poolstat_active": 0,
    "poolstat_send": 0,
    "poolstat_resend": 0,
    "statfs_active": 0,
    "statfs_send": 1348,
    "statfs_resend": 0,
    "command_active": 0,
    "command_send": 0,
    "command_resend": 0,
    "map_epoch": 1079783,
    "map_full": 0,
    "map_inc": 632,
    "osd_sessions": 2160,
    "osd_session_open": 226144,
    "osd_session_close": 223984,
    "osd_laggy": 0,
    "omap_wr": 0,
    "omap_rd": 0,
    "omap_del": 0
    },
    "throttle-msgr_dispatch_throttler-client": {
    "val": 0,
    "max": 104857600,
    "get_started": 0,
    "get": 673934,
    "get_sum": 3626395290,
    "get_or_fail_fail": 0,
    "get_or_fail_success": 673934,
    "take": 0,
    "take_sum": 0,
    "put": 673934,
    "put_sum": 3626395290,
    "wait": {
    "avgcount": 0,
    "sum": 0.0,
    "avgtime": 0.0
    }
    },
    "throttle-objecter_bytes": {
    "val": 0,
    "max": 104857600,
    "get_started": 0,
    "get": 0,
    "get_sum": 0,
    "get_or_fail_fail": 0,
    "get_or_fail_success": 0,
    "take": 111268,
    "take_sum": 38456168409,
    "put": 111264,
    "put_sum": 38456168409,
    "wait": {
    "avgcount": 0,
    "sum": 0.0,
    "avgtime": 0.0
    }
    },
    "throttle-objecter_ops": {
    "val": 0,
    "max": 1024,
    "get_started": 0,
    "get": 0,
    "get_sum": 0,
    "get_or_fail_fail": 0,
    "get_or_fail_success": 0,
    "take": 111268,
    "take_sum": 111268,
    "put": 111268,
    "put_sum": 111268,
    "wait": {
    "avgcount": 0,
    "sum": 0.0,
    "avgtime": 0.0
    }
    }
}


On 09/05/2018 10:00 AM, Sage Weil wrote:

On Wed, 5 Sep 2018, Andras Pataki wrote:

Hi cephers,

Every so often we have a ceph-fuse process that grows to rather large size (up
to eating up the whole memory of the machine).  Here is an example of a 200GB
RSS size ceph-fuse instance:

# ceph daemon /var/run/ceph/ceph-client.admin.asok dump_mempools
{
     "bloom_filter": {
     "items": 0,
     "bytes": 0
     },
     "bluestore_alloc": {
     "items": 0,
     "bytes": 0
     },
     "bluestore_cache_data": {
     "items": 0,
     "bytes": 0
     },
     "bluestore_cache_onode": {
     "items": 0,
     "bytes": 0
     },
     "bluestore_cache_other": {
     "items": 0,
     "bytes": 0
     },
     "bluestore_fsck": {
     "items": 0,
     "bytes": 0
     },
     "bluestore_txc": {
     "items": 0,
     "bytes": 0
     },
     "bluestore_writing_deferred": {
     "items": 0,
     "bytes": 0
     },
     "bluestore_writing": {
     "items": 0,
     "bytes": 0
     },
     "bluefs": {
     "items": 0,
     "bytes": 0
     },
     "buffer_anon": {
     "items": 51534897,
     "bytes": 207321872398
     },
     "buffer_meta": {
     "items": 64,
     "bytes": 5632
     },
     "osd": {
     "items": 0,
     "bytes": 0
     },
     "osd_mapbl": {
     "items": 0,
     "bytes": 0
     },
     "osd_pglog": {
     "items": 0,
     "bytes": 0
     },
     "osdmap": {
     "items": 28593,
     "bytes": 431872
     },
     "osdmap_mapping": {
     "items": 0,
     "bytes": 0
     },
     "pgmap": {
     "items": 0,
     "bytes": 0
     },
     "mds_co": {
     "items": 0,
     "bytes": 0
     },
     "unittest_1": {
     "item

[ceph-users] ceph-fuse using excessive memory

2018-09-05 Thread Andras Pataki

Hi cephers,

Every so often we have a ceph-fuse process that grows to rather large 
size (up to eating up the whole memory of the machine).  Here is an 
example of a 200GB RSS size ceph-fuse instance:


# ceph daemon /var/run/ceph/ceph-client.admin.asok dump_mempools
{
    "bloom_filter": {
    "items": 0,
    "bytes": 0
    },
    "bluestore_alloc": {
    "items": 0,
    "bytes": 0
    },
    "bluestore_cache_data": {
    "items": 0,
    "bytes": 0
    },
    "bluestore_cache_onode": {
    "items": 0,
    "bytes": 0
    },
    "bluestore_cache_other": {
    "items": 0,
    "bytes": 0
    },
    "bluestore_fsck": {
    "items": 0,
    "bytes": 0
    },
    "bluestore_txc": {
    "items": 0,
    "bytes": 0
    },
    "bluestore_writing_deferred": {
    "items": 0,
    "bytes": 0
    },
    "bluestore_writing": {
    "items": 0,
    "bytes": 0
    },
    "bluefs": {
    "items": 0,
    "bytes": 0
    },
    "buffer_anon": {
    "items": 51534897,
    "bytes": 207321872398
    },
    "buffer_meta": {
    "items": 64,
    "bytes": 5632
    },
    "osd": {
    "items": 0,
    "bytes": 0
    },
    "osd_mapbl": {
    "items": 0,
    "bytes": 0
    },
    "osd_pglog": {
    "items": 0,
    "bytes": 0
    },
    "osdmap": {
    "items": 28593,
    "bytes": 431872
    },
    "osdmap_mapping": {
    "items": 0,
    "bytes": 0
    },
    "pgmap": {
    "items": 0,
    "bytes": 0
    },
    "mds_co": {
    "items": 0,
    "bytes": 0
    },
    "unittest_1": {
    "items": 0,
    "bytes": 0
    },
    "unittest_2": {
    "items": 0,
    "bytes": 0
    },
    "total": {
    "items": 51563554,
    "bytes": 207322309902
    }
}

The general cache size looks like this (if it is helpful I can put a 
whole cache dump somewhere):


# ceph daemon /var/run/ceph/ceph-client.admin.asok dump_cache | grep 
path | wc -l

84085
# ceph daemon /var/run/ceph/ceph-client.admin.asok dump_cache | grep 
name | wc -l

168186

Any ideas what 'buffer_anon' is and what could be eating up the 200GB of 
RAM?


We are running with a few ceph-fuse specific parameters increased in 
ceph.conf:


   # Description:  Set the number of inodes that the client keeps in
   the metadata cache.
   # Default:  16384
   client_cache_size = 262144

   # Description:  Set the maximum number of dirty bytes in the object
   cache.
   # Default:  104857600 (100MB)
   client_oc_max_dirty = 536870912

   # Description:  Set the maximum number of objects in the object cache.
   # Default:  1000
   client_oc_max_objects = 8192

   # Description:  Set how many bytes of data will the client cache.
   # Default:  209715200 (200 MB)
   client_oc_size = 2147483640

   # Description:  Set the maximum number of bytes that the kernel
   reads ahead for future read operations. Overridden by the
   client_readahead_max_periods setting.
   # Default:  0 (unlimited)
   #client_readahead_max_bytes = 67108864

   # Description:  Set the number of file layout periods (object size *
   number of stripes) that the kernel reads ahead. Overrides the
   client_readahead_max_bytes setting.
   # Default:  4
   client_readahead_max_periods = 64

   # Description:  Set the minimum number bytes that the kernel reads
   ahead.
   # Default:  131072 (128KB)
   client_readahead_min = 4194304


We are running a 12.2.7 ceph cluster, and the cluster is otherwise healthy.

Any hints would be appreciated.  Thanks,

Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stability Issue with 52 OSD hosts

2018-08-24 Thread Andras Pataki
We pin half the OSDs to each socket (and to the corresponding memory).  
Since the disk controller and the network card is connected only to one 
socket, this still probably produces quite a bit of QPI traffic.
It is also worth investigating how the network does under high load.  We 
did run into problems where 40Gbps cards dropped packets heavily under load.


Andras


On 08/24/2018 05:16 AM, Marc Roos wrote:
  
Can this be related to numa issues? I have also dual processor nodes,

and was wondering if there is some guide on how to optimize for numa.




-Original Message-
From: Tyler Bishop [mailto:tyler.bis...@beyondhosting.net]
Sent: vrijdag 24 augustus 2018 3:11
To: Andras Pataki
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Stability Issue with 52 OSD hosts

Thanks for the info. I was investigating bluestore as well.  My host
dont go unresponsive but I do see parallel io slow down.

On Thu, Aug 23, 2018, 8:02 PM Andras Pataki
 wrote:


We are also running some fairly dense nodes with CentOS 7.4 and ran
into
similar problems.  The nodes ran filestore OSDs (Jewel, then
Luminous).
Sometimes a node would be so unresponsive that one couldn't even
ssh to
it (even though the root disk was a physically separate drive on a
separate controller from the OSD drives).  Often these would
coincide
with kernel stack traces about hung tasks. Initially we did blame
high
load, etc. from all the OSDs.

But then we benchmarked the nodes independently of ceph (with
iozone and
such) and noticed problems there too.  When we started a few dozen
iozone processes on separate JBOD drives with xfs, some didn't even

start and write a single byte for minutes.  The conclusion we came
to
was that there is some interference among a lot of mounted xfs file

systems in the Red Hat 3.10 kernels.  Some kind of central lock
that
prevents dozens of xfs file systems from running in parallel.  When
we
do I/O directly to raw devices in parallel, we saw no problems (no
high
loads, etc.).  So we built a newer kernel, and the situation got
better.  4.4 is already much better, nowadays we are testing moving
to 4.14.

Also, migrating to bluestore significantly reduced the load on
these
nodes too.  At busy times, the filestore host loads were 20-30,
even
higher (on a 28 core node), while the bluestore nodes hummed along
at a
lot of perhaps 6 or 8.  This also confirms that somehow lots of xfs

mounts don't work in parallel.

Andras


On 08/23/2018 03:24 PM, Tyler Bishop wrote:
> Yes I've reviewed all the logs from monitor and host.   I am not
> getting useful errors (or any) in dmesg or general messages.
>
> I have 2 ceph clusters, the other cluster is 300 SSD and i never
have
> issues like this.   That's why Im looking for help.
>
> On Thu, Aug 23, 2018 at 3:22 PM Alex Gorbachev
 wrote:
>> On Wed, Aug 22, 2018 at 11:39 PM Tyler Bishop
>>  wrote:
>>> During high load testing I'm only seeing user and sys cpu load
around 60%... my load doesn't seem to be anything crazy on the host and
iowait stays between 6 and 10%.  I have very good `ceph osd perf`
numbers too.
>>>
>>> I am using 10.2.11 Jewel.
>>>
>>>
>>> On Wed, Aug 22, 2018 at 11:30 PM Christian Balzer
 wrote:
>>>> Hello,
>>>>
>>>> On Wed, 22 Aug 2018 23:00:24 -0400 Tyler Bishop wrote:
>>>>
>>>>> Hi,   I've been fighting to get good stability on my cluster
for about
>>>>> 3 weeks now.  I am running into intermittent issues with OSD
flapping
>>>>> marking other OSD down then going back to a stable state for
hours and
>>>>> days.
>>>>>
>>>>> The cluster is 4x Cisco UCS S3260 with dual E5-2660, 256GB
ram, 40G
>>>>> Network to 40G Brocade VDX Switches.  The OSD are 6TB HGST
SAS drives
>>>>> with 400GB HGST SAS 12G SSDs.   My configuration is 4
journals per
>>>>> host with 12 disk per journal for a total of 56 disk per
system and 52
>>>>> OSD.
>>>>>
>>>> Any denser and you'd have a storage black hole.
>>>>
>>>> You already pointed your finger in the (or at least one) right
direction
>>>> and everybody will agree that this setup is woefully
underpowered in the
>>>> CPU department.
>>>>
   

Re: [ceph-users] Stability Issue with 52 OSD hosts

2018-08-23 Thread Andras Pataki
We are also running some fairly dense nodes with CentOS 7.4 and ran into 
similar problems.  The nodes ran filestore OSDs (Jewel, then Luminous).  
Sometimes a node would be so unresponsive that one couldn't even ssh to 
it (even though the root disk was a physically separate drive on a 
separate controller from the OSD drives).  Often these would coincide 
with kernel stack traces about hung tasks. Initially we did blame high 
load, etc. from all the OSDs.


But then we benchmarked the nodes independently of ceph (with iozone and 
such) and noticed problems there too.  When we started a few dozen 
iozone processes on separate JBOD drives with xfs, some didn't even 
start and write a single byte for minutes.  The conclusion we came to 
was that there is some interference among a lot of mounted xfs file 
systems in the Red Hat 3.10 kernels.  Some kind of central lock that 
prevents dozens of xfs file systems from running in parallel.  When we 
do I/O directly to raw devices in parallel, we saw no problems (no high 
loads, etc.).  So we built a newer kernel, and the situation got 
better.  4.4 is already much better, nowadays we are testing moving to 4.14.


Also, migrating to bluestore significantly reduced the load on these 
nodes too.  At busy times, the filestore host loads were 20-30, even 
higher (on a 28 core node), while the bluestore nodes hummed along at a 
lot of perhaps 6 or 8.  This also confirms that somehow lots of xfs 
mounts don't work in parallel.


Andras


On 08/23/2018 03:24 PM, Tyler Bishop wrote:

Yes I've reviewed all the logs from monitor and host.   I am not
getting useful errors (or any) in dmesg or general messages.

I have 2 ceph clusters, the other cluster is 300 SSD and i never have
issues like this.   That's why Im looking for help.

On Thu, Aug 23, 2018 at 3:22 PM Alex Gorbachev  wrote:

On Wed, Aug 22, 2018 at 11:39 PM Tyler Bishop
 wrote:

During high load testing I'm only seeing user and sys cpu load around 60%... my 
load doesn't seem to be anything crazy on the host and iowait stays between 6 
and 10%.  I have very good `ceph osd perf` numbers too.

I am using 10.2.11 Jewel.


On Wed, Aug 22, 2018 at 11:30 PM Christian Balzer  wrote:

Hello,

On Wed, 22 Aug 2018 23:00:24 -0400 Tyler Bishop wrote:


Hi,   I've been fighting to get good stability on my cluster for about
3 weeks now.  I am running into intermittent issues with OSD flapping
marking other OSD down then going back to a stable state for hours and
days.

The cluster is 4x Cisco UCS S3260 with dual E5-2660, 256GB ram, 40G
Network to 40G Brocade VDX Switches.  The OSD are 6TB HGST SAS drives
with 400GB HGST SAS 12G SSDs.   My configuration is 4 journals per
host with 12 disk per journal for a total of 56 disk per system and 52
OSD.


Any denser and you'd have a storage black hole.

You already pointed your finger in the (or at least one) right direction
and everybody will agree that this setup is woefully underpowered in the
CPU department.


I am using CentOS 7 with kernel 3.10 and the redhat tuned-adm profile
for throughput-performance enabled.


Ceph version would be interesting as well...


I have these sysctls set:

kernel.pid_max = 4194303
fs.file-max = 6553600
vm.swappiness = 0
vm.vfs_cache_pressure = 50
vm.min_free_kbytes = 3145728

I feel like my issue is directly related to the high number of OSD per
host but I'm not sure what issue I'm really running into.   I believe
that I have ruled out network issues, i am able to get 38Gbit
consistently via iperf testing and mtu for jump pings successfully
with no fragment set and 8972 packet size.


The fact that it all works for days at a time suggests this as well, but
you need to verify these things when they're happening.


 From FIO testing I seem to be able to get 150-200k iops write from my
rbd clients on 1gbit networking... This is about what I expected due
to the write penalty and my underpowered CPU for the number of OSD.

I get these messages which I believe are normal?
2018-08-22 10:33:12.754722 7f7d009f5700  0 -- 10.20.136.8:6894/718902

10.20.136.10:6876/490574 pipe(0x55aed77fd400 sd=192 :40502 s=2

pgs=1084 cs=53 l=0 c=0x55aed805bc80).fault with nothing to send, going
to standby


Ignore.


Then randomly I'll get a storm of this every few days for 20 minutes or so:
2018-08-22 15:48:32.631186 7f44b7514700 -1 osd.127 37333
heartbeat_check: no reply from 10.20.142.11:6861 osd.198 since back
2018-08-22 15:48:08.052762 front 2018-08-22 15:48:31.282890 (cutoff
2018-08-22 15:48:12.630773)


Randomly is unlikely.
Again, catch it in the act, atop in huge terminal windows (showing all
CPUs and disks) for all nodes should be very telling, collecting and
graphing this data might work, too.

My suspects would be deep scrubs and/or high IOPS spikes when this is
happening, starving out OSD processes (CPU wise, RAM should be fine one
supposes).

Christian


Please help!!!

Have you looked at the OSD logs on the OSD nodes by chance?  I found
that correlating the 

[ceph-users] Testing a hypothetical crush map

2018-08-06 Thread Andras Pataki

Hi cephers,

Is there a way to see what a crush map change does to the PG mappings 
(i.e. what placement groups end up on what OSDs) without actually 
setting the crush map (and have the map take effect)?  I'm looking for 
some way I could test hypothetical crush map changes without any effect 
on the running system.


Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS/ceph-fuse performance

2018-06-06 Thread Andras Pataki
Staring at the logs a bit more it seems like the following lines might 
be the clue:


2018-06-06 08:14:17.615359 7fffefa45700 10 objectcacher trim  start: 
bytes: max 2147483640  clean 2145935360, objects: max 8192 current 8192
2018-06-06 08:14:17.615361 7fffefa45700 10 objectcacher trim finish:  
max 2147483640  clean 2145935360, objects: max 8192 current 8192


The object caching could not free objects up to cache new ones perhaps 
(it was caching 8192 objects which is the maximum in the config)?  Not 
sure why that would be though.  Unfortunately the job since then 
terminated so I can't look at the caches any longer of the client.


Andras


On 06/06/2018 12:22 PM, Andras Pataki wrote:

Hi Greg,

The docs say that client_cache_size is the number of inodes that are 
cached, not bytes of data.  Is that incorrect?


Andras


On 06/06/2018 11:25 AM, Gregory Farnum wrote:

On Wed, Jun 6, 2018 at 5:52 AM, Andras Pataki
 wrote:
We're using CephFS with Luminous 12.2.5 and the fuse client (on 
CentOS 7.4,

kernel 3.10.0-693.5.2.el7.x86_64).  Performance has been very good
generally, but we're currently running into some strange performance 
issues

with one of our applications.  The client in this case is on a higher
latency link - it is about 2.5ms away from all the ceph server nodes 
(all
ceph server nodes are near each other on 10/40Ggbps local ethernet, 
only the

client is "away").

The application is reading contiguous data at 64k chunks, the strace 
(-tt -T

flags) looks something like:

06:37:04.152667 read(3, ".:.:.\t./.:.:.:.:.\t./.:.:.:.:.\t./"..., 
65536) =

65536 <0.024052>
06:37:04.178432 read(3, ",1523\t./.:.:.:.:.\t0/0:34,0:34:99"..., 
65536) =

65536 <0.023990>
06:37:04.204087 read(3, ":20:21:0,21,738\t0/0:8,0:8:0:0,0,"..., 
65536) =

65536 <0.024053>
06:37:04.229919 read(3, "665\t0/0:35,0:35:99:0,102,1530\t./"..., 
65536) =

65536 <0.024574>
06:37:04.255623 read(3, ":37:99:0,99,1485\t0/0:34,0:34:99:"..., 
65536) =

65536 <0.023795>
06:37:04.280914 read(3, ":.\t./.:.:.:.:.:.:.\t./.:.:.:.:.:."..., 
65536) =

65536 <0.023614>
06:37:04.306022 read(3, "0,0,0\t./.:0,0:0:.:0,0,0\t./.:0,0:"..., 
65536) =

65536 <0.024037>


so each 64k read takes about 23-24ms.  The client has the file open for
read, the machine is not busy (load of 0.2), neither are the ceph 
nodes.

The fuse client seems pretty idle also.

Increasing the log level to 20 for 'client' and 'objectcacher' on 
ceph-fuse,
it looks like ceph-fuse gets ll_read requests of 4k in size, and it 
looks
like it does an async read from the OSDs in 4k chunks (if I'm 
interpreting

the logs right).  Here is a trace of one ll_read:

2018-06-06 08:14:17.609495 7fffe7a35700  3 client.16794661 ll_read
0x5556dadfc1a0 0x1000d092e5f  238173646848~4096
2018-06-06 08:14:17.609506 7fffe7a35700 10 client.16794661 get_caps
0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=0,4096=0,8192=0} open={1=1,2=0} mode=100664
size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 
objects

6769 dirty_or_tx 0] parents=0x5f187680 0x5f138680) have
pAsLsXsFsxcrwb need Fr want Fc revoking -
2018-06-06 08:14:17.609517 7fffe7a35700 10 client.16794661 _read_async
0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664
size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 
objects
6769 dirty_or_tx 0] parents=0x5f187680 0x5f138680) 
238173646848~4096

2018-06-06 08:14:17.609523 7fffe7a35700 10 client.16794661
min_bytes=4194304 max_bytes=268435456 max_periods=64
2018-06-06 08:14:17.609528 7fffe7a35700 10 objectcacher readx
extent(1000d092e5f.ddd1 (56785) in @6 94208~4096 -> [0,4096])
2018-06-06 08:14:17.609532 7fffe7a35700 10
objectcacher.object(1000d092e5f.ddd1/head) map_read 
1000d092e5f.ddd1

94208~4096
2018-06-06 08:14:17.609535 7fffe7a35700 20
objectcacher.object(1000d092e5f.ddd1/head) map_read miss 4096 
left, bh[

0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 missing] waiters = {}
2018-06-06 08:14:17.609537 7fffe7a35700  7 objectcacher bh_read on bh[
0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 missing] waiters = {}
outstanding reads 0
2018-06-06 08:14:17.609576 7fffe7a35700 10 objectcacher readx missed,
waiting on bh[ 0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 rx] 
waiters

= {} off 94208
2018-06-06 08:14:17.609579 7fffe7a35700 20 objectcacher readx defer
0x55570211ec00
2018-06-06 08:14:17.609580 7fffe7a35700  5 client.16794661 
get_cap_ref got

first FILE_CACHE ref on 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664
size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
caps=pAsLsXsFsxcrwb(0=p

Re: [ceph-users] CephFS/ceph-fuse performance

2018-06-06 Thread Andras Pataki

Hi Greg,

The docs say that client_cache_size is the number of inodes that are 
cached, not bytes of data.  Is that incorrect?


Andras


On 06/06/2018 11:25 AM, Gregory Farnum wrote:

On Wed, Jun 6, 2018 at 5:52 AM, Andras Pataki
 wrote:

We're using CephFS with Luminous 12.2.5 and the fuse client (on CentOS 7.4,
kernel 3.10.0-693.5.2.el7.x86_64).  Performance has been very good
generally, but we're currently running into some strange performance issues
with one of our applications.  The client in this case is on a higher
latency link - it is about 2.5ms away from all the ceph server nodes (all
ceph server nodes are near each other on 10/40Ggbps local ethernet, only the
client is "away").

The application is reading contiguous data at 64k chunks, the strace (-tt -T
flags) looks something like:

06:37:04.152667 read(3, ".:.:.\t./.:.:.:.:.\t./.:.:.:.:.\t./"..., 65536) =
65536 <0.024052>
06:37:04.178432 read(3, ",1523\t./.:.:.:.:.\t0/0:34,0:34:99"..., 65536) =
65536 <0.023990>
06:37:04.204087 read(3, ":20:21:0,21,738\t0/0:8,0:8:0:0,0,"..., 65536) =
65536 <0.024053>
06:37:04.229919 read(3, "665\t0/0:35,0:35:99:0,102,1530\t./"..., 65536) =
65536 <0.024574>
06:37:04.255623 read(3, ":37:99:0,99,1485\t0/0:34,0:34:99:"..., 65536) =
65536 <0.023795>
06:37:04.280914 read(3, ":.\t./.:.:.:.:.:.:.\t./.:.:.:.:.:."..., 65536) =
65536 <0.023614>
06:37:04.306022 read(3, "0,0,0\t./.:0,0:0:.:0,0,0\t./.:0,0:"..., 65536) =
65536 <0.024037>


so each 64k read takes about 23-24ms.  The client has the file open for
read, the machine is not busy (load of 0.2), neither are the ceph nodes.
The fuse client seems pretty idle also.

Increasing the log level to 20 for 'client' and 'objectcacher' on ceph-fuse,
it looks like ceph-fuse gets ll_read requests of 4k in size, and it looks
like it does an async read from the OSDs in 4k chunks (if I'm interpreting
the logs right).  Here is a trace of one ll_read:

2018-06-06 08:14:17.609495 7fffe7a35700  3 client.16794661 ll_read
0x5556dadfc1a0 0x1000d092e5f  238173646848~4096
2018-06-06 08:14:17.609506 7fffe7a35700 10 client.16794661 get_caps
0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=0,4096=0,8192=0} open={1=1,2=0} mode=100664
size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects
6769 dirty_or_tx 0] parents=0x5f187680 0x5f138680) have
pAsLsXsFsxcrwb need Fr want Fc revoking -
2018-06-06 08:14:17.609517 7fffe7a35700 10 client.16794661 _read_async
0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664
size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects
6769 dirty_or_tx 0] parents=0x5f187680 0x5f138680) 238173646848~4096
2018-06-06 08:14:17.609523 7fffe7a35700 10 client.16794661
min_bytes=4194304 max_bytes=268435456 max_periods=64
2018-06-06 08:14:17.609528 7fffe7a35700 10 objectcacher readx
extent(1000d092e5f.ddd1 (56785) in @6 94208~4096 -> [0,4096])
2018-06-06 08:14:17.609532 7fffe7a35700 10
objectcacher.object(1000d092e5f.ddd1/head) map_read 1000d092e5f.ddd1
94208~4096
2018-06-06 08:14:17.609535 7fffe7a35700 20
objectcacher.object(1000d092e5f.ddd1/head) map_read miss 4096 left, bh[
0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 missing] waiters = {}
2018-06-06 08:14:17.609537 7fffe7a35700  7 objectcacher bh_read on bh[
0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 missing] waiters = {}
outstanding reads 0
2018-06-06 08:14:17.609576 7fffe7a35700 10 objectcacher readx missed,
waiting on bh[ 0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 rx] waiters
= {} off 94208
2018-06-06 08:14:17.609579 7fffe7a35700 20 objectcacher readx defer
0x55570211ec00
2018-06-06 08:14:17.609580 7fffe7a35700  5 client.16794661 get_cap_ref got
first FILE_CACHE ref on 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0} mode=100664
size=244712765330/249011634176 mtime=2018-06-05 00:33:31.332901
caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x1000d092e5f ts 0/0 objects
6769 dirty_or_tx 0] parents=0x5f187680 0x5f138680)
2018-06-06 08:14:17.609587 7fffe7a35700 15 inode.get on 0x5f138680
0x1000d092e5f.head now 4
2018-06-06 08:14:17.612318 7fffefa45700  7 objectcacher bh_read_finish
1000d092e5f.ddd1/head tid 29067611 94208~4096 (bl is 4096) returned 0
outstanding reads 1
2018-06-06 08:14:17.612338 7fffefa45700 20 objectcacher checking bh bh[
0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 rx] waiters = {
94208->[0x5557007383a0, ]}
2018-06-06 08:14:17.612341 7fffefa45700 10 objectcacher bh_read_finish read
bh[ 0x5fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean firstbyte=46]
waiters = {}
2018-06-06 08:14:17.612344 7ff

[ceph-users] CephFS/ceph-fuse performance

2018-06-06 Thread Andras Pataki
We're using CephFS with Luminous 12.2.5 and the fuse client (on CentOS 
7.4, kernel 3.10.0-693.5.2.el7.x86_64).  Performance has been very good 
generally, but we're currently running into some strange performance 
issues with one of our applications.  The client in this case is on a 
higher latency link - it is about 2.5ms away from all the ceph server 
nodes (all ceph server nodes are near each other on 10/40Ggbps local 
ethernet, only the client is "away").


The application is reading contiguous data at 64k chunks, the strace 
(-tt -T flags) looks something like:


   06:37:04.152667 read(3, ".:.:.\t./.:.:.:.:.\t./.:.:.:.:.\t./"...,
   65536) = 65536 <0.024052>
   06:37:04.178432 read(3, ",1523\t./.:.:.:.:.\t0/0:34,0:34:99"...,
   65536) = 65536 <0.023990>
   06:37:04.204087 read(3, ":20:21:0,21,738\t0/0:8,0:8:0:0,0,"...,
   65536) = 65536 <0.024053>
   06:37:04.229919 read(3, "665\t0/0:35,0:35:99:0,102,1530\t./"...,
   65536) = 65536 <0.024574>
   06:37:04.255623 read(3, ":37:99:0,99,1485\t0/0:34,0:34:99:"...,
   65536) = 65536 <0.023795>
   06:37:04.280914 read(3, ":.\t./.:.:.:.:.:.:.\t./.:.:.:.:.:."...,
   65536) = 65536 <0.023614>
   06:37:04.306022 read(3, "0,0,0\t./.:0,0:0:.:0,0,0\t./.:0,0:"...,
   65536) = 65536 <0.024037>


so each 64k read takes about 23-24ms.  The client has the file open for 
read, the machine is not busy (load of 0.2), neither are the ceph 
nodes.  The fuse client seems pretty idle also.


Increasing the log level to 20 for 'client' and 'objectcacher' on 
ceph-fuse, it looks like ceph-fuse gets ll_read requests of 4k in size, 
and it looks like it does an async read from the OSDs in 4k chunks (if 
I'm interpreting the logs right).  Here is a trace of one ll_read:


   2018-06-06 08:14:17.609495 7fffe7a35700  3 client.16794661 ll_read
   0x5556dadfc1a0 0x1000d092e5f 238173646848~4096
   2018-06-06 08:14:17.609506 7fffe7a35700 10 client.16794661 get_caps
   0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
   cap_refs={4=0,1024=0,2048=0,4096=0,8192=0} open={1=1,2=0}
   mode=100664 size=244712765330/249011634176 mtime=2018-06-05
   00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
   objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
   parents=0x5f187680 0x5f138680) have pAsLsXsFsxcrwb need Fr
   want Fc revoking -
   2018-06-06 08:14:17.609517 7fffe7a35700 10 client.16794661
   _read_async 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
   cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0}
   mode=100664 size=244712765330/249011634176 mtime=2018-06-05
   00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
   objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
   parents=0x5f187680 0x5f138680) 238173646848~4096
   2018-06-06 08:14:17.609523 7fffe7a35700 10 client.16794661
   min_bytes=4194304 max_bytes=268435456 max_periods=64
   2018-06-06 08:14:17.609528 7fffe7a35700 10 objectcacher readx
   extent(1000d092e5f.ddd1 (56785) in @6 94208~4096 -> [0,4096])
   2018-06-06 08:14:17.609532 7fffe7a35700 10
   objectcacher.object(1000d092e5f.ddd1/head) map_read
   1000d092e5f.ddd1 94208~4096
   2018-06-06 08:14:17.609535 7fffe7a35700 20
   objectcacher.object(1000d092e5f.ddd1/head) map_read miss 4096
   left, bh[ 0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 missing]
   waiters = {}
   2018-06-06 08:14:17.609537 7fffe7a35700  7 objectcacher bh_read on
   bh[ 0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 missing]
   waiters = {} outstanding reads 0
   2018-06-06 08:14:17.609576 7fffe7a35700 10 objectcacher readx
   missed, waiting on bh[ 0x5fecdd40 94208~4096 0x5556226235c0 (0)
   v 0 rx] waiters = {} off 94208
   2018-06-06 08:14:17.609579 7fffe7a35700 20 objectcacher readx defer
   0x55570211ec00
   2018-06-06 08:14:17.609580 7fffe7a35700  5 client.16794661
   get_cap_ref got first FILE_CACHE ref on
   0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
   cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0}
   mode=100664 size=244712765330/249011634176 mtime=2018-06-05
   00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
   objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
   parents=0x5f187680 0x5f138680)
   2018-06-06 08:14:17.609587 7fffe7a35700 15 inode.get on
   0x5f138680 0x1000d092e5f.head now 4
   2018-06-06 08:14:17.612318 7fffefa45700  7 objectcacher
   bh_read_finish 1000d092e5f.ddd1/head tid 29067611 94208~4096 (bl
   is 4096) returned 0 outstanding reads 1
   2018-06-06 08:14:17.612338 7fffefa45700 20 objectcacher checking bh
   bh[ 0x5fecdd40 94208~4096 0x5556226235c0 (0) v 0 rx] waiters = {
   94208->[0x5557007383a0, ]}
   2018-06-06 08:14:17.612341 7fffefa45700 10 objectcacher
   bh_read_finish read bh[ 0x5fecdd40 94208~4096 0x5556226235c0
   (4096) v 0 clean firstbyte=46] waiters = {}
   2018-06-06 08:14:17.612344 7fffefa45700 10
   objectcacher.object(1000d092e5f.ddd1/head) try_merge_bh bh[
   0x5fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean
   

Re: [ceph-users] Help/advice with crush rules

2018-05-21 Thread Andras Pataki

Hi Greg,

Thanks for the detailed explanation - the examples make a lot of sense.

One followup question regarding a two level crush rule like:

step take default
step choose 3 type=rack
step chooseleaf 3 type=host
step emit

If the erasure code has 9 chunks, this lines up exactly without any 
problems.  What if the erasure code isn't a nice product of the racks 
and hosts/rack, for example 6+2 with the above example?  Will it just 
take 3 chunks in the first two racks and 2 from the last without any 
issues?  The other direction I presume can't work, i.e. on the above 
example I can't put any erasure code with more than 9 chunks.


Andras


On 05/18/2018 06:30 PM, Gregory Farnum wrote:
On Thu, May 17, 2018 at 9:05 AM Andras Pataki 
<apat...@flatironinstitute.org <mailto:apat...@flatironinstitute.org>> 
wrote:


I've been trying to wrap my head around crush rules, and I need some
help/advice.  I'm thinking of using erasure coding instead of
replication, and trying to understand the possibilities for
planning for
failure cases.

For a simplified example, consider a 2 level topology, OSDs live on
hosts, and hosts live in racks.  I'd like to set up a rule for a 6+3
erasure code that would put at most 1 of the 9 chunks on a host,
and no
more than 3 chunks in a rack (so in case the rack is lost, we
still have
a way to recover).  Some racks may not have 3 hosts in them, so they
could potentially accept only 1 or 2 chunks then.  How can something
like this be implemented as a crush rule?  Or, if not exactly this,
something in this spirit?  I don't want to say that all chunks
need to
live in a separate rack because that is too restrictive (some
racks may
be much bigger than others, or there might not even be 9 racks).


Unfortunately what you describe here is a little too detailed in ways 
CRUSH can't easily specify. You should think of a CRUSH rule as a 
sequence of steps that start out at a root (the "take" step), and 
incrementally specify more detail about which piece of the CRUSH 
hierarchy they run on, but run the *same* rule on every piece they select.


So the simplest thing that comes close to what you suggest is:
(forgive me if my syntax is slightly off, I'm doing this from memory)
step take default
step chooseleaf n type=rack
step emit

That would start at the default root, select "n" racks (9, in your 
case) and then for each rack find an OSD within it. (chooseleaf is 
special and more flexibly than most of the CRUSH language; it's nice 
because if it can't find an OSD in one of the selected racks, it will 
pick another rack).

But a rule that's more illustrative of how things work is:
step take default
step choose 3 type=rack
step chooseleaf 3 type=host
step emit

That one selects three racks, then selects three OSDs within different 
hosts *in each rack*. (You'll note that it doesn't necessarily work 
out so well if you don't want 9 OSDs!) If one of the racks it selected 
doesn't have 3 separate hosts...well, tough, it tried to do what you 
told it. :/


If you were dedicated, you could split up your racks into 
equivalently-sized units — let's say rows. Then you could do

step take default
step choose 3 type=row
step chooseleaf 3 type=host
step emit

Assuming you have 3+ rows of good size, that'll get you 9 OSDs which 
are all on different hosts.

-Greg


Thanks,

Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Help/advice with crush rules

2018-05-17 Thread Andras Pataki
I've been trying to wrap my head around crush rules, and I need some 
help/advice.  I'm thinking of using erasure coding instead of 
replication, and trying to understand the possibilities for planning for 
failure cases.


For a simplified example, consider a 2 level topology, OSDs live on 
hosts, and hosts live in racks.  I'd like to set up a rule for a 6+3 
erasure code that would put at most 1 of the 9 chunks on a host, and no 
more than 3 chunks in a rack (so in case the rack is lost, we still have 
a way to recover).  Some racks may not have 3 hosts in them, so they 
could potentially accept only 1 or 2 chunks then.  How can something 
like this be implemented as a crush rule?  Or, if not exactly this, 
something in this spirit?  I don't want to say that all chunks need to 
live in a separate rack because that is too restrictive (some racks may 
be much bigger than others, or there might not even be 9 racks).


Thanks,

Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume and systemd troubles

2018-05-16 Thread Andras Pataki

Done: tracker #24152

Thanks,

Andras


On 05/16/2018 04:58 PM, Alfredo Deza wrote:

On Wed, May 16, 2018 at 4:50 PM, Andras Pataki
<apat...@flatironinstitute.org> wrote:

Dear ceph users,

I've been experimenting setting up a new node with ceph-volume and
bluestore.  Most of the setup works right, but I'm running into a strange
interaction between ceph-volume and systemd when starting OSDs.

After preparing/activating the OSD, a systemd unit instance is created with
a symlink in /etc/systemd/system/multi-user.target.wants
 ceph-volume@lvm-121-7a9aceb3-ac01-4c2e-97f7-94954004e2fb.service ->
/usr/lib/systemd/system/ceph-volume@.service

I've moved this dependency to ceph-osd.target.wants, since I'd like to be
able to start/stop all OSDs on the same node with one command (let me know
if there is a better way).  The stopping works without this, since
ceph-osd@.service is marked as part of ceph-osd.target, but starting does
not since these new ceph-volume units aren't together in a separate target.

However, when I run 'systemctl start ceph-osd.target' multiple times, the
systemctl command hangs, even though the OSD starts up fine.  Interestingly,
'systemctl start
ceph-volume@lvm-121-7a9aceb3-ac01-4c2e-97f7-94954004e2fb.service' does not
hang, however.
Troubleshooting further, I see that the ceph-volume@.target unit calls
'ceph-volume lvm trigger 121-7a9aceb3-ac01-4c2e-97f7-94954004e2fb', which in
turn calls 'Activate', running a few systemd commands:

Running command: ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev
/dev/H900D00/H900D00 --path /var/lib/ceph/osd/ceph-121
Running command: ln -snf /dev/H900D00/H900D00
/var/lib/ceph/osd/ceph-121/block
Running command: chown -R ceph:ceph /dev/dm-0
Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-121
Running command: systemctl enable
ceph-volume@lvm-121-7a9aceb3-ac01-4c2e-97f7-94954004e2fb
Running command: systemctl start ceph-osd@121
--> ceph-volume lvm activate successful for osd ID: 121

The problem seems to be the 'systemctl enable' command, which essentially
tries to enable the unit that is currently being executed (for the case when
running systemctl start ceph-osd.target).  Somehow systemd (in CentOS) isn't
very happy with that.  If I edit the python scripts to check that the unit
is not enabled before enabling it - the hangs stop.
For example, replacing in
/usr/lib/python2.7/site-packages/ceph_volume/systemd/systemd.py

def enable(unit):
 process.run(['systemctl', 'enable', unit])


with

def enable(unit):
 stdout, stderr, retcode = process.call(['systemctl', 'is-enabled',
unit], show_command=True)
 if retcode != 0:
 process.run(['systemctl', 'enable', unit])


fixes the issue.

Has anyone run into this, or has any ideas on how to proceed?

This looks like an oversight on our end. We don't run into this
because we haven't tried to start/stop all OSDs at once in our tests.

Can you create a ticket so that we can fix this? Your changes look
correct to me.

http://tracker.ceph.com/projects/ceph-volume/issues/new


Andras


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-volume and systemd troubles

2018-05-16 Thread Andras Pataki

Dear ceph users,

I've been experimenting setting up a new node with ceph-volume and 
bluestore.  Most of the setup works right, but I'm running into a 
strange interaction between ceph-volume and systemd when starting OSDs.


After preparing/activating the OSD, a systemd unit instance is created 
with a symlink in /etc/systemd/system/multi-user.target.wants
    ceph-volume@lvm-121-7a9aceb3-ac01-4c2e-97f7-94954004e2fb.service -> 
/usr/lib/systemd/system/ceph-volume@.service


I've moved this dependency to ceph-osd.target.wants, since I'd like to 
be able to start/stop all OSDs on the same node with one command (let me 
know if there is a better way).  The stopping works without this, since 
ceph-osd@.service is marked as part of ceph-osd.target, but starting 
does not since these new ceph-volume units aren't together in a separate 
target.


However, when I run 'systemctl start ceph-osd.target' multiple times, 
the systemctl command hangs, even though the OSD starts up fine.  
Interestingly, 'systemctl start 
ceph-volume@lvm-121-7a9aceb3-ac01-4c2e-97f7-94954004e2fb.service' does 
not hang, however.
Troubleshooting further, I see that the ceph-volume@.target unit calls 
'ceph-volume lvm trigger 121-7a9aceb3-ac01-4c2e-97f7-94954004e2fb', 
which in turn calls 'Activate', running a few systemd commands:


Running command: ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev 
/dev/H900D00/H900D00 --path /var/lib/ceph/osd/ceph-121
Running command: ln -snf /dev/H900D00/H900D00 
/var/lib/ceph/osd/ceph-121/block

Running command: chown -R ceph:ceph /dev/dm-0
Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-121
Running command: systemctl enable 
ceph-volume@lvm-121-7a9aceb3-ac01-4c2e-97f7-94954004e2fb

Running command: systemctl start ceph-osd@121
--> ceph-volume lvm activate successful for osd ID: 121

The problem seems to be the 'systemctl enable' command, which 
essentially tries to enable the unit that is currently being executed 
(for the case when running systemctl start ceph-osd.target).  Somehow 
systemd (in CentOS) isn't very happy with that.  If I edit the python 
scripts to check that the unit is not enabled before enabling it - the 
hangs stop.
For example, replacing in 
/usr/lib/python2.7/site-packages/ceph_volume/systemd/systemd.py


   def enable(unit):
    process.run(['systemctl', 'enable', unit])


with

   def enable(unit):
    stdout, stderr, retcode = process.call(['systemctl',
   'is-enabled', unit], show_command=True)
    if retcode != 0:
    process.run(['systemctl', 'enable', unit])


fixes the issue.

Has anyone run into this, or has any ideas on how to proceed?

Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph - incorrect output of ceph osd tree

2018-01-31 Thread Andras Pataki
There is a config option "mon osd min up ratio" (defaults to 0.3) - and 
if too many OSDs are down, the monitors will not mark further OSDs 
down.  Perhaps that's the culprit here?


Andras


On 01/31/2018 02:21 PM, Marc Roos wrote:
  
Maybe the process is still responding on an active session?

If you can't ping a host, that only means you cannot ping it.


-Original Message-
From: Steven Vacaroaia [mailto:ste...@gmail.com]
Sent: woensdag 31 januari 2018 19:47
To: ceph-users
Subject: [ceph-users] Ceph - incorrect output of ceph osd tree

Hi,

Why is ceph osd tree reports that osd.4 is up when the server on which
osd.4 is running is actually down ??

Any help will be appreciated

[root@osd01 ~]# ping -c 2 osd02
PING osd02 (10.10.30.182) 56(84) bytes of data.
 From osd01 (10.10.30.181) icmp_seq=1 Destination Host Unreachable From
osd01 (10.10.30.181) icmp_seq=2 Destination Host Unreachable


[root@osd01 ~]# ceph osd tree
ID  CLASS WEIGHT  TYPE NAME  STATUS REWEIGHT PRI-AFF
  -9 0 root ssds
-10 0 host osd01-ssd
-11 0 host osd02-ssd
-12 0 host osd04-ssd
  -1   4.22031 root default
  -3   1.67967 host osd01
   0   hdd 0.55989 osd.0down0 1.0
   3   hdd 0.55989 osd.3down0 1.0
   6   hdd 0.55989 osd.6  up  1.0 1.0
  -5   1.67967 host osd02
   1   hdd 0.55989 osd.1down  1.0 1.0
   4   hdd 0.55989 osd.4  up  1.0 1.0
   7   hdd 0.55989 osd.7down  1.0 1.0
  -7   0.86096 host osd04
   2   hdd 0.28699 osd.2down0 1.0
   5   hdd 0.28699 osd.5down  1.0 1.0
   8   hdd 0.28699 osd.8down  1.0 1.0
[root@osd01 ~]# ceph tell osd.4 bench
^CError EINTR: problem getting command descriptions from osd.4
[root@osd01 ~]# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL %USE VAR  PGS
  0   hdd 0.559890 0  0 000   0
  3   hdd 0.559890 0  0 000   0
  6   hdd 0.55989  1.0  573G 16474M  557G 2.81 0.84   0
  1   hdd 0.55989  1.0  573G 16516M  557G 2.81 0.84   0
  4   hdd 0.55989  1.0  573G 16465M  557G 2.80 0.84   0
  7   hdd 0.55989  1.0  573G 16473M  557G 2.81 0.84   0
  2   hdd 0.286990 0  0 000   0
  5   hdd 0.28699  1.0  293G 16466M  277G 5.47 1.63   0
  8   hdd 0.28699  1.0  293G 16461M  277G 5.47 1.63   0
 TOTAL 2881G 98857M 2784G 3.35 MIN/MAX VAR: 0.84/1.63
  STDDEV: 1.30
[root@osd01 ~]# ceph osd df tree
ID  CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL %USE VAR  PGS TYPE NAME
  -9 0- 0  0 000   - root ssds
-10 0- 0  0 000   - host
osd01-ssd
-11 0- 0  0 000   - host
osd02-ssd
-12 0- 0  0 000   - host
osd04-ssd
  -1   4.22031- 2881G 98857M 2784G 3.35 1.00   - root default
  -3   1.67967-  573G 16474M  557G 2.81 0.84   - host
osd01
   0   hdd 0.559890 0  0 000   0
osd.0
   3   hdd 0.559890 0  0 000   0
osd.3
   6   hdd 0.55989  1.0  573G 16474M  557G 2.81 0.84   0
osd.6
  -5   1.67967- 1720G 49454M 1671G 2.81 0.84   - host
osd02
   1   hdd 0.55989  1.0  573G 16516M  557G 2.81 0.84   0
osd.1
   4   hdd 0.55989  1.0  573G 16465M  557G 2.80 0.84   0
osd.4
   7   hdd 0.55989  1.0  573G 16473M  557G 2.81 0.84   0
osd.7
  -7   0.86096-  587G 32928M  555G 5.47 1.63   - host
osd04
   2   hdd 0.286990 0  0 000   0
osd.2
   5   hdd 0.28699  1.0  293G 16466M  277G 5.47 1.63   0
osd.5
   8   hdd 0.28699  1.0  293G 16461M  277G 5.47 1.63   0
osd.8
  TOTAL 2881G 98857M 2784G 3.35 MIN/MAX VAR:
0.84/1.63  STDDEV: 1.30



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] After Luminous upgrade: ceph-fuse clients failing to respond to cache pressure

2018-01-22 Thread Andras Pataki
Just to close this thread up - it looks like all the problems were 
related to setting the "mds cache size" option in Luminous instead of 
using "mds cache memory limit".  The "mds cache size" option 
documentation says that "it is recommended to use mds_cache_memory_limit 
...", but it looks more like "mds cache size" simply does not work in 
Luminous like it used to in Jewel (or does not work period).  As a 
result the MDS was trying to aggressively reduce caches in our setup.  
Since we switched all MDS's over to 'mds cache memory limit' of 16GB and 
bounced them, we have had no performance or cache pressure issues, and 
as expected they hover around 22-23GB of RSS.


Thanks everyone for the help,

Andras


On 01/18/2018 12:34 PM, Patrick Donnelly wrote:

Hi Andras,

On Thu, Jan 18, 2018 at 3:38 AM, Andras Pataki
<apat...@flatironinstitute.org> wrote:

Hi John,

Some other symptoms of the problem:  when the MDS has been running for a few
days, it starts looking really busy.  At this time, listing directories
becomes really slow.  An "ls -l" on a directory with about 250 entries takes
about 2.5 seconds.  All the metadata is on OSDs with NVMe backing stores.
Interestingly enough the memory usage seems pretty low (compared to the
allowed cache limit).


 PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
COMMAND
1604408 ceph  20   0 3710304 2.387g  18360 S 100.0  0.9 757:06.92
/usr/bin/ceph-mds -f --cluster ceph --id cephmon00 --setuser ceph --setgroup
ceph

Once I bounce it (fail it over), the CPU usage goes down to the 10-25%
range.  The same ls -l after the bounce takes about 0.5 seconds.  I
remounted the filesystem before each test to ensure there isn't anything
cached.

 PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
COMMAND
   00 ceph  20   0 6537052 5.864g  18500 S  17.6  2.3   9:23.55
/usr/bin/ceph-mds -f --cluster ceph --id cephmon02 --setuser ceph --setgroup
ceph

Also, I have a crawler that crawls the file system periodically.  Normally
the full crawl runs for about 24 hours, but with the slowing down MDS, now
it has been running for more than 2 days and isn't close to finishing.

The MDS related settings we are running with are:

mds_cache_memory_limit = 17179869184
mds_cache_reservation = 0.10

Debug logs from the MDS at that time would be helpful with `debug mds
= 20` and `debug ms = 1`. Feel free to create a tracker ticket and use
ceph-post-file [1] to share logs.

[1] http://docs.ceph.com/docs/hammer/man/8/ceph-post-file/



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] After Luminous upgrade: ceph-fuse clients failing to respond to cache pressure

2018-01-18 Thread Andras Pataki

Hi John,

Some other symptoms of the problem:  when the MDS has been running for a 
few days, it starts looking really busy.  At this time, listing 
directories becomes really slow.  An "ls -l" on a directory with about 
250 entries takes about 2.5 seconds.  All the metadata is on OSDs with 
NVMe backing stores.  Interestingly enough the memory usage seems pretty 
low (compared to the allowed cache limit).



    PID USER  PR  NI    VIRT    RES    SHR S  %CPU %MEM TIME+ COMMAND
1604408 ceph  20   0 3710304 2.387g  18360 S 100.0  0.9 757:06.92 
/usr/bin/ceph-mds -f --cluster ceph --id cephmon00 --setuser ceph 
--setgroup ceph


Once I bounce it (fail it over), the CPU usage goes down to the 10-25% 
range.  The same ls -l after the bounce takes about 0.5 seconds.  I 
remounted the filesystem before each test to ensure there isn't anything 
cached.


    PID USER  PR  NI    VIRT    RES    SHR S  %CPU %MEM TIME+ COMMAND
00 ceph  20   0 6537052 5.864g  18500 S  17.6 2.3   9:23.55 
/usr/bin/ceph-mds -f --cluster ceph --id cephmon02 --setuser ceph 
--setgroup ceph


Also, I have a crawler that crawls the file system periodically. 
Normally the full crawl runs for about 24 hours, but with the slowing 
down MDS, now it has been running for more than 2 days and isn't close 
to finishing.


The MDS related settings we are running with are:

   mds_cache_memory_limit = 17179869184
   mds_cache_reservation = 0.10


Andras


On 01/17/2018 01:11 PM, John Spray wrote:

On Wed, Jan 17, 2018 at 3:36 PM, Andras Pataki
<apat...@flatironinstitute.org> wrote:

Hi John,

All our hosts are CentOS 7 hosts, the majority are 7.4 with kernel
3.10.0-693.5.2.el7.x86_64, with fuse 2.9.2-8.el7.  We have some hosts that
have slight variations in kernel versions, the oldest one are a handful of
CentOS 7.3 hosts with kernel 3.10.0-514.21.1.el7.x86_64 and fuse
2.9.2-7.el7.  I know Redhat has been backporting lots of stuff so perhaps
these kernels fall into the category you are describing?

Quite possibly -- this issue was originally noticed on RHEL, so maybe
the relevant bits made it back to CentOS recently.

However, it looks like the fixes for that issue[1,2] are already in
12.2.2, so maybe this is something completely unrelated :-/

The ceph-fuse executable does create an admin command socket in
/var/run/ceph (named something ceph-client...) that you can drive with
"ceph daemon  dump_cache", but the output is extremely verbose
and low level and may not be informative.

John

1. http://tracker.ceph.com/issues/21423
2. http://tracker.ceph.com/issues/22269


When the cache pressure problem happens, is there a way to know exactly
which hosts are involved, and what items are in their caches easily?

Andras



On 01/17/2018 06:09 AM, John Spray wrote:

On Tue, Jan 16, 2018 at 8:50 PM, Andras Pataki
<apat...@flatironinstitute.org> wrote:

Dear Cephers,

We've upgraded the back end of our cluster from Jewel (10.2.10) to
Luminous
(12.2.2).  The upgrade went smoothly for the most part, except we seem to
be
hitting an issue with cephfs.  After about a day or two of use, the MDS
start complaining about clients failing to respond to cache pressure:

What's the OS, kernel version and fuse version on the hosts where the
clients are running?

There have been some issues with ceph-fuse losing the ability to
properly invalidate cached items when certain updated OS packages were
installed.

Specifically, ceph-fuse checks the kernel version against 3.18.0 to
decide which invalidation method to use, and if your OS has backported
new behaviour to a low-version-numbered kernel, that can confuse it.

John


[root@cephmon00 ~]# ceph -s
cluster:
  id: d7b33135-0940-4e48-8aa6-1d2026597c2f
  health: HEALTH_WARN
  1 MDSs have many clients failing to respond to cache
pressure
  noout flag(s) set
  1 osds down

services:
  mon: 3 daemons, quorum cephmon00,cephmon01,cephmon02
  mgr: cephmon00(active), standbys: cephmon01, cephmon02
  mds: cephfs-1/1/1 up  {0=cephmon00=up:active}, 2 up:standby
  osd: 2208 osds: 2207 up, 2208 in
   flags noout

data:
  pools:   6 pools, 42496 pgs
  objects: 919M objects, 3062 TB
  usage:   9203 TB used, 4618 TB / 13822 TB avail
  pgs: 42470 active+clean
   22active+clean+scrubbing+deep
   4 active+clean+scrubbing

io:
  client:   56122 kB/s rd, 18397 kB/s wr, 84 op/s rd, 101 op/s wr

[root@cephmon00 ~]# ceph health detail
HEALTH_WARN 1 MDSs have many clients failing to respond to cache
pressure;
noout flag(s) set; 1 osds down
MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to
cache
pressure
  mdscephmon00(mds.0): Many clients (103) failing to respond to cache
pressureclient_count: 103
OSDMAP_FLAGS noout flag(s) set
OSD_DOWN 1 osds down
  osd.1296 (root=root-disk,pod=pod0-disk,host=cephosd008-disk) is down


We are u

Re: [ceph-users] After Luminous upgrade: ceph-fuse clients failing to respond to cache pressure

2018-01-17 Thread Andras Pataki

Hi John,

All our hosts are CentOS 7 hosts, the majority are 7.4 with kernel 
3.10.0-693.5.2.el7.x86_64, with fuse 2.9.2-8.el7.  We have some hosts 
that have slight variations in kernel versions, the oldest one are a 
handful of CentOS 7.3 hosts with kernel 3.10.0-514.21.1.el7.x86_64 and 
fuse 2.9.2-7.el7.  I know Redhat has been backporting lots of stuff so 
perhaps these kernels fall into the category you are describing?


When the cache pressure problem happens, is there a way to know exactly 
which hosts are involved, and what items are in their caches easily?


Andras


On 01/17/2018 06:09 AM, John Spray wrote:

On Tue, Jan 16, 2018 at 8:50 PM, Andras Pataki
<apat...@flatironinstitute.org> wrote:

Dear Cephers,

We've upgraded the back end of our cluster from Jewel (10.2.10) to Luminous
(12.2.2).  The upgrade went smoothly for the most part, except we seem to be
hitting an issue with cephfs.  After about a day or two of use, the MDS
start complaining about clients failing to respond to cache pressure:

What's the OS, kernel version and fuse version on the hosts where the
clients are running?

There have been some issues with ceph-fuse losing the ability to
properly invalidate cached items when certain updated OS packages were
installed.

Specifically, ceph-fuse checks the kernel version against 3.18.0 to
decide which invalidation method to use, and if your OS has backported
new behaviour to a low-version-numbered kernel, that can confuse it.

John


[root@cephmon00 ~]# ceph -s
   cluster:
 id: d7b33135-0940-4e48-8aa6-1d2026597c2f
 health: HEALTH_WARN
 1 MDSs have many clients failing to respond to cache pressure
 noout flag(s) set
 1 osds down

   services:
 mon: 3 daemons, quorum cephmon00,cephmon01,cephmon02
 mgr: cephmon00(active), standbys: cephmon01, cephmon02
 mds: cephfs-1/1/1 up  {0=cephmon00=up:active}, 2 up:standby
 osd: 2208 osds: 2207 up, 2208 in
  flags noout

   data:
 pools:   6 pools, 42496 pgs
 objects: 919M objects, 3062 TB
 usage:   9203 TB used, 4618 TB / 13822 TB avail
 pgs: 42470 active+clean
  22active+clean+scrubbing+deep
  4 active+clean+scrubbing

   io:
 client:   56122 kB/s rd, 18397 kB/s wr, 84 op/s rd, 101 op/s wr

[root@cephmon00 ~]# ceph health detail
HEALTH_WARN 1 MDSs have many clients failing to respond to cache pressure;
noout flag(s) set; 1 osds down
MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to cache
pressure
 mdscephmon00(mds.0): Many clients (103) failing to respond to cache
pressureclient_count: 103
OSDMAP_FLAGS noout flag(s) set
OSD_DOWN 1 osds down
 osd.1296 (root=root-disk,pod=pod0-disk,host=cephosd008-disk) is down


We are using exclusively the 12.2.2 fuse client on about 350 nodes or so
(out of which it seems 100 are not responding to cache pressure in this
log).  When this happens, clients appear pretty sluggish also (listing
directories, etc.).  After bouncing the MDS, everything returns on normal
after the failover for a while.  Ignore the message about 1 OSD down, that
corresponds to a failed drive and all data has been re-replicated since.

We were also using the 12.2.2 fuse client with the Jewel back end before the
upgrade, and have not seen this issue.

We are running with a larger MDS cache than usual, we have mds_cache_size
set to 4 million.  All other MDS configs are the defaults.

Is this a known issue?  If not, any hints on how to further diagnose the
problem?

Andras


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] After Luminous upgrade: ceph-fuse clients failingtorespond to cache pressure

2018-01-17 Thread Andras Pataki

Burkhard,

Thanks very much for info - I'll try the MDS with a 16GB 
mds_cache_memory_limit (which leaves some buffer for extra memory 
consumption on the machine), and report back if there are any issues 
remaining.


Andras


On 01/17/2018 02:40 AM, Burkhard Linke wrote:

Hi,


On 01/16/2018 09:50 PM, Andras Pataki wrote:

Dear Cephers,


*snipsnap*




We are running with a larger MDS cache than usual, we have 
mds_cache_size set to 4 million.  All other MDS configs are the 
defaults.


AFAIK the MDS cache management in luminous has changed, focusing on 
memory size instead of number of inodes/caps/whatever.


We had to replace mds_cache_size with mds_cache_memory_limit to get 
mds cache working as expected again. This may also be the cause for 
the issue, since the default configuration uses quite a small cache. 
You can check this with 'ceph daemonperf mds.XYZ' on the mds host.


If you change the memory limit you also need to consider a certain 
overhead of the memory allocation. There was a thread about this on 
the mailing list some weeks ago; you should expect at least 50% 
overhead. As with the previous releases this is not a hard limit. The 
process may consume more memory in certain situations. Given the fact 
that bluestore osds do not use kernel page cache anymore but their own 
memory cache, you need to plan memory consumption of all ceph daemons.


As an example, our mds is configured with mds_cache_memory_limit = 
80 and is consuming about 12 GB memory RSS.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] After Luminous upgrade: ceph-fuse clients failing to respond to cache pressure

2018-01-16 Thread Andras Pataki

Dear Cephers,

We've upgraded the back end of our cluster from Jewel (10.2.10) to 
Luminous (12.2.2).  The upgrade went smoothly for the most part, except 
we seem to be hitting an issue with cephfs.  After about a day or two of 
use, the MDS start complaining about clients failing to respond to cache 
pressure:


   [root@cephmon00 ~]# *ceph -s*
  cluster:
    id: d7b33135-0940-4e48-8aa6-1d2026597c2f
    health: HEALTH_WARN
   *    1 MDSs have many clients failing to respond to cache
   pressure*
    noout flag(s) set
    1 osds down

  services:
    mon: 3 daemons, quorum cephmon00,cephmon01,cephmon02
    mgr: cephmon00(active), standbys: cephmon01, cephmon02
    mds: cephfs-1/1/1 up  {0=cephmon00=up:active}, 2 up:standby
    osd: 2208 osds: 2207 up, 2208 in
 flags noout

  data:
    pools:   6 pools, 42496 pgs
    objects: 919M objects, 3062 TB
    usage:   9203 TB used, 4618 TB / 13822 TB avail
    pgs: 42470 active+clean
 22    active+clean+scrubbing+deep
 4 active+clean+scrubbing

  io:
    client:   56122 kB/s rd, 18397 kB/s wr, 84 op/s rd, 101 op/s wr

   [root@cephmon00 ~]# *ceph health detail*
   HEALTH_WARN 1 MDSs have many clients failing to respond to cache
   pressure; noout flag(s) set; 1 osds down
   *MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond
   to cache pressure**
   **    mdscephmon00(mds.0): Many clients (103) failing to respond to
   cache pressureclient_count: 103*
   OSDMAP_FLAGS noout flag(s) set
   OSD_DOWN 1 osds down
    osd.1296 (root=root-disk,pod=pod0-disk,host=cephosd008-disk) is
   down


We are using exclusively the 12.2.2 fuse client on about 350 nodes or so 
(out of which it seems 100 are not responding to cache pressure in this 
log).  When this happens, clients appear pretty sluggish also (listing 
directories, etc.).  After bouncing the MDS, everything returns on 
normal after the failover for a while.  Ignore the message about 1 OSD 
down, that corresponds to a failed drive and all data has been 
re-replicated since.


We were also using the 12.2.2 fuse client with the Jewel back end before 
the upgrade, and have not seen this issue.


We are running with a larger MDS cache than usual, we have 
mds_cache_size set to 4 million.  All other MDS configs are the defaults.


Is this a known issue?  If not, any hints on how to further diagnose the 
problem?


Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] luminous ceph-fuse crashes with "failed to remount for kernel dentry trimming"

2017-11-27 Thread Andras Pataki

Hi Patrick,

We are using the CentOS 7.4 kernel (3.10.0-693.5.2.el7.x86_64).  The 
nodes in question have a fairly large amount of RAM (512GB), I don't see 
any evidence in any logs that the nodes ran out of memory (no OOM 
killer, and we have a small amount of swap that is used to catch memory 
pressure which is completely unused).  I do sometimes see the ceph-fuse 
processes grow in size up towards 20-30GB of RSS (due to the memory bug 
that has a fix on the way), but even then, the nodes are far from out of 
memory.


I'll set some closer memory monitoring up for the next crash to be 
definite about it.


Andras


On 11/27/2017 06:06 PM, Patrick Donnelly wrote:

Hello Andras,

On Mon, Nov 27, 2017 at 2:31 PM, Andras Pataki
<apat...@flatironinstitute.org> wrote:

After upgrading to the Luminous 12.2.1 ceph-fuse client, we've seen clients
on various nodes randomly crash at the assert
 FAILED assert(0 == "failed to remount for kernel dentry trimming")

with the stack:

  ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous
(stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x110) [0x5584ad80]
  2: (C_Client_Remount::finish(int)+0xcf) [0x557e7fff]
  3: (Context::complete(int)+0x9) [0x557e3dc9]
  4: (Finisher::finisher_thread_entry()+0x198) [0x55849d18]
  5: (()+0x7e25) [0x760a3e25]
  6: (clone()+0x6d) [0x74f8234d]

What kernel version are you using? We have seen instances of this
error recently. It may be related to [1]. Are you running out of
memory on these machines?

[1] http://tracker.ceph.com/issues/17517



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] luminous ceph-fuse crashes with "failed to remount for kernel dentry trimming"

2017-11-27 Thread Andras Pataki

Dear ceph users,

After upgrading to the Luminous 12.2.1 ceph-fuse client, we've seen 
clients on various nodes randomly crash at the assert

    FAILED assert(0 == "failed to remount for kernel dentry trimming")

with the stack:

 ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e)
   luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
   const*)+0x110) [0x5584ad80]
 2: (C_Client_Remount::finish(int)+0xcf) [0x557e7fff]
 3: (Context::complete(int)+0x9) [0x557e3dc9]
 4: (Finisher::finisher_thread_entry()+0x198) [0x55849d18]
 5: (()+0x7e25) [0x760a3e25]
 6: (clone()+0x6d) [0x74f8234d]


Any ideas what causes this and what we could do about it?

Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-fuse memory usage

2017-11-22 Thread Andras Pataki

Hello ceph users,

I've seen threads about Luminous OSDs using more memory than they should 
due to some memory accounting bugs.  Does this issue apply to ceph-fuse 
also?


After upgrading to the latest ceph-fuse luminous client (12.2.1), we see 
some ceph-fuse processes using excessive memory.  We are using all the 
defaults for client memory sizes, except increased client_oc_size to 
2GB.  Yet some ceph-fuse processes have RSS growing to 16GB and even 
more.  Any ideas?


Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: clients hanging on write with ceph-fuse

2017-11-03 Thread Andras Pataki
I've tested  the 12.2.1 fuse client - and it also reproduces the problem 
unfortunately.  Investigating the code that accesses the file system, it 
looks like multiple processes from multiple nodes write to the same file 
concurrently, but to different byte ranges of it.  Unfortunately the 
problem happens some hours into the run of the code, so I can't really 
run the MDS or fuse with a very high debug level that long.  Well, 
perhaps fuse I could run with a higher debug level on the nodes in 
question if that helps.


Andras


On 11/03/2017 12:29 AM, Gregory Farnum wrote:

Either ought to work fine.

On Thu, Nov 2, 2017 at 4:58 PM Andras Pataki 
<apat...@flatironinstitute.org <mailto:apat...@flatironinstitute.org>> 
wrote:


I'm planning to test the newer ceph-fuse tomorrow.  Would it be
better to stay with the Jewel 10.2.10 client, or would the 12.2.1
Luminous client be better (even though the back-end is Jewel for now)?


Andras



On 11/02/2017 05:54 PM, Gregory Farnum wrote:

Have you tested on the new ceph-fuse? This does sound vaguely
familiar and is an issue I'd generally expect to have the fix
backported for, once it was identified.

On Thu, Nov 2, 2017 at 11:40 AM Andras Pataki
<apat...@flatironinstitute.org
<mailto:apat...@flatironinstitute.org>> wrote:

We've been running into a strange problem with Ceph using
ceph-fuse and
the filesystem. All the back end nodes are on 10.2.10, the
fuse clients
are on 10.2.7.

After some hours of runs, some processes get stuck waiting
for fuse like:

[root@worker1144 ~]# cat /proc/58193/stack
[] wait_answer_interruptible+0x91/0xe0 [fuse]
[] __fuse_request_send+0x253/0x2c0 [fuse]
[] fuse_request_send+0x12/0x20 [fuse]
[] fuse_send_write+0xd6/0x110 [fuse]
[] fuse_perform_write+0x2f5/0x5a0 [fuse]
[] fuse_file_aio_write+0x2a1/0x340 [fuse]
[] do_sync_write+0x8d/0xd0
[] vfs_write+0xbd/0x1e0
[] SyS_write+0x7f/0xe0
[] system_call_fastpath+0x16/0x1b
[] 0x

The cluster is healthy (all OSDs up, no slow requests,
etc.).  More
details of my investigation efforts are in the bug report I
just submitted:
http://tracker.ceph.com/issues/22008

It looks like the fuse client is asking for some caps that it
never
thinks it receives from the MDS, so the thread waiting for
those caps on
behalf of the writing client never wakes up.  The restart of
the MDS
fixes the problem (since ceph-fuse re-negotiates caps).

Any ideas/suggestions?

Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: clients hanging on write with ceph-fuse

2017-11-02 Thread Andras Pataki
I'm planning to test the newer ceph-fuse tomorrow.  Would it be better 
to stay with the Jewel 10.2.10 client, or would the 12.2.1 Luminous 
client be better (even though the back-end is Jewel for now)?


Andras


On 11/02/2017 05:54 PM, Gregory Farnum wrote:
Have you tested on the new ceph-fuse? This does sound vaguely familiar 
and is an issue I'd generally expect to have the fix backported for, 
once it was identified.


On Thu, Nov 2, 2017 at 11:40 AM Andras Pataki 
<apat...@flatironinstitute.org <mailto:apat...@flatironinstitute.org>> 
wrote:


We've been running into a strange problem with Ceph using
ceph-fuse and
the filesystem. All the back end nodes are on 10.2.10, the fuse
clients
are on 10.2.7.

After some hours of runs, some processes get stuck waiting for
fuse like:

[root@worker1144 ~]# cat /proc/58193/stack
[] wait_answer_interruptible+0x91/0xe0 [fuse]
[] __fuse_request_send+0x253/0x2c0 [fuse]
[] fuse_request_send+0x12/0x20 [fuse]
[] fuse_send_write+0xd6/0x110 [fuse]
[] fuse_perform_write+0x2f5/0x5a0 [fuse]
[] fuse_file_aio_write+0x2a1/0x340 [fuse]
[] do_sync_write+0x8d/0xd0
[] vfs_write+0xbd/0x1e0
[] SyS_write+0x7f/0xe0
[] system_call_fastpath+0x16/0x1b
[] 0x

The cluster is healthy (all OSDs up, no slow requests, etc.). More
details of my investigation efforts are in the bug report I just
submitted:
http://tracker.ceph.com/issues/22008

It looks like the fuse client is asking for some caps that it never
thinks it receives from the MDS, so the thread waiting for those
caps on
behalf of the writing client never wakes up.  The restart of the MDS
fixes the problem (since ceph-fuse re-negotiates caps).

Any ideas/suggestions?

Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS: clients hanging on write with ceph-fuse

2017-11-02 Thread Andras Pataki
We've been running into a strange problem with Ceph using ceph-fuse and 
the filesystem. All the back end nodes are on 10.2.10, the fuse clients 
are on 10.2.7.


After some hours of runs, some processes get stuck waiting for fuse like:

[root@worker1144 ~]# cat /proc/58193/stack
[] wait_answer_interruptible+0x91/0xe0 [fuse]
[] __fuse_request_send+0x253/0x2c0 [fuse]
[] fuse_request_send+0x12/0x20 [fuse]
[] fuse_send_write+0xd6/0x110 [fuse]
[] fuse_perform_write+0x2f5/0x5a0 [fuse]
[] fuse_file_aio_write+0x2a1/0x340 [fuse]
[] do_sync_write+0x8d/0xd0
[] vfs_write+0xbd/0x1e0
[] SyS_write+0x7f/0xe0
[] system_call_fastpath+0x16/0x1b
[] 0x

The cluster is healthy (all OSDs up, no slow requests, etc.).  More 
details of my investigation efforts are in the bug report I just submitted:

    http://tracker.ceph.com/issues/22008

It looks like the fuse client is asking for some caps that it never 
thinks it receives from the MDS, so the thread waiting for those caps on 
behalf of the writing client never wakes up.  The restart of the MDS 
fixes the problem (since ceph-fuse re-negotiates caps).


Any ideas/suggestions?

Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore disk colocation using NVRAM, SSD and SATA

2017-09-20 Thread Andras Pataki
Is there any guidance on the sizes for the WAL and DB devices when they 
are separated to an SSD/NVMe?  I understand that probably there isn't a 
one size fits all number, but perhaps something as a function of 
cluster/usage parameters like OSD size and usage pattern (amount of 
writes, number/size of objects, etc.)?
Also, once numbers are chosen and the OSD is in use, is there a way to 
tell what portion of these spaces are used?


Thanks,

Andras


On 09/20/2017 05:36 PM, Nigel Williams wrote:
On 21 September 2017 at 04:53, Maximiliano Venesio 
> wrote:


Hi guys i'm reading different documents about bluestore, and it
never recommends to use NVRAM to store the bluefs db, nevertheless
the official documentation says that, is better to use the faster
device to put the block.db in.


​Likely not mentioned since no one yet has had the opportunity to test 
it.​


So how do i have to deploy using bluestore, regarding where i
should put block.wal and block.db ?


​block.* would be best on your NVRAM device, like this:

​ceph-deploy osd create --bluestore c0osd-136:/dev/sda --block-wal 
/dev/nvme0n1 --block-db /dev/nvme0n1





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous OSD startup errors

2017-08-15 Thread Andras Pataki
Thanks for the quick response and the pointer.  The dev build fixed the 
issue.


Andras


On 08/15/2017 09:19 AM, Jason Dillaman wrote:

I believe this is a known issue [1] and that there will potentially be
a new 12.1.4 RC released because of it. The tracker ticket has a link
to a set of development packages that should resolve the issue in the
meantime.


[1] http://tracker.ceph.com/issues/20985

On Tue, Aug 15, 2017 at 9:08 AM, Andras Pataki
<apat...@flatironinstitute.org> wrote:

After upgrading to the latest Luminous RC (12.1.3), all our OSD's are
crashing with the following assert:

  0> 2017-08-15 08:28:49.479238 7f9b7615cd00 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ceph-12.1.3/src/osd/PGLog.h:
In function 'static void PGLog::read_log_and_missing(ObjectStore*, coll_t,
coll_t, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&,
bool, std::ostringstream&, bool, bool*, const DoutPrefixProvider*,
std::set<std::basic_string >*, bool) [with missing_type =
pg_missing_set; std::ostringstream = std::basic_ostringstream]'
thread 7f9b7615cd00 time 2017-08-15 08:28:49.477367
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ceph-12.1.3/src/osd/PGLog.h:
1301: FAILED assert(force_rebuild_missing)

  ceph version 12.1.3 (c56d9c07b342c08419bbc18dcf2a4c5fae62b9cf) luminous
(rc)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x110) [0x55d0f2be3b50]
  2: (void PGLog::read_log_and_missing<pg_missing_set >(ObjectStore*,
coll_t, coll_t, ghobject_t, pg_info_t const&, PGLog::IndexedLog&,
pg_missing_set&, bool, std::basic_ostringstream<char,
std::char_traits, std::allocator >&, bool, bool*,
DoutPrefixProvider const*, std::set<std::string, std::less,
std::allocator >*, bool)+0x773) [0x55d0f276f013]
  3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x52b)
[0x55d0f272739b]
  4: (OSD::load_pgs()+0x97a) [0x55d0f2673dea]
  5: (OSD::init()+0x2179) [0x55d0f268c319]
  6: (main()+0x2def) [0x55d0f2591ccf]
  7: (__libc_start_main()+0xf5) [0x7f9b727d6b35]
  8: (()+0x4ac006) [0x55d0f2630006]

Looking at the code in PGLog.h, the change from 12.1.2 to 12.1.3 (in
read_log_missing) was:

 if (p->key() == "divergent_priors") {
   ::decode(divergent_priors, bp);
   ldpp_dout(dpp, 20) << "read_log_and_missing " <<
divergent_priors.size()
  << " divergent_priors" << dendl;
   has_divergent_priors = true;
   debug_verify_stored_missing = false;

to

 if (p->key() == "divergent_priors") {
   ::decode(divergent_priors, bp);
   ldpp_dout(dpp, 20) << "read_log_and_missing " <<
divergent_priors.size()
  << " divergent_priors" << dendl;
   assert(force_rebuild_missing);
   debug_verify_stored_missing = false;

and it seems like force_rebuild_missing is not being set.

This cluster was upgraded from Jewel to 12.1.1, then 12.1.2 and now 12.1.3.
So it seems something didn't happen correctly during the upgrade.  Any ideas
how to fix it?

Andras


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous OSD startup errors

2017-08-15 Thread Andras Pataki
After upgrading to the latest Luminous RC (12.1.3), all our OSD's are 
crashing with the following assert:


 0> 2017-08-15 08:28:49.479238 7f9b7615cd00 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ceph-12.1.3/src/osd/PGLog.h: 
In function 'static void PGLog::read_log_and_missing(ObjectStore*, 
coll_t, coll_t, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, 
missing_type&, bool, std::ostringstream&, bool, bool*, const 
DoutPrefixProvider*, std::set*, bool) [with 
missing_type = pg_missing_set; std::ostringstream = 
std::basic_ostringstream]' thread 7f9b7615cd00 time 2017-08-15 
08:28:49.477367
*/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ceph-12.1.3/src/osd/PGLog.h: 
1301: FAILED assert(force_rebuild_missing)*


 ceph version 12.1.3 (c56d9c07b342c08419bbc18dcf2a4c5fae62b9cf) 
luminous (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x110) [0x55d0f2be3b50]
 2: (void PGLog::read_log_and_missing(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, 
PGLog::IndexedLog&, pg_missing_set&, bool, 
std::basic_ostringstream&, bool, bool*, DoutPrefixProvider const*, 
std::set*, bool)+0x773) [0x55d0f276f013]
 3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x52b) 
[0x55d0f272739b]

 4: (OSD::load_pgs()+0x97a) [0x55d0f2673dea]
 5: (OSD::init()+0x2179) [0x55d0f268c319]
 6: (main()+0x2def) [0x55d0f2591ccf]
 7: (__libc_start_main()+0xf5) [0x7f9b727d6b35]
 8: (()+0x4ac006) [0x55d0f2630006]

Looking at the code in PGLog.h, the change from 12.1.2 to 12.1.3 (in 
read_log_missing) was:


if (p->key() == "divergent_priors") {
  ::decode(divergent_priors, bp);
  ldpp_dout(dpp, 20) << "read_log_and_missing " << 
divergent_priors.size()

 << " divergent_priors" << dendl;
  has_divergent_priors = true;
  debug_verify_stored_missing = false;

to

if (p->key() == "divergent_priors") {
  ::decode(divergent_priors, bp);
  ldpp_dout(dpp, 20) << "read_log_and_missing " << 
divergent_priors.size()

 << " divergent_priors" << dendl;
  assert(force_rebuild_missing);
  debug_verify_stored_missing = false;

and it seems like force_rebuild_missing is not being set.

This cluster was upgraded from Jewel to 12.1.1, then 12.1.2 and now 
12.1.3.  So it seems something didn't happen correctly during the 
upgrade.  Any ideas how to fix it?


Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: concurrent access to the same file from multiple nodes

2017-08-07 Thread Andras Pataki

I've filed a tracker bug for this: http://tracker.ceph.com/issues/20938

Andras


On 08/01/2017 10:26 AM, Andras Pataki wrote:

Hi John,

Sorry for the delay, it took a bit of work to set up a luminous test 
environment.  I'm sorry to have to report that the 12.1.1 RC version 
also suffers from this problem - when two nodes open the same file for 
read/write, and read from it, the performance is awful (under 1 
operation/second).  The behavior is exactly the same as with the 
latest Jewel.


I'm running an all luminous setup (12.1.1 mon/mds/osds and fuse 
client).  My original mail has a small test program that easily 
reproduces the issue.  Let me know if there is anything I can help 
with for tracking the issue down further.


Andras


On 07/21/2017 05:41 AM, John Spray wrote:

On Thu, Jul 20, 2017 at 9:19 PM, Andras Pataki
<apat...@flatironinstitute.org> wrote:
We are having some difficulties with cephfs access to the same file 
from
multiple nodes concurrently.  After debugging some large-ish 
applications
with noticeable performance problems using CephFS (with the fuse 
client), I

have a small test program to reproduce the problem.

The core of the problem boils down to the following operation being 
run on

the same file on multiple nodes (in a loop in the test program):

 int fd = open(filename, mode);
 read(fd, buffer, 100);
 close(fd);

Here are some results on our cluster:

One node, mode=read-only: 7000 opens/second
One node, mode=read-write: 7000 opens/second
Two nodes, mode=read-only: 7000 opens/second/node
Two nodes, mode=read-write: around 0.5 opens/second/node (!!!)
Two nodes, one read-only, one read-write: around 0.5 
opens/second/node (!!!)
Two nodes, mode=read-write, but remove the 'read(fd, buffer,100)' 
line from

the code: 500 opens/second/node


So there seems to be some problems with opening the same file 
read/write and

reading from the file on multiple nodes.  That operation seems to be 3
orders of magnitude slower than other parallel access patterns to 
the same
file.  The 1 second time to open files almost seems like some 
timeout is

happening somewhere.  I have some suspicion that this has to do with
capability management between the fuse client and the MDS, but I 
don't know

enough about that protocol to make an educated assessment.

You're pretty much spot on.  Things happening at 0.5 per second is
characteristic of a particular class of bug where we are not flushing
the journal soon enough, and instead waiting for the next periodic
(every five second) flush.  Hence there is an average 2.5 second dely,
hence operations happening at approximately half an operation per
second.


[And an aside - how does this become a problem?  I.e. why open a file
read/write and read from it?  Well, it turns out gfortran compiled 
code does

this by default if the user doesn't explicitly says otherwise].

All the nodes in this test are very lightly loaded, so there does 
not seems
to be any noticeable performance bottleneck (network, CPU, etc.).  
The code
to reproduce the problem is attached.  Simply compile it, create a 
test file
with a few bytes of data in it, and run the test code on two 
separate nodes

on the same file.

We are running ceph 10.2.9 both on the server, and we use the 10.2.9 
fuse

client on the client nodes.

Any input/help would be greatly appreciated.

If you have a test/staging environment, it would be great if you could
re-test this on the 12.1.1 release candidate.  There have been MDS
fixes for similar slowdowns that were shown up in multi-mds testing,
so it's possible that the issue you're seeing here was fixed along the
way.

John


Andras


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: concurrent access to the same file from multiple nodes

2017-08-01 Thread Andras Pataki

Hi John,

Sorry for the delay, it took a bit of work to set up a luminous test 
environment.  I'm sorry to have to report that the 12.1.1 RC version 
also suffers from this problem - when two nodes open the same file for 
read/write, and read from it, the performance is awful (under 1 
operation/second).  The behavior is exactly the same as with the latest 
Jewel.


I'm running an all luminous setup (12.1.1 mon/mds/osds and fuse 
client).  My original mail has a small test program that easily 
reproduces the issue.  Let me know if there is anything I can help with 
for tracking the issue down further.


Andras


On 07/21/2017 05:41 AM, John Spray wrote:

On Thu, Jul 20, 2017 at 9:19 PM, Andras Pataki
<apat...@flatironinstitute.org> wrote:

We are having some difficulties with cephfs access to the same file from
multiple nodes concurrently.  After debugging some large-ish applications
with noticeable performance problems using CephFS (with the fuse client), I
have a small test program to reproduce the problem.

The core of the problem boils down to the following operation being run on
the same file on multiple nodes (in a loop in the test program):

 int fd = open(filename, mode);
 read(fd, buffer, 100);
 close(fd);

Here are some results on our cluster:

One node, mode=read-only: 7000 opens/second
One node, mode=read-write: 7000 opens/second
Two nodes, mode=read-only: 7000 opens/second/node
Two nodes, mode=read-write: around 0.5 opens/second/node (!!!)
Two nodes, one read-only, one read-write: around 0.5 opens/second/node (!!!)
Two nodes, mode=read-write, but remove the 'read(fd, buffer,100)' line from
the code: 500 opens/second/node


So there seems to be some problems with opening the same file read/write and
reading from the file on multiple nodes.  That operation seems to be 3
orders of magnitude slower than other parallel access patterns to the same
file.  The 1 second time to open files almost seems like some timeout is
happening somewhere.  I have some suspicion that this has to do with
capability management between the fuse client and the MDS, but I don't know
enough about that protocol to make an educated assessment.

You're pretty much spot on.  Things happening at 0.5 per second is
characteristic of a particular class of bug where we are not flushing
the journal soon enough, and instead waiting for the next periodic
(every five second) flush.  Hence there is an average 2.5 second dely,
hence operations happening at approximately half an operation per
second.


[And an aside - how does this become a problem?  I.e. why open a file
read/write and read from it?  Well, it turns out gfortran compiled code does
this by default if the user doesn't explicitly says otherwise].

All the nodes in this test are very lightly loaded, so there does not seems
to be any noticeable performance bottleneck (network, CPU, etc.).  The code
to reproduce the problem is attached.  Simply compile it, create a test file
with a few bytes of data in it, and run the test code on two separate nodes
on the same file.

We are running ceph 10.2.9 both on the server, and we use the 10.2.9 fuse
client on the client nodes.

Any input/help would be greatly appreciated.

If you have a test/staging environment, it would be great if you could
re-test this on the 12.1.1 release candidate.  There have been MDS
fixes for similar slowdowns that were shown up in multi-mds testing,
so it's possible that the issue you're seeing here was fixed along the
way.

John


Andras


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS: concurrent access to the same file from multiple nodes

2017-07-20 Thread Andras Pataki
We are having some difficulties with cephfs access to the same file from 
multiple nodes concurrently.  After debugging some large-ish 
applications with noticeable performance problems using CephFS (with the 
fuse client), I have a small test program to reproduce the problem.


The core of the problem boils down to the following operation being run 
on the same file on multiple nodes (in a loop in the test program):


int fd = open(filename, mode);
read(fd, buffer, 100);
close(fd);

Here are some results on our cluster:

 * One node, mode=read-only: 7000 opens/second
 * One node, mode=read-write: 7000 opens/second
 * Two nodes, mode=read-only: 7000 opens/second/node
 * Two nodes, mode=read-write: around *0.5 opens/second/node* (!!!)
 * Two nodes, one read-only, one read-write: around *0.5
   opens/second/node* (!!!)
 * Two nodes, mode=read-write, but remove the 'read(fd, buffer,100)'
   line from the code: 500 opens/second/node


So there seems to be some problems with opening the same file read/write 
and reading from the file on multiple nodes.  That operation seems to be 
3 orders of magnitude slower than other parallel access patterns to the 
same file.  The 1 second time to open files almost seems like some 
timeout is happening somewhere.  I have some suspicion that this has to 
do with capability management between the fuse client and the MDS, but I 
don't know enough about that protocol to make an educated assessment.


[And an aside - how does this become a problem?  I.e. why open a file 
read/write and read from it?  Well, it turns out gfortran compiled code 
does this by default if the user doesn't explicitly says otherwise].


All the nodes in this test are very lightly loaded, so there does not 
seems to be any noticeable performance bottleneck (network, CPU, etc.).  
The code to reproduce the problem is attached.  Simply compile it, 
create a test file with a few bytes of data in it, and run the test code 
on two separate nodes on the same file.


We are running ceph 10.2.9 both on the server, and we use the 10.2.9 
fuse client on the client nodes.


Any input/help would be greatly appreciated.

Andras

#include 
#include 
#include 
#include 
#include 
#include 

#define INTERVAL 2


double now()
{
struct timeval tv;
gettimeofday(, NULL);

return tv.tv_sec + tv.tv_usec / 1e6;
}


int main(int argc, char *argv[])
{
if (argc != 3) {
fprintf(stderr, "Usage: %s  r|rw\n", argv[0]);
exit(1);
}

const char *filename = argv[1];

int mode = 0;
if (strcmp(argv[2], "r") == 0) {
mode = O_RDONLY;
} else if (strcmp(argv[2], "rw") == 0) {
mode = O_RDWR;
} else {
fprintf(stderr, "Second argument must be 'r' or 'rw'\n");
exit(1);
}

while (1) {

char buffer[100];
double t0 = now();
double dt;
int count = 0;

while (1) {
dt = now() - t0;
if (dt > INTERVAL) {
break;
}

int fd = open(filename, mode);
if (fd < 0) {
printf("Could not open file '%s' for read/write", filename);
exit(1);
}

read(fd, buffer, 100);

close(fd);
count++;
}

printf("File open rate: %8.2f\n", count / dt);

}

return 0;
}
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Degraded objects while OSD is being added/filled

2017-07-20 Thread Andras Pataki

Hi Greg,

I have just now added a single drive/osd to a clean cluster, and can see 
the degradation immediately.  We are on ceph 10.2.9 everywhere.


Here is how the cluster looked before the OSD got added:

cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
 health HEALTH_WARN
noout flag(s) set
 monmap e31: 3 mons at
   
{cephmon00=10.128.128.100:6789/0,cephmon01=10.128.128.101:6789/0,cephmon02=10.128.128.102:6789/0}
election epoch 46092, quorum 0,1,2
   cephmon00,cephmon01,cephmon02
  fsmap e26638: 1/1/1 up {0=cephmon01=up:active}, 2 up:standby
 osdmap e681227: 1270 osds: 1270 up, 1270 in
flags noout,sortbitwise,require_jewel_osds
  pgmap v54583934: 42496 pgs, 6 pools, 1488 TB data, 437 Mobjects
4471 TB used, 3416 TB / 7887 TB avail
   42491 active+clean
   5 active+clean+scrubbing+deep
  client io 2193 kB/s rd, 27240 kB/s wr, 85 op/s rd, 47 op/s wr


And this is shortly after it was added (after all the peering was done):

cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
 health HEALTH_WARN
141 pgs backfill_wait
117 pgs backfilling
20 pgs degraded
20 pgs recovery_wait
56 pgs stuck unclean
recovery 130/1376744346 objects degraded (0.000%)
recovery 3827502/1376744346 objects misplaced (0.278%)
noout flag(s) set
 monmap e31: 3 mons at
   
{cephmon00=10.128.128.100:6789/0,cephmon01=10.128.128.101:6789/0,cephmon02=10.128.128.102:6789/0}
election epoch 46092, quorum 0,1,2
   cephmon00,cephmon01,cephmon02
  fsmap e26638: 1/1/1 up {0=cephmon01=up:active}, 2 up:standby
 osdmap e681238: 1271 osds: 1271 up, 1271 in; 258 remapped pgs
flags noout,sortbitwise,require_jewel_osds
  pgmap v54585141: 42496 pgs, 6 pools, 1488 TB data, 437 Mobjects
4471 TB used, 3423 TB / 7895 TB avail
   *130/1376744346 objects degraded (0.000%)*
3827502/1376744346 objects misplaced (0.278%)
   42210 active+clean
 141 active+remapped+wait_backfill
 117 active+remapped+backfilling
   *  20 active+recovery_wait+degraded*
   7 active+clean+scrubbing+deep
   1 active+clean+scrubbing
   recovery io 17375 MB/s, 5069 objects/s
  client io 12210 kB/s rd, 29887 kB/s wr, 4 op/s rd, 140 op/s wr


Even though there was no failure, we have 20 degraded PGs, and 130 
degraded objects.  My expectation was for some data to move around, 
start filling the added drive, but I would not expect to see degraded 
objects or PGs.


Also, as time passes, the number of degraded objects increases steadily, 
here is a snapshot a little later:


cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
 health HEALTH_WARN
63 pgs backfill_wait
4 pgs backfilling
67 pgs stuck unclean
recovery 706/1377244134 objects degraded (0.000%)
recovery 843267/1377244134 objects misplaced (0.061%)
noout flag(s) set
 monmap e31: 3 mons at
   
{cephmon00=10.128.128.100:6789/0,cephmon01=10.128.128.101:6789/0,cephmon02=10.128.128.102:6789/0}
election epoch 46092, quorum 0,1,2
   cephmon00,cephmon01,cephmon02
  fsmap e26640: 1/1/1 up {0=cephmon01=up:active}, 2 up:standby
 osdmap e681569: 1271 osds: 1271 up, 1271 in; 67 remapped pgs
flags noout,sortbitwise,require_jewel_osds
  pgmap v54588554: 42496 pgs, 6 pools, 1488 TB data, 437 Mobjects
4471 TB used, 3423 TB / 7895 TB avail
   *706/1377244134 objects degraded (0.000%)*
843267/1377244134 objects misplaced (0.061%)
   42422 active+clean
  63 active+remapped+wait_backfill
   5 active+clean+scrubbing+deep
   4 active+remapped+backfilling
   2 active+clean+scrubbing
   recovery io 779 MB/s, 229 objects/s
  client io 306 MB/s rd, 344 MB/s wr, 138 op/s rd, 226 op/s wr

From past experience, the degraded object count keeps going up for most 
of the time the disk is being filled.  Towards the end it decreases.  Is 
writing to a pool that is waiting for backfilling causing degraded 
objects to appear perhaps?


I took a 'pg dump' before and after the change, as well as an 'osd tree' 
before and after.  All these are available at 
http://voms.simonsfoundation.org:50013/m1Maf76sV1kS95spXQpijycmne92yjm/ceph-20170720/


All pools are now with replicated size 3 and min size 2. Let me know if 
any other info would be helpful.


Andras


On 07/06/2017 02:30 PM, Andras Pataki wrote:

Hi Greg,

At the moment our cluster is all in balance.  We have one

Re: [ceph-users] Degraded objects while OSD is being added/filled

2017-07-06 Thread Andras Pataki

Hi Greg,

At the moment our cluster is all in balance.  We have one failed drive 
that will be replaced in a few days (the OSD has been removed from ceph 
and will be re-added with the replacement drive).  I'll document the 
state of the PGs before the addition of the drive and during the 
recovery process and report back.


We have a few pools, most are on 3 replicas now, some with non-critical 
data that we have elsewhere are on 2.  But I've seen the degradation 
even on the 3 replica pools (I think in my original example there was an 
example of such a pool as well).


Andras


On 06/30/2017 04:38 PM, Gregory Farnum wrote:
On Wed, Jun 21, 2017 at 6:57 AM Andras Pataki 
<apat...@flatironinstitute.org <mailto:apat...@flatironinstitute.org>> 
wrote:


Hi cephers,

I noticed something I don't understand about ceph's behavior when
adding an OSD.  When I start with a clean cluster (all PG's
active+clean) and add an OSD (via ceph-deploy for example), the
crush map gets updated and PGs get reassigned to different OSDs,
and the new OSD starts getting filled with data.  As the new OSD
gets filled, I start seeing PGs in degraded states.  Here is an
example:

  pgmap v52068792: 42496 pgs, 6 pools, 1305 TB data, 390
Mobjects
3164 TB used, 781 TB / 3946 TB avail
*8017/994261437 objects degraded (0.001%)*
2220581/994261437 objects misplaced (0.223%)
   42393 active+clean
  91 active+remapped+wait_backfill
   9 active+clean+scrubbing+deep
*   1 active+recovery_wait+degraded*
   1 active+clean+scrubbing
   1 active+remapped+backfilling


Any ideas why there would be any persistent degradation in the
cluster while the newly added drive is being filled? It takes
perhaps a day or two to fill the drive - and during all this time
the cluster seems to be running degraded.  As data is written to
the cluster, the number of degraded objects increases over time. 
Once the newly added OSD is filled, the cluster comes back to

clean again.

Here is the PG that is degraded in this picture:

7.87c10200419430477
active+recovery_wait+degraded2017-06-20 14:12:44.119921   
344610'7583572:2797[402,521] 402[402,521]402   
344610'72017-06-16 06:04:55.822503344610'72017-06-16

06:04:55.822503

The newly added osd here is 521.  Before it got added, this PG had
two replicas clean, but one got forgotten somehow?


This sounds a bit concerning at first glance. Can you provide some 
output of exactly what commands you're invoking, and the "ceph -s" 
output as it changes in response?


I really don't see how adding a new OSD can result in it "forgetting" 
about existing valid copies — it's definitely not supposed to — so I 
wonder if there's a collision in how it's deciding to remove old 
locations.


Are you running with only two copies of your data? It shouldn't matter 
but there could also be errors resulting in a behavioral difference 
between two and three copies.

-Greg


Other remapped PGs have 521 in their "up" set but still have the
two existing copies in their "acting" set - and no degradation is
shown.  Examples:

2.f241428201628564051014850801 3102   
3102active+remapped+wait_backfill 2017-06-20
14:12:42.650308583553'2033479 583573:2033266[467,521]   
467[467,499]467 582430'2072017-06-16

09:08:51.055131 582036'20308372017-05-31 20:37:54.831178
6.2b7d104990140209980 372428746873673   
3673 active+remapped+wait_backfill2017-06-20
14:12:42.070019583569'165163583572:342128 [541,37,521]   
541[541,37,532]541 582430'1618902017-06-18

09:42:49.148402 582430'1618902017-06-18 09:42:49.148402

We are running the latest Jewel patch level everywhere (10.2.7). 
Any insights would be appreciated.


Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Degraded objects while OSD is being added/filled

2017-06-21 Thread Andras Pataki

Hi cephers,

I noticed something I don't understand about ceph's behavior when adding 
an OSD.  When I start with a clean cluster (all PG's active+clean) and 
add an OSD (via ceph-deploy for example), the crush map gets updated and 
PGs get reassigned to different OSDs, and the new OSD starts getting 
filled with data.  As the new OSD gets filled, I start seeing PGs in 
degraded states.  Here is an example:


  pgmap v52068792: 42496 pgs, 6 pools, 1305 TB data, 390 Mobjects
3164 TB used, 781 TB / 3946 TB avail
   *8017/994261437 objects degraded (0.001%)*
2220581/994261437 objects misplaced (0.223%)
   42393 active+clean
  91 active+remapped+wait_backfill
   9 active+clean+scrubbing+deep
   *   1 active+recovery_wait+degraded*
   1 active+clean+scrubbing
   1 active+remapped+backfilling


Any ideas why there would be any persistent degradation in the cluster 
while the newly added drive is being filled?  It takes perhaps a day or 
two to fill the drive - and during all this time the cluster seems to be 
running degraded.  As data is written to the cluster, the number of 
degraded objects increases over time.  Once the newly added OSD is 
filled, the cluster comes back to clean again.


Here is the PG that is degraded in this picture:

7.87c10200419430477 
active+recovery_wait+degraded2017-06-20 14:12:44.119921 344610'7
583572:2797[402,521]402[402,521]402 344610'7
2017-06-16 06:04:55.822503344610'72017-06-16 06:04:55.822503


The newly added osd here is 521.  Before it got added, this PG had two 
replicas clean, but one got forgotten somehow?


Other remapped PGs have 521 in their "up" set but still have the two 
existing copies in their "acting" set - and no degradation is shown.  
Examples:


2.f2414282016285640510148508013102 3102
active+remapped+wait_backfill2017-06-20 14:12:42.650308
583553'2033479583573:2033266[467,521] 467[467,499]467
582430'2072017-06-16 09:08:51.055131582036'2030837
2017-05-31 20:37:54.831178
6.2b7d104990140209980372428746873673 3673
active+remapped+wait_backfill2017-06-20 14:12:42.070019
583569'165163583572:342128[541,37,521] 541[541,37,532]
541582430'1618902017-06-18 09:42:49.148402582430'161890
2017-06-18 09:42:49.148402


We are running the latest Jewel patch level everywhere (10.2.7). Any 
insights would be appreciated.


Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fsping, why you no work no mo?

2017-04-13 Thread Andras Pataki

Hi Dan,

I don't have a solution to the problem, I can only second that we've 
also been seeing strange problems when more than one node accesses the 
same file in ceph and at least one of them opens it for writing.  I've 
tried verbose logging on the client (fuse), and it seems that the fuse 
client sends some cap request to the MDS and does not get a response 
sometimes.  And it looks like it has some 5 second polling interval, and 
that sometimes (but not always) saves the day and the client continues 
with a 5 second-ish delay.  This does not happen when multiple processes 
open the file for reading, but it does when processes open it for 
writing (even if they never write to the file and only read 
afterwards).  I have some earlier mailing list messages from a week or 
two ago describing what we see more in detail (including log outputs).  
I think the issue has in some way to do with cap requests being 
lost/miscommunicated between the client and the MDS.


Andras


On 04/13/2017 01:41 PM, Dan van der Ster wrote:

Dear ceph-*,

A couple weeks ago I wrote this simple tool to measure the round-trip 
latency of a shared filesystem.


https://github.com/dvanders/fsping

In our case, the tool is to be run from two clients who mount the same 
CephFS.


First, start the server (a.k.a. the ping reflector) on one machine in 
a CephFS directory:


   ./fsping --server

Then, from another client machine and in the same directory, start the 
fsping client (aka the ping emitter):


./fsping --prefix 

The idea is that the "client" writes a syn file, the reflector notices 
it, and writes an ack file. The time for the client to notice the ack 
file is what I call the rtt.


And the output looks like normal ping, so that's neat. (The README.md 
shows a working example)



Anyway, two weeks ago when I wrote this, it was working very well on 
my CephFS clusters (running 10.2.5, IIRC). I was seeing ~20ms rtt for 
small files, which is more or less what I was expecting on my test 
cluster.


But when I run fsping today, it does one of two misbehaviours:

  1. Most of the time it just hangs, both on the reflector and on the 
emitter. The fsping processes are stuck in some uninterruptible state 
-- only an MDS failover breaks them out. I tried with and without 
fuse_disable_pagecache -- no big difference.


  2. When I increase the fsping --size to 512kB, it works a bit more 
reliably. But there is a weird bimodal distribution with most 
"packets" having 20-30ms rtt, some ~20% having ~5-6 seconds rtt, and 
some ~5% taking ~10-11s. I suspected the mds_tick_interval -- but 
decreasing that didn't help.



In summary, if someone is curious, please give this tool a try on your 
CephFS cluster -- let me know if its working or not (and what rtt you 
can achieve with which configuration).
And perhaps a dev would understand why it is not working with latest 
jewel ceph-fuse / ceph MDS's?


Best Regards,

Dan




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS fuse client users stuck

2017-04-06 Thread Andras Pataki

Hi John,

Have you managed to reproduce the test case on your side?  Any hints on 
how to proceed, or if anything I could help with?  I've been trying to 
understand the protocol between the MDS and the fuse client, but if you 
can point me to any docs on the rationale of what the implementation is 
trying to achieve, that'd be super helpful.


Andras


On 03/31/2017 02:07 PM, Andras Pataki wrote:
Several clients on one node also works well for me (I guess the fuse 
client arbitrates then and the MDS perhaps doesn't need to do so 
much).  So the clients need to be on different nodes for this test to 
fail.


Andras


On 03/31/2017 01:25 PM, John Spray wrote:

On Fri, Mar 31, 2017 at 1:27 PM, Andras Pataki
<apat...@flatironinstitute.org> wrote:

Hi John,

It took a while but I believe now I have a reproducible test case 
for the
capabilities being lost issue in CephFS I wrote about a couple of 
weeks ago.

The quick summary of problem is that often processes hang using CephFS
either for a while or sometimes indefinitely.  The fuse client 
believes it

is waiting for some caps which it does not get from the MDS.

A quick summary of the setup: the whole system is on Jewel 10.2.6.  
We are

using the fuse client on all nodes for the file system.

The full test program is uploaded together with the verbose logs of all
clients and the MDS (at log level 20) to
http://voms.simonsfoundation.org:50013/GTrbrMWDHb9F7CampXyYt5Ensdjg47w/ceph-20170331/ 

It essentially runs in a loop opening a file for read/write, reading 
from it
and closing it.  The read/write open is key, if open the file 
read-only, the

problem doesn't happen:

 while (time(NULL) - t0 < INTERVAL) {

 int fd = open(FILENAME, O_RDWR);
 if (fd < 0) {
 printf("Could not open file '%s' for read/write", 
FILENAME);

 exit(1);
 }

 read(fd, buffer, 100);

 close(fd);
 count++;
 }

With INTERVAL = 10 seconds, on one machine with a single process I get
something like 30k opens per 10 seconds - an excellent result.  When 
I run 3
processes on 3 different nodes, I get a few (perhaps 4 or 5) opens 
per 10
seconds per process.  The particular case I collected logs for looks 
like

this:

apataki@worker1108:~/prg/ceph/test$ date; ssh worker1109
~/prg/ceph/test/timed_openrw_read < /dev/null & ssh worker1110
~/prg/ceph/test/timed_openrw_read < /dev/null & ssh worker1108
~/prg/ceph/test/timed_openrw_read < /dev/null &
Fri Mar 31 07:33:04 EDT 2017
[1] 53833
[2] 53834
[3] 53835
apataki@worker1108:~/prg/ceph/test$ Number of repeats: 4
Number of repeats: 3
Number of repeats: 5

[1]   Donessh worker1109
~/prg/ceph/test/timed_openrw_read < /dev/null
[2]-  Donessh worker1110
~/prg/ceph/test/timed_openrw_read < /dev/null
[3]+  Donessh worker1108
~/prg/ceph/test/timed_openrw_read < /dev/null
apataki@worker1108:~/prg/ceph/test$

Looking at the client, it looks like there are 5 second pauses 
waiting for
capabilities.  It seems that it doesn't get a response, and perhaps 
at some
frequency it polls back to find out what happened to its 
capabilities.  Here

is an example of such 5 second gaps from the client logs:

2017-03-31 07:33:05.010111 2aaab1a57700 10 client.3849178 waiting 
for caps

need Fr want Fc
2017-03-31 07:33:05.464058 2fd47700 20 client.3849178 trim_cache 
size 4

max 16384
2017-03-31 07:33:06.464180 2fd47700 20 client.3849178 trim_cache 
size 4

max 16384
2017-03-31 07:33:07.464307 2fd47700 20 client.3849178 trim_cache 
size 4

max 16384
2017-03-31 07:33:08.464421 2fd47700 20 client.3849178 trim_cache 
size 4

max 16384
2017-03-31 07:33:09.464554 2fd47700 20 client.3849178 trim_cache 
size 4

max 16384
2017-03-31 07:33:10.464680 2fd47700 10 client.3849178 check_caps on
100024a7b1d.head(faked_ino=0 ref=3 ll_ref=79894 
cap_refs={1024=0,2048=0}

open={3=1} mode=100664 size=23/4194304 mtime=2017-03-30 11:56:38.588478
caps=pAsLsXsFr(0=pAsLsXsFr) objectset[100024a7b1d ts 0/0 objects 0
dirty_or_tx 0] parents=0x5f5d5340 0x5f73ea00) wanted 
pAsxXsxFsxcrwb

used - issued pAsLsXsFr revoking Fsc is_delayed=1

The inode in question is 100024a7b1d which corresponds to the test file
/mnt/ceph/users/apataki/test/TEST-READ.txt
When I grep for this inode in the MDS logs, it also has a gap in the 
7:33:05
to 7:33:10 time.  I've learned some about the caps by trying to 
reproduce
the problem, but I'm afraid I don't understand enough of the MDS 
logic to

see what the problem is there.

The full logs are too large for the mailing list, so I've put them 
here:
http://voms.simonsfoundation.org:50013/GTrbrMWDHb9F7CampXyYt5Ensdjg47w/ceph-20170331/ 



Some help/advice with this would very much be appreciated. Thanks in
advance!

Thanks for the reproducer -- I've tried this against master ceph with
several clients on one node and I'm seeing hundreds of repeats

Re: [ceph-users] CephFS fuse client users stuck

2017-03-31 Thread Andras Pataki
Several clients on one node also works well for me (I guess the fuse 
client arbitrates then and the MDS perhaps doesn't need to do so much).  
So the clients need to be on different nodes for this test to fail.


Andras


On 03/31/2017 01:25 PM, John Spray wrote:

On Fri, Mar 31, 2017 at 1:27 PM, Andras Pataki
<apat...@flatironinstitute.org> wrote:

Hi John,

It took a while but I believe now I have a reproducible test case for the
capabilities being lost issue in CephFS I wrote about a couple of weeks ago.
The quick summary of problem is that often processes hang using CephFS
either for a while or sometimes indefinitely.  The fuse client believes it
is waiting for some caps which it does not get from the MDS.

A quick summary of the setup: the whole system is on Jewel 10.2.6.  We are
using the fuse client on all nodes for the file system.

The full test program is uploaded together with the verbose logs of all
clients and the MDS (at log level 20) to
http://voms.simonsfoundation.org:50013/GTrbrMWDHb9F7CampXyYt5Ensdjg47w/ceph-20170331/
It essentially runs in a loop opening a file for read/write, reading from it
and closing it.  The read/write open is key, if open the file read-only, the
problem doesn't happen:

 while (time(NULL) - t0 < INTERVAL) {

 int fd = open(FILENAME, O_RDWR);
 if (fd < 0) {
 printf("Could not open file '%s' for read/write", FILENAME);
 exit(1);
 }

 read(fd, buffer, 100);

 close(fd);
 count++;
 }

With INTERVAL = 10 seconds, on one machine with a single process I get
something like 30k opens per 10 seconds - an excellent result.  When I run 3
processes on 3 different nodes, I get a few (perhaps 4 or 5) opens per 10
seconds per process.  The particular case I collected logs for looks like
this:

apataki@worker1108:~/prg/ceph/test$ date; ssh worker1109
~/prg/ceph/test/timed_openrw_read < /dev/null & ssh worker1110
~/prg/ceph/test/timed_openrw_read < /dev/null & ssh worker1108
~/prg/ceph/test/timed_openrw_read < /dev/null &
Fri Mar 31 07:33:04 EDT 2017
[1] 53833
[2] 53834
[3] 53835
apataki@worker1108:~/prg/ceph/test$ Number of repeats: 4
Number of repeats: 3
Number of repeats: 5

[1]   Donessh worker1109
~/prg/ceph/test/timed_openrw_read < /dev/null
[2]-  Donessh worker1110
~/prg/ceph/test/timed_openrw_read < /dev/null
[3]+  Donessh worker1108
~/prg/ceph/test/timed_openrw_read < /dev/null
apataki@worker1108:~/prg/ceph/test$

Looking at the client, it looks like there are 5 second pauses waiting for
capabilities.  It seems that it doesn't get a response, and perhaps at some
frequency it polls back to find out what happened to its capabilities.  Here
is an example of such 5 second gaps from the client logs:

2017-03-31 07:33:05.010111 2aaab1a57700 10 client.3849178 waiting for caps
need Fr want Fc
2017-03-31 07:33:05.464058 2fd47700 20 client.3849178 trim_cache size 4
max 16384
2017-03-31 07:33:06.464180 2fd47700 20 client.3849178 trim_cache size 4
max 16384
2017-03-31 07:33:07.464307 2fd47700 20 client.3849178 trim_cache size 4
max 16384
2017-03-31 07:33:08.464421 2fd47700 20 client.3849178 trim_cache size 4
max 16384
2017-03-31 07:33:09.464554 2fd47700 20 client.3849178 trim_cache size 4
max 16384
2017-03-31 07:33:10.464680 2fd47700 10 client.3849178 check_caps on
100024a7b1d.head(faked_ino=0 ref=3 ll_ref=79894 cap_refs={1024=0,2048=0}
open={3=1} mode=100664 size=23/4194304 mtime=2017-03-30 11:56:38.588478
caps=pAsLsXsFr(0=pAsLsXsFr) objectset[100024a7b1d ts 0/0 objects 0
dirty_or_tx 0] parents=0x5f5d5340 0x5f73ea00) wanted pAsxXsxFsxcrwb
used - issued pAsLsXsFr revoking Fsc is_delayed=1

The inode in question is 100024a7b1d which corresponds to the test file
/mnt/ceph/users/apataki/test/TEST-READ.txt
When I grep for this inode in the MDS logs, it also has a gap in the 7:33:05
to 7:33:10 time.  I've learned some about the caps by trying to reproduce
the problem, but I'm afraid I don't understand enough of the MDS logic to
see what the problem is there.

The full logs are too large for the mailing list, so I've put them here:
http://voms.simonsfoundation.org:50013/GTrbrMWDHb9F7CampXyYt5Ensdjg47w/ceph-20170331/

Some help/advice with this would very much be appreciated.  Thanks in
advance!

Thanks for the reproducer -- I've tried this against master ceph with
several clients on one node and I'm seeing hundreds of repeats per 10
second interval, although my clients are all on the same node.  I got
a similar result when setting the client to inject 10ms delays on
messages to simulate being remote.  I ran it in a loop with three
clients several tens of times without seeing a hang.

I'm compiling jewel now to see if it has a different behaviour to
master (with the optimistic notion that we might have already fixed
the underlying cause...)

John


Andras




On 03/14/20

Re: [ceph-users] CephFS fuse client users stuck

2017-03-31 Thread Andras Pataki

Hi John,

It took a while but I believe now I have a reproducible test case for 
the capabilities being lost issue in CephFS I wrote about a couple of 
weeks ago.  The quick summary of problem is that often processes hang 
using CephFS either for a while or sometimes indefinitely.  The fuse 
client believes it is waiting for some caps which it does not get from 
the MDS.


A quick summary of the setup: the whole system is on Jewel 10.2.6. We 
are using the fuse client on all nodes for the file system.


The full test program is uploaded together with the verbose logs of all 
clients and the MDS (at log level 20) to 
http://voms.simonsfoundation.org:50013/GTrbrMWDHb9F7CampXyYt5Ensdjg47w/ceph-20170331/
It essentially runs in a loop opening a file for read/write, reading 
from it and closing it.  The read/write open is key, if open the file 
read-only, the problem doesn't happen:


while (time(NULL) - t0 < INTERVAL) {

int fd = open(FILENAME, O_RDWR);
if (fd < 0) {
printf("Could not open file '%s' for read/write",
   FILENAME);
exit(1);
}

read(fd, buffer, 100);

close(fd);
count++;
}

With INTERVAL = 10 seconds, on one machine with a single process I get 
something like 30k opens per 10 seconds - an excellent result. When I 
run 3 processes on 3 different nodes, I get a few (perhaps 4 or 5) opens 
per 10 seconds per process.  The particular case I collected logs for 
looks like this:


   apataki@worker1108:~/prg/ceph/test$ date; ssh worker1109
   ~/prg/ceph/test/timed_openrw_read < /dev/null & ssh worker1110
   ~/prg/ceph/test/timed_openrw_read < /dev/null & ssh worker1108
   ~/prg/ceph/test/timed_openrw_read < /dev/null &
   Fri Mar 31 07:33:04 EDT 2017
   [1] 53833
   [2] 53834
   [3] 53835
   apataki@worker1108:~/prg/ceph/test$ *Number of repeats: 4**
   **Number of repeats: 3**
   **Number of repeats: 5*

   [1]   Donessh worker1109
   ~/prg/ceph/test/timed_openrw_read < /dev/null
   [2]-  Donessh worker1110
   ~/prg/ceph/test/timed_openrw_read < /dev/null
   [3]+  Donessh worker1108
   ~/prg/ceph/test/timed_openrw_read < /dev/null
   apataki@worker1108:~/prg/ceph/test$

Looking at the client, it looks like there are 5 second pauses waiting 
for capabilities.  It seems that it doesn't get a response, and perhaps 
at some frequency it polls back to find out what happened to its 
capabilities.  Here is an example of such 5 second gaps from the client 
logs:


   2017-03-31 07:33:05.010111 2aaab1a57700 10 client.3849178 *waiting
   for caps need Fr want Fc*
   2017-03-31 07:33:05.464058 2fd47700 20 client.3849178 trim_cache
   size 4 max 16384
   2017-03-31 07:33:06.464180 2fd47700 20 client.3849178 trim_cache
   size 4 max 16384
   2017-03-31 07:33:07.464307 2fd47700 20 client.3849178 trim_cache
   size 4 max 16384
   2017-03-31 07:33:08.464421 2fd47700 20 client.3849178 trim_cache
   size 4 max 16384
   2017-03-31 07:33:09.464554 2fd47700 20 client.3849178 trim_cache
   size 4 max 16384
   2017-03-31 07:33:10.464680 2fd47700 10 client.3849178 check_caps
   on 100024a7b1d.head(faked_ino=0 ref=3 ll_ref=79894
   cap_refs={1024=0,2048=0} open={3=1} mode=100664 size=23/4194304
   mtime=2017-03-30 11:56:38.588478 caps=pAsLsXsFr(0=pAsLsXsFr)
   objectset[100024a7b1d ts 0/0 objects 0 dirty_or_tx 0]
   parents=0x5f5d5340 0x5f73ea00) wanted pAsxXsxFsxcrwb used -
   issued pAsLsXsFr revoking Fsc is_delayed=1

The inode in question is 100024a7b1d which corresponds to the test file 
/mnt/ceph/users/apataki/test/TEST-READ.txt
When I grep for this inode in the MDS logs, it also has a gap in the 
7:33:05 to 7:33:10 time.  I've learned some about the caps by trying to 
reproduce the problem, but I'm afraid I don't understand enough of the 
MDS logic to see what the problem is there.


The full logs are too large for the mailing list, so I've put them here: 
http://voms.simonsfoundation.org:50013/GTrbrMWDHb9F7CampXyYt5Ensdjg47w/ceph-20170331/


Some help/advice with this would very much be appreciated.  Thanks in 
advance!


Andras



On 03/14/2017 12:55 PM, John Spray wrote:

On Tue, Mar 14, 2017 at 2:10 PM, Andras Pataki
<apat...@flatironinstitute.org> wrote:

Hi John,

I've checked the MDS session list, and the fuse client does appear on that
with 'state' as 'open'.  So both the fuse client and the MDS agree on an
open connection.

Attached is the log of the ceph fuse client at debug level 20.  The MDS got
restarted at 9:44:20, and it went through its startup, and was in an
'active' state in ceph -s by 9:45:20.  As for the IP addresses in the logs,
10.128.128.110 is the MDS IP, the 10.128.128.1xy addresses are OSDs,
10.128.129.63 is the IP of the client the log is from.

So it looks like the client is getting stuck waiting for some
capabilities (the 7fff9c3

[ceph-users] CephFS: ceph-fuse segfaults

2017-03-29 Thread Andras Pataki
Below is a crash we had on a few machines with the ceph-fuse client on 
the latest Jewel release 10.2.6.  A total of 5 ceph-fuse processes 
crashed more or less the same way at different times.  The full logs are 
at 
http://voms.simonsfoundation.org:50013/9SXnEpflYPmE6UhM9EgOR3us341eqym/ceph-20170328


Andras

*** Caught signal (Segmentation fault) **
 in thread 7fffa50cb700 thread_name:ceph-fuse
 ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
 1: (()+0x29f112) [0x557f3112]
 2: (()+0xf100) [0x76d9e100]
 3: (Inode::_put(int)+0x25) [0x5578e0b5]
 4: (Client::put_inode(Inode*, int)+0x103) [0x55718ac3]
 5: (Client::_ll_put(Inode*, int)+0x93) [0x5575a263]
 6: (Client::ll_forget(Inode*, int)+0x1e0) [0x5575bb00]
 7: (()+0x19b664) [0x556ef664]
 8: (()+0x16bdb) [0x77bb3bdb]
 9: (()+0x13471) [0x77bb0471]
 10: (()+0x7dc5) [0x76d96dc5]
 11: (clone()+0x6d) [0x75c7c28d]
2017-03-26 07:59:03.354209 7fffa50cb700 -1 *** Caught signal 
(Segmentation fault) **

 in thread 7fffa50cb700 thread_name:ceph-fuse

 ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
 1: (()+0x29f112) [0x557f3112]
 2: (()+0xf100) [0x76d9e100]
 3: (Inode::_put(int)+0x25) [0x5578e0b5]
 4: (Client::put_inode(Inode*, int)+0x103) [0x55718ac3]
 5: (Client::_ll_put(Inode*, int)+0x93) [0x5575a263]
 6: (Client::ll_forget(Inode*, int)+0x1e0) [0x5575bb00]
 7: (()+0x19b664) [0x556ef664]
 8: (()+0x16bdb) [0x77bb3bdb]
 9: (()+0x13471) [0x77bb0471]
 10: (()+0x7dc5) [0x76d96dc5]
 11: (clone()+0x6d) [0x75c7c28d]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


2017-03-26 07:59:03.354209 7fffa50cb700 -1 *** Caught signal 
(Segmentation fault) **

 in thread 7fffa50cb700 thread_name:ceph-fuse

 ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
 1: (()+0x29f112) [0x557f3112]
 2: (()+0xf100) [0x76d9e100]
 3: (Inode::_put(int)+0x25) [0x5578e0b5]
 4: (Client::put_inode(Inode*, int)+0x103) [0x55718ac3]
 5: (Client::_ll_put(Inode*, int)+0x93) [0x5575a263]
 6: (Client::ll_forget(Inode*, int)+0x1e0) [0x5575bb00]
 7: (()+0x19b664) [0x556ef664]
 8: (()+0x16bdb) [0x77bb3bdb]
 9: (()+0x13471) [0x77bb0471]
 10: (()+0x7dc5) [0x76d96dc5]
 11: (clone()+0x6d) [0x75c7c28d]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS fuse client users stuck

2017-03-14 Thread Andras Pataki
Thanks for the decoding of the logs, now I see what to look for. Can you 
point me to any documentation that explains a bit more on the logic 
(about capabilities, Fb/Fw, how the communication between the client and 
the MDS works, etc.)?


I've tried running the client and the MDS at log level 20, but I didn't 
get any freeze-ups before the MDS (almost) filled up a 100GB root disk.  
Also, the processes were running extremely slow, presumably since the 
MDS was too busy logging so many gigabytes. I'm going to try just 
running the clients at 20, and see what happens before the freeze-up to 
that thread and inode.


The inode in question is a file that many many processes write to 
periodically opening it as append (it is a file with output warnings of 
the algorithm).  This is an MPI job with a few hundred ranks, and they 
all append to do the same warning file.  I'm suspecting that they 
somehow concurrently write and things go wrong there (just from what the 
output of the processes looks like).


FsUsage is a change that I made on our side to the ceph client to 
collect information on the client usage.  It is quite on the outside of 
the client, it collects some statistics at the fuse API level about 
calls, it doesn't mess with the logic of what the calls do afterwards.  
I've tried running the vanilla fuse client from the CentOS repositories 
just to rule those changes out as the cause - but that also fail the 
same way.  There are no modifications to any other parts of ceph other 
than the fuse client data collection, they come from the official CentOS 
builds.


I'll try some more debugging and write back if/when I find something.  
But any info on explaining how things are supposed to work would be 
appreciated.


Andras


On 03/14/2017 12:55 PM, John Spray wrote:

On Tue, Mar 14, 2017 at 2:10 PM, Andras Pataki
<apat...@flatironinstitute.org> wrote:

Hi John,

I've checked the MDS session list, and the fuse client does appear on that
with 'state' as 'open'.  So both the fuse client and the MDS agree on an
open connection.

Attached is the log of the ceph fuse client at debug level 20.  The MDS got
restarted at 9:44:20, and it went through its startup, and was in an
'active' state in ceph -s by 9:45:20.  As for the IP addresses in the logs,
10.128.128.110 is the MDS IP, the 10.128.128.1xy addresses are OSDs,
10.128.129.63 is the IP of the client the log is from.

So it looks like the client is getting stuck waiting for some
capabilities (the 7fff9c3f7700 thread in that log, which eventually
completes a ll_write on inode 100024ebea8 after the MDS restart).
Hard to say whether the MDS failed to send it the proper messages, or
if the client somehow missed it.

It would be useful to have equally verbose logs from the MDS side from
earlier on, at the point that the client started trying to do the
write.  I wonder if you could see if your MDS+client can handle both
being run at "debug mds = 20", "debug client = 20" respectively for a
while, then when a client gets stuck, do the MDS restart, and follow
back in the client log to work out which inode it was stuck on, then
find log areas on the MDS side relating to that inode number.

BTW I see this "FsUsage" stuff in your log, which I don't recognise
from mainline Ceph, are you running something modified?

John



As per another suggestion, I've also tried kick_stale_sessions on the fuse
client, which didn't help (I guess since it doesn't think the session is
stale).
Let me know if there is anything else I can do to help.

Andras



On 03/13/2017 06:08 PM, John Spray wrote:

On Mon, Mar 13, 2017 at 8:15 PM, Andras Pataki
<apat...@flatironinstitute.org> wrote:

Dear Cephers,

We're using the ceph file system with the fuse client, and lately some of
our processes are getting stuck seemingly waiting for fuse operations.
At
the same time, the cluster is healthy, no slow requests, all OSDs up and
running, and both the MDS and the fuse client think that there are no
pending operations.  The situation is semi-reproducible.  When I run a
various cluster jobs, some get stuck after a few hours of correct
operation.
The cluster is on ceph 10.2.5 and 10.2.6, the fuse clients are 10.2.6,
but I
have tried 10.2.5 and 10.2.3, all of which have the same issue.  This is
on
CentOS (7.2 for the clients, 7.3 for the MDS/OSDs).

Here are some details:

The node with the stuck processes:

[root@worker1070 ~]# ps -auxwww | grep 30519
apataki   30519 39.8  0.9 8728064 5257588 ? Dl   12:11  60:50 ./Arepo
param.txt 2 6
[root@worker1070 ~]# cat /proc/30519/stack
[] fuse_file_aio_write+0xbb/0x340 [fuse]
[] do_sync_write+0x8d/0xd0
[] vfs_write+0xbd/0x1e0
[] SyS_write+0x7f/0xe0
[] system_call_fastpath+0x16/0x1b
[] 0x

[root@worker1070 ~]# ps -auxwww | grep 30533
apataki   30533 39.8  0.9 8795316 5261308 ? Sl   12:11  60:55 ./Arepo
param.txt 2 6
[root@worker1070 ~]# cat /proc/30533/stack
[] wait_answer_interruptible+0x9

Re: [ceph-users] CephFS fuse client users stuck

2017-03-14 Thread Andras Pataki

Hi John,

I've checked the MDS session list, and the fuse client does appear on 
that with 'state' as 'open'.  So both the fuse client and the MDS agree 
on an open connection.


Attached is the log of the ceph fuse client at debug level 20.  The MDS 
got restarted at 9:44:20, and it went through its startup, and was in an 
'active' state in ceph -s by 9:45:20.  As for the IP addresses in the 
logs, 10.128.128.110 is the MDS IP, the 10.128.128.1xy addresses are 
OSDs, 10.128.129.63 is the IP of the client the log is from.


As per another suggestion, I've also tried kick_stale_sessions on the 
fuse client, which didn't help (I guess since it doesn't think the 
session is stale).

Let me know if there is anything else I can do to help.

Andras


On 03/13/2017 06:08 PM, John Spray wrote:

On Mon, Mar 13, 2017 at 8:15 PM, Andras Pataki
<apat...@flatironinstitute.org> wrote:

Dear Cephers,

We're using the ceph file system with the fuse client, and lately some of
our processes are getting stuck seemingly waiting for fuse operations.  At
the same time, the cluster is healthy, no slow requests, all OSDs up and
running, and both the MDS and the fuse client think that there are no
pending operations.  The situation is semi-reproducible.  When I run a
various cluster jobs, some get stuck after a few hours of correct operation.
The cluster is on ceph 10.2.5 and 10.2.6, the fuse clients are 10.2.6, but I
have tried 10.2.5 and 10.2.3, all of which have the same issue.  This is on
CentOS (7.2 for the clients, 7.3 for the MDS/OSDs).

Here are some details:

The node with the stuck processes:

[root@worker1070 ~]# ps -auxwww | grep 30519
apataki   30519 39.8  0.9 8728064 5257588 ? Dl   12:11  60:50 ./Arepo
param.txt 2 6
[root@worker1070 ~]# cat /proc/30519/stack
[] fuse_file_aio_write+0xbb/0x340 [fuse]
[] do_sync_write+0x8d/0xd0
[] vfs_write+0xbd/0x1e0
[] SyS_write+0x7f/0xe0
[] system_call_fastpath+0x16/0x1b
[] 0x

[root@worker1070 ~]# ps -auxwww | grep 30533
apataki   30533 39.8  0.9 8795316 5261308 ? Sl   12:11  60:55 ./Arepo
param.txt 2 6
[root@worker1070 ~]# cat /proc/30533/stack
[] wait_answer_interruptible+0x91/0xe0 [fuse]
[] __fuse_request_send+0x253/0x2c0 [fuse]
[] fuse_request_send+0x12/0x20 [fuse]
[] fuse_send_write+0xd6/0x110 [fuse]
[] fuse_perform_write+0x2ed/0x590 [fuse]
[] fuse_file_aio_write+0x2a1/0x340 [fuse]
[] do_sync_write+0x8d/0xd0
[] vfs_write+0xbd/0x1e0
[] SyS_write+0x7f/0xe0
[] system_call_fastpath+0x16/0x1b
[] 0x

Presumably the second process is waiting on the first holding some lock ...

The fuse client on the node:

[root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok status
{
 "metadata": {
 "ceph_sha1": "656b5b63ed7c43bd014bcafd81b001959d5f089f",
 "ceph_version": "ceph version 10.2.6
(656b5b63ed7c43bd014bcafd81b001959d5f089f)",
 "entity_id": "admin",
 "hostname": "worker1070",
 "mount_point": "\/mnt\/ceph",
 "root": "\/"
 },
 "dentry_count": 40,
 "dentry_pinned_count": 23,
 "inode_count": 123,
 "mds_epoch": 19041,
 "osd_epoch": 462327,
 "osd_epoch_barrier": 462326
}

[root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
mds_sessions
{
 "id": 3616543,
 "sessions": [
 {
 "mds": 0,
 "addr": "10.128.128.110:6800\/909443124",
 "seq": 338,
 "cap_gen": 0,
 "cap_ttl": "2017-03-13 14:47:37.575229",
 "last_cap_renew_request": "2017-03-13 14:46:37.575229",
 "cap_renew_seq": 12694,
 "num_caps": 713,
 "state": "open"
 }
 ],
 "mdsmap_epoch": 19041
}

[root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
mds_requests
{}


The overall cluster health and the MDS:

[root@cephosd000 ~]# ceph -s
 cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
  health HEALTH_WARN
 noscrub,nodeep-scrub,require_jewel_osds flag(s) set
  monmap e17: 3 mons at
{hyperv029=10.4.36.179:6789/0,hyperv030=10.4.36.180:6789/0,hyperv031=10.4.36.181:6789/0}
 election epoch 29148, quorum 0,1,2 hyperv029,hyperv030,hyperv031
   fsmap e19041: 1/1/1 up {0=cephosd000=up:active}
  osdmap e462328: 624 osds: 624 up, 624 in
 flags noscrub,nodeep-scrub,require_jewel_osds
   pgmap v44458747: 42496 pgs, 6 pools, 924 TB data, 272 Mobjects
 2154 TB used, 1791 TB / 3946 TB avail
42496 active+clean
   client io 86911 kB/s rd, 556 MB/s wr, 227 op/s rd, 303 op/s wr

[root@cephosd000 ~]# ceph daemon /var/run/ceph/ceph-

[ceph-users] CephFS fuse client users stuck

2017-03-13 Thread Andras Pataki

Dear Cephers,

We're using the ceph file system with the fuse client, and lately some 
of our processes are getting stuck seemingly waiting for fuse 
operations.  At the same time, the cluster is healthy, no slow requests, 
all OSDs up and running, and both the MDS and the fuse client think that 
there are no pending operations.  The situation is semi-reproducible.  
When I run a various cluster jobs, some get stuck after a few hours of 
correct operation.  The cluster is on ceph 10.2.5 and 10.2.6, the fuse 
clients are 10.2.6, but I have tried 10.2.5 and 10.2.3, all of which 
have the same issue.  This is on CentOS (7.2 for the clients, 7.3 for 
the MDS/OSDs).


Here are some details:

The node with the stuck processes:

   [root@worker1070 ~]# ps -auxwww | grep 30519
   *apataki   30519 39.8  0.9 8728064 5257588 ? Dl 12:11  60:50
   ./Arepo param.txt 2 6*
   [root@worker1070 ~]# cat /proc/30519/stack
   *[] fuse_file_aio_write+0xbb/0x340 [fuse]*
   [] do_sync_write+0x8d/0xd0
   [] vfs_write+0xbd/0x1e0
   [] SyS_write+0x7f/0xe0
   [] system_call_fastpath+0x16/0x1b
   [] 0x

   [root@worker1070 ~]# ps -auxwww | grep 30533
   *apataki   30533 39.8  0.9 8795316 5261308 ? Sl   12:11 60:55
   ./Arepo param.txt 2 6*
   [root@worker1070 ~]# cat /proc/30533/stack
   *[] wait_answer_interruptible+0x91/0xe0 [fuse]**
   **[] __fuse_request_send+0x253/0x2c0 [fuse]**
   **[] fuse_request_send+0x12/0x20 [fuse]**
   **[] fuse_send_write+0xd6/0x110 [fuse]**
   **[] fuse_perform_write+0x2ed/0x590 [fuse]**
   **[] fuse_file_aio_write+0x2a1/0x340 [fuse]**
   **[] do_sync_write+0x8d/0xd0*
   [] vfs_write+0xbd/0x1e0
   [] SyS_write+0x7f/0xe0
   [] system_call_fastpath+0x16/0x1b
   [] 0x

Presumably the second process is waiting on the first holding some lock ...

The fuse client on the node:

   [root@worker1070 ~]# ceph daemon
   /var/run/ceph/ceph-client.admin.asok status
   {
"metadata": {
"ceph_sha1": "656b5b63ed7c43bd014bcafd81b001959d5f089f",
"ceph_version": "ceph version 10.2.6
   (656b5b63ed7c43bd014bcafd81b001959d5f089f)",
"entity_id": "admin",
"hostname": "worker1070",
"mount_point": "\/mnt\/ceph",
"root": "\/"
},
"dentry_count": 40,
"dentry_pinned_count": 23,
"inode_count": 123,
"mds_epoch": 19041,
"osd_epoch": 462327,
"osd_epoch_barrier": 462326
   }

   [root@worker1070 ~]# ceph daemon
   /var/run/ceph/ceph-client.admin.asok mds_sessions
   {
"id": 3616543,
"sessions": [
{
"mds": 0,
"addr": "10.128.128.110:6800\/909443124",
"seq": 338,
"cap_gen": 0,
"cap_ttl": "2017-03-13 14:47:37.575229",
"last_cap_renew_request": "2017-03-13 14:46:37.575229",
"cap_renew_seq": 12694,
"num_caps": 713,
"state": "open"
}
],
"mdsmap_epoch": 19041
   }

   [root@worker1070 ~]# ceph daemon
   /var/run/ceph/ceph-client.admin.asok mds_requests
   {}


The overall cluster health and the MDS:

   [root@cephosd000 ~]# ceph -s
cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
 health HEALTH_WARN
noscrub,nodeep-scrub,require_jewel_osds flag(s) set
 monmap e17: 3 mons at
   
{hyperv029=10.4.36.179:6789/0,hyperv030=10.4.36.180:6789/0,hyperv031=10.4.36.181:6789/0}
election epoch 29148, quorum 0,1,2
   hyperv029,hyperv030,hyperv031
  fsmap e19041: 1/1/1 up {0=cephosd000=up:active}
 osdmap e462328: 624 osds: 624 up, 624 in
flags noscrub,nodeep-scrub,require_jewel_osds
  pgmap v44458747: 42496 pgs, 6 pools, 924 TB data, 272 Mobjects
2154 TB used, 1791 TB / 3946 TB avail
   42496 active+clean
  client io 86911 kB/s rd, 556 MB/s wr, 227 op/s rd, 303 op/s wr

   [root@cephosd000 ~]# ceph daemon
   /var/run/ceph/ceph-mds.cephosd000.asok ops
   {
"ops": [],
"num_ops": 0
   }


The odd thing is that if in this state I restart the MDS, the client 
process wakes up and proceeds with its work without any errors.  As if a 
request was lost and somehow retransmitted/restarted when the MDS got 
restarted and the fuse layer reconnected to it.


When I try to attach a gdb session to either of the client processes, 
gdb just hangs.  However, right after the MDS restart gdb attaches to 
the process successfully, and shows that the getting stuck happened on 
closing of a file.  In fact, it looks like both processes were trying to 
write to the same file opened with fopen("filename", "a") and close it:


   (gdb) where
   #0  0x2dc53abd in write () from /lib64/libc.so.6
   #1  0x2dbe2383 in _IO_new_file_write () from /lib64/libc.so.6
   #2  0x2dbe37ec in __GI__IO_do_write () from /lib64/libc.so.6
   #3  0x2dbe30e0 in 

Re: [ceph-users] Ceph pg active+clean+inconsistent

2017-01-09 Thread Andras Pataki
Yes, it doesn't cause issues, but I don't see any way to "repair" the 
problem.  One possible idea that I might do eventually if no solution is 
found is to copy the CephFS files in question and remove the ones with 
inconsistencies (which should remove the underlying rados objects).  But 
it'd be perhaps good to do some searching on how/why this problem came 
about before doing this.


andras


On 01/07/2017 06:48 PM, Shinobu Kinjo wrote:

Sorry for the late.

Are you still facing inconsistent pg status?

On Wed, Jan 4, 2017 at 11:39 PM, Andras Pataki
<apat...@simonsfoundation.org> wrote:

# ceph pg debug unfound_objects_exist
FALSE

Andras


On 01/03/2017 11:38 PM, Shinobu Kinjo wrote:

Would you run:

   # ceph pg debug unfound_objects_exist

On Wed, Jan 4, 2017 at 5:31 AM, Andras Pataki
<apat...@simonsfoundation.org> wrote:

Here is the output of ceph pg query for one of hte
active+clean+inconsistent
PGs:

{
  "state": "active+clean+inconsistent",
  "snap_trimq": "[]",
  "epoch": 342982,
  "up": [
  319,
  90,
  51
  ],
  "acting": [
  319,
  90,
  51
  ],
  "actingbackfill": [
  "51",
  "90",
  "319"
  ],
  "info": {
  "pgid": "6.92c",
  "last_update": "342982'41304",
  "last_complete": "342982'41304",
  "log_tail": "342980'38259",
  "last_user_version": 41304,
  "last_backfill": "MAX",
  "last_backfill_bitwise": 0,
  "purged_snaps": "[]",
  "history": {
  "epoch_created": 262553,
  "last_epoch_started": 342598,
  "last_epoch_clean": 342613,
  "last_epoch_split": 0,
  "last_epoch_marked_full": 0,
  "same_up_since": 342596,
  "same_interval_since": 342597,
  "same_primary_since": 342597,
  "last_scrub": "342982'41177",
  "last_scrub_stamp": "2017-01-02 18:19:48.081750",
  "last_deep_scrub": "342965'37465",
  "last_deep_scrub_stamp": "2016-12-20 16:31:06.438823",
  "last_clean_scrub_stamp": "2016-12-11 12:51:19.258816"
  },
  "stats": {
  "version": "342982'41304",
  "reported_seq": "43600",
  "reported_epoch": "342982",
  "state": "active+clean+inconsistent",
  "last_fresh": "2017-01-03 15:27:15.075176",
  "last_change": "2017-01-02 18:19:48.081806",
  "last_active": "2017-01-03 15:27:15.075176",
  "last_peered": "2017-01-03 15:27:15.075176",
  "last_clean": "2017-01-03 15:27:15.075176",
  "last_became_active": "2016-11-01 16:21:23.328639",
  "last_became_peered": "2016-11-01 16:21:23.328639",
  "last_unstale": "2017-01-03 15:27:15.075176",
  "last_undegraded": "2017-01-03 15:27:15.075176",
  "last_fullsized": "2017-01-03 15:27:15.075176",
  "mapping_epoch": 342596,
  "log_start": "342980'38259",
  "ondisk_log_start": "342980'38259",
  "created": 262553,
  "last_epoch_clean": 342613,
  "parent": "0.0",
  "parent_split_bits": 0,
  "last_scrub": "342982'41177",
  "last_scrub_stamp": "2017-01-02 18:19:48.081750",
  "last_deep_scrub": "342965'37465",
  "last_deep_scrub_stamp": "2016-12-20 16:31:06.438823",
  "last_clean_scrub_stamp": "2016-12-11 12:51:19.258816",
  "log_size": 3045,
  "ondisk_log_size": 3045,
  "stats_invalid": false,
  "dirty_stats_invalid": false,
  "omap_stats_invalid": false,
  "hitset_stats_invalid": false,
  "hitset_bytes_stats_invalid": false,
  "pin_stats_invalid": true,

Re: [ceph-users] Ceph pg active+clean+inconsistent

2017-01-04 Thread Andras Pataki

# ceph pg debug unfound_objects_exist
FALSE

Andras

On 01/03/2017 11:38 PM, Shinobu Kinjo wrote:

Would you run:

  # ceph pg debug unfound_objects_exist

On Wed, Jan 4, 2017 at 5:31 AM, Andras Pataki
<apat...@simonsfoundation.org> wrote:

Here is the output of ceph pg query for one of hte active+clean+inconsistent
PGs:

{
 "state": "active+clean+inconsistent",
 "snap_trimq": "[]",
 "epoch": 342982,
 "up": [
 319,
 90,
 51
 ],
 "acting": [
 319,
 90,
 51
 ],
 "actingbackfill": [
 "51",
 "90",
 "319"
 ],
 "info": {
 "pgid": "6.92c",
 "last_update": "342982'41304",
 "last_complete": "342982'41304",
 "log_tail": "342980'38259",
 "last_user_version": 41304,
 "last_backfill": "MAX",
 "last_backfill_bitwise": 0,
 "purged_snaps": "[]",
 "history": {
 "epoch_created": 262553,
 "last_epoch_started": 342598,
 "last_epoch_clean": 342613,
 "last_epoch_split": 0,
 "last_epoch_marked_full": 0,
 "same_up_since": 342596,
 "same_interval_since": 342597,
 "same_primary_since": 342597,
 "last_scrub": "342982'41177",
 "last_scrub_stamp": "2017-01-02 18:19:48.081750",
 "last_deep_scrub": "342965'37465",
 "last_deep_scrub_stamp": "2016-12-20 16:31:06.438823",
 "last_clean_scrub_stamp": "2016-12-11 12:51:19.258816"
 },
 "stats": {
 "version": "342982'41304",
 "reported_seq": "43600",
 "reported_epoch": "342982",
 "state": "active+clean+inconsistent",
 "last_fresh": "2017-01-03 15:27:15.075176",
 "last_change": "2017-01-02 18:19:48.081806",
 "last_active": "2017-01-03 15:27:15.075176",
 "last_peered": "2017-01-03 15:27:15.075176",
 "last_clean": "2017-01-03 15:27:15.075176",
 "last_became_active": "2016-11-01 16:21:23.328639",
 "last_became_peered": "2016-11-01 16:21:23.328639",
 "last_unstale": "2017-01-03 15:27:15.075176",
 "last_undegraded": "2017-01-03 15:27:15.075176",
 "last_fullsized": "2017-01-03 15:27:15.075176",
 "mapping_epoch": 342596,
 "log_start": "342980'38259",
 "ondisk_log_start": "342980'38259",
 "created": 262553,
 "last_epoch_clean": 342613,
 "parent": "0.0",
 "parent_split_bits": 0,
 "last_scrub": "342982'41177",
 "last_scrub_stamp": "2017-01-02 18:19:48.081750",
 "last_deep_scrub": "342965'37465",
 "last_deep_scrub_stamp": "2016-12-20 16:31:06.438823",
 "last_clean_scrub_stamp": "2016-12-11 12:51:19.258816",
 "log_size": 3045,
 "ondisk_log_size": 3045,
 "stats_invalid": false,
 "dirty_stats_invalid": false,
 "omap_stats_invalid": false,
 "hitset_stats_invalid": false,
 "hitset_bytes_stats_invalid": false,
 "pin_stats_invalid": true,
 "stat_sum": {
 "num_bytes": 16929346269,
 "num_objects": 4881,
 "num_object_clones": 0,
 "num_object_copies": 14643,
 "num_objects_missing_on_primary": 0,
 "num_objects_missing": 0,
 "num_objects_degraded": 0,
 "num_objects_misplaced": 0,
 "num_objects_unfound": 0,
 "num_objects_dirty": 4881,
 "num_whiteouts": 0,
 "num_read": 7592,
 "num_read_kb": 195939

Re: [ceph-users] Ceph pg active+clean+inconsistent

2017-01-03 Thread Andras Pataki
The attributes for one of the inconsistent objects for the following 
scrub error:


2016-12-20 11:58:25.825830 7f3e1a4b1700 -1 log_channel(cluster) log 
[ERR] : deep-scrub 6.92c 6:34932257:::1000187bbb5.0009:head on disk 
size (0) does not match object info size (3014656) adjusted for ondisk 
to (3014656)


# cd ~ceph/osd/ceph-319/current/6.92c_head/
# find . -name *1000187bbb* -ls
51023230850 -rw-r--r--   1 ceph ceph0 Dec 13 17:05 
./DIR_C/DIR_2/DIR_9/DIR_C/DIR_4/1000187bbb5.0009__head_EA44C92C__6
# attr -l 
./DIR_C/DIR_2/DIR_9/DIR_C/DIR_4/1000187bbb5.0009__head_EA44C92C__6
Attribute "cephos.spill_out" has a 2 byte value for 
./DIR_C/DIR_2/DIR_9/DIR_C/DIR_4/1000187bbb5.0009__head_EA44C92C__6
Attribute "ceph._" has a 250 byte value for 
./DIR_C/DIR_2/DIR_9/DIR_C/DIR_4/1000187bbb5.0009__head_EA44C92C__6
Attribute "ceph._@1" has a 5 byte value for 
./DIR_C/DIR_2/DIR_9/DIR_C/DIR_4/1000187bbb5.0009__head_EA44C92C__6
Attribute "ceph.snapset" has a 31 byte value for 
./DIR_C/DIR_2/DIR_9/DIR_C/DIR_4/1000187bbb5.0009__head_EA44C92C__6


The rados list-inconsistent-obj command doesn't return anything 
interestingly enough ...


# rados list-inconsistent-obj 6.92c
{"epoch":342597,"inconsistents":[]}

# ceph pg dump | grep 6.92c
dumped all in format plain
6.92c   48810   0   0   0 16929346269 3045
3045active+clean+inconsistent 2017-01-02 18:19:48.081806  
342982'41304342982:43601 [319,90,51] 319 [319,90,51] 
319 342982'41177 2017-01-02 18:19:48.081750  342965'37465
2016-12-20 16:31:06.438823



Andras


On 12/23/2016 05:10 PM, Shinobu Kinjo wrote:

Plus do this as well:

  # rados list-inconsistent-obj ${PG ID}

On Fri, Dec 23, 2016 at 7:08 PM, Brad Hubbard <bhubb...@redhat.com> wrote:

Could you also try this?

$ attr -l ./DIR_1/DIR_F/DIR_3/DIR_9/DIR_8/1000187bb70.0009__head_EED893F1__6

Take note of any of ceph._, ceph._@1, ceph._@2, etc.

For me on my test cluster it looks like this.

$ attr -l 
dev/osd1/current/0.3_head/benchmark\\udata\\urskikr.localdomain\\u16952\\uobject99__head_2969453B__0
Attribute "cephos.spill_out" has a 2 byte value for
dev/osd1/current/0.3_head/benchmark\udata\urskikr.localdomain\u16952\uobject99__head_2969453B__0
Attribute "ceph._" has a 250 byte value for
dev/osd1/current/0.3_head/benchmark\udata\urskikr.localdomain\u16952\uobject99__head_2969453B__0
Attribute "ceph.snapset" has a 31 byte value for
dev/osd1/current/0.3_head/benchmark\udata\urskikr.localdomain\u16952\uobject99__head_2969453B__0
Attribute "ceph._@1" has a 53 byte value for
dev/osd1/current/0.3_head/benchmark\udata\urskikr.localdomain\u16952\uobject99__head_2969453B__0
Attribute "selinux" has a 37 byte value for
dev/osd1/current/0.3_head/benchmark\udata\urskikr.localdomain\u16952\uobject99__head_2969453B__0

Then dump out ceph._ to a file and append all ceph._@X attributes like so.

$ attr -q -g ceph._
dev/osd1/current/0.3_head/benchmark\\udata\\urskikr.localdomain\\u16952\\uobject99__head_2969453B__0

/tmp/attr1

$ attr -q -g ceph._@1
dev/osd1/current/0.3_head/benchmark\\udata\\urskikr.localdomain\\u16952\\uobject99__head_2969453B__0

/tmp/attr1

Note the ">>" on the second command to append the output, not
overwrite. Do this for each ceph._@X attribute.

Then display the file as an object_info_t structure and check the size value.

$ bin/ceph-dencoder type object_info_t import /tmp/attr1 decode dump_json
{
 "oid": {
 "oid": "benchmark_data_rskikr.localdomain_16952_object99",
 "key": "",
 "snapid": -2,
 "hash": 694764859,
 "max": 0,
 "pool": 0,
 "namespace": ""
 },
 "version": "9'19",
 "prior_version": "0'0",
 "last_reqid": "client.4110.0:100",
 "user_version": 19,
 "size": 4194304,
 "mtime": "2016-12-23 19:13:57.012681",
 "local_mtime": "2016-12-23 19:13:57.032306",
 "lost": 0,
 "flags": 52,
 "snaps": [],
 "truncate_seq": 0,
 "truncate_size": 0,
 "data_digest": 2293522445,
 "omap_digest": 4294967295,
 "expected_object_size": 4194304,
 "expected_write_size": 4194304,
 "alloc_hint_flags": 53,
 "watchers": {}
}

Depending on the output one method for fixing this may be to use a
binary editing technique such a laid out in
https://www.spinics.net/lists/ceph-devel/msg16519.html to set the size
value to zero. Your target value is 1c.

$ printf '%x\n' 1835008
1c

Make sure you check it is right before injecti

Re: [ceph-users] Ceph pg active+clean+inconsistent

2017-01-03 Thread Andras Pataki
"last_became_peered": "2016-11-01 16:05:44.990630",
"last_unstale": "2016-11-01 16:21:20.584113",
"last_undegraded": "2016-11-01 16:21:20.584113",
"last_fullsized": "2016-11-01 16:21:20.584113",
"mapping_epoch": 342596,
"log_start": "341563'12014",
"ondisk_log_start": "341563'12014",
"created": 262553,
"last_epoch_clean": 342587,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "342266'14514",
"last_scrub_stamp": "2016-10-28 16:41:06.563820",
"last_deep_scrub": "342266'14514",
"last_deep_scrub_stamp": "2016-10-28 16:41:06.563820",
"last_clean_scrub_stamp": "2016-10-28 16:41:06.563820",
"log_size": 3019,
"ondisk_log_size": 3019,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": true,
"stat_sum": {
"num_bytes": 12528581359,
"num_objects": 3562,
"num_object_clones": 0,
"num_object_copies": 10686,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 3562,
"num_whiteouts": 0,
"num_read": 3678,
"num_read_kb": 10197642,
"num_write": 15656,
"num_write_kb": 19564203,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 5806,
"num_bytes_recovered": 22687335556,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0
},
"up": [
319,
90,
51
],
"acting": [
319,
90,
51
],
"blocked_by": [],
"up_primary": 319,
"acting_primary": 319
},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 342598,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
}
],
"recovery_state": [
{
"name": "Started\/Primary\/Active",
"enter_time": "2016-11-01 16:21:23.007072",
"might_have_unfound": [
{
"osd": "51",
"status": "already probed"
},
{
"osd": "90",
"status": "already probed"
}
],
"recovery_progress": {
"backfill_targets": [],
"waiting_on_backfill": [],
"last_backfill_started&quo

Re: [ceph-users] Ceph pg active+clean+inconsistent

2016-12-21 Thread Andras Pataki
Yes, size = 3, and I have checked that all three replicas are the same 
zero length object on the disk.  I think some metadata info is 
mismatching what the OSD log refers to as "object info size".  But I'm 
not sure what to do about it.  pg repair does not fix it.  In fact, the 
file this object corresponds to in CephFS is shorter so this chunk 
shouldn't even exist I think (details are in the original email).  
Although I may be understanding the situation wrong ...


Andras


On 12/21/2016 07:17 AM, Mehmet wrote:

Hi Andras,

Iam not the experienced User but i guess you could have a look on this 
object on each related osd for the pg, compare them and delete the 
Different object. I assume you have size = 3.


Then again pg repair.

But be carefull iirc the replica will be recovered from the primary pg.

Hth

Am 20. Dezember 2016 22:39:44 MEZ, schrieb Andras Pataki 
<apat...@simonsfoundation.org>:


Hi cephers,

Any ideas on how to proceed on the inconsistencies below?  At the
moment our ceph setup has 5 of these - in all cases it seems like
some zero length objects that match across the three replicas, but
do not match the object info size.  I tried running pg repair on
one of them, but it didn't repair the problem:

2016-12-20 16:24:40.870307 7f3e1a4b1700  0
log_channel(cluster) log [INF] : 6.92c repair starts
2016-12-20 16:27:06.183186 7f3e1a4b1700 -1
log_channel(cluster) log [ERR] : repair 6.92c
6:34932257:::1000187bbb5.0009:head on disk size (0) does
not match object info size (3014656) adjusted for ondisk to
(3014656)
2016-12-20 16:27:35.885496 7f3e17cac700 -1
log_channel(cluster) log [ERR] : 6.92c repair 1 errors, 0 fixed


Any help/hints would be appreciated.

Thanks,

Andras


On 12/15/2016 10:13 AM, Andras Pataki wrote:

Hi everyone,

Yesterday scrubbing turned up an inconsistency in one of our
placement groups.  We are running ceph 10.2.3, using CephFS and
RBD for some VM images.

[root@hyperv017 ~]# ceph -s
cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
 health HEALTH_ERR
1 pgs inconsistent
1 scrub errors
noout flag(s) set
 monmap e15: 3 mons at

{hyperv029=10.4.36.179:6789/0,hyperv030=10.4.36.180:6789/0,hyperv031=10.4.36.181:6789/0}
election epoch 27192, quorum 0,1,2
hyperv029,hyperv030,hyperv031
  fsmap e17181: 1/1/1 up {0=hyperv029=up:active}, 2 up:standby
 osdmap e342930: 385 osds: 385 up, 385 in
flags noout
  pgmap v37580512: 34816 pgs, 5 pools, 673 TB data, 198 Mobjects
1583 TB used, 840 TB / 2423 TB avail
   34809 active+clean
   4 active+clean+scrubbing+deep
   2 active+clean+scrubbing
   1 active+clean+inconsistent
  client io 87543 kB/s rd, 671 MB/s wr, 23 op/s rd, 2846 op/s wr

# ceph pg dump | grep inconsistent
6.13f1  46920   0   0   0 16057314767 3087   
3087active+clean+inconsistent 2016-12-14 16:49:48.391572 
342929'41011342929:43966 [158,215,364]   158
[158,215,364]   158 342928'40540 2016-12-14

16:49:48.391511  342928'405402016-12-14 16:49:48.391511

I tried a couple of other deep scrubs on pg 6.13f1 but got
repeated errors.  In the OSD logs:

2016-12-14 16:48:07.733291 7f3b56e3a700 -1 log_channel(cluster)
log [ERR] : deep-scrub 6.13f1
6:8fc91b77:::1000187bb70.0009:head on disk size (0) does not
match object info size (1835008) adjusted for ondisk to (1835008)
I looked at the objects on the 3 OSD's on their respective hosts
and they are the same, zero length files:

# cd ~ceph/osd/ceph-158/current/6.13f1_head
# find . -name *1000187bb70* -ls
6697380 -rw-r--r--   1 ceph ceph0 Dec 13
17:00
./DIR_1/DIR_F/DIR_3/DIR_9/DIR_8/1000187bb70.0009__head_EED893F1__6


# cd ~ceph/osd/ceph-215/current/6.13f1_head
# find . -name *1000187bb70* -ls
5398156470 -rw-r--r--   1 ceph ceph0 Dec 13
17:00
./DIR_1/DIR_F/DIR_3/DIR_9/DIR_8/1000187bb70.0009__head_EED893F1__6


# cd ~ceph/osd/ceph-364/current/6.13f1_head
# find . -name *1000187bb70* -ls
18814322150 -rw-r--r--   1 ceph ceph0 Dec 13
17:00
./DIR_1/DIR_F/DIR_3/DIR_9/DIR_8/1000187bb70.0009__head_EED893F1__6


At the time of the write, there wasn't anything unusual going on
as far as I can tell (no hardware/network issues, all processes
were up, etc).

This pool is a CephFS data pool, and the corresponding file
(inode hex 1000187bb70, decimal 1099537300336) looks like this:

# ls -li chr4.tags.tsv
1099537300336 -rw-r--r-- 1 xichen xichen 14469915 Dec 13 17:01
chr4.tags.tsv

Reading the file is al

Re: [ceph-users] Ceph pg active+clean+inconsistent

2016-12-20 Thread Andras Pataki

Hi cephers,

Any ideas on how to proceed on the inconsistencies below?  At the moment 
our ceph setup has 5 of these - in all cases it seems like some zero 
length objects that match across the three replicas, but do not match 
the object info size.  I tried running pg repair on one of them, but it 
didn't repair the problem:


   2016-12-20 16:24:40.870307 7f3e1a4b1700  0 log_channel(cluster) log
   [INF] : 6.92c repair starts
   2016-12-20 16:27:06.183186 7f3e1a4b1700 -1 log_channel(cluster) log
   [ERR] : repair 6.92c 6:34932257:::1000187bbb5.0009:head on disk
   size (0) does not match object info size (3014656) adjusted for
   ondisk to (3014656)
   2016-12-20 16:27:35.885496 7f3e17cac700 -1 log_channel(cluster) log
   [ERR] : 6.92c repair 1 errors, 0 fixed


Any help/hints would be appreciated.

Thanks,

Andras


On 12/15/2016 10:13 AM, Andras Pataki wrote:

Hi everyone,

Yesterday scrubbing turned up an inconsistency in one of our placement 
groups.  We are running ceph 10.2.3, using CephFS and RBD for some VM 
images.


[root@hyperv017 ~]# ceph -s
cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
 health HEALTH_ERR
1 pgs inconsistent
1 scrub errors
noout flag(s) set
 monmap e15: 3 mons at 
{hyperv029=10.4.36.179:6789/0,hyperv030=10.4.36.180:6789/0,hyperv031=10.4.36.181:6789/0}
election epoch 27192, quorum 0,1,2 
hyperv029,hyperv030,hyperv031

  fsmap e17181: 1/1/1 up {0=hyperv029=up:active}, 2 up:standby
 osdmap e342930: 385 osds: 385 up, 385 in
flags noout
  pgmap v37580512: 34816 pgs, 5 pools, 673 TB data, 198 Mobjects
1583 TB used, 840 TB / 2423 TB avail
   34809 active+clean
   4 active+clean+scrubbing+deep
   2 active+clean+scrubbing
   1 active+clean+inconsistent
  client io 87543 kB/s rd, 671 MB/s wr, 23 op/s rd, 2846 op/s wr

# ceph pg dump | grep inconsistent
6.13f1  46920   0   0   0 16057314767 3087 3087
active+clean+inconsistent 2016-12-14 16:49:48.391572 342929'41011
342929:43966 [158,215,364]   158 [158,215,364]   158 342928'40540 
2016-12-14 16:49:48.391511  342928'405402016-12-14 
16:49:48.391511


I tried a couple of other deep scrubs on pg 6.13f1 but got repeated 
errors.  In the OSD logs:


2016-12-14 16:48:07.733291 7f3b56e3a700 -1 log_channel(cluster) log 
[ERR] : deep-scrub 6.13f1 6:8fc91b77:::1000187bb70.0009:head on 
disk size (0) does not match object info size (1835008) adjusted for 
ondisk to (1835008)
I looked at the objects on the 3 OSD's on their respective hosts and 
they are the same, zero length files:


# cd ~ceph/osd/ceph-158/current/6.13f1_head
# find . -name *1000187bb70* -ls
6697380 -rw-r--r--   1 ceph ceph0 Dec 13 17:00 
./DIR_1/DIR_F/DIR_3/DIR_9/DIR_8/1000187bb70.0009__head_EED893F1__6


# cd ~ceph/osd/ceph-215/current/6.13f1_head
# find . -name *1000187bb70* -ls
5398156470 -rw-r--r--   1 ceph ceph0 Dec 13 17:00 
./DIR_1/DIR_F/DIR_3/DIR_9/DIR_8/1000187bb70.0009__head_EED893F1__6


# cd ~ceph/osd/ceph-364/current/6.13f1_head
# find . -name *1000187bb70* -ls
18814322150 -rw-r--r--   1 ceph ceph0 Dec 13 17:00 
./DIR_1/DIR_F/DIR_3/DIR_9/DIR_8/1000187bb70.0009__head_EED893F1__6


At the time of the write, there wasn't anything unusual going on as 
far as I can tell (no hardware/network issues, all processes were up, 
etc).


This pool is a CephFS data pool, and the corresponding file (inode hex 
1000187bb70, decimal 1099537300336) looks like this:


# ls -li chr4.tags.tsv
1099537300336 -rw-r--r-- 1 xichen xichen 14469915 Dec 13 17:01 
chr4.tags.tsv


Reading the file is also ok (no errors, right number of bytes):
# cat chr4.tags.tsv > /dev/null
# wc chr4.tags.tsv
  592251  2961255 14469915 chr4.tags.tsv

We are using the standard 4MB block size for CephFS, and if I 
interpret this right, this is the 9th chunk, so there shouldn't be any 
data (or even a 9th chunk), since the file is only 14MB. Should I run 
pg repair on this?  Any ideas on how this could come about? Any other 
recommendations?


Thanks,

Andras
apat...@apataki.net



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph pg active+clean+inconsistent

2016-12-15 Thread Andras Pataki

Hi everyone,

Yesterday scrubbing turned up an inconsistency in one of our placement 
groups.  We are running ceph 10.2.3, using CephFS and RBD for some VM 
images.


[root@hyperv017 ~]# ceph -s
cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
 health HEALTH_ERR
1 pgs inconsistent
1 scrub errors
noout flag(s) set
 monmap e15: 3 mons at 
{hyperv029=10.4.36.179:6789/0,hyperv030=10.4.36.180:6789/0,hyperv031=10.4.36.181:6789/0}
election epoch 27192, quorum 0,1,2 
hyperv029,hyperv030,hyperv031

  fsmap e17181: 1/1/1 up {0=hyperv029=up:active}, 2 up:standby
 osdmap e342930: 385 osds: 385 up, 385 in
flags noout
  pgmap v37580512: 34816 pgs, 5 pools, 673 TB data, 198 Mobjects
1583 TB used, 840 TB / 2423 TB avail
   34809 active+clean
   4 active+clean+scrubbing+deep
   2 active+clean+scrubbing
   1 active+clean+inconsistent
  client io 87543 kB/s rd, 671 MB/s wr, 23 op/s rd, 2846 op/s wr

# ceph pg dump | grep inconsistent
6.13f1  46920   0   0   0 16057314767 3087
3087active+clean+inconsistent 2016-12-14 16:49:48.391572  
342929'41011342929:43966 [158,215,364]   158 [158,215,364]   
158 342928'40540 2016-12-14 16:49:48.391511  342928'40540
2016-12-14 16:49:48.391511


I tried a couple of other deep scrubs on pg 6.13f1 but got repeated 
errors.  In the OSD logs:


2016-12-14 16:48:07.733291 7f3b56e3a700 -1 log_channel(cluster) log 
[ERR] : deep-scrub 6.13f1 6:8fc91b77:::1000187bb70.0009:head on disk 
size (0) does not match object info size (1835008) adjusted for ondisk 
to (1835008)
I looked at the objects on the 3 OSD's on their respective hosts and 
they are the same, zero length files:


# cd ~ceph/osd/ceph-158/current/6.13f1_head
# find . -name *1000187bb70* -ls
6697380 -rw-r--r--   1 ceph ceph0 Dec 13 17:00 
./DIR_1/DIR_F/DIR_3/DIR_9/DIR_8/1000187bb70.0009__head_EED893F1__6


# cd ~ceph/osd/ceph-215/current/6.13f1_head
# find . -name *1000187bb70* -ls
5398156470 -rw-r--r--   1 ceph ceph0 Dec 13 17:00 
./DIR_1/DIR_F/DIR_3/DIR_9/DIR_8/1000187bb70.0009__head_EED893F1__6


# cd ~ceph/osd/ceph-364/current/6.13f1_head
# find . -name *1000187bb70* -ls
18814322150 -rw-r--r--   1 ceph ceph0 Dec 13 17:00 
./DIR_1/DIR_F/DIR_3/DIR_9/DIR_8/1000187bb70.0009__head_EED893F1__6


At the time of the write, there wasn't anything unusual going on as far 
as I can tell (no hardware/network issues, all processes were up, etc).


This pool is a CephFS data pool, and the corresponding file (inode hex 
1000187bb70, decimal 1099537300336) looks like this:


# ls -li chr4.tags.tsv
1099537300336 -rw-r--r-- 1 xichen xichen 14469915 Dec 13 17:01 chr4.tags.tsv

Reading the file is also ok (no errors, right number of bytes):
# cat chr4.tags.tsv > /dev/null
# wc chr4.tags.tsv
  592251  2961255 14469915 chr4.tags.tsv

We are using the standard 4MB block size for CephFS, and if I interpret 
this right, this is the 9th chunk, so there shouldn't be any data (or 
even a 9th chunk), since the file is only 14MB.  Should I run pg repair 
on this?  Any ideas on how this could come about? Any other recommendations?


Thanks,

Andras
apat...@apataki.net

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs stuck in active+clean+replay

2015-11-09 Thread Andras Pataki
Hi Greg,

I’ve tested the patch below on top of the 0.94.5 hammer sources, and it
works beautifully.  No more active+clean+replay stuck PGs.

Thanks!

Andras


On 10/27/15, 4:46 PM, "Andras Pataki" <apat...@simonsfoundation.org> wrote:

>Yes, this definitely sounds plausible (the peering/activating process does
>take a long time).  At the moment I’m trying to get our cluster back to a
>more working state.  Once everything works, I could try building a patched
>set of ceph processes from source (currently I’m using the pre-built
>centos RPMs) before a planned larger rebalance.
>
>Andras
>
>
>On 10/27/15, 2:36 PM, "Gregory Farnum" <gfar...@redhat.com> wrote:
>
>>On Tue, Oct 27, 2015 at 11:22 AM, Andras Pataki
>><apat...@simonsfoundation.org> wrote:
>>> Hi Greg,
>>>
>>> No, unfortunately I haven¹t found any resolution to it.  We are using
>>> cephfs, the whole installation is on 0.94.4.  What I did notice is that
>>> performance is extremely poor when backfilling is happening.  I wonder
>>>if
>>> timeouts of some kind could cause PG¹s to get stuck in replay.  I
>>>lowered
>>> the Œosd max backfills¹ parameter today from the default 10 all the way
>>> down to 1 to see if it improves things.  Client read/write performance
>>>has
>>> definitely improved since then, whether this improves the
>>> Œstuck-in-replay¹ situation, I¹m still waiting to see.
>>
>>Argh. Looks like known bug http://tracker.ceph.com/issues/13116. I've
>>pushed a new branch hammer-pg-replay to the gitbuilders which
>>backports that patch and ought to improve things if you're able to
>>install that to test. (It's untested but I don't foresee any issues
>>arising.) I've also added it to the backport queue.
>>-Greg
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs stuck in active+clean+replay

2015-10-27 Thread Andras Pataki
Hi Greg,

No, unfortunately I haven¹t found any resolution to it.  We are using
cephfs, the whole installation is on 0.94.4.  What I did notice is that
performance is extremely poor when backfilling is happening.  I wonder if
timeouts of some kind could cause PG¹s to get stuck in replay.  I lowered
the Œosd max backfills¹ parameter today from the default 10 all the way
down to 1 to see if it improves things.  Client read/write performance has
definitely improved since then, whether this improves the
Œstuck-in-replay¹ situation, I¹m still waiting to see.

Andras


On 10/27/15, 2:06 PM, "Gregory Farnum" <gfar...@redhat.com> wrote:

>On Tue, Oct 27, 2015 at 11:03 AM, Gregory Farnum <gfar...@redhat.com>
>wrote:
>> On Thu, Oct 22, 2015 at 3:58 PM, Andras Pataki
>> <apat...@simonsfoundation.org> wrote:
>>> Hi ceph users,
>>>
>>> We¹ve upgraded to 0.94.4 (all ceph daemons got restarted) ­ and are in
>>>the
>>> middle of doing some rebalancing due to crush changes (removing some
>>>disks).
>>> During the rebalance, I see that some placement groups get stuck in
>>> Œactive+clean+replay¹ for a long time (essentially until I restart the
>>>OSD
>>> they are on).  All IO for these PGs gets queued, and clients hang.
>>>
>>> ceph health details the blocked ops in it:
>>>
>>> 4 ops are blocked > 2097.15 sec
>>> 1 ops are blocked > 131.072 sec
>>> 2 ops are blocked > 2097.15 sec on osd.41
>>> 2 ops are blocked > 2097.15 sec on osd.119
>>> 1 ops are blocked > 131.072 sec on osd.124
>>>
>>> ceph pg dump | grep replay
>>> dumped all in format plain
>>> 2.121b 3836 0 0 0 0 15705994377 3006 3006 active+clean+replay
>>>2015-10-22
>>> 14:12:01.104564 123840'2258640 125080:1252265 [41,111] 41 [41,111] 41
>>> 114515'2258631 2015-10-20 18:44:09.757620 114515'2258631 2015-10-20
>>> 18:44:09.757620
>>> 2.b4 3799 0 0 0 0 15604827445 3003 3003 active+clean+replay 2015-10-22
>>> 13:57:25.490150 119558'2322127 125084:1174759 [119,75] 119 [119,75] 119
>>> 114515'2322124 2015-10-20 11:00:51.448239 114515'2322124 2015-10-17
>>> 09:22:14.676006
>>>
>>> Both osd.41 and OSD.119 are doing this ³replay².
>>>
>>> The end of the log of osd.41:
>>>
>>> 2015-10-22 10:44:35.727000 7f037929b700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.170:6913/121602 pipe(0x3b4d sd=125 :6827 s=2 pgs=618 cs=1
>>>l=0
>>> c=0x374398c0).fault with nothing to send, going to standby
>>> 2015-10-22 10:50:25.954404 7f038adae700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.105:6809/141110 pipe(0x3adff000 sd=229 :6827 s=2 pgs=94 cs=3
>>>l=0
>>> c=0x3e9d0940).fault with nothing to send, going to standby
>>> 2015-10-22 12:11:28.029214 7f03a0e0d700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.106:6864/102556 pipe(0x40afe000 sd=621 :6827 s=2 pgs=91 cs=3
>>>l=0
>>> c=0x3acf5860).fault with nothing to send, going to standby
>>> 2015-10-22 12:45:45.404765 7f038050d700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.102:6837/77957 pipe(0x39cbe000 sd=578 :6827 s=0 pgs=0 cs=0 l=0
>>> c=0x37b3cec0).accept connect_seq 1 vs existing 1 state standby
>>> 2015-10-22 12:45:45.405232 7f038050d700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.102:6837/77957 pipe(0x39cbe000 sd=578 :6827 s=0 pgs=0 cs=0 l=0
>>> c=0x37b3cec0).accept connect_seq 2 vs existing 1 state standby
>>> 2015-10-22 12:52:49.062752 7f036525c700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.105:6809/141110 pipe(0x3f637000 sd=405 :6827 s=0 pgs=0 cs=0 l=0
>>> c=0x37b3ba20).accept connect_seq 3 vs existing 3 state standby
>>> 2015-10-22 12:52:49.063169 7f036525c700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.105:6809/141110 pipe(0x3f637000 sd=405 :6827 s=0 pgs=0 cs=0 l=0
>>> c=0x37b3ba20).accept connect_seq 4 vs existing 3 state standby
>>> 2015-10-22 13:02:16.573546 7f038050d700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.102:6837/77957 pipe(0x39cbe000 sd=578 :6827 s=2 pgs=110 cs=3
>>>l=0
>>> c=0x37b92000).fault with nothing to send, going to standby
>>> 2015-10-22 13:07:58.667432 7f036525c700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.105:6809/141110 pipe(0x3f637000 sd=405 :6827 s=2 pgs=146 cs=5
>>>l=0
>>> c=0x3e9d0940).fault with nothing to send, going to standby
>>> 2015-10-22 13:25:35.020722 7f038191a700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.111:6841/71447 pipe(0x3e78e000 sd=205 :6827 s=2 

Re: [ceph-users] PGs stuck in active+clean+replay

2015-10-27 Thread Andras Pataki
Yes, this definitely sounds plausible (the peering/activating process does
take a long time).  At the moment I’m trying to get our cluster back to a
more working state.  Once everything works, I could try building a patched
set of ceph processes from source (currently I’m using the pre-built
centos RPMs) before a planned larger rebalance.

Andras


On 10/27/15, 2:36 PM, "Gregory Farnum" <gfar...@redhat.com> wrote:

>On Tue, Oct 27, 2015 at 11:22 AM, Andras Pataki
><apat...@simonsfoundation.org> wrote:
>> Hi Greg,
>>
>> No, unfortunately I haven¹t found any resolution to it.  We are using
>> cephfs, the whole installation is on 0.94.4.  What I did notice is that
>> performance is extremely poor when backfilling is happening.  I wonder
>>if
>> timeouts of some kind could cause PG¹s to get stuck in replay.  I
>>lowered
>> the Œosd max backfills¹ parameter today from the default 10 all the way
>> down to 1 to see if it improves things.  Client read/write performance
>>has
>> definitely improved since then, whether this improves the
>> Œstuck-in-replay¹ situation, I¹m still waiting to see.
>
>Argh. Looks like known bug http://tracker.ceph.com/issues/13116. I've
>pushed a new branch hammer-pg-replay to the gitbuilders which
>backports that patch and ought to improve things if you're able to
>install that to test. (It's untested but I don't foresee any issues
>arising.) I've also added it to the backport queue.
>-Greg

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PGs stuck in active+clean+replay

2015-10-22 Thread Andras Pataki
Hi ceph users,

We’ve upgraded to 0.94.4 (all ceph daemons got restarted) – and are in the 
middle of doing some rebalancing due to crush changes (removing some disks).  
During the rebalance, I see that some placement groups get stuck in 
‘active+clean+replay’ for a long time (essentially until I restart the OSD they 
are on).  All IO for these PGs gets queued, and clients hang.

ceph health details the blocked ops in it:

4 ops are blocked > 2097.15 sec
1 ops are blocked > 131.072 sec
2 ops are blocked > 2097.15 sec on osd.41
2 ops are blocked > 2097.15 sec on osd.119
1 ops are blocked > 131.072 sec on osd.124

ceph pg dump | grep replay
dumped all in format plain
2.121b 3836 0 0 0 0 15705994377 3006 3006 active+clean+replay 2015-10-22 
14:12:01.104564 123840'2258640 125080:1252265 [41,111] 41 [41,111] 41 
114515'2258631 2015-10-20 18:44:09.757620 114515'2258631 2015-10-20 
18:44:09.757620
2.b4 3799 0 0 0 0 15604827445 3003 3003 active+clean+replay 2015-10-22 
13:57:25.490150 119558'2322127 125084:1174759 [119,75] 119 [119,75] 119 
114515'2322124 2015-10-20 11:00:51.448239 114515'2322124 2015-10-17 
09:22:14.676006

Both osd.41 and OSD.119 are doing this “replay”.

The end of the log of osd.41:

2015-10-22 10:44:35.727000 7f037929b700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.170:6913/121602 pipe(0x3b4d sd=125 :6827 s=2 pgs=618 cs=1 l=0 
c=0x374398c0).fault with nothing to send, going to standby
2015-10-22 10:50:25.954404 7f038adae700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.105:6809/141110 pipe(0x3adff000 sd=229 :6827 s=2 pgs=94 cs=3 l=0 
c=0x3e9d0940).fault with nothing to send, going to standby
2015-10-22 12:11:28.029214 7f03a0e0d700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.106:6864/102556 pipe(0x40afe000 sd=621 :6827 s=2 pgs=91 cs=3 l=0 
c=0x3acf5860).fault with nothing to send, going to standby
2015-10-22 12:45:45.404765 7f038050d700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.102:6837/77957 pipe(0x39cbe000 sd=578 :6827 s=0 pgs=0 cs=0 l=0 
c=0x37b3cec0).accept connect_seq 1 vs existing 1 state standby
2015-10-22 12:45:45.405232 7f038050d700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.102:6837/77957 pipe(0x39cbe000 sd=578 :6827 s=0 pgs=0 cs=0 l=0 
c=0x37b3cec0).accept connect_seq 2 vs existing 1 state standby
2015-10-22 12:52:49.062752 7f036525c700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.105:6809/141110 pipe(0x3f637000 sd=405 :6827 s=0 pgs=0 cs=0 l=0 
c=0x37b3ba20).accept connect_seq 3 vs existing 3 state standby
2015-10-22 12:52:49.063169 7f036525c700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.105:6809/141110 pipe(0x3f637000 sd=405 :6827 s=0 pgs=0 cs=0 l=0 
c=0x37b3ba20).accept connect_seq 4 vs existing 3 state standby
2015-10-22 13:02:16.573546 7f038050d700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.102:6837/77957 pipe(0x39cbe000 sd=578 :6827 s=2 pgs=110 cs=3 l=0 
c=0x37b92000).fault with nothing to send, going to standby
2015-10-22 13:07:58.667432 7f036525c700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.105:6809/141110 pipe(0x3f637000 sd=405 :6827 s=2 pgs=146 cs=5 l=0 
c=0x3e9d0940).fault with nothing to send, going to standby
2015-10-22 13:25:35.020722 7f038191a700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.111:6841/71447 pipe(0x3e78e000 sd=205 :6827 s=2 pgs=82 cs=3 l=0 
c=0x36bf5860).fault with nothing to send, going to standby
2015-10-22 13:45:48.610068 7f0361620700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.105:6841/99063 pipe(0x3e43b000 sd=539 :6827 s=0 pgs=0 cs=0 l=0 
c=0x373e11e0).accept we reset (peer sent cseq 1), sending RESETSESSION
2015-10-22 13:45:48.880698 7f0361620700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.105:6841/99063 pipe(0x3e43b000 sd=539 :6827 s=2 pgs=199 cs=1 l=0 
c=0x373e11e0).reader missed message?  skipped from seq 0 to 825623574
2015-10-22 14:11:32.967937 7f035d9e4700  0 -- 10.4.36.105:6827/98624 >> 
10.4.36.105:6802/98037 pipe(0x3ce82000 sd=63 :43711 s=2 pgs=144 cs=3 l=0 
c=0x3bf8c100).fault with nothing to send, going to standby
2015-10-22 14:12:35.338635 7f03afffb700  0 log_channel(cluster) log [WRN] : 2 
slow requests, 2 included below; oldest blocked for > 30.079053 secs
2015-10-22 14:12:35.338875 7f03afffb700  0 log_channel(cluster) log [WRN] : 
slow request 30.079053 seconds old, received at 2015-10-22 14:12:05.259156: 
osd_op(client.734338.0:50618164 10b8f73.03ef [read 0~65536] 2.338a921b 
ack+read+known_if_redirected e124995) currently waiting for replay end
2015-10-22 14:12:35.339050 7f03afffb700  0 log_channel(cluster) log [WRN] : 
slow request 30.063995 seconds old, received at 2015-10-22 14:12:05.274213: 
osd_op(client.734338.0:50618166 10b8f73.03ef [read 65536~131072] 
2.338a921b ack+read+known_if_redirected e124995) currently waiting for replay 
end
2015-10-22 14:13:11.817243 7f03afffb700  0 log_channel(cluster) log [WRN] : 2 
slow requests, 2 included below; oldest blocked for > 66.557970 secs
2015-10-22 14:13:11.817408 7f03afffb700  0 log_channel(cluster) log [WRN] : 
slow request 66.557970 seconds old, received at 2015-10-22 14:12:05.259156: 

Re: [ceph-users] CephFS file to rados object mapping

2015-09-29 Thread Andras Pataki
Thanks, that worked.  Is there a mapping in the other direction easily
available, I.e. To find where all the 4MB pieces of a file are?

On 9/28/15, 4:56 PM, "John Spray" <jsp...@redhat.com> wrote:

>On Mon, Sep 28, 2015 at 9:46 PM, Andras Pataki
><apat...@simonsfoundation.org> wrote:
>> Hi,
>>
>> Is there a way to find out which radios objects a file in cephfs is
>>mapped
>> to from the command line?  Or vice versa, which file a particular radios
>> object belongs to?
>
>The part of the object name before the period is the inode number (in
>hex).
>
>John
>
>> Our ceph cluster has some inconsistencies/corruptions and I am trying to
>> find out which files are impacted in cephfs.
>>
>> Thanks,
>>
>> Andras
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS file to rados object mapping

2015-09-29 Thread Andras Pataki
Thanks, that makes a lot of sense.
One more question about checksumming objects in rados.  Our cluster uses
two copies per object, and I have some where the checkums mismatch between
the two copies (that deep scrub warns about).  Does ceph store an
authoritative checksum of what the block should look like?  I.e. Is there
a way to tell which version of the block is correct?  I seem to recall
some changelog entry that Hammer is adding checksum storage for blocks, or
am I wrong?

Andras


On 9/29/15, 9:58 AM, "Gregory Farnum" <gfar...@redhat.com> wrote:

>The formula for objects in a file is .sequence>. So you'll have noticed they all look something like
>12345.0001, 12345.0002, 12345.0003, ...
>
>So if you've got a particular inode and file size, you can generate a
>list of all the possible objects in it. To find the object->OSD
>mapping you'd need to run crush, by making use of the crushtool or
>similar.
>-Greg
>
>On Tue, Sep 29, 2015 at 6:29 AM, Andras Pataki
><apat...@simonsfoundation.org> wrote:
>> Thanks, that worked.  Is there a mapping in the other direction easily
>> available, I.e. To find where all the 4MB pieces of a file are?
>>
>> On 9/28/15, 4:56 PM, "John Spray" <jsp...@redhat.com> wrote:
>>
>>>On Mon, Sep 28, 2015 at 9:46 PM, Andras Pataki
>>><apat...@simonsfoundation.org> wrote:
>>>> Hi,
>>>>
>>>> Is there a way to find out which radios objects a file in cephfs is
>>>>mapped
>>>> to from the command line?  Or vice versa, which file a particular
>>>>radios
>>>> object belongs to?
>>>
>>>The part of the object name before the period is the inode number (in
>>>hex).
>>>
>>>John
>>>
>>>> Our ceph cluster has some inconsistencies/corruptions and I am trying
>>>>to
>>>> find out which files are impacted in cephfs.
>>>>
>>>> Thanks,
>>>>
>>>> Andras
>>>>
>>>>
>>>>
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Uneven data distribution across OSDs

2015-09-21 Thread Andras Pataki
Hi ceph users,

I am using CephFS for file storage and I have noticed that the data gets 
distributed very unevenly across OSDs.

  *   I have about 90 OSDs across 8 hosts, and 4096 PGs for the cephfs_data 
pool with 2 replicas, which is in line with the total PG recommendation if 
"Total PGs = (OSDs * 100) / pool_size" from the docs.
  *   CephFS distributes the data pretty much evenly across the PGs as shown by 
'ceph pg dump'
  *   However - the number of PGs assigned to various OSDs (per weight 
unit/terabyte) varies quite a lot.  The fullest OSD has as many as 44 PGs per 
terabyte (weight unit), while the emptier ones have as few as 19 or 20.
  *   Even if I consider the total number of PGs for all pools per OSD, the 
number varies similarly wildly (as with the cephfs_data pool only).

As a result, when the whole CephFS file system is at 60% full, some of the OSDs 
already reach the 95% full condition, and no more data can be written to the 
system.
Is there any way to force a more even distribution of PGs to OSDs?  I am using 
the default crush map, with two levels (root/host).  Can any changes to the 
crush map help?  I would really like to be get higher disk utilization than 60% 
without 1 of 90 disks filling up so early.

Thanks,

Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Uneven data distribution across OSDs

2015-09-21 Thread Andras Pataki
0 1862G  1352G  509G 72.65 1.22
33 1.81999  1.0 1862G  1065G  796G 57.22 0.96
34 1.81999  1.0 1862G  1128G  733G 60.61 1.02
35 1.81999  1.0 1862G  1269G  592G 68.18 1.14
36 1.81999  1.0 1862G  1398G  464G 75.08 1.26
37 1.81999  1.0 1862G  1172G  689G 62.98 1.06
38 1.81999  1.0 1862G  1176G  685G 63.16 1.06
39 1.81999  1.0 1862G  1220G  641G 65.55 1.10
84 0.81000  1.0  931G   816G  114G 87.73 1.47
85 0.90999  1.0  931G   769G  161G 82.67 1.39
86 0.90999  1.0  931G   704G  226G 75.63 1.27
87 0.90999  1.0  931G   638G  292G 68.55 1.15
88 0.90999  1.0  931G   523G  407G 56.28 0.94
89 0.90999  1.0  931G   502G  428G 53.96 0.90
90 0.90999  1.0  931G   729G  201G 78.33 1.31
91 0.90999  1.0  931G   548G  383G 58.86 0.99
 9 0.90999  1.0  931G   818G  112G 87.94 1.47
17 0.90999  1.0  931G   479G  451G 51.50 0.86
18 0.90999  1.0  931G   547G  383G 58.78 0.99
19 0.90999  1.0  931G   637G  293G 68.46 1.15
20 0.90999  1.0  931G   322G  608G 34.69 0.58
21 0.90999  1.0  931G   523G  407G 56.20 0.94
22 0.90999  1.0  931G   615G  315G 66.12 1.11
23 0.90999  1.0  931G   480G  450G 51.56 0.86
49 1.81999  1.0 1862G  1467G  394G 78.83 1.32
50 1.81999  1.0 1862G  1198G  663G 64.38 1.08
51 1.81999  1.0 1862G  1087G  774G 58.41 0.98
52 1.81999  1.0 1862G  1174G  687G 63.09 1.06
53 1.81999  1.0 1862G  1246G  615G 66.96 1.12
54 1.81999  1.0 1862G   771G 1090G 41.43 0.69
55 1.81999  1.0 1862G   885G  976G 47.58 0.80
56 1.81999  1.0 1862G  1489G  373G 79.96 1.34
79 5.45999  1.0 5588G  3441G 2146G 61.59 1.03
80 5.45999  1.0 5588G  3427G 2160G 61.33 1.03
81 5.45999  1.0 5588G  3607G 1980G 64.55 1.08
82 5.45999  1.0 5588G  3311G 2276G 59.26 0.99
83 5.45999  1.0 5588G  3295G 2292G 58.98 0.99
40 5.45999  1.0 5587G  3548G 2038G 63.51 1.06
41 5.45999  1.0 5587G  3471G 2115G 62.13 1.04
42 5.45999  1.0 5587G  3540G 2046G 63.37 1.06
43 5.45999  1.0 5587G  3356G 2230G 60.07 1.01
44 5.45999  1.0 5587G  3113G 2473G 55.72 0.93
45 5.45999  1.0 5587G  3426G 2160G 61.33 1.03
46 5.45999  1.0 5587G  3136G 2451G 56.13 0.94
47 5.45999  1.0 5587G  3222G 2364G 57.67 0.97
74 5.45999  1.0 5588G  3536G 2051G 63.28 1.06
75 5.45999  1.0 5588G  3672G 1915G 65.72 1.10
76 5.45999  1.0 5588G  3784G 1803G 67.73 1.14
77 5.45999  1.0 5588G  3652G 1935G 65.36 1.10
78 5.45999  1.0 5588G  3291G 2297G 58.89 0.99
 1 5.45999  1.0 5587G  3200G 2386G 57.28 0.96
 2 5.45999  1.0 5587G  2680G 2906G 47.98 0.80
 3 5.45999  1.0 5587G  3382G 2204G 60.54 1.01
 4 5.45999  1.0 5587G  3095G 2491G 55.41 0.93
 5 5.45999  1.0 5587G  3851G 1735G 68.94 1.16
 6 5.45999  1.0 5587G  3312G 2274G 59.29 0.99
 7 5.45999  1.0 5587G  2884G 2702G 51.63 0.87
 8 5.45999  1.0 5587G  3407G 2179G 60.98 1.02
67 5.45999  1.0 5587G  3452G 2134G 61.80 1.04
68 5.45999  1.0 5587G  2780G 2806G 49.76 0.83
69 5.45999  1.0 5587G  3337G 2249G 59.74 1.00
70 5.45999  1.0 5587G  3578G 2008G 64.06 1.07
71 5.45999  1.0 5587G  3358G 2228G 60.12 1.01
72 5.45999  1.0 5587G  3021G 2565G 54.08 0.91
73 5.45999  1.0 5587G  3160G 2426G 56.57 0.95
24 5.45999  1.0 5587G  3085G 2501G 55.22 0.93
25 5.45999  1.0 5587G  3495G 2091G 62.56 1.05
26 5.45999  1.0 5587G  3141G 2445G 56.22 0.94
27 5.45999  1.0 5587G  3897G 1689G 69.76 1.17
28 5.45999  1.0 5587G  3243G 2343G 58.05 0.97
29 5.45999  1.0 5587G  2907G 2679G 52.05 0.87
30 5.45999  1.0 5587G  3788G 1798G 67.81 1.14
31 5.45999  1.0 5587G  3289G 2297G 58.88 0.99
57 4.45999  1.0 4563G  2824G 1738G 61.90 1.04
58 1.81999  1.0 1862G  1267G  594G 68.09 1.14
59 1.81999  1.0 1862G  1064G  798G 57.14 0.96
60 1.81999  1.0 1862G  1468G  393G 78.86 1.32
61 1.81999  1.0 1862G  1219G  642G 65.50 1.10
62 1.81999  1.0 1862G  1175G  686G 63.13 1.06
63 1.81999  1.0 1862G  1290G  571G 69.32 1.16
64 1.81999  1.0 1862G  1358G  503G 72.96 1.22
65 1.81999  1.0 1862G  1401G  460G 75.28 1.26
66 1.81999  1.0 1862G  1309G  552G 70.31 1.18

Thanks for the help :)


Andras





On 9/21/15, 2:55 PM, "Michael Hackett" <mhack...@redhat.com> wrote:

>Hello Andras,
>
>Some initial observations and questions:
>
>The total PG recommendation for this cluster would actually be 8192 PGs
>per the formula. 
>
>Total PG's = (90 * 100) / 2 = 4500
>
>Next power of 2 = 8192.
>
>The result should be rounded up to the nearest power of two. Rounding up
>is optional, but recommended for CRUSH to evenly balance the number of
>objects among placement groups.
>
>How many data pools are being used for storing objects?
>
>'ceph osd dump |grep pool'
>
>Also how are these 90 OSD's laid out across the 8 hosts and is there any
>discrepancy between disk sizes and weight?
>
>'ceph osd tree'
>
>Also what are you using for CRUSH tunables a

Re: [ceph-users] Inconsistent PGs that ceph pg repair does not fix

2015-09-08 Thread Andras Pataki
Hi Sam,

I saw that ceph 0.94.3 is out and it contains a resolution to the issue below 
(http://tracker.ceph.com/issues/12577).  I installed it on our cluster, but 
unfortunately it didn't resolve the issue.  Same as before, I have a couple of 
inconsistent pg's, and run ceph pg repair on them - the OSD says:

2015-09-08 11:21:53.930324 7f49c17ea700  0 log_channel(cluster) log [INF] : 
2.439 repair starts
2015-09-08 11:27:57.708394 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
repair 2.439 b029e439/1022d93.0f0c/head//2 on disk data digest 
0xb3d78a6e != 0xa3944ad0
2015-09-08 11:28:32.359938 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
2.439 repair 1 errors, 0 fixed
2015-09-08 11:28:32.364506 7f49c17ea700  0 log_channel(cluster) log [INF] : 
2.439 deep-scrub starts
2015-09-08 11:29:18.650876 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
deep-scrub 2.439 b029e439/1022d93.0f0c/head//2 on disk data digest 
0xb3d78a6e != 0xa3944ad0
2015-09-08 11:29:23.136109 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
2.439 deep-scrub 1 errors

$ ceph tell osd.* version | grep version | sort | uniq -c
 94 "version": "ceph version 0.94.3 
(95cefea9fd9ab740263bf8bb4796fd864d9afe2b)"

Could you have another look?

Thanks,

Andras


________
From: Andras Pataki
Sent: Monday, August 3, 2015 4:09 PM
To: Samuel Just
Cc: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
Subject: Re: [ceph-users] Inconsistent PGs that ceph pg repair does not fix

Done: http://tracker.ceph.com/issues/12577
BTW, I¹m using the latest release 0.94.2 on all machines.

Andras


On 8/3/15, 3:38 PM, "Samuel Just" <sj...@redhat.com> wrote:

>Hrm, that's certainly supposed to work.  Can you make a bug?  Be sure
>to note what version you are running (output of ceph-osd -v).
>-Sam
>
>On Mon, Aug 3, 2015 at 12:34 PM, Andras Pataki
><apat...@simonsfoundation.org> wrote:
>> Summary: I am having problems with inconsistent PG's that the 'ceph pg
>> repair' command does not fix.  Below are the details.  Any help would be
>> appreciated.
>>
>> # Find the inconsistent PG's
>> ~# ceph pg dump | grep inconsistent
>> dumped all in format plain
>> 2.439 42080 00 017279507143 31033103 active+clean+inconsistent2015-08-03
>> 14:49:17.29288477323'2250145 77480:890566 [78,54]78 [78,54]78
>> 77323'22501452015-08-03 14:49:17.29253877323'2250145 2015-08-03
>> 14:49:17.292538
>> 2.8b9 40830 00 016669590823 30513051 active+clean+inconsistent2015-08-03
>> 14:46:05.14006377323'2249886 77473:897325 [7,72]7 [7,72]7
>> 77323'22498862015-08-03 14:22:47.83406377323'2249886 2015-08-03
>> 14:22:47.834063
>>
>> # Look at the first one:
>> ~# ceph pg deep-scrub 2.439
>> instructing pg 2.439 on osd.78 to deep-scrub
>>
>> # The logs of osd.78 show:
>> 2015-08-03 15:16:34.409738 7f09ec04a700  0 log_channel(cluster) log
>>[INF] :
>> 2.439 deep-scrub starts
>> 2015-08-03 15:16:51.364229 7f09ec04a700 -1 log_channel(cluster) log
>>[ERR] :
>> deep-scrub 2.439 b029e439/1022d93.0f0c/head//2 on disk data
>>digest
>> 0xb3d78a6e != 0xa3944ad0
>> 2015-08-03 15:16:52.763977 7f09ec04a700 -1 log_channel(cluster) log
>>[ERR] :
>> 2.439 deep-scrub 1 errors
>>
>> # Finding the object in question:
>> ~# find ~ceph/osd/ceph-78/current/2.439_head -name
>>1022d93.0f0c* -ls
>> 21510412310 4100 -rw-r--r--   1 root root  4194304 Jun 30 17:09
>>
>>/var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
>>0022d93.0f0c__head_B029E439__2
>> ~# md5sum
>>
>>/var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
>>0022d93.0f0c__head_B029E439__2
>> 4e4523244deec051cfe53dd48489a5db
>>
>>/var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
>>0022d93.0f0c__head_B029E439__2
>>
>> # The object on the backup osd:
>> ~# find ~ceph/osd/ceph-54/current/2.439_head -name
>>1022d93.0f0c* -ls
>> 6442614367 4100 -rw-r--r--   1 root root  4194304 Jun 30 17:09
>>
>>/var/lib/ceph/osd/ceph-54/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
>>0022d93.0f0c__head_B029E439__2
>> ~# md5sum
>>
>>/var/lib/ceph/osd/ceph-54/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
>>0022d93.0f0c__head_B029E439__2
>> 4e4523244deec051cfe53dd48489a5db
>>
>>/var/lib/ceph/osd/ceph-54/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
>>0022d93.0f0c__head_B029E439__2
>>
>> # They don't seem to be different.
>> # When I try repair:
>> ~# ceph pg repair 2.439
>> instructing pg 2.439 on osd.78 to repair
>

Re: [ceph-users] Inconsistent PGs that ceph pg repair does not fix

2015-09-08 Thread Andras Pataki
Cool, thanks!

Andras


From: Sage Weil <sw...@redhat.com>
Sent: Tuesday, September 8, 2015 2:07 PM
To: Andras Pataki
Cc: Samuel Just; ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
Subject: Re: [ceph-users] Inconsistent PGs that ceph pg repair does not fix

On Tue, 8 Sep 2015, Andras Pataki wrote:
> Hi Sam,
>
> I saw that ceph 0.94.3 is out and it contains a resolution to the issue below 
> (http://tracker.ceph.com/issues/12577).  I installed it on our cluster, but 
> unfortunately it didn't resolve the issue.  Same as before, I have a couple 
> of inconsistent pg's, and run ceph pg repair on them - the OSD says:
>
> 2015-09-08 11:21:53.930324 7f49c17ea700  0 log_channel(cluster) log [INF] : 
> 2.439 repair starts
> 2015-09-08 11:27:57.708394 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
> repair 2.439 b029e439/1022d93.0f0c/head//2 on disk data digest 
> 0xb3d78a6e != 0xa3944ad0
> 2015-09-08 11:28:32.359938 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
> 2.439 repair 1 errors, 0 fixed
> 2015-09-08 11:28:32.364506 7f49c17ea700  0 log_channel(cluster) log [INF] : 
> 2.439 deep-scrub starts
> 2015-09-08 11:29:18.650876 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
> deep-scrub 2.439 b029e439/1022d93.0f0c/head//2 on disk data digest 
> 0xb3d78a6e != 0xa3944ad0
> 2015-09-08 11:29:23.136109 7f49c17ea700 -1 log_channel(cluster) log [ERR] : 
> 2.439 deep-scrub 1 errors
>
> $ ceph tell osd.* version | grep version | sort | uniq -c
>  94 "version": "ceph version 0.94.3 
> (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)"
>
> Could you have another look?

The fix was merged into master in
6a949e10198a1787f2008b6c537b7060d191d236, after v0.94.3 was released.  It
will be in v0.94.4.

Note that we had a bunch of similar errors on our internal lab cluster and
this resolved them.  We installed the test build from gitbuilder,
available at
http://gitbuilder.ceph.com/ceph-rpm-centos7-x86_64-basic/ref/hammer/ (or
similar, adjust URL for your distro).

sage


>
> Thanks,
>
> Andras
>
>
> 
> From: Andras Pataki
> Sent: Monday, August 3, 2015 4:09 PM
> To: Samuel Just
> Cc: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
> Subject: Re: [ceph-users] Inconsistent PGs that ceph pg repair does not fix
>
> Done: http://tracker.ceph.com/issues/12577
> BTW, I¹m using the latest release 0.94.2 on all machines.
>
> Andras
>
>
> On 8/3/15, 3:38 PM, "Samuel Just" <sj...@redhat.com> wrote:
>
> >Hrm, that's certainly supposed to work.  Can you make a bug?  Be sure
> >to note what version you are running (output of ceph-osd -v).
> >-Sam
> >
> >On Mon, Aug 3, 2015 at 12:34 PM, Andras Pataki
> ><apat...@simonsfoundation.org> wrote:
> >> Summary: I am having problems with inconsistent PG's that the 'ceph pg
> >> repair' command does not fix.  Below are the details.  Any help would be
> >> appreciated.
> >>
> >> # Find the inconsistent PG's
> >> ~# ceph pg dump | grep inconsistent
> >> dumped all in format plain
> >> 2.439 42080 00 017279507143 31033103 active+clean+inconsistent2015-08-03
> >> 14:49:17.29288477323'2250145 77480:890566 [78,54]78 [78,54]78
> >> 77323'22501452015-08-03 14:49:17.29253877323'2250145 2015-08-03
> >> 14:49:17.292538
> >> 2.8b9 40830 00 016669590823 30513051 active+clean+inconsistent2015-08-03
> >> 14:46:05.14006377323'2249886 77473:897325 [7,72]7 [7,72]7
> >> 77323'22498862015-08-03 14:22:47.83406377323'2249886 2015-08-03
> >> 14:22:47.834063
> >>
> >> # Look at the first one:
> >> ~# ceph pg deep-scrub 2.439
> >> instructing pg 2.439 on osd.78 to deep-scrub
> >>
> >> # The logs of osd.78 show:
> >> 2015-08-03 15:16:34.409738 7f09ec04a700  0 log_channel(cluster) log
> >>[INF] :
> >> 2.439 deep-scrub starts
> >> 2015-08-03 15:16:51.364229 7f09ec04a700 -1 log_channel(cluster) log
> >>[ERR] :
> >> deep-scrub 2.439 b029e439/1022d93.0f0c/head//2 on disk data
> >>digest
> >> 0xb3d78a6e != 0xa3944ad0
> >> 2015-08-03 15:16:52.763977 7f09ec04a700 -1 log_channel(cluster) log
> >>[ERR] :
> >> 2.439 deep-scrub 1 errors
> >>
> >> # Finding the object in question:
> >> ~# find ~ceph/osd/ceph-78/current/2.439_head -name
> >>1022d93.0f0c* -ls
> >> 21510412310 4100 -rw-r--r--   1 root root  4194304 Jun 30 17:09
> >>
> >>/var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
> >>0022d93.0

[ceph-users] Inconsistent PGs that ceph pg repair does not fix

2015-08-03 Thread Andras Pataki
Summary: I am having problems with inconsistent PG's that the 'ceph pg repair' 
command does not fix.  Below are the details.  Any help would be appreciated.

# Find the inconsistent PG's
~# ceph pg dump | grep inconsistent
dumped all in format plain
2.439 42080 00 017279507143 31033103 active+clean+inconsistent2015-08-03 
14:49:17.29288477323'2250145 77480:890566 [78,54]78 [78,54]78 
77323'22501452015-08-03 14:49:17.29253877323'2250145 2015-08-03 14:49:17.292538
2.8b9 40830 00 016669590823 30513051 active+clean+inconsistent2015-08-03 
14:46:05.14006377323'2249886 77473:897325 [7,72]7 [7,72]7 
77323'22498862015-08-03 14:22:47.83406377323'2249886 2015-08-03 14:22:47.834063

# Look at the first one:
~# ceph pg deep-scrub 2.439
instructing pg 2.439 on osd.78 to deep-scrub

# The logs of osd.78 show:
2015-08-03 15:16:34.409738 7f09ec04a700  0 log_channel(cluster) log [INF] : 
2.439 deep-scrub starts
2015-08-03 15:16:51.364229 7f09ec04a700 -1 log_channel(cluster) log [ERR] : 
deep-scrub 2.439 b029e439/1022d93.0f0c/head//2 on disk data digest 
0xb3d78a6e != 0xa3944ad0
2015-08-03 15:16:52.763977 7f09ec04a700 -1 log_channel(cluster) log [ERR] : 
2.439 deep-scrub 1 errors

# Finding the object in question:
~# find ~ceph/osd/ceph-78/current/2.439_head -name 1022d93.0f0c* -ls
21510412310 4100 -rw-r--r--   1 root root  4194304 Jun 30 17:09 
/var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1022d93.0f0c__head_B029E439__2
~# md5sum 
/var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1022d93.0f0c__head_B029E439__2
4e4523244deec051cfe53dd48489a5db  
/var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1022d93.0f0c__head_B029E439__2

# The object on the backup osd:
~# find ~ceph/osd/ceph-54/current/2.439_head -name 1022d93.0f0c* -ls
6442614367 4100 -rw-r--r--   1 root root  4194304 Jun 30 17:09 
/var/lib/ceph/osd/ceph-54/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1022d93.0f0c__head_B029E439__2
~# md5sum 
/var/lib/ceph/osd/ceph-54/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1022d93.0f0c__head_B029E439__2
4e4523244deec051cfe53dd48489a5db  
/var/lib/ceph/osd/ceph-54/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1022d93.0f0c__head_B029E439__2

# They don't seem to be different.
# When I try repair:
~# ceph pg repair 2.439
instructing pg 2.439 on osd.78 to repair

# The osd.78 logs show:
2015-08-03 15:19:21.775933 7f09ec04a700  0 log_channel(cluster) log [INF] : 
2.439 repair starts
2015-08-03 15:19:38.088673 7f09ec04a700 -1 log_channel(cluster) log [ERR] : 
repair 2.439 b029e439/1022d93.0f0c/head//2 on disk data digest 
0xb3d78a6e != 0xa3944ad0
2015-08-03 15:19:39.958019 7f09ec04a700 -1 log_channel(cluster) log [ERR] : 
2.439 repair 1 errors, 0 fixed
2015-08-03 15:19:39.962406 7f09ec04a700  0 log_channel(cluster) log [INF] : 
2.439 deep-scrub starts
2015-08-03 15:19:56.510874 7f09ec04a700 -1 log_channel(cluster) log [ERR] : 
deep-scrub 2.439 b029e439/1022d93.0f0c/head//2 on disk data digest 
0xb3d78a6e != 0xa3944ad0
2015-08-03 15:19:58.348083 7f09ec04a700 -1 log_channel(cluster) log [ERR] : 
2.439 deep-scrub 1 errors

The inconsistency is not fixed.  Any hints of what should be done next?
I have tried  a few things:
 * Stop the primary osd, remove the object from the filesystem, restart the OSD 
and issue a repair.  It didn't work - it sais that one object is missing, but 
did not copy it from the backup.
 * I tried the same on the backup (remove the file) - it also didn't get copied 
back from the primary in a repair.

Any help would be appreciated.

Thanks,

Andras
apat...@simonsfoundation.orgmailto:apat...@simonsfoundation.org

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent PGs that ceph pg repair does not fix

2015-08-03 Thread Andras Pataki
Done: http://tracker.ceph.com/issues/12577
BTW, I¹m using the latest release 0.94.2 on all machines.

Andras


On 8/3/15, 3:38 PM, Samuel Just sj...@redhat.com wrote:

Hrm, that's certainly supposed to work.  Can you make a bug?  Be sure
to note what version you are running (output of ceph-osd -v).
-Sam

On Mon, Aug 3, 2015 at 12:34 PM, Andras Pataki
apat...@simonsfoundation.org wrote:
 Summary: I am having problems with inconsistent PG's that the 'ceph pg
 repair' command does not fix.  Below are the details.  Any help would be
 appreciated.

 # Find the inconsistent PG's
 ~# ceph pg dump | grep inconsistent
 dumped all in format plain
 2.439 42080 00 017279507143 31033103 active+clean+inconsistent2015-08-03
 14:49:17.29288477323'2250145 77480:890566 [78,54]78 [78,54]78
 77323'22501452015-08-03 14:49:17.29253877323'2250145 2015-08-03
 14:49:17.292538
 2.8b9 40830 00 016669590823 30513051 active+clean+inconsistent2015-08-03
 14:46:05.14006377323'2249886 77473:897325 [7,72]7 [7,72]7
 77323'22498862015-08-03 14:22:47.83406377323'2249886 2015-08-03
 14:22:47.834063

 # Look at the first one:
 ~# ceph pg deep-scrub 2.439
 instructing pg 2.439 on osd.78 to deep-scrub

 # The logs of osd.78 show:
 2015-08-03 15:16:34.409738 7f09ec04a700  0 log_channel(cluster) log
[INF] :
 2.439 deep-scrub starts
 2015-08-03 15:16:51.364229 7f09ec04a700 -1 log_channel(cluster) log
[ERR] :
 deep-scrub 2.439 b029e439/1022d93.0f0c/head//2 on disk data
digest
 0xb3d78a6e != 0xa3944ad0
 2015-08-03 15:16:52.763977 7f09ec04a700 -1 log_channel(cluster) log
[ERR] :
 2.439 deep-scrub 1 errors

 # Finding the object in question:
 ~# find ~ceph/osd/ceph-78/current/2.439_head -name
1022d93.0f0c* -ls
 21510412310 4100 -rw-r--r--   1 root root  4194304 Jun 30 17:09
 
/var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
0022d93.0f0c__head_B029E439__2
 ~# md5sum
 
/var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
0022d93.0f0c__head_B029E439__2
 4e4523244deec051cfe53dd48489a5db
 
/var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
0022d93.0f0c__head_B029E439__2

 # The object on the backup osd:
 ~# find ~ceph/osd/ceph-54/current/2.439_head -name
1022d93.0f0c* -ls
 6442614367 4100 -rw-r--r--   1 root root  4194304 Jun 30 17:09
 
/var/lib/ceph/osd/ceph-54/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
0022d93.0f0c__head_B029E439__2
 ~# md5sum
 
/var/lib/ceph/osd/ceph-54/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
0022d93.0f0c__head_B029E439__2
 4e4523244deec051cfe53dd48489a5db
 
/var/lib/ceph/osd/ceph-54/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/1000
0022d93.0f0c__head_B029E439__2

 # They don't seem to be different.
 # When I try repair:
 ~# ceph pg repair 2.439
 instructing pg 2.439 on osd.78 to repair

 # The osd.78 logs show:
 2015-08-03 15:19:21.775933 7f09ec04a700  0 log_channel(cluster) log
[INF] :
 2.439 repair starts
 2015-08-03 15:19:38.088673 7f09ec04a700 -1 log_channel(cluster) log
[ERR] :
 repair 2.439 b029e439/1022d93.0f0c/head//2 on disk data digest
 0xb3d78a6e != 0xa3944ad0
 2015-08-03 15:19:39.958019 7f09ec04a700 -1 log_channel(cluster) log
[ERR] :
 2.439 repair 1 errors, 0 fixed
 2015-08-03 15:19:39.962406 7f09ec04a700  0 log_channel(cluster) log
[INF] :
 2.439 deep-scrub starts
 2015-08-03 15:19:56.510874 7f09ec04a700 -1 log_channel(cluster) log
[ERR] :
 deep-scrub 2.439 b029e439/1022d93.0f0c/head//2 on disk data
digest
 0xb3d78a6e != 0xa3944ad0
 2015-08-03 15:19:58.348083 7f09ec04a700 -1 log_channel(cluster) log
[ERR] :
 2.439 deep-scrub 1 errors

 The inconsistency is not fixed.  Any hints of what should be done next?
 I have tried  a few things:
  * Stop the primary osd, remove the object from the filesystem, restart
the
 OSD and issue a repair.  It didn't work - it sais that one object is
 missing, but did not copy it from the backup.
  * I tried the same on the backup (remove the file) - it also didn't get
 copied back from the primary in a repair.

 Any help would be appreciated.

 Thanks,

 Andras
 apat...@simonsfoundation.org


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com