We're using CephFS with Luminous 12.2.5 and the fuse client (on CentOS 7.4, kernel 3.10.0-693.5.2.el7.x86_64).  Performance has been very good generally, but we're currently running into some strange performance issues with one of our applications.  The client in this case is on a higher latency link - it is about 2.5ms away from all the ceph server nodes (all ceph server nodes are near each other on 10/40Ggbps local ethernet, only the client is "away").

The application is reading contiguous data at 64k chunks, the strace (-tt -T flags) looks something like:

   06:37:04.152667 read(3, ".:.:.\t./.:.:.:.:.\t./.:.:.:.:.\t./"...,
   65536) = 65536 <0.024052>
   06:37:04.178432 read(3, ",1523\t./.:.:.:.:.\t0/0:34,0:34:99"...,
   65536) = 65536 <0.023990>
   06:37:04.204087 read(3, ":20:21:0,21,738\t0/0:8,0:8:0:0,0,"...,
   65536) = 65536 <0.024053>
   06:37:04.229919 read(3, "665\t0/0:35,0:35:99:0,102,1530\t./"...,
   65536) = 65536 <0.024574>
   06:37:04.255623 read(3, ":37:99:0,99,1485\t0/0:34,0:34:99:"...,
   65536) = 65536 <0.023795>
   06:37:04.280914 read(3, ":.\t./.:.:.:.:.:.:.\t./.:.:.:.:.:."...,
   65536) = 65536 <0.023614>
   06:37:04.306022 read(3, "0,0,0\t./.:0,0:0:.:0,0,0\t./.:0,0:"...,
   65536) = 65536 <0.024037>


so each 64k read takes about 23-24ms.  The client has the file open for read, the machine is not busy (load of 0.2), neither are the ceph nodes.  The fuse client seems pretty idle also.

Increasing the log level to 20 for 'client' and 'objectcacher' on ceph-fuse, it looks like ceph-fuse gets ll_read requests of 4k in size, and it looks like it does an async read from the OSDs in 4k chunks (if I'm interpreting the logs right).  Here is a trace of one ll_read:

   2018-06-06 08:14:17.609495 7fffe7a35700  3 client.16794661 ll_read
   0x5556dadfc1a0 0x1000d092e5f 238173646848~4096
   2018-06-06 08:14:17.609506 7fffe7a35700 10 client.16794661 get_caps
   0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
   cap_refs={4=0,1024=0,2048=0,4096=0,8192=0} open={1=1,2=0}
   mode=100664 size=244712765330/249011634176 mtime=2018-06-05
   00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
   objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
   parents=0x55555f187680 0x55555f138680) have pAsLsXsFsxcrwb need Fr
   want Fc revoking -
   2018-06-06 08:14:17.609517 7fffe7a35700 10 client.16794661
   _read_async 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
   cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0}
   mode=100664 size=244712765330/249011634176 mtime=2018-06-05
   00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
   objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
   parents=0x55555f187680 0x55555f138680) 238173646848~4096
   2018-06-06 08:14:17.609523 7fffe7a35700 10 client.16794661
   min_bytes=4194304 max_bytes=268435456 max_periods=64
   2018-06-06 08:14:17.609528 7fffe7a35700 10 objectcacher readx
   extent(1000d092e5f.0000ddd1 (56785) in @6 94208~4096 -> [0,4096])
   2018-06-06 08:14:17.609532 7fffe7a35700 10
   objectcacher.object(1000d092e5f.0000ddd1/head) map_read
   1000d092e5f.0000ddd1 94208~4096
   2018-06-06 08:14:17.609535 7fffe7a35700 20
   objectcacher.object(1000d092e5f.0000ddd1/head) map_read miss 4096
   left, bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (0) v 0 missing]
   waiters = {}
   2018-06-06 08:14:17.609537 7fffe7a35700  7 objectcacher bh_read on
   bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (0) v 0 missing]
   waiters = {} outstanding reads 0
   2018-06-06 08:14:17.609576 7fffe7a35700 10 objectcacher readx
   missed, waiting on bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (0)
   v 0 rx] waiters = {} off 94208
   2018-06-06 08:14:17.609579 7fffe7a35700 20 objectcacher readx defer
   0x55570211ec00
   2018-06-06 08:14:17.609580 7fffe7a35700  5 client.16794661
   get_cap_ref got first FILE_CACHE ref on
   0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
   cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0}
   mode=100664 size=244712765330/249011634176 mtime=2018-06-05
   00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
   objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
   parents=0x55555f187680 0x55555f138680)
   2018-06-06 08:14:17.609587 7fffe7a35700 15 inode.get on
   0x55555f138680 0x1000d092e5f.head now 4
   2018-06-06 08:14:17.612318 7fffefa45700  7 objectcacher
   bh_read_finish 1000d092e5f.0000ddd1/head tid 29067611 94208~4096 (bl
   is 4096) returned 0 outstanding reads 1
   2018-06-06 08:14:17.612338 7fffefa45700 20 objectcacher checking bh
   bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (0) v 0 rx] waiters = {
   94208->[0x5557007383a0, ]}
   2018-06-06 08:14:17.612341 7fffefa45700 10 objectcacher
   bh_read_finish read bh[ 0x55555fecdd40 94208~4096 0x5556226235c0
   (4096) v 0 clean firstbyte=46] waiters = {}
   2018-06-06 08:14:17.612344 7fffefa45700 10
   objectcacher.object(1000d092e5f.0000ddd1/head) try_merge_bh bh[
   0x55555fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean
   firstbyte=46] waiters = {}
   2018-06-06 08:14:17.612346 7fffefa45700 20 objectcacher finishing
   waiters 0x5557007383a0
   2018-06-06 08:14:17.612347 7fffefa45700 10 objectcacher readx
   extent(1000d092e5f.0000ddd1 (56785) in @6 94208~4096 -> [0,4096])
   2018-06-06 08:14:17.612349 7fffefa45700 10
   objectcacher.object(1000d092e5f.0000ddd1/head) map_read
   1000d092e5f.0000ddd1 94208~4096
   2018-06-06 08:14:17.612350 7fffefa45700 20
   objectcacher.object(1000d092e5f.0000ddd1/head) map_read hit bh[
   0x55555fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean
   firstbyte=46] waiters = {}
   2018-06-06 08:14:17.612352 7fffefa45700 10 objectcacher readx hit bh
   bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (4096) v 0 clean
   firstbyte=46] waiters = {}
   2018-06-06 08:14:17.612354 7fffefa45700 10 objectcacher readx rmap
   opos 94208: bh[ 0x55555fecdd40 94208~4096 0x5556226235c0 (4096) v 0
   clean firstbyte=46] waiters = {} +0 frag 0~4096 +0~4096
   2018-06-06 08:14:17.612357 7fffefa45700 10 objectcacher readx has
   all buffers
   2018-06-06 08:14:17.612357 7fffefa45700 10 objectcacher readx adding
   buffer len 4096 at 0
   2018-06-06 08:14:17.612358 7fffefa45700 10 objectcacher readx result
   is 4096
   2018-06-06 08:14:17.612359 7fffefa45700 20 objectcacher readx done
   0x55570211ec00 4096
   2018-06-06 08:14:17.612360 7fffefa45700 10 objectcacher trim start:
   bytes: max 2147483640  clean 2145931264, objects: max 8192 current 8192
   2018-06-06 08:14:17.612361 7fffefa45700 10 objectcacher trim
   finish:  max 2147483640  clean 2145931264, objects: max 8192 current
   8192
   2018-06-06 08:14:17.612410 7fffe7a35700  5 client.16794661
   put_cap_ref dropped last FILE_CACHE ref on
   0x1000d092e5f.head(faked_ino=0 ref=4 ll_ref=31
   cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0}
   mode=100664 size=244712765330/249011634176 mtime=2018-06-05
   00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
   objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
   parents=0x55555f187680 0x55555f138680)
   2018-06-06 08:14:17.612438 7fffe7a35700 10 client.16794661 put_inode
   on 0x1000d092e5f.head(faked_ino=0 ref=4 ll_ref=31
   cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0}
   mode=100664 size=244712765330/249011634176 mtime=2018-06-05
   00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
   objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
   parents=0x55555f187680 0x55555f138680)
   2018-06-06 08:14:17.612446 7fffe7a35700 15 inode.put on
   0x55555f138680 0x1000d092e5f.head now 3


The 2-3ms trip to the OSD seems clear  between 17.609s to 17.612s. This is also confirmed by running ' objecter_requests' through the admin socket of the client. Then immediately the next ll_read starts:

   2018-06-06 08:14:17.612503 7fffe6a33700  3 client.16794661 ll_read
   0x5556dadfc1a0 0x1000d092e5f 238173655040~4096
   2018-06-06 08:14:17.612516 7fffe6a33700 10 client.16794661 get_caps
   0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
   cap_refs={4=0,1024=0,2048=0,4096=0,8192=0} open={1=1,2=0}
   mode=100664 size=244712765330/249011634176 mtime=2018-06-05
   00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
   objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
   parents=0x55555f187680 0x55555f138680) have pAsLsXsFsxcrwb need Fr
   want Fc revoking -
   2018-06-06 08:14:17.612526 7fffe6a33700 10 client.16794661
   _read_async 0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
   cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0}
   mode=100664 size=244712765330/249011634176 mtime=2018-06-05
   00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
   objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
   parents=0x55555f187680 0x55555f138680) 238173655040~4096
   2018-06-06 08:14:17.612533 7fffe6a33700 10 client.16794661
   min_bytes=4194304 max_bytes=268435456 max_periods=64
   2018-06-06 08:14:17.612539 7fffe6a33700 10 objectcacher readx
   extent(1000d092e5f.0000ddd1 (56785) in @6 102400~4096 -> [0,4096])
   2018-06-06 08:14:17.612542 7fffe6a33700 10
   objectcacher.object(1000d092e5f.0000ddd1/head) map_read
   1000d092e5f.0000ddd1 102400~4096
   2018-06-06 08:14:17.612545 7fffe6a33700 20
   objectcacher.object(1000d092e5f.0000ddd1/head) map_read miss 4096
   left, bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (0) v 0 missing]
   waiters = {}
   2018-06-06 08:14:17.612548 7fffe6a33700  7 objectcacher bh_read on
   bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (0) v 0 missing]
   waiters = {} outstanding reads 0
   2018-06-06 08:14:17.612587 7fffe6a33700 10 objectcacher readx
   missed, waiting on bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (0)
   v 0 rx] waiters = {} off 102400
   2018-06-06 08:14:17.612591 7fffe6a33700 20 objectcacher readx defer
   0x55568850eee0
   2018-06-06 08:14:17.612591 7fffe6a33700  5 client.16794661
   get_cap_ref got first FILE_CACHE ref on
   0x1000d092e5f.head(faked_ino=0 ref=3 ll_ref=31
   cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0}
   mode=100664 size=244712765330/249011634176 mtime=2018-06-05
   00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
   objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
   parents=0x55555f187680 0x55555f138680)
   2018-06-06 08:14:17.612598 7fffe6a33700 15 inode.get on
   0x55555f138680 0x1000d092e5f.head now 4
   2018-06-06 08:14:17.615300 7fffefa45700  7 objectcacher
   bh_read_finish 1000d092e5f.0000ddd1/head tid 29067612 102400~4096
   (bl is 4096) returned 0 outstanding reads 1
   2018-06-06 08:14:17.615321 7fffefa45700 20 objectcacher checking bh
   bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (0) v 0 rx] waiters =
   { 102400->[0x555724927d10, ]}
   2018-06-06 08:14:17.615324 7fffefa45700 10 objectcacher
   bh_read_finish read bh[ 0x555653fd3b00 102400~4096 0x5556226235c0
   (4096) v 0 clean firstbyte=58] waiters = {}
   2018-06-06 08:14:17.615327 7fffefa45700 10
   objectcacher.object(1000d092e5f.0000ddd1/head) try_merge_bh bh[
   0x555653fd3b00 102400~4096 0x5556226235c0 (4096) v 0 clean
   firstbyte=58] waiters = {}
   2018-06-06 08:14:17.615329 7fffefa45700 20 objectcacher finishing
   waiters 0x555724927d10
   2018-06-06 08:14:17.615330 7fffefa45700 10 objectcacher readx
   extent(1000d092e5f.0000ddd1 (56785) in @6 102400~4096 -> [0,4096])
   2018-06-06 08:14:17.615332 7fffefa45700 10
   objectcacher.object(1000d092e5f.0000ddd1/head) map_read
   1000d092e5f.0000ddd1 102400~4096
   2018-06-06 08:14:17.615349 7fffefa45700 20
   objectcacher.object(1000d092e5f.0000ddd1/head) map_read hit bh[
   0x555653fd3b00 102400~4096 0x5556226235c0 (4096) v 0 clean
   firstbyte=58] waiters = {}
   2018-06-06 08:14:17.615351 7fffefa45700 10 objectcacher readx hit bh
   bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (4096) v 0 clean
   firstbyte=58] waiters = {}
   2018-06-06 08:14:17.615353 7fffefa45700 10 objectcacher readx rmap
   opos 102400: bh[ 0x555653fd3b00 102400~4096 0x5556226235c0 (4096) v
   0 clean firstbyte=58] waiters = {} +0 frag 0~4096 +0~4096
   2018-06-06 08:14:17.615356 7fffefa45700 10 objectcacher readx has
   all buffers
   2018-06-06 08:14:17.615357 7fffefa45700 10 objectcacher readx adding
   buffer len 4096 at 0
   2018-06-06 08:14:17.615358 7fffefa45700 10 objectcacher readx result
   is 4096
   2018-06-06 08:14:17.615358 7fffefa45700 20 objectcacher readx done
   0x55568850eee0 4096
   2018-06-06 08:14:17.615359 7fffefa45700 10 objectcacher trim start:
   bytes: max 2147483640  clean 2145935360, objects: max 8192 current 8192
   2018-06-06 08:14:17.615361 7fffefa45700 10 objectcacher trim
   finish:  max 2147483640  clean 2145935360, objects: max 8192 current
   8192
   2018-06-06 08:14:17.615432 7fffe6a33700  5 client.16794661
   put_cap_ref dropped last FILE_CACHE ref on
   0x1000d092e5f.head(faked_ino=0 ref=4 ll_ref=31
   cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0}
   mode=100664 size=244712765330/249011634176 mtime=2018-06-05
   00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
   objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
   parents=0x55555f187680 0x55555f138680)
   2018-06-06 08:14:17.615462 7fffe6a33700 10 client.16794661 put_inode
   on 0x1000d092e5f.head(faked_ino=0 ref=4 ll_ref=31
   cap_refs={4=0,1024=0,2048=1,4096=0,8192=0} open={1=1,2=0}
   mode=100664 size=244712765330/249011634176 mtime=2018-06-05
   00:33:31.332901 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb)
   objectset[0x1000d092e5f ts 0/0 objects 6769 dirty_or_tx 0]
   parents=0x55555f187680 0x55555f138680)
   2018-06-06 08:14:17.615470 7fffe6a33700 15 inode.put on
   0x55555f138680 0x1000d092e5f.head now 3


We're running with increase client limits in ceph.conf:

   client_cache_size = 262144
   client_oc_max_dirty = 536870912
   client_oc_max_objects = 8192
   client_oc_size = 2147483640
   client_readahead_max_periods = 64
   client_readahead_min = 4194304


Some things I'm confused about:

 * This node has exclusive access to the file, no other node is
   reading/writing/accessing the file.  This seems to be confirmed by
   the caps, it has Fr/Fc and almost everything else.  I also don't see
   any trips to the MDS for caps, so no problems there. Yet, for some
   reason, ceph-fuse goes to the OSD to read 4k chunks without any
   read-ahead or larger reads of some kind.  So it suffers the 2-3ms
   latency for every 4k read.  Why would this be?  I.e. why is read
   ahead not working here?
 * Why does ceph-fuse get such tiny read requests from the kernel when
   the client application is reading in 64k chunks?  I see on various
   other clients (on other nodes) the kernel issuing 128k chunk read
   requests generally to fuse.
 * Something else that is strange is that successive ll_read offsets
   are 8k apart, even though the reads are 4k in size
   (238173646848~4096 followed by 238173655040~4096 for example). I
   don't understand how the 4k read becomes somehow 8k along the way.


Perhaps some settings I have aren't quite right about caching here? Any hints/suggestions would be welcome. I'd also be interested in any hints/pointers on the internals to help me better understand how things are supposed to work.

Thanks,

Andras

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to