Hi, To make the the problem clearer, here is the configuration of the cluster:
The 'problem' I have is the low bandwidth no matter how I increase the concurrency. I have tried using MPI to launch 322 processes, each calling librados to create a handle and initialize the io context, and write one 80MB object. I only got ~160 MB/sec, with one process, I can get ~40 MB/sec, I'm wondering if the number of client-osd connection is limited by the number of hosts. Best, Jialin NERSC/LBNL $ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 1047.59473 root default -2 261.89868 host ngfdv036 0 21.82489 osd.0 up 1.00000 1.00000 4 21.82489 osd.4 up 1.00000 1.00000 8 21.82489 osd.8 up 1.00000 1.00000 12 21.82489 osd.12 up 1.00000 1.00000 16 21.82489 osd.16 up 1.00000 1.00000 20 21.82489 osd.20 up 1.00000 1.00000 24 21.82489 osd.24 up 1.00000 1.00000 28 21.82489 osd.28 up 1.00000 1.00000 32 21.82489 osd.32 up 1.00000 1.00000 36 21.82489 osd.36 up 1.00000 1.00000 40 21.82489 osd.40 up 1.00000 1.00000 44 21.82489 osd.44 up 1.00000 1.00000 -3 261.89868 host ngfdv037 1 21.82489 osd.1 up 1.00000 1.00000 5 21.82489 osd.5 up 1.00000 1.00000 9 21.82489 osd.9 up 1.00000 1.00000 13 21.82489 osd.13 up 1.00000 1.00000 17 21.82489 osd.17 up 1.00000 1.00000 21 21.82489 osd.21 up 1.00000 1.00000 25 21.82489 osd.25 up 1.00000 1.00000 29 21.82489 osd.29 up 1.00000 1.00000 33 21.82489 osd.33 up 1.00000 1.00000 37 21.82489 osd.37 up 1.00000 1.00000 41 21.82489 osd.41 up 1.00000 1.00000 45 21.82489 osd.45 up 1.00000 1.00000 -4 261.89868 host ngfdv038 2 21.82489 osd.2 up 1.00000 1.00000 6 21.82489 osd.6 up 1.00000 1.00000 10 21.82489 osd.10 up 1.00000 1.00000 14 21.82489 osd.14 up 1.00000 1.00000 18 21.82489 osd.18 up 1.00000 1.00000 22 21.82489 osd.22 up 1.00000 1.00000 26 21.82489 osd.26 up 1.00000 1.00000 30 21.82489 osd.30 up 1.00000 1.00000 34 21.82489 osd.34 up 1.00000 1.00000 38 21.82489 osd.38 up 1.00000 1.00000 42 21.82489 osd.42 up 1.00000 1.00000 46 21.82489 osd.46 up 1.00000 1.00000 -5 261.89868 host ngfdv039 3 21.82489 osd.3 up 1.00000 1.00000 7 21.82489 osd.7 up 1.00000 1.00000 11 21.82489 osd.11 up 1.00000 1.00000 15 21.82489 osd.15 up 1.00000 1.00000 19 21.82489 osd.19 up 1.00000 1.00000 23 21.82489 osd.23 up 1.00000 1.00000 27 21.82489 osd.27 up 1.00000 1.00000 31 21.82489 osd.31 up 1.00000 1.00000 35 21.82489 osd.35 up 1.00000 1.00000 39 21.82489 osd.39 up 1.00000 1.00000 43 21.82489 osd.43 up 1.00000 1.00000 47 21.82489 osd.47 up 1.00000 1.00000 ceph -s cluster 2b0e2d2b-3f63-4815-908a-b032c7f9427a health HEALTH_OK monmap e1: 2 mons at {ngfdv076=128.55.xxx.xx:6789/0,ngfdv078=128.55.xxx.xx:6789/0} election epoch 4, quorum 0,1 ngfdv076,ngfdv078 osdmap e280: 48 osds: 48 up, 48 in flags sortbitwise,require_jewel_osds pgmap v117283: 3136 pgs, 11 pools, 25600 MB data, 510 objects 79218 MB used, 1047 TB / 1047 TB avail 3136 active+clean On Mon, Jun 18, 2018 at 1:06 AM Jialin Liu <jaln...@lbl.gov> wrote: > Thank you Dan. I’ll try it. > > Best, > Jialin > NERSC/LBNL > > > On Jun 18, 2018, at 12:22 AM, Dan van der Ster <d...@vanderster.com> > wrote: > > > > Hi, > > > > One way you can see exactly what is happening when you write an object > > is with --debug_ms=1. > > > > For example, I write a 100MB object to a test pool: rados > > --debug_ms=1 -p test put 100M.dat 100M.dat > > I pasted the output of this here: https://pastebin.com/Zg8rjaTV > > In this case, it first gets the cluster maps from a mon, then writes > > the object to osd.58, which is the primary osd for PG 119.77: > > > > # ceph pg 119.77 query | jq .up > > [ > > 58, > > 49, > > 31 > > ] > > > > Otherwise I answered your questions below... > > > >> On Sun, Jun 17, 2018 at 8:29 PM Jialin Liu <jaln...@lbl.gov> wrote: > >> > >> Hello, > >> > >> I have a couple questions regarding the IO on OSD via librados. > >> > >> > >> 1. How to check which osd is receiving data? > >> > > > > See `ceph osd map`. > > For my example above: > > > > # ceph osd map test 100M.dat > > osdmap e236396 pool 'test' (119) object '100M.dat' -> pg 119.864b0b77 > > (119.77) -> up ([58,49,31], p58) acting ([58,49,31], p58) > > > >> 2. Can the write operation return immediately to the application once > the write to the primary OSD is done? or does it return only when the data > is replicated twice? (size=3) > > > > Write returns once it is safe on *all* replicas or EC chunks. > > > >> 3. What is the I/O size in the lower level in librados, e.g., if I send > a 100MB request with 1 thread, does librados send the data by a fixed > transaction size? > > > > This depends on the client. The `rados` CLI example I showed you broke > > the 100MB object into 4MB parts. > > Most use-cases keep the objects around 4MB or 8MB. > > > >> 4. I have 4 OSS, 48 OSDs, will the 4 OSS become the bottleneck? from > the ceph documentation, once the cluster map is received by the client, the > client can talk to OSD directly, so the assumption is the max parallelism > depends on the number of OSDs, is this correct? > >> > > > > That's more or less correct -- the IOPS and BW capacity of the cluster > > generally scales linearly with number of OSDs. > > > > Cheers, > > Dan > > CERN >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com