Thanks for the advice, Dan. I'll try to reconfigure the cluster and see if the performance changes.
Best, Jialin On Tue, Jun 19, 2018 at 12:02 AM Dan van der Ster <d...@vanderster.com> wrote: > On Tue, Jun 19, 2018 at 1:04 AM Jialin Liu <jaln...@lbl.gov> wrote: > > > > Hi Dan, Thanks for the follow-ups. > > > > I have just tried running multiple librados MPI applications from > multiple nodes, it does show increased bandwidth, > > with ceph -w, I observed as high as 500MB/sec (previously only 160MB/sec > ), I think I can do finer tuning by > > coordinating more concurrent applications to get the peak. (Sorry, I > only have one node having rados cli installed, so I can't follow your > example to stress the server) > > > >> Then you can try different replication or erasure coding settings to > >> learn their impact on performance... > > > > > > Good points. > > > >> > >> PPS. What are those 21.8TB devices ? > > > > > > The storage arrays are Nexsan E60 arrays having two active-active > redundant > > controllers, 60 3 TB disk drives. The disk drives are organized into six > 8+2 > > Raid 6 LUNs of 24 TB each. > > > > This is not the ideal Ceph hardware. Ceph is designed to use disks > directly -- JBODs. All redundancy is handled at the RADOS level, so > you can happily save lots of cash on your servers. I suggest reading > through the various Ceph hardware recommendations that you can find > via Google. > > I can't tell from here if this is the root cause of your performance > issue -- but you should plan future clusters to use JBODs instead of > expensive arrays. > > > > >> > >> PPPS. Any reason you are running jewel instead of luminous or mimic? > > > > > > I have to ask the cluster admin, I'm not sure about it. > > > > I have one more questions, regarding the OSD server and OSDs, I was told > that the IO has to go through the 4 OSD servers (hosts), before touching > the OSDs, > > This is confusing to me, as I learned from the ceph document > http://docs.ceph.com/docs/jewel/rados/operations/monitoring-osd-pg/#monitoring-osds > > the librados can talk to the OSDs directly, what am I missing here? > > You should have one ceph-osd process per disk (or per LUN in your > case). The clients connect to the ceph-osd processes directly. > > -- dan > > > > > > > > Best, > > Jialin > > NERSC/LBNL > > > > > >> > >> On Mon, Jun 18, 2018 at 3:43 PM Jialin Liu <jaln...@lbl.gov> wrote: > >> > > >> > Hi, To make the the problem clearer, here is the configuration of the > cluster: > >> > > >> > The 'problem' I have is the low bandwidth no matter how I increase > the concurrency. > >> > I have tried using MPI to launch 322 processes, each calling librados > to create a handle and initialize the io context, and write one 80MB object. > >> > I only got ~160 MB/sec, with one process, I can get ~40 MB/sec, I'm > wondering if the number of client-osd connection is limited by the number > of hosts. > >> > > >> > Best, > >> > Jialin > >> > NERSC/LBNL > >> > > >> > $ceph osd tree > >> > > >> > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > >> > > >> > -1 1047.59473 root default > >> > > >> > -2 261.89868 host ngfdv036 > >> > > >> > 0 21.82489 osd.0 up 1.00000 1.00000 > >> > > >> > 4 21.82489 osd.4 up 1.00000 1.00000 > >> > > >> > 8 21.82489 osd.8 up 1.00000 1.00000 > >> > > >> > 12 21.82489 osd.12 up 1.00000 1.00000 > >> > > >> > 16 21.82489 osd.16 up 1.00000 1.00000 > >> > > >> > 20 21.82489 osd.20 up 1.00000 1.00000 > >> > > >> > 24 21.82489 osd.24 up 1.00000 1.00000 > >> > > >> > 28 21.82489 osd.28 up 1.00000 1.00000 > >> > > >> > 32 21.82489 osd.32 up 1.00000 1.00000 > >> > > >> > 36 21.82489 osd.36 up 1.00000 1.00000 > >> > > >> > 40 21.82489 osd.40 up 1.00000 1.00000 > >> > > >> > 44 21.82489 osd.44 up 1.00000 1.00000 > >> > > >> > -3 261.89868 host ngfdv037 > >> > > >> > 1 21.82489 osd.1 up 1.00000 1.00000 > >> > > >> > 5 21.82489 osd.5 up 1.00000 1.00000 > >> > > >> > 9 21.82489 osd.9 up 1.00000 1.00000 > >> > > >> > 13 21.82489 osd.13 up 1.00000 1.00000 > >> > > >> > 17 21.82489 osd.17 up 1.00000 1.00000 > >> > > >> > 21 21.82489 osd.21 up 1.00000 1.00000 > >> > > >> > 25 21.82489 osd.25 up 1.00000 1.00000 > >> > > >> > 29 21.82489 osd.29 up 1.00000 1.00000 > >> > > >> > 33 21.82489 osd.33 up 1.00000 1.00000 > >> > > >> > 37 21.82489 osd.37 up 1.00000 1.00000 > >> > > >> > 41 21.82489 osd.41 up 1.00000 1.00000 > >> > > >> > 45 21.82489 osd.45 up 1.00000 1.00000 > >> > > >> > -4 261.89868 host ngfdv038 > >> > > >> > 2 21.82489 osd.2 up 1.00000 1.00000 > >> > > >> > 6 21.82489 osd.6 up 1.00000 1.00000 > >> > > >> > 10 21.82489 osd.10 up 1.00000 1.00000 > >> > > >> > 14 21.82489 osd.14 up 1.00000 1.00000 > >> > > >> > 18 21.82489 osd.18 up 1.00000 1.00000 > >> > > >> > 22 21.82489 osd.22 up 1.00000 1.00000 > >> > > >> > 26 21.82489 osd.26 up 1.00000 1.00000 > >> > > >> > 30 21.82489 osd.30 up 1.00000 1.00000 > >> > > >> > 34 21.82489 osd.34 up 1.00000 1.00000 > >> > > >> > 38 21.82489 osd.38 up 1.00000 1.00000 > >> > > >> > 42 21.82489 osd.42 up 1.00000 1.00000 > >> > > >> > 46 21.82489 osd.46 up 1.00000 1.00000 > >> > > >> > -5 261.89868 host ngfdv039 > >> > > >> > 3 21.82489 osd.3 up 1.00000 1.00000 > >> > > >> > 7 21.82489 osd.7 up 1.00000 1.00000 > >> > > >> > 11 21.82489 osd.11 up 1.00000 1.00000 > >> > > >> > 15 21.82489 osd.15 up 1.00000 1.00000 > >> > > >> > 19 21.82489 osd.19 up 1.00000 1.00000 > >> > > >> > 23 21.82489 osd.23 up 1.00000 1.00000 > >> > > >> > 27 21.82489 osd.27 up 1.00000 1.00000 > >> > > >> > 31 21.82489 osd.31 up 1.00000 1.00000 > >> > > >> > 35 21.82489 osd.35 up 1.00000 1.00000 > >> > > >> > 39 21.82489 osd.39 up 1.00000 1.00000 > >> > > >> > 43 21.82489 osd.43 up 1.00000 1.00000 > >> > > >> > 47 21.82489 osd.47 up 1.00000 1.00000 > >> > > >> > > >> > ceph -s > >> > > >> > cluster 2b0e2d2b-3f63-4815-908a-b032c7f9427a > >> > > >> > health HEALTH_OK > >> > > >> > monmap e1: 2 mons at > {ngfdv076=128.55.xxx.xx:6789/0,ngfdv078=128.55.xxx.xx:6789/0} > >> > > >> > election epoch 4, quorum 0,1 ngfdv076,ngfdv078 > >> > > >> > osdmap e280: 48 osds: 48 up, 48 in > >> > > >> > flags sortbitwise,require_jewel_osds > >> > > >> > pgmap v117283: 3136 pgs, 11 pools, 25600 MB data, 510 objects > >> > > >> > 79218 MB used, 1047 TB / 1047 TB avail > >> > > >> > 3136 active+clean > >> > > >> > > >> > > >> > On Mon, Jun 18, 2018 at 1:06 AM Jialin Liu <jaln...@lbl.gov> wrote: > >> >> > >> >> Thank you Dan. I’ll try it. > >> >> > >> >> Best, > >> >> Jialin > >> >> NERSC/LBNL > >> >> > >> >> > On Jun 18, 2018, at 12:22 AM, Dan van der Ster <d...@vanderster.com> > wrote: > >> >> > > >> >> > Hi, > >> >> > > >> >> > One way you can see exactly what is happening when you write an > object > >> >> > is with --debug_ms=1. > >> >> > > >> >> > For example, I write a 100MB object to a test pool: rados > >> >> > --debug_ms=1 -p test put 100M.dat 100M.dat > >> >> > I pasted the output of this here: https://pastebin.com/Zg8rjaTV > >> >> > In this case, it first gets the cluster maps from a mon, then > writes > >> >> > the object to osd.58, which is the primary osd for PG 119.77: > >> >> > > >> >> > # ceph pg 119.77 query | jq .up > >> >> > [ > >> >> > 58, > >> >> > 49, > >> >> > 31 > >> >> > ] > >> >> > > >> >> > Otherwise I answered your questions below... > >> >> > > >> >> >> On Sun, Jun 17, 2018 at 8:29 PM Jialin Liu <jaln...@lbl.gov> > wrote: > >> >> >> > >> >> >> Hello, > >> >> >> > >> >> >> I have a couple questions regarding the IO on OSD via librados. > >> >> >> > >> >> >> > >> >> >> 1. How to check which osd is receiving data? > >> >> >> > >> >> > > >> >> > See `ceph osd map`. > >> >> > For my example above: > >> >> > > >> >> > # ceph osd map test 100M.dat > >> >> > osdmap e236396 pool 'test' (119) object '100M.dat' -> pg > 119.864b0b77 > >> >> > (119.77) -> up ([58,49,31], p58) acting ([58,49,31], p58) > >> >> > > >> >> >> 2. Can the write operation return immediately to the application > once the write to the primary OSD is done? or does it return only when the > data is replicated twice? (size=3) > >> >> > > >> >> > Write returns once it is safe on *all* replicas or EC chunks. > >> >> > > >> >> >> 3. What is the I/O size in the lower level in librados, e.g., if > I send a 100MB request with 1 thread, does librados send the data by a > fixed transaction size? > >> >> > > >> >> > This depends on the client. The `rados` CLI example I showed you > broke > >> >> > the 100MB object into 4MB parts. > >> >> > Most use-cases keep the objects around 4MB or 8MB. > >> >> > > >> >> >> 4. I have 4 OSS, 48 OSDs, will the 4 OSS become the bottleneck? > from the ceph documentation, once the cluster map is received by the > client, the client can talk to OSD directly, so the assumption is the max > parallelism depends on the number of OSDs, is this correct? > >> >> >> > >> >> > > >> >> > That's more or less correct -- the IOPS and BW capacity of the > cluster > >> >> > generally scales linearly with number of OSDs. > >> >> > > >> >> > Cheers, > >> >> > Dan > >> >> > CERN >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com