Thanks for the advice, Dan.

I'll try to reconfigure the cluster and see if the performance changes.

Best,
Jialin

On Tue, Jun 19, 2018 at 12:02 AM Dan van der Ster <d...@vanderster.com>
wrote:

> On Tue, Jun 19, 2018 at 1:04 AM Jialin Liu <jaln...@lbl.gov> wrote:
> >
> > Hi Dan, Thanks for the follow-ups.
> >
> > I have just tried running multiple librados MPI applications from
> multiple nodes, it does show increased bandwidth,
> > with ceph -w, I observed as high as 500MB/sec (previously only 160MB/sec
> ), I think I can do finer tuning by
> > coordinating more concurrent applications to get the peak. (Sorry, I
> only have one node having rados cli installed, so I can't follow your
> example to stress the server)
> >
> >> Then you can try different replication or erasure coding settings to
> >> learn their impact on performance...
> >
> >
> > Good points.
> >
> >>
> >> PPS. What are those 21.8TB devices ?
> >
> >
> > The storage arrays are Nexsan E60 arrays having two active-active
> redundant
> > controllers, 60 3 TB disk drives. The disk drives are organized into six
> 8+2
> > Raid 6 LUNs of 24 TB each.
> >
>
> This is not the ideal Ceph hardware. Ceph is designed to use disks
> directly -- JBODs. All redundancy is handled at the RADOS level, so
> you can happily save lots of cash on your servers. I suggest reading
> through the various Ceph hardware recommendations that you can find
> via Google.
>
> I can't tell from here if this is the root cause of your performance
> issue -- but you should plan future clusters to use JBODs instead of
> expensive arrays.
>
> >
> >>
> >> PPPS. Any reason you are running jewel instead of luminous or mimic?
> >
> >
> > I have to ask the cluster admin, I'm not sure about it.
> >
> > I have one more questions, regarding the OSD server and OSDs, I was told
> that the IO has to go through the 4 OSD servers (hosts), before touching
> the OSDs,
> > This is confusing to me, as I learned from the ceph document
> http://docs.ceph.com/docs/jewel/rados/operations/monitoring-osd-pg/#monitoring-osds
> > the librados can talk to the OSDs directly, what am I missing here?
>
> You should have one ceph-osd process per disk (or per LUN in your
> case). The clients connect to the ceph-osd processes directly.
>
> -- dan
>
>
> >
> >
> > Best,
> > Jialin
> > NERSC/LBNL
> >
> >
> >>
> >> On Mon, Jun 18, 2018 at 3:43 PM Jialin Liu <jaln...@lbl.gov> wrote:
> >> >
> >> > Hi, To make the the problem clearer, here is the configuration of the
> cluster:
> >> >
> >> > The 'problem' I have is the low bandwidth no matter how I increase
> the concurrency.
> >> > I have tried using MPI to launch 322 processes, each calling librados
> to create a handle and initialize the io context, and write one 80MB object.
> >> > I only got ~160 MB/sec, with one process, I can get ~40 MB/sec, I'm
> wondering if the number of client-osd connection is limited by the number
> of hosts.
> >> >
> >> > Best,
> >> > Jialin
> >> > NERSC/LBNL
> >> >
> >> > $ceph osd tree
> >> >
> >> > ID WEIGHT     TYPE NAME         UP/DOWN REWEIGHT PRIMARY-AFFINITY
> >> >
> >> > -1 1047.59473 root default
> >> >
> >> > -2  261.89868     host ngfdv036
> >> >
> >> >  0   21.82489         osd.0          up  1.00000          1.00000
> >> >
> >> >  4   21.82489         osd.4          up  1.00000          1.00000
> >> >
> >> >  8   21.82489         osd.8          up  1.00000          1.00000
> >> >
> >> > 12   21.82489         osd.12         up  1.00000          1.00000
> >> >
> >> > 16   21.82489         osd.16         up  1.00000          1.00000
> >> >
> >> > 20   21.82489         osd.20         up  1.00000          1.00000
> >> >
> >> > 24   21.82489         osd.24         up  1.00000          1.00000
> >> >
> >> > 28   21.82489         osd.28         up  1.00000          1.00000
> >> >
> >> > 32   21.82489         osd.32         up  1.00000          1.00000
> >> >
> >> > 36   21.82489         osd.36         up  1.00000          1.00000
> >> >
> >> > 40   21.82489         osd.40         up  1.00000          1.00000
> >> >
> >> > 44   21.82489         osd.44         up  1.00000          1.00000
> >> >
> >> > -3  261.89868     host ngfdv037
> >> >
> >> >  1   21.82489         osd.1          up  1.00000          1.00000
> >> >
> >> >  5   21.82489         osd.5          up  1.00000          1.00000
> >> >
> >> >  9   21.82489         osd.9          up  1.00000          1.00000
> >> >
> >> > 13   21.82489         osd.13         up  1.00000          1.00000
> >> >
> >> > 17   21.82489         osd.17         up  1.00000          1.00000
> >> >
> >> > 21   21.82489         osd.21         up  1.00000          1.00000
> >> >
> >> > 25   21.82489         osd.25         up  1.00000          1.00000
> >> >
> >> > 29   21.82489         osd.29         up  1.00000          1.00000
> >> >
> >> > 33   21.82489         osd.33         up  1.00000          1.00000
> >> >
> >> > 37   21.82489         osd.37         up  1.00000          1.00000
> >> >
> >> > 41   21.82489         osd.41         up  1.00000          1.00000
> >> >
> >> > 45   21.82489         osd.45         up  1.00000          1.00000
> >> >
> >> > -4  261.89868     host ngfdv038
> >> >
> >> >  2   21.82489         osd.2          up  1.00000          1.00000
> >> >
> >> >  6   21.82489         osd.6          up  1.00000          1.00000
> >> >
> >> > 10   21.82489         osd.10         up  1.00000          1.00000
> >> >
> >> > 14   21.82489         osd.14         up  1.00000          1.00000
> >> >
> >> > 18   21.82489         osd.18         up  1.00000          1.00000
> >> >
> >> > 22   21.82489         osd.22         up  1.00000          1.00000
> >> >
> >> > 26   21.82489         osd.26         up  1.00000          1.00000
> >> >
> >> > 30   21.82489         osd.30         up  1.00000          1.00000
> >> >
> >> > 34   21.82489         osd.34         up  1.00000          1.00000
> >> >
> >> > 38   21.82489         osd.38         up  1.00000          1.00000
> >> >
> >> > 42   21.82489         osd.42         up  1.00000          1.00000
> >> >
> >> > 46   21.82489         osd.46         up  1.00000          1.00000
> >> >
> >> > -5  261.89868     host ngfdv039
> >> >
> >> >  3   21.82489         osd.3          up  1.00000          1.00000
> >> >
> >> >  7   21.82489         osd.7          up  1.00000          1.00000
> >> >
> >> > 11   21.82489         osd.11         up  1.00000          1.00000
> >> >
> >> > 15   21.82489         osd.15         up  1.00000          1.00000
> >> >
> >> > 19   21.82489         osd.19         up  1.00000          1.00000
> >> >
> >> > 23   21.82489         osd.23         up  1.00000          1.00000
> >> >
> >> > 27   21.82489         osd.27         up  1.00000          1.00000
> >> >
> >> > 31   21.82489         osd.31         up  1.00000          1.00000
> >> >
> >> > 35   21.82489         osd.35         up  1.00000          1.00000
> >> >
> >> > 39   21.82489         osd.39         up  1.00000          1.00000
> >> >
> >> > 43   21.82489         osd.43         up  1.00000          1.00000
> >> >
> >> > 47   21.82489         osd.47         up  1.00000          1.00000
> >> >
> >> >
> >> > ceph -s
> >> >
> >> >     cluster 2b0e2d2b-3f63-4815-908a-b032c7f9427a
> >> >
> >> >      health HEALTH_OK
> >> >
> >> >      monmap e1: 2 mons at
> {ngfdv076=128.55.xxx.xx:6789/0,ngfdv078=128.55.xxx.xx:6789/0}
> >> >
> >> >             election epoch 4, quorum 0,1 ngfdv076,ngfdv078
> >> >
> >> >      osdmap e280: 48 osds: 48 up, 48 in
> >> >
> >> >             flags sortbitwise,require_jewel_osds
> >> >
> >> >       pgmap v117283: 3136 pgs, 11 pools, 25600 MB data, 510 objects
> >> >
> >> >             79218 MB used, 1047 TB / 1047 TB avail
> >> >
> >> >                 3136 active+clean
> >> >
> >> >
> >> >
> >> > On Mon, Jun 18, 2018 at 1:06 AM Jialin Liu <jaln...@lbl.gov> wrote:
> >> >>
> >> >> Thank you Dan. I’ll try it.
> >> >>
> >> >> Best,
> >> >> Jialin
> >> >> NERSC/LBNL
> >> >>
> >> >> > On Jun 18, 2018, at 12:22 AM, Dan van der Ster <d...@vanderster.com>
> wrote:
> >> >> >
> >> >> > Hi,
> >> >> >
> >> >> > One way you can see exactly what is happening when you write an
> object
> >> >> > is with --debug_ms=1.
> >> >> >
> >> >> > For example, I write a 100MB object to a test pool:  rados
> >> >> > --debug_ms=1 -p test put 100M.dat 100M.dat
> >> >> > I pasted the output of this here: https://pastebin.com/Zg8rjaTV
> >> >> > In this case, it first gets the cluster maps from a mon, then
> writes
> >> >> > the object to osd.58, which is the primary osd for PG 119.77:
> >> >> >
> >> >> > # ceph pg 119.77 query | jq .up
> >> >> > [
> >> >> >  58,
> >> >> >  49,
> >> >> >  31
> >> >> > ]
> >> >> >
> >> >> > Otherwise I answered your questions below...
> >> >> >
> >> >> >> On Sun, Jun 17, 2018 at 8:29 PM Jialin Liu <jaln...@lbl.gov>
> wrote:
> >> >> >>
> >> >> >> Hello,
> >> >> >>
> >> >> >> I have a couple questions regarding the IO on OSD via librados.
> >> >> >>
> >> >> >>
> >> >> >> 1. How to check which osd is receiving data?
> >> >> >>
> >> >> >
> >> >> > See `ceph osd map`.
> >> >> > For my example above:
> >> >> >
> >> >> > # ceph osd map test 100M.dat
> >> >> > osdmap e236396 pool 'test' (119) object '100M.dat' -> pg
> 119.864b0b77
> >> >> > (119.77) -> up ([58,49,31], p58) acting ([58,49,31], p58)
> >> >> >
> >> >> >> 2. Can the write operation return immediately to the application
> once the write to the primary OSD is done? or does it return only when the
> data is replicated twice? (size=3)
> >> >> >
> >> >> > Write returns once it is safe on *all* replicas or EC chunks.
> >> >> >
> >> >> >> 3. What is the I/O size in the lower level in librados, e.g., if
> I send a 100MB request with 1 thread, does librados send the data by a
> fixed transaction size?
> >> >> >
> >> >> > This depends on the client. The `rados` CLI example I showed you
> broke
> >> >> > the 100MB object into 4MB parts.
> >> >> > Most use-cases keep the objects around 4MB or 8MB.
> >> >> >
> >> >> >> 4. I have 4 OSS, 48 OSDs, will the 4 OSS become the bottleneck?
> from the ceph documentation, once the cluster map is received by the
> client, the client can talk to OSD directly, so the assumption is the max
> parallelism depends on the number of OSDs, is this correct?
> >> >> >>
> >> >> >
> >> >> > That's more or less correct -- the IOPS and BW capacity of the
> cluster
> >> >> > generally scales linearly with number of OSDs.
> >> >> >
> >> >> > Cheers,
> >> >> > Dan
> >> >> > CERN
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to