Hi, To make the the problem clearer, here is the configuration of the
cluster:

The 'problem' I have is the low bandwidth no matter how I increase the
concurrency.
I have tried using MPI to launch 322 processes, each calling librados to
create a handle and initialize the io context, and write one 80MB object.
I only got ~160 MB/sec, with one process, I can get ~40 MB/sec, I'm
wondering if the number of client-osd connection is limited by the number
of hosts.

Best,
Jialin
NERSC/LBNL

$ceph osd tree

ID WEIGHT     TYPE NAME         UP/DOWN REWEIGHT PRIMARY-AFFINITY

-1 1047.59473 root default

-2  261.89868     host ngfdv036

 0   21.82489         osd.0          up  1.00000          1.00000

 4   21.82489         osd.4          up  1.00000          1.00000

 8   21.82489         osd.8          up  1.00000          1.00000

12   21.82489         osd.12         up  1.00000          1.00000

16   21.82489         osd.16         up  1.00000          1.00000

20   21.82489         osd.20         up  1.00000          1.00000

24   21.82489         osd.24         up  1.00000          1.00000

28   21.82489         osd.28         up  1.00000          1.00000

32   21.82489         osd.32         up  1.00000          1.00000

36   21.82489         osd.36         up  1.00000          1.00000

40   21.82489         osd.40         up  1.00000          1.00000

44   21.82489         osd.44         up  1.00000          1.00000

-3  261.89868     host ngfdv037

 1   21.82489         osd.1          up  1.00000          1.00000

 5   21.82489         osd.5          up  1.00000          1.00000

 9   21.82489         osd.9          up  1.00000          1.00000

13   21.82489         osd.13         up  1.00000          1.00000

17   21.82489         osd.17         up  1.00000          1.00000

21   21.82489         osd.21         up  1.00000          1.00000

25   21.82489         osd.25         up  1.00000          1.00000

29   21.82489         osd.29         up  1.00000          1.00000

33   21.82489         osd.33         up  1.00000          1.00000

37   21.82489         osd.37         up  1.00000          1.00000

41   21.82489         osd.41         up  1.00000          1.00000

45   21.82489         osd.45         up  1.00000          1.00000

-4  261.89868     host ngfdv038

 2   21.82489         osd.2          up  1.00000          1.00000

 6   21.82489         osd.6          up  1.00000          1.00000

10   21.82489         osd.10         up  1.00000          1.00000

14   21.82489         osd.14         up  1.00000          1.00000

18   21.82489         osd.18         up  1.00000          1.00000

22   21.82489         osd.22         up  1.00000          1.00000

26   21.82489         osd.26         up  1.00000          1.00000

30   21.82489         osd.30         up  1.00000          1.00000

34   21.82489         osd.34         up  1.00000          1.00000

38   21.82489         osd.38         up  1.00000          1.00000

42   21.82489         osd.42         up  1.00000          1.00000

46   21.82489         osd.46         up  1.00000          1.00000

-5  261.89868     host ngfdv039

 3   21.82489         osd.3          up  1.00000          1.00000

 7   21.82489         osd.7          up  1.00000          1.00000

11   21.82489         osd.11         up  1.00000          1.00000

15   21.82489         osd.15         up  1.00000          1.00000

19   21.82489         osd.19         up  1.00000          1.00000

23   21.82489         osd.23         up  1.00000          1.00000

27   21.82489         osd.27         up  1.00000          1.00000

31   21.82489         osd.31         up  1.00000          1.00000

35   21.82489         osd.35         up  1.00000          1.00000

39   21.82489         osd.39         up  1.00000          1.00000

43   21.82489         osd.43         up  1.00000          1.00000

47   21.82489         osd.47         up  1.00000          1.00000

ceph -s

    cluster 2b0e2d2b-3f63-4815-908a-b032c7f9427a

     health HEALTH_OK

     monmap e1: 2 mons at
{ngfdv076=128.55.xxx.xx:6789/0,ngfdv078=128.55.xxx.xx:6789/0}

            election epoch 4, quorum 0,1 ngfdv076,ngfdv078

     osdmap e280: 48 osds: 48 up, 48 in

            flags sortbitwise,require_jewel_osds

      pgmap v117283: 3136 pgs, 11 pools, 25600 MB data, 510 objects

            79218 MB used, 1047 TB / 1047 TB avail

                3136 active+clean


On Mon, Jun 18, 2018 at 1:06 AM Jialin Liu <jaln...@lbl.gov> wrote:

> Thank you Dan. I’ll try it.
>
> Best,
> Jialin
> NERSC/LBNL
>
> > On Jun 18, 2018, at 12:22 AM, Dan van der Ster <d...@vanderster.com>
> wrote:
> >
> > Hi,
> >
> > One way you can see exactly what is happening when you write an object
> > is with --debug_ms=1.
> >
> > For example, I write a 100MB object to a test pool:  rados
> > --debug_ms=1 -p test put 100M.dat 100M.dat
> > I pasted the output of this here: https://pastebin.com/Zg8rjaTV
> > In this case, it first gets the cluster maps from a mon, then writes
> > the object to osd.58, which is the primary osd for PG 119.77:
> >
> > # ceph pg 119.77 query | jq .up
> > [
> >  58,
> >  49,
> >  31
> > ]
> >
> > Otherwise I answered your questions below...
> >
> >> On Sun, Jun 17, 2018 at 8:29 PM Jialin Liu <jaln...@lbl.gov> wrote:
> >>
> >> Hello,
> >>
> >> I have a couple questions regarding the IO on OSD via librados.
> >>
> >>
> >> 1. How to check which osd is receiving data?
> >>
> >
> > See `ceph osd map`.
> > For my example above:
> >
> > # ceph osd map test 100M.dat
> > osdmap e236396 pool 'test' (119) object '100M.dat' -> pg 119.864b0b77
> > (119.77) -> up ([58,49,31], p58) acting ([58,49,31], p58)
> >
> >> 2. Can the write operation return immediately to the application once
> the write to the primary OSD is done? or does it return only when the data
> is replicated twice? (size=3)
> >
> > Write returns once it is safe on *all* replicas or EC chunks.
> >
> >> 3. What is the I/O size in the lower level in librados, e.g., if I send
> a 100MB request with 1 thread, does librados send the data by a fixed
> transaction size?
> >
> > This depends on the client. The `rados` CLI example I showed you broke
> > the 100MB object into 4MB parts.
> > Most use-cases keep the objects around 4MB or 8MB.
> >
> >> 4. I have 4 OSS, 48 OSDs, will the 4 OSS become the bottleneck? from
> the ceph documentation, once the cluster map is received by the client, the
> client can talk to OSD directly, so the assumption is the max parallelism
> depends on the number of OSDs, is this correct?
> >>
> >
> > That's more or less correct -- the IOPS and BW capacity of the cluster
> > generally scales linearly with number of OSDs.
> >
> > Cheers,
> > Dan
> > CERN
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to