Christian,

Thanks again for providing this level of insight in trying to help us solve our 
issues.

I am going to move ahead with the new hardware purchase....just to make sure we 
eliminate hardware (or under-powered hardware) as the bottleneck.

At this point the directory listings seem fast enough...however we are seeing 
periodic hang up's on our samba mount (but I am having trouble trying to 
determine if we have an rbd issue or a samaba issue).

We currently have rbd  block device shared out with samba....and then we are 
mounting the php app server via smbmount .

I may try nfs instead of samba to see if that provides better stability moving 
forward.

Thanks,

Shain

Shain Miley | Manager of Systems and Infrastructure, Digital Media | 
smi...@npr.org | 202.513.3649

________________________________________
From: Christian Balzer [ch...@gol.com]
Sent: Monday, January 12, 2015 7:57 PM
To: ceph-us...@ceph.com
Cc: Shain Miley
Subject: Re: [ceph-users] rbd directory listing performance issues

On Mon, 12 Jan 2015 13:49:28 +0000 Shain Miley wrote:

> Hi,
> I am just wondering if anyone has any thoughts on the questions
> below...I would like to order some additional hardware ASAP...and the
> order that I place may change depending on the feedback that I receive.
>
> Thanks again,
>
> Shain
>
> Sent from my iPhone
>
> > On Jan 9, 2015, at 2:45 PM, Shain Miley <smi...@npr.org> wrote:
> >
> > Although it seems like having a regularly scheduled cron job to do a
> > recursive directory listing may be ok for us as a bit of a work
> > around...I am still in the processes of trying to improve performance.
> >
Once you have enough RAM in the host where you mount this image and that
directory is accessed frequently enough the cron job should no be needed.

> > A few other questions have come up as a result.
> >
> > a)I am in the process of looking at specs for a new rbd 'headnode'
> > that will be used to mount our 100TB rbd image.  At some point in the
> > future we may look into the performance, and multi client access that
> > cephfs could offer...is there any reason that I would not be able to
> > use this new server as both an rbd client and an mds server (assuming
> > the hardware is good enough)?  I know that some cluster functions
> > should not and cannot be mixed on the same server...is this by any
> > chance one of them?
> >
AFAIK that combination should be safe, however never having used CephFS I
can't really tell you what that node will need most for this function.
Clearly one can't go wrong with as much RAM followed by CPU as possible
within your budget.

> > b)Currently the 100TB rbd image is acting as one large repository for
> > our archive....this will only grow over time.   I understand that ceph
> > is pool based...however I am wondering if I would somehow see any
> > better per rbd image performance...if for example...instead of having
> > 1 x 100TB rbd image...I had 4 x 25TB rbd images (since we really could
> > split these up based on our internal groups).
> >
Not really, no. However splitting things up might have other advantages in
terms of access control and manageability.

> > c)Would adding a few ssd drives (in the right quantity) to each node
> > help out with reads as well as writes?
> >
Not directly, as Ceph journals are only used for writes.
That said, halving the number of IOPS your HDDs need to perform for writes
(and those writes hopefully being coalesced by the journal) of course will
benefit reads as well (less spindle contention).

Using SSDs for things like dm-cache or bcache may work, but from what I
gathered in this ML the improvements are not that great.
The same goes for SSD based cache tiers in Ceph itself.

As I said before, lots of RAM in the storage nodes will improve read
speeds, especially if that data has been accessed before.
In addition this RAM (page cache) of course also helps reducing disk
accesses if that data is already in it, thus benefiting both reads and
writes.

> > d)I am a bit confused about how to enable the rbd cache option on the
> > client...is this change something that only needs to be made to the
> > ceph.conf file on the rbd kernel client server...or do the mds and osd
> > servers need the ceph.conf file modified as well and their services
> > restarted?
> >
The kernel client always uses the page cache of the host, the rbd cache is
only for user space (librbd) clients like VMs.
And for your directory speedups the memory of the client is the key.
Even if you had a VM accessing that volume, RBD cache would be the wrong
place to improve things.

This is the slabtop output for a VM I have here containing a massive mail
archive with some directories having 200k files in them:
---
  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
11312000 11312000 100%    0.85K 2828000        4  11312000K ext4_inode_cache

6863380 6863371  99%    0.19K 343169       20   1372676K dentry
---
Yes, thats 11 million cached inodes, taking up nearly 12GB of RAM.
However that means that once that cache is populated, things are as fast
as they can be ("ls" is CPU bound at that point).

Regards,

Christian

> > Other options that I might be looking into going forward are moving
> > some of this data (the data actually needed by our php apps) to
> > rgw...although that option adds some more complexity and unfamiliarity
> > for our users.
> >
> > Thanks again for all the help so far.
> >
> > Shain
> >
> >> On 01/07/2015 03:40 PM, Shain Miley wrote:
> >> Just to follow up on this thread, the main reason that the rbd
> >> directory listing latency was an issue for us,  was that we were
> >> seeing a large amount of IO delay in a PHP app that reads from that
> >> rbd image.
> >>
> >> It occurred to me (based on Roberts cache_dir suggestion below) that
> >> maybe doing a recursive find or a recursive directory listing inside
> >> the one folder in question might speed things up.
> >>
> >> After doing the recursive find...the directory listing seems much
> >> faster and the responsiveness of the PHP app has increased as well.
> >>
> >> Hopefully nothing else will need to be done here, however it seems
> >> that worst case...a daily or weekly cronjob that traverses the
> >> directory tree in that folder might be all we need.
> >>
> >> Thanks again for all the help.
> >>
> >> Shain
> >>
> >>
> >>
> >> Shain Miley | Manager of Systems and Infrastructure, Digital Media |
> >> smi...@npr.org | 202.513.3649
> >>
> >> ________________________________________
> >> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of
> >> Shain Miley [smi...@npr.org] Sent: Tuesday, January 06, 2015 8:16 PM
> >> To: Christian Balzer; ceph-us...@ceph.com
> >> Subject: Re: [ceph-users] rbd directory listing performance issues
> >>
> >> Christian,
> >>
> >> Each of the OSD's server nodes are running on Dell R-720xd's with 64
> >> GB or RAM.
> >>
> >> We have 107 OSD's so I have not checked all of them..however the ones
> >> I have checked with xfs_db, have shown anywhere from 1% to 4%
> >> fragmentation.
> >>
> >> I'll try to upgrade the client server to 32 or 64 GB of ram at some
> >> point soon...however at this point all the tuning that I have done
> >> has not yielded all that much in terms of results.
> >>
> >> It maybe a simple fact that I need to look into adding some SSD's,
> >> and the overall bottleneck here are the 4TB 7200 rpm disks we are
> >> using.
> >>
> >> In general, when looking at the graphs in Calamari, we see around
> >> 20ms latency (await) for our OSD's however there are lots of times
> >> where we see (via the graphs) spikes of 250ms to 400ms as well.
> >>
> >> Thanks again,
> >>
> >> Shain
> >>
> >>
> >> Shain Miley | Manager of Systems and Infrastructure, Digital Media |
> >> smi...@npr.org | 202.513.3649
> >>
> >> ________________________________________
> >> From: Christian Balzer [ch...@gol.com]
> >> Sent: Tuesday, January 06, 2015 7:34 PM
> >> To: ceph-us...@ceph.com
> >> Cc: Shain Miley
> >> Subject: Re: [ceph-users] rbd directory listing performance issues
> >>
> >> Hello,
> >>
> >>> On Tue, 6 Jan 2015 15:29:50 +0000 Shain Miley wrote:
> >>>
> >>> Hello,
> >>>
> >>> We currently have a 12 node (3 monitor+9 OSD) ceph cluster, made up
> >>> of 107 x 4TB drives formatted with xfs. The cluster is running ceph
> >>> version 0.80.7:
> >> I assume journals on the same HDD then.
> >>
> >> How much memory per node?
> >>
> >> [snip]
> >>> A while back I created an 80 TB rbd image to be used as an archive
> >>> repository for some of our audio and video files. We are still seeing
> >>> good rados and rbd read and write throughput performance, however we
> >>> seem to be having quite a long delay in response times when we try to
> >>> list out the files in directories with a large number of folders,
> >>> files, etc.
> >>>
> >>> Subsequent directory listing times seem to run a lot faster (but I am
> >>> not sure for long that is the case before we see another instance of
> >>> slowness), however the initial directory listings can take 20 to 45
> >>> seconds.
> >> Basically the same thing(s) that Robert said.
> >> How big is "large"?
> >> How much memory on the machine you're mounting this image?
> >> Ah, never mind, just saw your follow-up.
> >>
> >> Definitely add memory to this machine if you can.
> >>
> >> The initial listing is always going to be slow-ish of sorts depending
> >> on a number of things in the cluster.
> >>
> >> As in, how busy is it (IOPS)? With journals on disk your HDDs are
> >> going to be sluggish individually and your directory information
> >> might reside mostly in one object (on one OSD), thus limiting you to
> >> the speed of that particular disk.
> >>
> >> And this is also where the memory of your storage nodes comes in, if
> >> it is large enough your "hot" objects will get cached there as well.
> >> To see if that's the case (at least temporarily), drop the caches on
> >> all of your storage nodes (echo 3 > /proc/sys/vm/drop_caches), mount
> >> your image, do the "ls -l" until it's "fast", umount it, mount it
> >> again and do the listing again.
> >> In theory, unless your cluster is extremely busy or your storage node
> >> have very little pagecache, the re-mounted image should get all the
> >> info it needs from said pagecache on your storage nodes, never having
> >> to go to the actual OSD disks and thus be fast(er) than the initial
> >> test.
> >>
> >> Finally to potentially improve the initial scan that has to come from
> >> the disks obviously, see how fragmented your OSDs are and depending
> >> on the results defrag them.
> >>
> >> Christian
> >> --
> >> Christian Balzer        Network/Systems Engineer
> >> ch...@gol.com           Global OnLine Japan/Fusion Communications
> >> http://www.gol.com/
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


--
Christian Balzer        Network/Systems Engineer
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to