Re: CephFS use cases + MDS limitations

2013-11-05 Thread Malcolm Haak

Michael,

I haven't seen any on-list replies yet, so I wasn't sure if this was the 
right place. But I'll just reply and somebody will let me know if I am 
wrong.


The use cases I have encountered, in my clustered computing universe, 
were implemented with a different proprietary clustered file system. 
These file-systems were being used as home folders or "shared scratch" 
space. And the specific issues occur when you have users who 'misbehave' 
or have code that, by way of function create(and destroy) large numbers 
of files. And in the process bog down file-system access for everybody. 
I have not yet implemented ceph in production in this role but base 
testing shows it will encounter the same issues.


While it is ideal to not do such things to a clustered file system, it 
would be nice to be able to dedicate an MDS to specific sub folders 
without having to create a whole separate sub-file-system/mount-point 
(as is the current procedure with other solutions).


It would be really AWESOME to do this 'on the fly'. Having more than one 
MDS look after the whole file-system in an ACTIVE/ACTIVE fashion would 
be nice/ideal (as long as latency is not too negativity impacted), but 
really just being able to 'shard' the file-system up would be more than 
sufficient to solve most of the issues I usually encounter. Having this 
kind of functionality would be a 'killer feature' for this kind of workload.


I hope my wall of text makes sense. Please feel free to ping me with 
questions.


Regards

Malcolm Haak




On 04/11/13 09:53, Michael Sevilla wrote:

Hi Ceph community,

I’d like to get a feel for some of the problems that CephFS users are
encountering with single MDS deployments. There were requests for
stable distributed metadata/MDS services [1] and I’m guessing its
because your workloads exhibit many, many metadata operations. Some of
you mentioned opening many files in a directory for checkpointing,
recursive stats on a directory, etc. [2] and I’d like more details,
such as:
- workloads/applications that stress the MDS service that would cause
you to call for multi-MDS support
- use cases for the Ceph file system (I’m not really too interested in
users using CephFS to host VMs, since many of these use cases are
migrating to RBD)

I’m just trying to get an idea of what’s out there and the problems
CephFS users encounter as a result of a bottlenecked MDS (single node
or cluster).

Thanks!

Michael

[1] CephFS MDS Status Discussion,
http://ceph.com/dev-notes/cephfs-mds-status-discussion/
[2] CephFS First Product Release Discussion,
http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13524
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: writing a ceph cliente for MS windows

2013-11-07 Thread Malcolm Haak

I'm just going to throw these in there.

http://www.acc.umu.se/~bosse/

They are GPLv2 some already use sockets and such from inside the kernel. 
 Heck you might even be able to mod the HTTP one to use rados gateway. 
I don't know as I havent sat down and pulled them apart enough yet.


They might help, but they might be useless. Not sure.

On 08/11/13 06:47, Alphe Salas Michels wrote:

Hello all I finally finished my first source code extraction that starts
from ceph/src/client/fuse_ll.c
The result is accurate unlike previous provided results. basically the
script start from a file extract all the private includes definitions
#include "something.h" and recursively extract private includes too. the
best way to know who is related with who.

starting from fuse_ll.cc I optain 390 files retreived and 120 000 lines
of code !
involved dirs are : in ceph/src
objclass/, common/, msg/, common/, osdc/, include/, client/, mds/,
global/, json_spirit/, log/, os/, crush/, mon/, osd/, auth/

probably not a good way to analyse what amount of work it means since
most of those directories are the implementation of servers (osd, mon,
mds) and even if only a tiny bit of them is needed at client level. you
need two structures from ./osd/OSD.h and  my script by relation will
take into acount the whole directory...

I ran the script with libcephfs.cc as start point and got almost the
same results. 131 000 lines of code and 386 files most of the same dirs
involved.



I think I will spend alot of time doing the manual source code isolation
and understand way each #include is set in the files I read (what
purpose they have do they allow to integrate a crucial data type or not.


The other way around will be to read src/libcephfs.cc. It seems shorter
but without understanding what part is used for each included header I
can t say anything...



I will keep reading the source code and take notes. I think in the case
of libcephfs I will gain alot of time.

signature

*Alphé Salas*
Ingeniero T.I

asa...@kepler.cl
*www.kepler.cl *

On 11/07/13 15:02, Alphe Salas Michels wrote:

Hello D.Ketor and Matt Benjamin,
You give me alot to think about and this is great!
I merged your previous post to make a single reply that anyone can
report to easyly

Windows NFS 4.1 is available here:
http://www.citi.umich.edu/projects/nfsv4/windows/readme.html

pnfs is another name for NFS4.X. It is presented as alternative to
ceph and we get known terminology as MDS and OSD but without the self
healing part if I understand well my rapid look on the topic. (when I
say rapid look I mean ... 5 minutes spent in that... which is really
small amount of time to get an accurate view on something)


starting from mount.ceph ... I know that mount.ceph does little but it
is a great hint to know what ceph needs and do things.
Basically mount.ceph modprobe the ceph driver in the linux kernel then
call mount with the line command passed args and the cephfs type as
argument. Then the kernel does the work I don t understand yet what is
the start calls that are made to the ceph driver but it seemed to me
that is was relatively light. (a first impression compared to ceph-fuse.)

I think I will do both isolate source code from ceph-client kernel
(cephfs module for linux kernel) and the one pointed by Sage starting
from client/fuse_ll.cc in ceph master branch. The common files betwin
those 2 extractions will be our core set of mandatory features.

Then we try to compile with cygwin a cephfs client library . Then we
will try to interface with a modified windows nfs 4.1 client or pnfs
or any other that will accept to be compiled with gcc for win32...

the fact that windows 8.1 is and windows 2012 are out of reach at the
moment is not a problem to me.

Our first concern is to understand what is ceph protocol. Then adapt
it to something that can be used on windows prior windows 8.1. Dokan
fs if I remember well use too the WDK (windows driver dev-kit ) for it
s compilation so possibly we will see the same limitations.

We need to multiply our source of information by example regarding
ceph-client (kernel or fuse, radosgw is on a different layer so I will
not try anything around it at first.) And we need to multiply our
source of information by example regarding virtual file system
technologies on windoes OS.
Alot of work but all of those available source code everyone point at
me will make our best solution. And in the end we will choose
technologies knowing what we do and what concequencies they have.

regards,




Regards

signature

*Alphé Salas*
Ingeniero T.I

asa...@kepler.cl


On 11/07/13 11:29, Ketor D wrote:

Hi Alphe:
   Yes Callback Filesystem is very expensive and can't open source.
It's not a good choice for ceph4win.
   Another way for ceph4win maybe develop a kernel-mode fs like
pnfs. pnfs has a kernel-mode windows client. I think you can read its
src code and maybe migrating from ceph kernel client to windows kernel
fs is easier than f

Re: HSM

2013-11-10 Thread Malcolm Haak

Hi All,

If you are talking specifically about Lustre HSM, its really an 
interface to add HSM functionality by leveraging existing HSM's (DMF for 
example)


So with Lustre HSM you have a policy engine that triggers the migrations 
out of the filesystem. Rules are based around size, last accessed and 
target state (online, dual and offline).


There is a 'coordinator' process involved here as well, it (from what I 
understand) runs on MDS nodes. It handles the interaction with the 
copytool. The copytool is provided by the HSM solution you are acutally 
using.


For recalls when caps are aquired on the MDS for an exported file the 
resposible MSD contacts the coordinator, which in-turn uses the copytool 
to pull the required file out of the HSM.


In the Lustre HSM, the objects that make up a file are all recalled and 
the file, not the objects, are handed to the HSM.


For Lustre all it needs to keep track of is the current state of the 
file and the correct ID to reqest from the HSM. This is done inside the 
normal metadata storage.


So there aren't really any hooks in that exports are triggered by the 
policy engine after a scan of the metadata, and the recalls are 
triggered when caps are requested on offline files. Then its just 
standard POSIX blocking until the file is available.


Most of the state and ID stuff could be stored as XATTRS in cephfs. I'm 
not as sure how to do it for other things but as long as you could store 
some kind of extended metadata about whole objects, it could use the 
same interfaces as well.


Hope that was acutually helpful and not just an obvious rehash...

Regards

Malcolm Haak

On 09/11/13 18:33, Sage Weil wrote:

The latest Lustre just added HSM support:


http://archive.hpcwire.com/hpcwire/2013-11-06/lustre_scores_business_class_upgrade_with_hsm.html

Here is a slide deck with some high-level detail:


https://jira.hpdd.intel.com/secure/attachment/13185/Lustre_HSM_Design.pdf

Is anyone familiar with the interfaces and requirements of the file system
itself?  I don't know much about how these systems are implemented, but I
would guess there are relatively lightweight requirements on the fs (ceph
mds in our case) to keep track of file state (online or archived
elsewhere).  And some hooks to trigger migrations?

If anyone is interested in this area, I would be happy to help figure out
how to integrate things cleanly!

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HSM

2013-11-11 Thread Malcolm Haak

Hi Gregory,


On 12/11/13 10:13, Gregory Farnum wrote:

On Mon, Nov 11, 2013 at 3:04 AM, John Spray  wrote:

This is a really useful summary from Malcolm.

In addition to the coordinator/copytool interface, there is the question of
where the policy engine gets its data from.  Lustre has the MDS changelog,
which Robinhood uses to replicate metadata into its MySQL database with all
the indices that it wants.



On Sun, Nov 10, 2013 at 11:17 PM, Malcolm Haak  wrote:

So there aren't really any hooks in that exports are triggered by the policy 
engine after a scan of the metadata, and the recalls are triggered when caps 
are requested on offline files


Wait, is the HSM using a changelog or is it just scanning the full
filesystem tree? Scanning the whole tree seems awfully expensive.


While I can't speak at length about the LustreHSM, it may just use 
incremental updates to its SQL database via metadata logs, I do know 
that filesystem scans are done regularly in other HSM solutions. I also 
know that the scan is multi-threaded and when backed by decent disks 
does not take an excessive amount of time.





I don't know if CephFS MDS currently has a similar interface.

Well, the MDSes each have their journal of course, but more than that
we can stick whatever we want into the metadata and expose it via
virtual xattrs or whatever else.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com



John


On Sun, Nov 10, 2013 at 11:17 PM, Malcolm Haak  wrote:


Hi All,

If you are talking specifically about Lustre HSM, its really an interface to 
add HSM functionality by leveraging existing HSM's (DMF for example)

So with Lustre HSM you have a policy engine that triggers the migrations out of 
the filesystem. Rules are based around size, last accessed and target state 
(online, dual and offline).

There is a 'coordinator' process involved here as well, it (from what I 
understand) runs on MDS nodes. It handles the interaction with the copytool. 
The copytool is provided by the HSM solution you are acutally using.

For recalls when caps are aquired on the MDS for an exported file the 
resposible MSD contacts the coordinator, which in-turn uses the copytool to 
pull the required file out of the HSM.

In the Lustre HSM, the objects that make up a file are all recalled and the 
file, not the objects, are handed to the HSM.

For Lustre all it needs to keep track of is the current state of the file and 
the correct ID to reqest from the HSM. This is done inside the normal metadata 
storage.

So there aren't really any hooks in that exports are triggered by the policy 
engine after a scan of the metadata, and the recalls are triggered when caps 
are requested on offline files. Then its just standard POSIX blocking until the 
file is available.

Most of the state and ID stuff could be stored as XATTRS in cephfs. I'm not as 
sure how to do it for other things but as long as you could store some kind of 
extended metadata about whole objects, it could use the same interfaces as well.

Hope that was acutually helpful and not just an obvious rehash...

Regards

Malcolm Haak

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HSM

2013-11-20 Thread Malcolm Haak

It is, except it might not be.

Dmapi only works if you are the one in charge of the HSM and the filesystem.

So for example in a DMF solution the filesystem mounted with DMAPI 
options is on your NFS head node. Your HSM solution is also installed 
there.


Things get a bit more odd when you look at DMAPI + Clustered systems. 
You would need HSM agents on every client node If we are talking CephFS 
that is.


This is also true with the Lustre solution. The Lustre clients have no 
idea this stuff is happening. This is how it should work. It means the 
current requirement for installed software on the bulk of your clients 
is a working kernel or fuse module.


On 19/11/13 05:22, Dmitry Borodaenko wrote:

On Tue, Nov 12, 2013 at 1:47 AM, Andreas Joachim Peters
 wrote:

I think you need to support the following functionality to support HSM (file 
not block based):

1 implement a trigger on file creation/modification/deletion

2 store the additional HSM identifier for recall as a file attribute

3 policy based purging of file related blocks (LRU cache etc.)

4 implement an optional trigger to recall a purged file and block the IO (our 
experience is that automatic recalls are problematic for huge installations if 
the aggregation window for desired recalls is short since they create 
inefficient and chaotic access on tapes)

5 either snapshot a file before migration, do an exclusive lock or freeze it to 
avoid modifications during migration (you need to have a unique enough 
identifier for a file, either inode/path + checksum or also inode/path + 
modification time works)


DMAPI seems to be the natural choice for items 1 & 4 above.


FYI: there was a paper about migration policy scanning performance by IBM two 
years ago:
http://domino.watson.ibm.com/library/CyberDig.nsf/papers/4A50C2D66A1F90F7852578E3005A2034/$File/rj10484.pdf


An important omission in that paper is the exact ILM policy that was
used to scan the file system. I strongly suspect that it was a
catch-all policy that matches every file without examining any
metadata. When you add conditions that check file metadata, scan time
would increase, probably by a few orders of magnitude.


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Ceph updates and fixes for 3.13

2013-12-04 Thread Malcolm Haak

Hi Dave,

This is a definite bug/regression.

I've bumped into it as well.

It's still in 3.13 -rc2

I've lodged a bug report on it.

Regards

Malcolm Haak

On 24/11/13 19:59, Dave (Bob) wrote:

I have just tried ceph 0.72.1 and kernel 3.13.0-rc1.

There seems to be a problem with ceph file system access from this kernel.

I mount a ceph running on another machine, that seems to go OK.

I create a directory on that mount, that seems to go OK.

I cal 'ls -l' that mount and all looks good.

I cannot remove the directory, nor can I write anything to the
directory. I get a message to the effect 'is not a directory'.

I step back to kernel 3.12.1 and all is well.

Does the message below explain this? Known issue to be fixed in rc2?

Thank you,
David

On 23/11/2013 19:16, Sage Weil wrote:

Hi Linus,

I just returned from two weeks off the grid to discover I'd miscalculated
and just missed the merge window.  If you're feeling inclined, there are a
few non-fixes mixed into this this request (improved readv/writev, nicer
behavior for unlinked files) that can be pulled from here:

   git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

If not, I have a fixes only branch here:

   git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git 
for-linus-bugs-only

These include a couple fixes to the new fscache code that went in during
the last cycle (which will need to go stable@ shortly as well), a couple
client-side directory fragmentation fixes, a fix for a race in the cap
release queuing path, and a couple race fixes in the request abort
and resend code.

Obviously some of this could have gone into 3.12 final, but I preferred to
overtest rather than send things in for a late -rc, and then my travel
schedule intervened--my apologies there.

Thanks!
sage


everything (for-linus):


Li Wang (1):
   ceph: allocate non-zero page to fscache in readpage()

Milosz Tanski (1):
   ceph: hung on ceph fscache invalidate in some cases

Yan, Zheng (8):
   ceph: remove outdated frag information
   ceph: handle frag mismatch between readdir request and reply
   ceph: drop unconnected inodes
   ceph: queue cap release in __ceph_remove_cap()
   ceph: set caps count after composing cap reconnect message
   ceph: handle race between cap reconnect and cap release
   ceph: cleanup aborted requests when re-sending requests.
   ceph: wake up 'safe' waiters when unregistering request

majianpeng (2):
   ceph: Implement writev/pwritev for sync operation.
   ceph: implement readv/preadv for sync operation

  fs/ceph/addr.c   |2 +-
  fs/ceph/cache.c  |3 +
  fs/ceph/caps.c   |   27 ++--
  fs/ceph/dir.c|   11 +-
  fs/ceph/file.c   |  435 +++---
  fs/ceph/inode.c  |   59 ++-
  fs/ceph/mds_client.c |   61 +--
  fs/ceph/mds_client.h |1 +
  fs/ceph/super.c  |1 +
  fs/ceph/super.h  |9 +-
  10 files changed, 442 insertions(+), 167 deletions(-)


or the bug fixes only (for-linus-bugs):


Li Wang (1):
   ceph: allocate non-zero page to fscache in readpage()

Milosz Tanski (1):
   ceph: hung on ceph fscache invalidate in some cases

Yan, Zheng (7):
   ceph: remove outdated frag information
   ceph: handle frag mismatch between readdir request and reply
   ceph: queue cap release in __ceph_remove_cap()
   ceph: set caps count after composing cap reconnect message
   ceph: handle race between cap reconnect and cap release
   ceph: cleanup aborted requests when re-sending requests.
   ceph: wake up 'safe' waiters when unregistering request

  fs/ceph/addr.c   |2 +-
  fs/ceph/cache.c  |3 +++
  fs/ceph/caps.c   |   27 +-
  fs/ceph/dir.c|   11 -
  fs/ceph/inode.c  |   49 +++-
  fs/ceph/mds_client.c |   61 +-
  fs/ceph/mds_client.h |1 +
  fs/ceph/super.h  |8 +--
  8 files changed, 121 insertions(+), 41 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RBD Read performance

2013-04-17 Thread Malcolm Haak

Hi all,

I jumped into the IRC channel yesterday and they said to email 
ceph-devel. I have been having some read performance issues. With Reads 
being slower than writes by a factor of ~5-8.


First info:
Server
SLES 11 SP2
Ceph 0.56.4.
12 OSD's  that are Hardware Raid 5 each of the twelve is made from 5 
NL-SAS disks for a total of 60 disks (Each lun can do around 320MB/s 
stream write and the same if not better read) Connected via 2xQDR IB

OSD's/MDS and such all on same box (for testing)
Box is a Quad AMD Opteron 6234
Ram is 256Gb
10GB Journals
osd_op_theads: 8
osd_disk_threads:2
Filestore_op_threads:4
OSD's are all XFS

All nodes are connected via QDR IB using IP_O_IB. We get 1.7GB/s on TCP 
performance tests between the nodes.


Clients: One is FC17 the other us Ubuntu 12.10 they only have around 
32GB-70GB ram.


We ran into an odd issue were the OSD's would all start in the same NUMA 
node and pretty much on the same processor core. We fixed that up with 
some cpuset magic.


Performance testing we have done: (Note oflag=direct was yielding 
results within 5% of cached results)



root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=3200
3200+0 records in
3200+0 records out
33554432000 bytes (34 GB) copied, 47.6685 s, 704 MB/s
root@ty3:~#
root@ty3:~# rm /test-rbd-fs/DELETEME
root@ty3:~#
root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=4800
4800+0 records in
4800+0 records out
50331648000 bytes (50 GB) copied, 69.5527 s, 724 MB/s

[root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M 
count=2400

2400+0 records in
2400+0 records out
25165824000 bytes (25 GB) copied, 26.3593 s, 955 MB/s
[root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME
[root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M 
count=9600

9600+0 records in
9600+0 records out
100663296000 bytes (101 GB) copied, 145.212 s, 693 MB/s

Both clients each doing a 140GB write (2x dogbreath's RAM) at the same 
time to two different rbds in the same pool.


root@ty3:~# rm /test-rbd-fs/DELETEME
root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=14000
14000+0 records in
14000+0 records out
14680064 bytes (147 GB) copied, 412.404 s, 356 MB/s
root@ty3:~#

[root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME
[root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M 
count=14000

14000+0 records in
14000+0 records out
14680064 bytes (147 GB) copied, 433.351 s, 339 MB/s
[root@dogbreath ~]#

Onto reads...
Also we found that doing iflag=direct increased read performance.

[root@dogbreath ~]# dd of=/dev/null if=/test-rbd-fs/DELETEME bs=10M 
count=160

160+0 records in
160+0 records out
1677721600 bytes (1.7 GB) copied, 29.4242 s, 57.0 MB/s
[root@dogbreath ~]#
[root@dogbreath ~]# echo 1 > /proc/sys/vm/drop_caches
[root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M 
count=1

1+0 records in
1+0 records out
4194304 bytes (42 GB) copied, 382.334 s, 110 MB/s
[root@dogbreath ~]#
[root@dogbreath ~]# echo 1 > /proc/sys/vm/drop_caches
[root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M 
count=1 iflag=direct

1+0 records in
1+0 records out
4194304 bytes (42 GB) copied, 150.774 s, 278 MB/s
[root@dogbreath ~]#


So what info do you want/where do I start hunting for my wumpus?

Regards

Malcolm Haak


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD Read performance

2013-04-18 Thread Malcolm Haak

Hi Mark!

Thanks for the quick reply!

I'll reply inline below.

On 18/04/13 17:04, Mark Nelson wrote:

On 04/17/2013 11:35 PM, Malcolm Haak wrote:

Hi all,


Hi Malcolm!



I jumped into the IRC channel yesterday and they said to email
ceph-devel. I have been having some read performance issues. With Reads
being slower than writes by a factor of ~5-8.


I recently saw this kind of behaviour (writes were fine, but reads were
terrible) on an IPoIB based cluster and it was caused by the same TCP
auto tune issues that Jim Schutt saw last year. It's worth a try at
least to see if it helps.

echo "0" > /proc/sys/net/ipv4/tcp_moderate_rcvbuf

on all of the clients and server nodes should be enough to test it out.
  Sage added an option in more recent Ceph builds that lets you work
around it too.


Awesome I will test this first up tomorrow.


First info:
Server
SLES 11 SP2
Ceph 0.56.4.
12 OSD's  that are Hardware Raid 5 each of the twelve is made from 5
NL-SAS disks for a total of 60 disks (Each lun can do around 320MB/s
stream write and the same if not better read) Connected via 2xQDR IB
OSD's/MDS and such all on same box (for testing)
Box is a Quad AMD Opteron 6234
Ram is 256Gb
10GB Journals
osd_op_theads: 8
osd_disk_threads:2
Filestore_op_threads:4
OSD's are all XFS


Interesting setup!  QUAD socket Opteron boxes have somewhat slow and
slightly oversubscribed hypertransport links don't they?  I wonder if on
a system with so many disks and QDR-IB if that could become a problem...

We typically like smaller nodes where we can reasonably do 1 OSD per
drive, but we've tested on a couple of 60 drive chassis in RAID configs
too.  Should be interesting to hear what kind of aggregate performance
you can eventually get.


We are also going to try this out with 6 luns on a dual xeon box. The 
Opteron box was the biggest scariest thing we had that was doing nothing.






All nodes are connected via QDR IB using IP_O_IB. We get 1.7GB/s on TCP
performance tests between the nodes.

Clients: One is FC17 the other us Ubuntu 12.10 they only have around
32GB-70GB ram.

We ran into an odd issue were the OSD's would all start in the same NUMA
node and pretty much on the same processor core. We fixed that up with
some cpuset magic.


Strange!  Was that more due to cpuset or Ceph?  I can't imagine that we
are doing anything that would cause that.



More than likely it is an odd quirk in the SLES kernel.. but when I have 
time I'll do some more poking. We were seeing insane CPU usage on some 
cores because all the OSD's were piled up in one place.




Performance testing we have done: (Note oflag=direct was yielding
results within 5% of cached results)


root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=3200
3200+0 records in
3200+0 records out
33554432000 bytes (34 GB) copied, 47.6685 s, 704 MB/s
root@ty3:~#
root@ty3:~# rm /test-rbd-fs/DELETEME
root@ty3:~#
root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=4800
4800+0 records in
4800+0 records out
50331648000 bytes (50 GB) copied, 69.5527 s, 724 MB/s

[root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M
count=2400
2400+0 records in
2400+0 records out
25165824000 bytes (25 GB) copied, 26.3593 s, 955 MB/s
[root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME
[root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M
count=9600
9600+0 records in
9600+0 records out
100663296000 bytes (101 GB) copied, 145.212 s, 693 MB/s

Both clients each doing a 140GB write (2x dogbreath's RAM) at the same
time to two different rbds in the same pool.

root@ty3:~# rm /test-rbd-fs/DELETEME
root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=14000
14000+0 records in
14000+0 records out
14680064 bytes (147 GB) copied, 412.404 s, 356 MB/s
root@ty3:~#

[root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME
[root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M
count=14000
14000+0 records in
14000+0 records out
14680064 bytes (147 GB) copied, 433.351 s, 339 MB/s
[root@dogbreath ~]#

Onto reads...
Also we found that doing iflag=direct increased read performance.

[root@dogbreath ~]# dd of=/dev/null if=/test-rbd-fs/DELETEME bs=10M
count=160
160+0 records in
160+0 records out
1677721600 bytes (1.7 GB) copied, 29.4242 s, 57.0 MB/s
[root@dogbreath ~]#
[root@dogbreath ~]# echo 1 > /proc/sys/vm/drop_caches
[root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M
count=1
1+0 records in
1+0 records out
4194304 bytes (42 GB) copied, 382.334 s, 110 MB/s
[root@dogbreath ~]#
[root@dogbreath ~]# echo 1 > /proc/sys/vm/drop_caches
[root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M
count=1 iflag=direct
1+0 records in
1+0 records out
4194304 bytes (42 GB) copied, 150.774 s, 278 MB/s
[root@dogbreath ~]#


So what info do you want/where do I start hunting for my wumpus?


might also be worth looking at the size o

Re: RBD Read performance

2013-04-18 Thread Malcolm Haak

Morning all,

Did the echos on all boxes involved... and the results are in..

[root@dogbreath ~]#
[root@dogbreath ~]# dd if=/todd-rbd-fs/DELETEME of=/dev/null bs=4M 
count=1 iflag=direct

1+0 records in
1+0 records out
4194304 bytes (42 GB) copied, 144.083 s, 291 MB/s
[root@dogbreath ~]# dd if=/todd-rbd-fs/DELETEME of=/dev/null bs=4M 
count=1

1+0 records in
1+0 records out
4194304 bytes (42 GB) copied, 316.025 s, 133 MB/s
[root@dogbreath ~]#

No change which is a shame. What other information or testing should I 
start?


Regards

Malcolm Haak

On 18/04/13 17:22, Malcolm Haak wrote:

Hi Mark!

Thanks for the quick reply!

I'll reply inline below.

On 18/04/13 17:04, Mark Nelson wrote:

On 04/17/2013 11:35 PM, Malcolm Haak wrote:

Hi all,


Hi Malcolm!



I jumped into the IRC channel yesterday and they said to email
ceph-devel. I have been having some read performance issues. With Reads
being slower than writes by a factor of ~5-8.


I recently saw this kind of behaviour (writes were fine, but reads were
terrible) on an IPoIB based cluster and it was caused by the same TCP
auto tune issues that Jim Schutt saw last year. It's worth a try at
least to see if it helps.

echo "0" > /proc/sys/net/ipv4/tcp_moderate_rcvbuf

on all of the clients and server nodes should be enough to test it out.
  Sage added an option in more recent Ceph builds that lets you work
around it too.


Awesome I will test this first up tomorrow.


First info:
Server
SLES 11 SP2
Ceph 0.56.4.
12 OSD's  that are Hardware Raid 5 each of the twelve is made from 5
NL-SAS disks for a total of 60 disks (Each lun can do around 320MB/s
stream write and the same if not better read) Connected via 2xQDR IB
OSD's/MDS and such all on same box (for testing)
Box is a Quad AMD Opteron 6234
Ram is 256Gb
10GB Journals
osd_op_theads: 8
osd_disk_threads:2
Filestore_op_threads:4
OSD's are all XFS


Interesting setup!  QUAD socket Opteron boxes have somewhat slow and
slightly oversubscribed hypertransport links don't they?  I wonder if on
a system with so many disks and QDR-IB if that could become a problem...

We typically like smaller nodes where we can reasonably do 1 OSD per
drive, but we've tested on a couple of 60 drive chassis in RAID configs
too.  Should be interesting to hear what kind of aggregate performance
you can eventually get.


We are also going to try this out with 6 luns on a dual xeon box. The
Opteron box was the biggest scariest thing we had that was doing nothing.





All nodes are connected via QDR IB using IP_O_IB. We get 1.7GB/s on TCP
performance tests between the nodes.

Clients: One is FC17 the other us Ubuntu 12.10 they only have around
32GB-70GB ram.

We ran into an odd issue were the OSD's would all start in the same NUMA
node and pretty much on the same processor core. We fixed that up with
some cpuset magic.


Strange!  Was that more due to cpuset or Ceph?  I can't imagine that we
are doing anything that would cause that.



More than likely it is an odd quirk in the SLES kernel.. but when I have
time I'll do some more poking. We were seeing insane CPU usage on some
cores because all the OSD's were piled up in one place.



Performance testing we have done: (Note oflag=direct was yielding
results within 5% of cached results)


root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=3200
3200+0 records in
3200+0 records out
33554432000 bytes (34 GB) copied, 47.6685 s, 704 MB/s
root@ty3:~#
root@ty3:~# rm /test-rbd-fs/DELETEME
root@ty3:~#
root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=4800
4800+0 records in
4800+0 records out
50331648000 bytes (50 GB) copied, 69.5527 s, 724 MB/s

[root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M
count=2400
2400+0 records in
2400+0 records out
25165824000 bytes (25 GB) copied, 26.3593 s, 955 MB/s
[root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME
[root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M
count=9600
9600+0 records in
9600+0 records out
100663296000 bytes (101 GB) copied, 145.212 s, 693 MB/s

Both clients each doing a 140GB write (2x dogbreath's RAM) at the same
time to two different rbds in the same pool.

root@ty3:~# rm /test-rbd-fs/DELETEME
root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=14000
14000+0 records in
14000+0 records out
14680064 bytes (147 GB) copied, 412.404 s, 356 MB/s
root@ty3:~#

[root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME
[root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M
count=14000
14000+0 records in
14000+0 records out
14680064 bytes (147 GB) copied, 433.351 s, 339 MB/s
[root@dogbreath ~]#

Onto reads...
Also we found that doing iflag=direct increased read performance.

[root@dogbreath ~]# dd of=/dev/null if=/test-rbd-fs/DELETEME bs=10M
count=160
160+0 records in
160+0 records out
1677721600 bytes (1.7 GB) copied, 29.4242 s, 57.0 MB/s
[root@dogbreath 

Re: RBD Read performance

2013-04-18 Thread Malcolm Haak

Ok this is getting interesting.

rados -p  bench 300 write --no-cleanup

 Total time run: 301.103933
Total writes made:  22477
Write size: 4194304
Bandwidth (MB/sec): 298.595

Stddev Bandwidth:   171.941
Max bandwidth (MB/sec): 832
Min bandwidth (MB/sec): 8
Average Latency:0.214295
Stddev Latency: 0.405511
Max latency:3.26323
Min latency:0.019429


 rados -p  bench 300 seq

 Total time run:76.634659
Total reads made: 22477
Read size:4194304
Bandwidth (MB/sec):1173.203

Average Latency:   0.054539
Max latency:   0.937036
Min latency:   0.018132


So the writes on the rados bench are slower than we have achieved with 
dd and were slower on the back-end file store as well. But the reads are 
great. We could see 1~1.5GB/s on the back-end as well.


So we started doing some other tests to see if it was in RBD or the VFS 
layer in the kernel.. And things got weird.


So using CephFS:

root@dogbreath ~]# dd if=/dev/zero of=/test-fs/DELETEME1 bs=1G count=10
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 7.28658 s, 1.5 GB/s
[root@dogbreath ~]# dd if=/dev/zero of=/test-fs/DELETEME1 bs=1G count=20
20+0 records in
20+0 records out
21474836480 bytes (21 GB) copied, 20.6105 s, 1.0 GB/s
[root@dogbreath ~]# dd if=/dev/zero of=/test-fs/DELETEME1 bs=1G count=40
40+0 records in
40+0 records out
42949672960 bytes (43 GB) copied, 53.4013 s, 804 MB/s
[root@dogbreath ~]# dd if=/test-fs/DELETEME1 of=/dev/null bs=1G count=4 
iflag=direct

4+0 records in
4+0 records out
4294967296 bytes (4.3 GB) copied, 23.1572 s, 185 MB/s
[root@dogbreath ~]#
[root@dogbreath ~]# dd if=/test-fs/DELETEME1 of=/dev/null bs=1G count=4
4+0 records in
4+0 records out
4294967296 bytes (4.3 GB) copied, 1.20258 s, 3.6 GB/s
[root@dogbreath ~]# dd if=/test-fs/DELETEME1 of=/dev/null bs=1G count=20
20+0 records in
20+0 records out
21474836480 bytes (21 GB) copied, 5.40589 s, 4.0 GB/s
[root@dogbreath ~]# dd if=/test-fs/DELETEME1 of=/dev/null bs=1G count=40
40+0 records in
40+0 records out
42949672960 bytes (43 GB) copied, 10.4781 s, 4.1 GB/s
[root@dogbreath ~]# echo 1 > /proc/sys/vm/drop_caches
[root@dogbreath ~]# dd if=/test-fs/DELETEME1 of=/dev/null bs=1G count=40
^C24+0 records in
23+0 records out
24696061952 bytes (25 GB) copied, 56.8824 s, 434 MB/s

[root@dogbreath ~]# dd if=/test-fs/DELETEME1 of=/dev/null bs=1G count=40
40+0 records in
40+0 records out
42949672960 bytes (43 GB) copied, 113.542 s, 378 MB/s
[root@dogbreath ~]#

So about the same, when we were not hitting cache. So we decided to just 
hit the RBD block device with no FS on it.. Welcome to weirdsville


root@ty3:~# umount /test-rbd-fs
root@ty3:~#
root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=4
4+0 records in
4+0 records out
4294967296 bytes (4.3 GB) copied, 18.6603 s, 230 MB/s
root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=4 iflag=direct
4+0 records in
4+0 records out
4294967296 bytes (4.3 GB) copied, 1.13584 s, 3.8 GB/s
root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=20 iflag=direct
20+0 records in
20+0 records out
21474836480 bytes (21 GB) copied, 4.61028 s, 4.7 GB/s
root@ty3:~# echo 1 > /proc/sys/vm/drop_caches
root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=20 iflag=direct
20+0 records in
20+0 records out
21474836480 bytes (21 GB) copied, 4.43416 s, 4.8 GB/s
root@ty3:~# echo 1 > /proc/sys/vm/drop_caches
root@ty3:~#
root@ty3:~#
root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=20 iflag=direct
20+0 records in
20+0 records out
21474836480 bytes (21 GB) copied, 5.07426 s, 4.2 GB/s
root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=40 iflag=direct
40+0 records in
40+0 records out
42949672960 bytes (43 GB) copied, 8.60885 s, 5.0 GB/s
root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=80 iflag=direct
80+0 records in
80+0 records out
85899345920 bytes (86 GB) copied, 18.4305 s, 4.7 GB/s
root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=20
20+0 records in
20+0 records out
21474836480 bytes (21 GB) copied, 91.5546 s, 235 MB/s
root@ty3:~#

So.. we just started reading from the block device. And the numbers were 
well.. Faster than the QDR IB can do TCP/IP. So we figured local 
caching. So we dropped caches and ramped up to bigger than ram. (ram is 
24GB) and it got faster. So we went to 3x ram.. and it was a bit slower..


Oh also the whole time we were doing these tests, the back-end disk was 
seeing no I/O at all.. We were dropping caches on the OSD's as well, but 
even if it was caching at the OSD end, the IB link is only QDR and we 
aren't doing RDMA so. Yeah..No idea what is going on here...



On 19/04/13 10:40, Mark Nelson wrote:

On 04/18/2013 07:27 PM, Malcolm Haak wrote:

Morning all,

Did the echos on all boxes involved... and the results are in..

[root@dogbreath ~]#
[root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M
count=1 iflag=direct
1+0 records in
1+0 records out
41

Re: RBD Read performance

2013-04-21 Thread Malcolm Haak

Hi all,

We switched to a, now free, Sandy Bridge based server.

This has resolved our read issues. So something about the Quad AMD box 
was very bad for reads...


I've got numbers if people are interested.. but I would say that AMD is 
not a great idea for OSD's.


Thanks for all the pointers!

Regards

Malcolm Haak

On 19/04/13 12:21, Malcolm Haak wrote:

Ok this is getting interesting.

rados -p  bench 300 write --no-cleanup

  Total time run: 301.103933
Total writes made:  22477
Write size: 4194304
Bandwidth (MB/sec): 298.595

Stddev Bandwidth:   171.941
Max bandwidth (MB/sec): 832
Min bandwidth (MB/sec): 8
Average Latency:0.214295
Stddev Latency: 0.405511
Max latency:3.26323
Min latency:0.019429


  rados -p  bench 300 seq

  Total time run:76.634659
Total reads made: 22477
Read size:4194304
Bandwidth (MB/sec):1173.203

Average Latency:   0.054539
Max latency:   0.937036
Min latency:   0.018132


So the writes on the rados bench are slower than we have achieved with
dd and were slower on the back-end file store as well. But the reads are
great. We could see 1~1.5GB/s on the back-end as well.

So we started doing some other tests to see if it was in RBD or the VFS
layer in the kernel.. And things got weird.

So using CephFS:

root@dogbreath ~]# dd if=/dev/zero of=/test-fs/DELETEME1 bs=1G count=10
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 7.28658 s, 1.5 GB/s
[root@dogbreath ~]# dd if=/dev/zero of=/test-fs/DELETEME1 bs=1G count=20
20+0 records in
20+0 records out
21474836480 bytes (21 GB) copied, 20.6105 s, 1.0 GB/s
[root@dogbreath ~]# dd if=/dev/zero of=/test-fs/DELETEME1 bs=1G count=40
40+0 records in
40+0 records out
42949672960 bytes (43 GB) copied, 53.4013 s, 804 MB/s
[root@dogbreath ~]# dd if=/test-fs/DELETEME1 of=/dev/null bs=1G count=4
iflag=direct
4+0 records in
4+0 records out
4294967296 bytes (4.3 GB) copied, 23.1572 s, 185 MB/s
[root@dogbreath ~]#
[root@dogbreath ~]# dd if=/test-fs/DELETEME1 of=/dev/null bs=1G count=4
4+0 records in
4+0 records out
4294967296 bytes (4.3 GB) copied, 1.20258 s, 3.6 GB/s
[root@dogbreath ~]# dd if=/test-fs/DELETEME1 of=/dev/null bs=1G count=20
20+0 records in
20+0 records out
21474836480 bytes (21 GB) copied, 5.40589 s, 4.0 GB/s
[root@dogbreath ~]# dd if=/test-fs/DELETEME1 of=/dev/null bs=1G count=40
40+0 records in
40+0 records out
42949672960 bytes (43 GB) copied, 10.4781 s, 4.1 GB/s
[root@dogbreath ~]# echo 1 > /proc/sys/vm/drop_caches
[root@dogbreath ~]# dd if=/test-fs/DELETEME1 of=/dev/null bs=1G count=40
^C24+0 records in
23+0 records out
24696061952 bytes (25 GB) copied, 56.8824 s, 434 MB/s

[root@dogbreath ~]# dd if=/test-fs/DELETEME1 of=/dev/null bs=1G count=40
40+0 records in
40+0 records out
42949672960 bytes (43 GB) copied, 113.542 s, 378 MB/s
[root@dogbreath ~]#

So about the same, when we were not hitting cache. So we decided to just
hit the RBD block device with no FS on it.. Welcome to weirdsville

root@ty3:~# umount /test-rbd-fs
root@ty3:~#
root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=4
4+0 records in
4+0 records out
4294967296 bytes (4.3 GB) copied, 18.6603 s, 230 MB/s
root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=4 iflag=direct
4+0 records in
4+0 records out
4294967296 bytes (4.3 GB) copied, 1.13584 s, 3.8 GB/s
root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=20 iflag=direct
20+0 records in
20+0 records out
21474836480 bytes (21 GB) copied, 4.61028 s, 4.7 GB/s
root@ty3:~# echo 1 > /proc/sys/vm/drop_caches
root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=20 iflag=direct
20+0 records in
20+0 records out
21474836480 bytes (21 GB) copied, 4.43416 s, 4.8 GB/s
root@ty3:~# echo 1 > /proc/sys/vm/drop_caches
root@ty3:~#
root@ty3:~#
root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=20 iflag=direct
20+0 records in
20+0 records out
21474836480 bytes (21 GB) copied, 5.07426 s, 4.2 GB/s
root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=40 iflag=direct
40+0 records in
40+0 records out
42949672960 bytes (43 GB) copied, 8.60885 s, 5.0 GB/s
root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=80 iflag=direct
80+0 records in
80+0 records out
85899345920 bytes (86 GB) copied, 18.4305 s, 4.7 GB/s
root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=20
20+0 records in
20+0 records out
21474836480 bytes (21 GB) copied, 91.5546 s, 235 MB/s
root@ty3:~#

So.. we just started reading from the block device. And the numbers were
well.. Faster than the QDR IB can do TCP/IP. So we figured local
caching. So we dropped caches and ramped up to bigger than ram. (ram is
24GB) and it got faster. So we went to 3x ram.. and it was a bit slower..

Oh also the whole time we were doing these tests, the back-end disk was
seeing no I/O at all.. We were dropping caches on the OSD's as well, but
even if it was caching at the OSD end, the IB link is only QDR and we
aren't doing RDMA so. Y