Re: OSD crash

2012-06-18 Thread Stefan Priebe - Profihost AG

Am 17.06.2012 23:16, schrieb Sage Weil:

Hi Stefan,

I opened http://tracker.newdream.net/issues/2599 to track this, but the
dump strangely does not include the ceph version or commit sha1.  What
version were you running?
Sorry that was my build system it accidently removed the .git dir while 
builing so the version string couldn't be compiled in.


It was 5efaa8d7799347dfae38333b1fd6e1a87dc76b28

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


stable 10GBE under 3.4.2 or 3.5.0-rc2

2012-06-18 Thread Stefan Priebe - Profihost AG

Hi list,

i've still problems with stable network speed under recent kernels. With 
3.0.32 i get stable 9.90 Gbit/s in both directions. With 3.4.2 or 
3.5.0-rc2 it drops sometimes down to around 1 Gbit/s. Sadly i've no idea 
when this happen? It works sometimes for minutes at 9.9Gbit/s and then 
suddently only with 3-4Gbit/s or even 1Gbit/s using recent kernels.


I'm using various tunings recommanded by intel (Improving Performance) 
= http://downloadmirror.intel.com/5874/eng/README.txt


Has anybody a hint for me or additional settings intel does not mention?

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Updating OSD from current stable (0.47-2) to next failed with broken filestore

2012-06-18 Thread Simon Frerichs | Fremaks GmbH

Hi Sage,

it's fixed now in the 'next' branch.
We're using XFS for data storage.

Thanks for fixing this.
Simon

Am 17.06.12 23:22, schrieb Sage Weil:

On Sun, 17 Jun 2012, Sage Weil wrote:

Hi Simon,

We've opened http://tracker.newdream.net/issues/2598 to track this.

Actually, having looked at the code, I'm pretty sure I see the problem.
I pushed a fix to the 'next' branch.  Can you try the latest and see if it
resolves the problem?

(Also, out of curiosity, what file system are you running underneath the
ceph-osd?)

Thanks!
sage



Thanks!
sage

On Sat, 16 Jun 2012, Simon Frerichs | Fremaks GmbH wrote:


Hi,

i tried updating one of our osds from stable 0.47-2 to latest next branch and
it started updating the filestore and failed.
After that neither next branch osd nor stable osd would start with this
filestore anymore.
Is their something wrong with the filestore update?

Jun 16 14:10:03 fcstore01 ceph-osd: 2012-06-16 14:10:03.134135 7ffed3e35780 0
filestore(/data/osd11) mount FIEMAP ioctl is supported and appears to work Jun
16 14:10:03 fcstore01 ceph-osd: 2012-06-16 14:10:03.134163 7ffed3e35780 0
filestore(/data/osd11) mount FIEMAP ioctl is disabled via 'filestore fiemap'
config option Jun 16 14:10:03 fcstore01 ceph-osd: 2012-06-16 14:10:03.134476
7ffed3e35780 0 filestore(/data/osd11) mount did NOT detect btrfs Jun 16
14:10:03 fcstore01 ceph-osd: 2012-06-16 14:10:03.134485 7ffed3e35780 0
filestore(/data/osd11) mount syncfs(2) syscall not support by glibc Jun 16
14:10:03 fcstore01 ceph-osd: 2012-06-16 14:10:03.134513 7ffed3e35780 0
filestore(/data/osd11) mount no syncfs(2), must use sync(2). Jun 16 14:10:03
fcstore01 ceph-osd: 2012-06-16 14:10:03.134514 7ffed3e35780 0
filestore(/data/osd11) mount WARNING: multiple ceph-osd daemons on the same
host will be slow Jun 16 14:10:03 fcstore01 ceph-osd: 2012-06-16
14:10:03.134551 7ffed3e35780 -1 filestore(/data/osd11) FileStore::mount :
stale version stamp detected: 2. Proceeding, do_update is set, DO NOT USE THIS
OPTION IF YOU DO NOT KNOW WHAT IT DOES. More details can be found on the wiki.
Jun 16 14:10:03 fcstore01 ceph-osd: 2012-06-16 14:10:03.134585 7ffed3e35780 0
filestore(/data/osd11) mount found snaps  Jun 16 14:10:12 fcstore01
ceph-osd: 2012-06-16 14:10:12.531974 7ffed3e35780 0 filestore(/data/osd11)
mount: enabling WRITEAHEAD journal mode: btrfs not detected Jun 16 14:10:12
fcstore01 ceph-osd: 2012-06-16 14:10:12.543721 7ffed3e35780 1 journal _open
/dev/sdb1 fd 18: 53687091200 bytes, block size 4096 bytes, directio = 1, aio =
0 Jun 16 14:10:12 fcstore01 ceph-osd: 2012-06-16 14:10:12.588059 7ffed3e35780
1 journal _open /dev/sdb1 fd 18: 53687091200 bytes, block size 4096 bytes,
directio = 1, aio = 0 Jun 16 14:10:12 fcstore01 ceph-osd: 2012-06-16
14:10:12.588905 7ffed3e35780 -1 FileStore is old at version 2. Updating... Jun
16 14:10:12 fcstore01 ceph-osd: 2012-06-16 14:10:12.588914 7ffed3e35780 -1
Removing tmp pgs Jun 16 14:10:12 fcstore01 ceph-osd: 2012-06-16
14:10:12.594362 7ffed3e35780 -1 Getting collections Jun 16 14:10:12 fcstore01
ceph-osd: 2012-06-16 14:10:12.594369 7ffed3e35780 -1 597 to process. Jun 16
14:10:12 fcstore01 ceph-osd: 2012-06-16 14:10:12.595195 7ffed3e35780 -1 0/597
processed Jun 16 14:10:12 fcstore01 ceph-osd: 2012-06-16 14:10:12.595213
7ffed3e35780 -1 Updating collection omap current version is 0 Jun 16 14:10:12
fcstore01 ceph-osd: 2012-06-16 14:10:12.662274 7ffed3e35780 -1
os/FlatIndex.cc: In function 'virtual int
FlatIndex::collection_list_partial(const hobject_t, int, int, snapid_t,
std::vectorhobject_t*, hobject_t*)' thread 7ffed3e35780 time 2012-06-16
14:10:12.637479#012os/FlatIndex.cc: 386: FAILED assert(0)#012#012 ceph version
0.47.2-500-g1e899d0 (commit:1e899d08e61bbba0af6f3600b6bc9a5fc9e5c2e9)#012 1:
/usr/local/bin/ceph-osd() [0x6b337d]#012 2:
(FileStore::collection_list_partial(coll_t, hobject_t, int, int, snapid_t,
std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x9c)
[0x67b24c]#012 3: (OSD::convert_collection(ObjectStore*, coll_t)+0x529)
[0x5b90e9]#012 4: (OSD::do_convertfs(ObjectStore*)+0x46f) [0x5b9b9f]#012 5:
(OSD::convertfs(std::string const, std::string const)+0x47) [0x5ba127]#012
6: (main()+0x967) [0x531d07]#012 7: (__libc_start_main()+0xfd)
[0x7ffed1d8aead]#012 8: /usr/local/bin/ceph-osd() [0x5357b9]#012 NOTE: a copy
of the executable, or `objdump -rdS executable` is needed to interpret this.

Simon

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Mit freundlichen 

Re: stable 10GBE under 3.4.2 or 3.5.0-rc2

2012-06-18 Thread Alexandre DERUMIER
maybe try to play with net.ipv4.tcp_congestion_control ? 



The available algorithms are: 

• BIC - The default on Gentoo 
• Reno - The classic TCP protocol. Most OSes use this. 
• highspeed - HighSpeed TCP: Sally Floyd's suggested algorithm 
• htcp - Hamilton TCP 
• hybla - For Satellite Links 
• scalable - Scalable TCP 
• vegas - Vegas TCP 
• westwood - Optimized for lossy networks 

http://lwn.net/Articles/128681/

The High-speed TCP algorithm is optimized for very fat pipes - 10G Ethernet and 
such. When things are congested, it behaves much like the Reno algorithm. When 
the congestion window is being increased, however, the high-speed algorithm 
makes use of a table to pick large increment values. This approach lets the 
congestion window get very large (i.e. tens of thousands of segments) quickly, 
and to stay large, without requiring that the network function for long periods 
of time without a single dropped packet.



- Mail original -

De: Stefan Priebe - Profihost AG s.pri...@profihost.ag 
À: ceph-devel@vger.kernel.org 
Envoyé: Lundi 18 Juin 2012 09:40:47 
Objet: stable 10GBE under 3.4.2 or 3.5.0-rc2 

Hi list, 

i've still problems with stable network speed under recent kernels. With 
3.0.32 i get stable 9.90 Gbit/s in both directions. With 3.4.2 or 
3.5.0-rc2 it drops sometimes down to around 1 Gbit/s. Sadly i've no idea 
when this happen? It works sometimes for minutes at 9.9Gbit/s and then 
suddently only with 3-4Gbit/s or even 1Gbit/s using recent kernels. 

I'm using various tunings recommanded by intel (Improving Performance) 
= http://downloadmirror.intel.com/5874/eng/README.txt 

Has anybody a hint for me or additional settings intel does not mention? 

Stefan 
-- 
To unsubscribe from this list: send the line unsubscribe ceph-devel in 
the body of a message to majord...@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 



-- 

-- 




Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: all rbd users: set 'filestore fiemap = false'

2012-06-18 Thread Oliver Francke

Hi Sage,

On 06/18/2012 06:02 AM, Sage Weil wrote:

If you are using RBD, and want to avoid potential image corruption, add

filestore fiemap = false

to the [osd] section of your ceph.conf and restart your OSDs.


as far as this heals some trouble, but I fairly don't understand...



We've tracked down the source of some corruption to racy/buggy FIEMAP
ioctl behavior.  The RBD client (when caching is diabled--the default)
uses a 'sparse read' operation that the OSD implements by doing an fsync
on the object file, mapping which extents are allocated, and sending only
that data over the wire.  We have observed incorrect/changing FIEMAP on
both btrfs:

fsync
fiemap returns mapping
time passes, no modifications to file
fiemap returns different mapping


... that even an initial start of a VM leads to corruption of the read data?

I get s/t like:

--- 8- ---

Loading, please wait
/sbin/init: relocation error: ...
 not defined in file libc.so.6...
[ 0.81...] Kernel panic - not snycing: Attempted to kill init!

--- 8- ---

host-kernel is now 3.4.1 + qemu-1.0.1, but shows failures with other 
kernel/qemu-versions, too.


Keeping fingers crossed for Josh, though ;-)
Give me a shout, If I can do some debugging,

regards,

Oliver.



Josh is still tracking down which kernels and file system are affected;
fortunately it is relatively easy to reproduce with the test_librbd_fsx
tool.  In the meantime, the (mis)feature can be safely disabled. It will
default to off in 0.48. It is unclear whether it's really much of a
performance win anyway.

Thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stable 10GBE under 3.4.2 or 3.5.0-rc2

2012-06-18 Thread Alexandre DERUMIER
Did you use same driver version with the differents kernel version ?


Maybe try to  desactive tso,gso,... with ethtool ?


- Mail original - 

De: Stefan Priebe - Profihost AG s.pri...@profihost.ag 
À: Alexandre DERUMIER aderum...@odiso.com 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Lundi 18 Juin 2012 11:11:16 
Objet: Re: stable 10GBE under 3.4.2 or 3.5.0-rc2 

Am 18.06.2012 10:11, schrieb Alexandre DERUMIER: 
 maybe try to play with net.ipv4.tcp_congestion_control ? 
The default on my machines and even under RHEL6 is cubic. But i've now 
alos tried, reno, bic and highspeed. But it doesn't change anything. 

Everything is fine under 3.0.32 and pretty bad under 3.4.X or 3.5rc2. 

Stefan 




--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stable 10GBE under 3.4.2 or 3.5.0-rc2

2012-06-18 Thread Stefan Priebe - Profihost AG

Am 18.06.2012 11:21, schrieb Alexandre DERUMIER:

Did you use same driver version with the differents kernel version ?


Yes driver: ixgbe version: 3.9.17-NAPI.


Maybe try to  desactive tso,gso,... with ethtool ?


Already tried that no change and remember it works fine with 3.0.32.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stable 10GBE under 3.4.2 or 3.5.0-rc2

2012-06-18 Thread Alexandre DERUMIER
also , I see a new feature in 3.3

byte_queue_limits 

1.5. Bufferbloat fighting: Byte queue limits

https://lwn.net/Articles/454390/

Bufferbloat is a term used to describe the latency and throughput problems 
caused by excessive buffering trough the several elements of a network 
connection. Some tools are being developed to help to alleviate these problems, 
and this feature is one of them.

Byte queue limits are a configurable limit of packet data that can be put in 
the transmission queue of a network device. As a result one can tune things 
such that high priority packets get serviced with a reasonable amount of 
latency whilst not subjecting the hardware queue to emptying when data is 
available to send. Configuration of the queue limits is in the tx-n sysfs 
directory for the queue under the byte_queue_limits directory.

I see some bug report about spike:

http://www.mail-archive.com/e1000-devel@lists.sourceforge.net/msg05538.html



maybe can you try to play with values in

/sys/class/net/eth0/queues/tx-0/byte_queue_limits/*

- Mail original - 

De: Alexandre DERUMIER aderum...@odiso.com 
À: Stefan Priebe - Profihost AG s.pri...@profihost.ag 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Lundi 18 Juin 2012 11:21:09 
Objet: Re: stable 10GBE under 3.4.2 or 3.5.0-rc2 

Did you use same driver version with the differents kernel version ? 


Maybe try to desactive tso,gso,... with ethtool ? 


- Mail original - 

De: Stefan Priebe - Profihost AG s.pri...@profihost.ag 
À: Alexandre DERUMIER aderum...@odiso.com 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Lundi 18 Juin 2012 11:11:16 
Objet: Re: stable 10GBE under 3.4.2 or 3.5.0-rc2 

Am 18.06.2012 10:11, schrieb Alexandre DERUMIER: 
 maybe try to play with net.ipv4.tcp_congestion_control ? 
The default on my machines and even under RHEL6 is cubic. But i've now 
alos tried, reno, bic and highspeed. But it doesn't change anything. 

Everything is fine under 3.0.32 and pretty bad under 3.4.X or 3.5rc2. 

Stefan 







-- 

-- 




Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sync Ceph packaging efforts for Debian/Ubuntu

2012-06-18 Thread James Page
Hi Sage/Laszlo

Laszlo - thanks for sending the original email - I'd like to get
everything as closely in-sync as possible between the three packaging
sources as well.

On 16/06/12 22:50, Sage Weil wrote:
 I've take a closer look at these patches, and have a few questions.
 
 - The URL change and nss patches I've applied; they are in the ceph.git 
 'debian' branch.

Great!

 
 - Has the leveldb patch been sent upstream?  Once it is committed to 
 the upstream git, we can update ceph to use it; that's nicer than carrying 
 the patch.  However, I thought you needed to link against the existing
 libleveldb1 package... which means we shouldn't do anything on our
 side,
 right?

I can't see any evidence that this has been sent upstream; ideally we
would be building against libleveldb1 rather than using the embedded
copy - I'm not familiar with the reason that this has not happened
already (if there is one).  This package would also need to be reviewed
for inclusion in main if that was the case.

 - I'm not sure how useful it is to break mount.ceph and cephfs into a 
 separate ceph-fs-common package, but we can do it.  Same goes for a 
 separate package for ceph-mds.  That was originally motivated by ubuntu 
 not wanting the mds in main, but in the end only the libraries went in, so 
 it's a moot point.  I'd rather hear from them what their intentions are 
 for 12.10 before complicating things...

ceph-fs-common is in Ubuntu main; so I think the original motivation
still stands IMHO.

For the Ubuntu quantal cycle we still have the same primary objective as
we had during 12.04; namely ensuring that Ceph RBD can be used as a
block store for qemu-kvm which ties nicely into the Ubuntu OpenStack
story through Cinder; In addition we will be looking at Ceph RADOS as a
backend for Glance (see [0] for more details).

The MIR for Ceph occurred quite late in the 12.04 cycle so we had to
trim the scope to actually get it done; We will be looking at libfcgi
and google-perftools this cycle for main inclusion to re-enable the
components that are currently disabled in the Ubuntu packaging.

 - That same patch also switched all the Architecture: lines back to 
 linux-any.  Was that intentional?  I just changed them from that last 
 week.

I think linux-any is correct - the change you have made would exclude
the PPC architecture in Ubuntu and Debian.

[...]

 Ben, James, can you please share in some sentences why ceph-fuse is
 dropped in Ubuntu? Do you need it Sage? If it's feasible, you may drop
 that as well.

There is an outstanding question on the 12.04 MIR as to whether this
package could still be built but not promoted to main - I'll follow up
with the MIR reviewer as to whether that's possible as I don't think it
requires any additional build dependencies.

[...]

I hope that explains the Ubuntu position on Ceph and what plans we have
this development cycle.

I expect Clint will chip in if I have missed anything.

Cheers

James

[0]
https://blueprints.launchpad.net/ubuntu/+spec/servercloud-q-ceph-object-integration
-- 
James Page
Ubuntu Core Developer
Debian Maintainer
james.p...@ubuntu.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: extent rbd ls to show size and free space?

2012-06-18 Thread Wido den Hollander

Hi,

 Hello list,

 are there any plans to extent rbd ls in a way that it shows, image size
 and free space of the pool?


You want something like:

$ rbd ls
NAMESIZE
alpha   50G
beta400G
charlie 150G

That is possible, but if you want to see the allocation of an image that 
will be harder, since RBD doesn't know which objects have been written 
to and which haven't.


There is also no such thing as free pool space, you have to look at 
the free cluster space.


But again, if you cluster has X TB of free space and you have multiple 
pools, your usage will depend on the amount of written data and the 
replication level.


If you want to know the usage of the rbd pool I suggest using rados df.

The RBD tool could however be modified to show the image size if you 
give another flag.


$ rbd --extended ls

Wido

On 06/18/2012 02:08 PM, Stefan Priebe - Profihost AG wrote:

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: extent rbd ls to show size and free space?

2012-06-18 Thread Stefan Priebe - Profihost AG

Am 18.06.2012 14:47, schrieb Wido den Hollander:

Hi,

  Hello list,
 
  are there any plans to extent rbd ls in a way that it shows, image size
  and free space of the pool?
 

You want something like:

$ rbd ls
NAME SIZE
alpha 50G
beta 400G
charlie 150G


Yes


That is possible, but if you want to see the allocation of an image that
will be harder, since RBD doesn't know which objects have been written
to and which haven't.


Sure i didn't mean that.


If you want to know the usage of the rbd pool I suggest using rados df.

ah OK.


The RBD tool could however be modified to show the image size if you
give another flag.

$ rbd --extended ls

That would be great.

Stefan

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: extent rbd ls to show size and free space?

2012-06-18 Thread Wido den Hollander

On 06/18/2012 03:03 PM, Stefan Priebe - Profihost AG wrote:

Am 18.06.2012 14:47, schrieb Wido den Hollander:

Hi,

 Hello list,

 are there any plans to extent rbd ls in a way that it shows, image size
 and free space of the pool?


You want something like:

$ rbd ls
NAME SIZE
alpha 50G
beta 400G
charlie 150G


Yes


That is possible, but if you want to see the allocation of an image that
will be harder, since RBD doesn't know which objects have been written
to and which haven't.


Sure i didn't mean that.


If you want to know the usage of the rbd pool I suggest using rados df.

ah OK.


The RBD tool could however be modified to show the image size if you
give another flag.

$ rbd --extended ls

That would be great.


I created an issue in the tracker for it: 
http://tracker.newdream.net/issues/2601




Stefan

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ?

2012-06-18 Thread Mark Nelson

On 6/18/12 7:34 AM, Alexandre DERUMIER wrote:

Hi,

I'm doing test with rados bench, and I see constant writes to osd disks.
Is it the normal behaviour ? with write-ahead should write occur each 20-30 
seconde ?


Cluster is
3 nodes (ubuntu precise - glibc 2.14 - ceph 0.47.2) with each node 1 journal on 
tmpfs 8GB - 1 osd (xfs) on sas disk  - 1 gigabit link


8GB journal can handle easily 20s of write (1 gigabit link)

[osd]
 osd data = /srv/osd.$id
 osd journal = /tmpfs/osd.$id.journal
 osd journal size = 8000
 journal dio = false
 filestore journal parallel = false
 filestore journal writeahead = true
 filestore fiemap = false




I have done tests with differents kernel (3.0,3.2,3.4) , differents filesystem 
(xfs,btrfs,ext4), forced journal mode to writeahead.
Bench were done write rados bench and fio.

I always have constant write since the first second of bench start.

Any idea ?


Hi Alex,

Sorry I got behind at looking at your output last week.  I've created a 
seekwatcher movie of your blktrace results here:


http://nhm.ceph.com/movies/mailinglist-tests/alex-test-3.4.mpg

The results match up well with your iostat output.  Peaks and valleys in 
the writes every couple of seconds.  Low numbers of seeks, so probably 
not limited by the filestore (a quick osd tell X bench might confirm 
that).


I'm wondering if you increase filestore max sync interval to something 
bigger (default is 5s) if you'd see somewhat different behavior.  Maybe 
try something like 30s and see what happens?


Mark

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stable 10GBE under 3.4.2 or 3.5.0-rc2

2012-06-18 Thread Mark Nelson

On 6/18/12 2:40 AM, Stefan Priebe - Profihost AG wrote:

Hi list,

i've still problems with stable network speed under recent kernels. With
3.0.32 i get stable 9.90 Gbit/s in both directions. With 3.4.2 or
3.5.0-rc2 it drops sometimes down to around 1 Gbit/s. Sadly i've no idea
when this happen? It works sometimes for minutes at 9.9Gbit/s and then
suddently only with 3-4Gbit/s or even 1Gbit/s using recent kernels.

I'm using various tunings recommanded by intel (Improving Performance)
= http://downloadmirror.intel.com/5874/eng/README.txt



Hi Stefan,

Did you ever get a chance to talk to Jason Wang about the commit that 
was causing the problems?  It might be a good idea to report all of this 
upstream and see what they have to say.


Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stable 10GBE under 3.4.2 or 3.5.0-rc2

2012-06-18 Thread Stefan Priebe - Profihost AG

Am 18.06.2012 15:56, schrieb Mark Nelson:

On 6/18/12 2:40 AM, Stefan Priebe - Profihost AG wrote:

Hi list,

i've still problems with stable network speed under recent kernels. With
3.0.32 i get stable 9.90 Gbit/s in both directions. With 3.4.2 or
3.5.0-rc2 it drops sometimes down to around 1 Gbit/s. Sadly i've no idea
when this happen? It works sometimes for minutes at 9.9Gbit/s and then
suddently only with 3-4Gbit/s or even 1Gbit/s using recent kernels.

I'm using various tunings recommanded by intel (Improving Performance)
= http://downloadmirror.intel.com/5874/eng/README.txt



Hi Stefan,

Did you ever get a chance to talk to Jason Wang about the commit that
was causing the problems? It might be a good idea to report all of this
upstream and see what they have to say.


Hi Mark,

yes i talked to him and he told me that his change was partically 
removed in more recent kernel version.


Right now i'm in discussion with eric dumazet at the netdev kernel 
mailinglist. And he gives me some good advices which seem to work well.


I'm still testing and will share the results here.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ?

2012-06-18 Thread Alexandre DERUMIER
Hi Mark,

Sorry I got behind at looking at your output last week. I've created a 
seekwatcher movie of your blktrace results here: 

http://nhm.ceph.com/movies/mailinglist-tests/alex-test-3.4.mpg 

how do you create seekwatcher movie from blktrace ? (I'd like to create them 
myself, seem good to debug)


The results match up well with your iostat output. Peaks and valleys in 
the writes every couple of seconds. Low numbers of seeks, so probably 
not limited by the filestore (a quick osd tell X bench might confirm 
that). 

yet, i'm pretty sure that the limitation if not hardware. (each osd are 15k 
drive, handling around 10MB/S during the test, so I think it should be ok ^_^ )
how do you use osd tell X bench ?

I'm wondering if you increase filestore max sync interval to something 
bigger (default is 5s) if you'd see somewhat different behavior. Maybe 
try something like 30s and see what happens? 

I have done test with 30s, that doesn't change nothing.
I have try with filestore min sync interval = 29  + filestore max sync interval 
= 30




- Mail original - 

De: Mark Nelson mark.nel...@inktank.com 
À: Alexandre DERUMIER aderum...@odiso.com 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Lundi 18 Juin 2012 15:29:58 
Objet: Re: iostat show constants write to osd disk with writeahead journal, 
normal behaviour ? 

On 6/18/12 7:34 AM, Alexandre DERUMIER wrote: 
 Hi, 
 
 I'm doing test with rados bench, and I see constant writes to osd disks. 
 Is it the normal behaviour ? with write-ahead should write occur each 20-30 
 seconde ? 
 
 
 Cluster is 
 3 nodes (ubuntu precise - glibc 2.14 - ceph 0.47.2) with each node 1 journal 
 on tmpfs 8GB - 1 osd (xfs) on sas disk - 1 gigabit link 
 
 
 8GB journal can handle easily 20s of write (1 gigabit link) 
 
 [osd] 
 osd data = /srv/osd.$id 
 osd journal = /tmpfs/osd.$id.journal 
 osd journal size = 8000 
 journal dio = false 
 filestore journal parallel = false 
 filestore journal writeahead = true 
 filestore fiemap = false 
 
 
 
 
 I have done tests with differents kernel (3.0,3.2,3.4) , differents 
 filesystem (xfs,btrfs,ext4), forced journal mode to writeahead. 
 Bench were done write rados bench and fio. 
 
 I always have constant write since the first second of bench start. 
 
 Any idea ? 

Hi Alex, 

Sorry I got behind at looking at your output last week. I've created a 
seekwatcher movie of your blktrace results here: 

http://nhm.ceph.com/movies/mailinglist-tests/alex-test-3.4.mpg 

The results match up well with your iostat output. Peaks and valleys in 
the writes every couple of seconds. Low numbers of seeks, so probably 
not limited by the filestore (a quick osd tell X bench might confirm 
that). 

I'm wondering if you increase filestore max sync interval to something 
bigger (default is 5s) if you'd see somewhat different behavior. Maybe 
try something like 30s and see what happens? 

Mark 




-- 

-- 




Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ?

2012-06-18 Thread Mark Nelson

On 6/18/12 9:04 AM, Alexandre DERUMIER wrote:

Hi Mark,


Sorry I got behind at looking at your output last week. I've created a
seekwatcher movie of your blktrace results here:

http://nhm.ceph.com/movies/mailinglist-tests/alex-test-3.4.mpg


how do you create seekwatcher movie from blktrace ? (I'd like to create them 
myself, seem good to debug)


You'll need to download seekwatcher from Chris Mason's website.  Get the 
newest unstable version.  To make movies you'll need mencoder.  (It also 
needs numpy and matplotlib).  There is a small bug in the code where  
/dev/null should be changed to  /dev/null 21.  If you have trouble 
let me know and I can send you a fixed version of the script.






The results match up well with your iostat output. Peaks and valleys in
the writes every couple of seconds. Low numbers of seeks, so probably
not limited by the filestore (a quick osd tell X bench might confirm
that).


yet, i'm pretty sure that the limitation if not hardware. (each osd are 15k 
drive, handling around 10MB/S during the test, so I think it should be ok ^_^ )
how do you use osd tell X bench ?


Yeah, I just wanted to make sure that the constant writes weren't 
because the filestore was falling behind.  You may want to take a look 
at some of the information that is provided by the admin socket for the 
OSD while the test is running. dump_ops_in_flight, perf schema, and perf 
dump are all useful.


Try:

ceph --admin-daemon socket help

The osd admin sockets should be available in /var/run/ceph.




I'm wondering if you increase filestore max sync interval to something
bigger (default is 5s) if you'd see somewhat different behavior. Maybe
try something like 30s and see what happens?


I have done test with 30s, that doesn't change nothing.
I have try with filestore min sync interval = 29  + filestore max sync interval 
= 30



Nuts.  Do you still see the little peaks/valleys every couple seconds?





- Mail original -

De: Mark Nelsonmark.nel...@inktank.com
À: Alexandre DERUMIERaderum...@odiso.com
Cc: ceph-devel@vger.kernel.org
Envoyé: Lundi 18 Juin 2012 15:29:58
Objet: Re: iostat show constants write to osd disk with writeahead journal, 
normal behaviour ?

On 6/18/12 7:34 AM, Alexandre DERUMIER wrote:

Hi,

I'm doing test with rados bench, and I see constant writes to osd disks.
Is it the normal behaviour ? with write-ahead should write occur each 20-30 
seconde ?


Cluster is
3 nodes (ubuntu precise - glibc 2.14 - ceph 0.47.2) with each node 1 journal on 
tmpfs 8GB - 1 osd (xfs) on sas disk - 1 gigabit link


8GB journal can handle easily 20s of write (1 gigabit link)

[osd]
osd data = /srv/osd.$id
osd journal = /tmpfs/osd.$id.journal
osd journal size = 8000
journal dio = false
filestore journal parallel = false
filestore journal writeahead = true
filestore fiemap = false




I have done tests with differents kernel (3.0,3.2,3.4) , differents filesystem 
(xfs,btrfs,ext4), forced journal mode to writeahead.
Bench were done write rados bench and fio.

I always have constant write since the first second of bench start.

Any idea ?


Hi Alex,

Sorry I got behind at looking at your output last week. I've created a
seekwatcher movie of your blktrace results here:

http://nhm.ceph.com/movies/mailinglist-tests/alex-test-3.4.mpg

The results match up well with your iostat output. Peaks and valleys in
the writes every couple of seconds. Low numbers of seeks, so probably
not limited by the filestore (a quick osd tell X bench might confirm
that).

I'm wondering if you increase filestore max sync interval to something
bigger (default is 5s) if you'd see somewhat different behavior. Maybe
try something like 30s and see what happens?

Mark






--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ?

2012-06-18 Thread Alexandre DERUMIER
Yeah, I just wanted to make sure that the constant writes weren't 
because the filestore was falling behind. You may want to take a look 
at some of the information that is provided by the admin socket for the 
OSD while the test is running. dump_ops_in_flight, perf schema, and perf 
dump are all useful.


don't know which values to check in these big json reponses ;)
But I have try with more osd, so write are splitted on more disks and and write 
are smaller, and the behaviour is same


root@cephtest1:/var/run/ceph# ceph --admin-daemon ceph-osd.0.asok 
dump_ops_in_flight
{ num_ops: 1,
  ops: [
{ description: osd_op(client.4179.0:83 kvmtest1_1006560_object82 
[write 0~4194304] 3.9f5c55af),
  received_at: 2012-06-18 16:41:17.995167,
  age: 0.406678,
  flag_point: waiting for sub ops,
  client_info: { client: client.4179,
  tid: 83}}]}


root@cephtest1:/var/run/ceph# ceph --admin-daemon ceph-osd.0.asok 
perfcounters_dump

{filestore:{journal_queue_max_ops:500,journal_queue_ops:0,journal_ops:2198,journal_queue_max_bytes:104857600,journal_queue_bytes:0,journal_bytes:1012769525,journal_latency:{avgcount:2198,sum:3.13569},op_queue_max_ops:500,op_queue_ops:0,ops:2198,op_queue_max_bytes:104857600,op_queue_bytes:0,bytes:1012757330,apply_latency:{avgcount:2198,sum:290.27},committing:0,commitcycle:59,commitcycle_interval:{avgcount:59,sum:300.04},commitcycle_latency:{avgcount:59,sum:4.76299},journal_full:0},osd:{opq:0,op_wip:0,op:127,op_in_bytes:532692449,op_out_bytes:0,op_latency:{avgcount:127,sum:49.2627},op_r:0,op_r_out_bytes:0,op_r_latency:{avgcount:0,sum:0},op_w:127,op_w_in_bytes:532692449,op_w_rlat:{avgcount:127,sum:0},op_w_latency:{avgcount:127,sum:49.2627},op_rw:0,op_rw_in_bytes:0,op_rw_out_bytes:0,op_rw_rlat:{avgcount:0,sum:0},op_rw_latency:{avgcount:0,sum:0},subop:114,subop_in_bytes:478212311,subop_latency:{avgcount:114,sum:8.82174},subop_w:0,subop_w_in_bytes:478212311,subop_w_latency:{avgcount:114,sum:8.82174},subop_pull:0,subop_pull_latency:{avgcount:0,sum:0},subop_push:0,subop_push_in_bytes:0,subop_push_latency:{avgcount:0,sum:0},pull:0,push:0,push_out_bytes:0,recovery_ops:0,loadavg:0.47,buffer_bytes:0,numpg:423,numpg_primary:259,numpg_replica:164,numpg_stray:0,heartbeat_to_peers:10,heartbeat_from_peers:0,map_messages:34,map_message_epochs:44,map_message_epoch_dups:24},throttle-filestore_bytes:{val:0,max:104857600,get:0,get_sum:0,get_or_fail_fail:0,get_or_fail_success:0,take:2198,take_sum:1012769525,put:1503,put_sum:1012769525,wait:{avgcount:0,sum:0}},throttle-filestore_ops:{val:0,max:500,get:0,get_sum:0,get_or_fail_fail:0,get_or_fail_success:0,take:2198,take_sum:2198,put:1503,put_sum:2198,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_throttler-client:{val:4194469,max:104857600,get:243,get_sum:536987810,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:242,put_sum:532793341,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_throttler-cluster:{val:0,max:104857600,get:1480,get_sum:482051948,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:1480,put_sum:482051948,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_throttler-hbclient:{val:0,max:104857600,get:1077,get_sum:50619,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:1077,put_sum:50619,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_throttler-hbserver:{val:0,max:104857600,get:972,get_sum:45684,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:972,put_sum:45684,wait:{avgcount:0,sum:0}},throttle-osd_client_bytes:{val:4194469,max:524288000,get:128,get_sum:536892019,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:254,put_sum:532697550,wait:{avgcount:0,sum:0}}}


root@cephtest1:/var/run/ceph# ceph --admin-daemon ceph-osd.0.asok 
perfcounters_schema


Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ?

2012-06-18 Thread Alexandre DERUMIER
forget to send iostat -x 1 trace

(osd's are on sdb,sbc,sdd,sde,sdf)

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0,0055,000,00   31,00 0,00  1468,0094,71 
0,216,770,006,77   5,16  16,00
sdb   0,00 0,000,00   74,00 0,00 20516,00   554,49 
2,74   38,780,00   38,78   3,51  26,00
sdc   0,00 0,000,00   57,00 0,00 15520,00   544,56 
1,77   28,600,00   28,60   3,68  21,00
sdd   0,00 0,000,00   16,00 0,00  4108,00   513,50 
0,52   32,500,00   32,50   4,38   7,00
sde   0,00 0,000,00   15,00 0,00  4104,00   547,20 
0,48   32,000,00   32,00   4,00   6,00
sdf   0,00 0,000,00   46,00 0,00 12316,00   535,48 
1,42   30,870,00   30,87   3,70  17,00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   1,550,007,091,160,00   90,21

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0,0037,000,00   20,00 0,00   236,0023,60 
0,126,000,006,00   5,00  10,00
sdb   0,00 0,000,00   41,00 0,00 10780,00   525,85 
1,03   21,460,00   21,46   3,66  15,00
sdc   0,00 0,000,00   78,00 0,00 21416,00   549,13 
3,20   42,820,00   42,82   3,08  24,00
sdd   0,0018,000,00  121,00 0,00 24859,00   410,89 
3,00   24,790,00   24,79   3,06  37,00
sde   0,00 0,000,000,00 0,00 0,00 0,00 
0,000,000,000,00   0,00   0,00
sdf   0,0015,000,00   75,00 0,00 12521,00   333,89 
2,12   28,270,00   28,27   3,47  26,00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   2,510,006,521,380,00   89,59

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0,0030,000,00   19,00 0,00   204,0021,47 
0,105,260,005,26   5,26  10,00
sdb   0,0023,000,00  105,00 0,00 18281,50   348,22 
3,92   38,670,00   38,67   3,33  35,00
sdc   0,00 0,000,00   31,00 0,00  8212,00   529,81 
0,89   28,710,00   28,71   3,87  12,00
sdd   0,00 0,000,00   45,00 0,00 12312,00   547,20 
1,35   30,000,00   30,00   3,78  17,00
sde   0,0017,000,00   42,00 0,00  4308,00   205,14 
1,14   27,140,00   27,14   3,33  14,00
sdf   0,00 0,000,00   45,00 0,00 12312,00   547,20 
1,33   29,560,00   29,56   3,78  17,00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   2,280,004,310,000,00   93,41

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0,00 0,000,000,00 0,00 0,00 0,00 
0,000,000,000,00   0,00   0,00
sdb   0,00 0,000,00   29,00 0,00  8204,00   565,79 
0,89   31,030,00   31,03   3,45  10,00
sdc   0,0021,000,00   85,00 0,00 12627,50   297,12 
2,66   31,290,00   31,29   2,94  25,00
sdd   0,00 0,000,00   16,00 0,00  4108,00   513,50 
0,45   28,120,00   28,12   4,38   7,00
sde   0,00 0,000,00   75,00 0,00 20520,00   547,20 
2,32   30,930,00   30,93   3,47  26,00
sdf   0,00 0,000,00   17,00 0,00  4112,00   483,76 
0,39   22,940,00   22,94   2,94   5,00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   1,920,008,971,540,00   87,56

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0,0051,000,00   32,00 0,00  1432,0089,50 
0,217,190,007,19   5,00  16,00
sdb   0,00 0,000,00   60,00 0,00 16416,00   547,20 
1,59   26,500,00   26,50   3,33  20,00
sdc   0,00 0,000,00   48,00 0,00 12324,00   513,50 
1,41   23,960,00   23,96   3,54  17,00
sdd   0,00 0,000,00   31,00 0,00  8212,00   529,81 
0,79   25,480,00   25,48   3,23  10,00
sde   0,00 0,000,00   66,00 0,00 17704,00   536,48 
2,96   40,760,00   40,76   3,79  25,00
sdf   0,00 0,000,00   46,00 0,00 12316,00   535,48 
1,33   28,910,00   28,91   3,91  18,00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   2,290,005,221,660,00   90,83

Device: rrqm/s   wrqm/s r/s w/s

Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ?

2012-06-18 Thread Alexandre DERUMIER
Forget to say:

The blktrace of the osd was done with 15 osd on 3 nodes.
So the peak and valley could come from rbd block distribution.


I have done same test with 1 osd by node with 3 nodes,
I have around 60MB/S by disk (with same behaviour)

So this is not a bottleneck.

I'm going to do some blktrace and seekwatcher move with 1 osd by node.



- Mail original - 

De: Alexandre DERUMIER aderum...@odiso.com 
À: Mark Nelson mark.nel...@inktank.com 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Lundi 18 Juin 2012 16:50:57 
Objet: Re: iostat show constants write to osd disk with writeahead journal, 
normal behaviour ? 

forget to send iostat -x 1 trace 

(osd's are on sdb,sbc,sdd,sde,sdf) 

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
w_await svctm %util 
sda 0,00 55,00 0,00 31,00 0,00 1468,00 94,71 0,21 6,77 0,00 6,77 5,16 16,00 
sdb 0,00 0,00 0,00 74,00 0,00 20516,00 554,49 2,74 38,78 0,00 38,78 3,51 26,00 
sdc 0,00 0,00 0,00 57,00 0,00 15520,00 544,56 1,77 28,60 0,00 28,60 3,68 21,00 
sdd 0,00 0,00 0,00 16,00 0,00 4108,00 513,50 0,52 32,50 0,00 32,50 4,38 7,00 
sde 0,00 0,00 0,00 15,00 0,00 4104,00 547,20 0,48 32,00 0,00 32,00 4,00 6,00 
sdf 0,00 0,00 0,00 46,00 0,00 12316,00 535,48 1,42 30,87 0,00 30,87 3,70 17,00 

avg-cpu: %user %nice %system %iowait %steal %idle 
1,55 0,00 7,09 1,16 0,00 90,21 

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
w_await svctm %util 
sda 0,00 37,00 0,00 20,00 0,00 236,00 23,60 0,12 6,00 0,00 6,00 5,00 10,00 
sdb 0,00 0,00 0,00 41,00 0,00 10780,00 525,85 1,03 21,46 0,00 21,46 3,66 15,00 
sdc 0,00 0,00 0,00 78,00 0,00 21416,00 549,13 3,20 42,82 0,00 42,82 3,08 24,00 
sdd 0,00 18,00 0,00 121,00 0,00 24859,00 410,89 3,00 24,79 0,00 24,79 3,06 
37,00 
sde 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 
sdf 0,00 15,00 0,00 75,00 0,00 12521,00 333,89 2,12 28,27 0,00 28,27 3,47 26,00 

avg-cpu: %user %nice %system %iowait %steal %idle 
2,51 0,00 6,52 1,38 0,00 89,59 

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
w_await svctm %util 
sda 0,00 30,00 0,00 19,00 0,00 204,00 21,47 0,10 5,26 0,00 5,26 5,26 10,00 
sdb 0,00 23,00 0,00 105,00 0,00 18281,50 348,22 3,92 38,67 0,00 38,67 3,33 
35,00 
sdc 0,00 0,00 0,00 31,00 0,00 8212,00 529,81 0,89 28,71 0,00 28,71 3,87 12,00 
sdd 0,00 0,00 0,00 45,00 0,00 12312,00 547,20 1,35 30,00 0,00 30,00 3,78 17,00 
sde 0,00 17,00 0,00 42,00 0,00 4308,00 205,14 1,14 27,14 0,00 27,14 3,33 14,00 
sdf 0,00 0,00 0,00 45,00 0,00 12312,00 547,20 1,33 29,56 0,00 29,56 3,78 17,00 

avg-cpu: %user %nice %system %iowait %steal %idle 
2,28 0,00 4,31 0,00 0,00 93,41 

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
w_await svctm %util 
sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 
sdb 0,00 0,00 0,00 29,00 0,00 8204,00 565,79 0,89 31,03 0,00 31,03 3,45 10,00 
sdc 0,00 21,00 0,00 85,00 0,00 12627,50 297,12 2,66 31,29 0,00 31,29 2,94 25,00 
sdd 0,00 0,00 0,00 16,00 0,00 4108,00 513,50 0,45 28,12 0,00 28,12 4,38 7,00 
sde 0,00 0,00 0,00 75,00 0,00 20520,00 547,20 2,32 30,93 0,00 30,93 3,47 26,00 
sdf 0,00 0,00 0,00 17,00 0,00 4112,00 483,76 0,39 22,94 0,00 22,94 2,94 5,00 

avg-cpu: %user %nice %system %iowait %steal %idle 
1,92 0,00 8,97 1,54 0,00 87,56 

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
w_await svctm %util 
sda 0,00 51,00 0,00 32,00 0,00 1432,00 89,50 0,21 7,19 0,00 7,19 5,00 16,00 
sdb 0,00 0,00 0,00 60,00 0,00 16416,00 547,20 1,59 26,50 0,00 26,50 3,33 20,00 
sdc 0,00 0,00 0,00 48,00 0,00 12324,00 513,50 1,41 23,96 0,00 23,96 3,54 17,00 
sdd 0,00 0,00 0,00 31,00 0,00 8212,00 529,81 0,79 25,48 0,00 25,48 3,23 10,00 
sde 0,00 0,00 0,00 66,00 0,00 17704,00 536,48 2,96 40,76 0,00 40,76 3,79 25,00 
sdf 0,00 0,00 0,00 46,00 0,00 12316,00 535,48 1,33 28,91 0,00 28,91 3,91 18,00 

avg-cpu: %user %nice %system %iowait %steal %idle 
2,29 0,00 5,22 1,66 0,00 90,83 

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
w_await svctm %util 
sda 0,00 51,00 0,00 30,00 0,00 1460,00 97,33 0,15 5,00 0,00 5,00 4,67 14,00 
sdb 0,00 0,00 0,00 45,00 0,00 12312,00 547,20 1,31 29,11 0,00 29,11 3,78 17,00 
sdc 0,00 0,00 0,00 29,00 0,00 8204,00 565,79 0,62 30,34 0,00 30,34 3,45 10,00 
sdd 0,00 0,00 0,00 33,00 0,00 8220,00 498,18 1,13 30,30 0,00 30,30 4,24 14,00 
sde 0,00 0,00 0,00 40,00 0,00 11028,00 551,40 0,91 29,50 0,00 29,50 3,50 14,00 
sdf 0,00 0,00 0,00 64,00 0,00 16432,00 513,50 1,69 26,41 0,00 26,41 3,91 25,00 

avg-cpu: %user %nice %system %iowait %steal %idle 
1,93 0,00 6,05 1,93 0,00 90,09 

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
w_await svctm %util 
sda 0,00 34,00 0,00 19,00 0,00 220,00 23,16 0,11 5,79 0,00 5,79 5,79 11,00 
sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 
sdc 0,00 0,00 0,00 45,00 0,00 12312,00 547,20 1,13 25,11 0,00 25,11 3,33 15,00 
sdd 0,00 25,00 0,00 110,00 0,00 20841,00 378,93 3,39 

Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ?

2012-06-18 Thread Mark Nelson

On 6/18/12 9:47 AM, Alexandre DERUMIER wrote:

Yeah, I just wanted to make sure that the constant writes weren't
because the filestore was falling behind. You may want to take a look
at some of the information that is provided by the admin socket for the
OSD while the test is running. dump_ops_in_flight, perf schema, and perf
dump are all useful.



don't know which values to check in these big json reponses ;)
But I have try with more osd, so write are splitted on more disks and and write 
are smaller, and the behaviour is same


No worries, there is a lot of data there!




root@cephtest1:/var/run/ceph# ceph --admin-daemon ceph-osd.0.asok 
dump_ops_in_flight
{ num_ops: 1,
   ops: [
 { description: osd_op(client.4179.0:83 kvmtest1_1006560_object82 [write 
0~4194304] 3.9f5c55af),
   received_at: 2012-06-18 16:41:17.995167,
   age: 0.406678,
   flag_point: waiting for sub ops,
   client_info: { client: client.4179,
   tid: 83}}]}


root@cephtest1:/var/run/ceph# ceph --admin-daemon ceph-osd.0.asok 
perfcounters_dump

{filestore:{journal_queue_max_ops:500,journal_queue_ops:0,journal_ops:2198,journal_queue_max_bytes:104857600,journal_queue_bytes:0,journal_bytes:1012769525,journal_latency:{avgcount:2198,sum:3.13569},op_queue_max_ops:500,op_queue_ops:0,ops:2198,op_queue_max_bytes:104857600,op_queue_bytes:0,bytes:1012757330,apply_latency:{avgcount:2198,sum:290.27},committing:0,commitcycle:59,commitcycle_interval:{avgcount:59,sum:300.04},commitcycle_latency:{avgcount:59,sum:4.76299},journal_full:0},osd:{opq:0,op_wip:0,op:127,op_in_bytes:532692449,op_out_bytes:0,op_latency:{avgcount:127,sum:49.2627},op_r:0,op_r_out_bytes:0,op_r_latency:{avgcount:0,sum:0},op_w:127,op_w_in_bytes:532692449,op_w_rlat:{avgcount:127,sum:0},op_w_latency:{avgcount:127,sum:49.2627},op_rw:0,op_rw_in_bytes:0,op_rw_out_bytes:0,op_rw_rlat:{avgcount:0,sum:0},op_rw_latency:{avgcount:0,sum:0},subop:114,subop_in_byte

s:478212311,subop_latency:{avgcount:114,sum:8.82174},subop_w:0,subop_w_in_bytes:478212311,subop_w_latency:{avgcount:114,sum:8.82174},subop_pull:0,subop_pull_latency:{avgcount:0,sum:0},subop_push:0,subop_push_in_bytes:0,subop_push_latency:{avgcount:0,sum:0},pull:0,push:0,push_out_bytes:0,recovery_ops:0,loadavg:0.47,buffer_bytes:0,numpg:423,numpg_primary:259,numpg_replica:164,numpg_stray:0,heartbeat_to_peers:10,heartbeat_from_peers:0,map_messages:34,map_message_epochs:44,map_message_epoch_dups:24},throttle-filestore_bytes:{val:0,max:104857600,get:0,get_sum:0,get_or_fail_fail:0,get_or_fail_success:0,take:2198,take_sum:1012769525,put:1503,put_sum:1012769525,wait:{avgcount:0,sum:0}},throttle-filestore_ops:{val:0,max:500,get:0,get_sum:0,get_or_fail_fail:0,get_or_fail_success:0,take:2198,take_sum:2198,put:1503,put_sum:2198,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_t
hrottler-client:{val:4194469,max:104857600,get:243,get_sum:536987810,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:242,put_sum:532793341,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_throttler-cluster:{val:0,max:104857600,get:1480,get_sum:482051948,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:1480,put_sum:482051948,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_throttler-hbclient:{val:0,max:104857600,get:1077,get_sum:50619,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:1077,put_sum:50619,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_throttler-hbserver:{val:0,max:104857600,get:972,get_sum:45684,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:972,put_sum:45684,wait:{avgcount:0,sum:0}},throttle-osd_client_bytes:{val:4194469,max:524288000,get:128,get_sum:536892019,get_or_fail_fail:0,get_or_fail_su
ccess:0,take:0,take_sum:0,put:254,put_sum:532697550,wait:{avgcount:0,sum:0}}}



root@cephtest1:/var/run/ceph# ceph --admin-daemon ceph-osd.0.asok 
perfcounters_schema

{filestore:{journal_queue_max_ops:{type:2},journal_queue_ops:{type:2},journal_ops:{type:10},journal_queue_max_bytes:{type:2},journal_queue_bytes:{type:2},journal_bytes:{type:10},journal_latency:{type:5},op_queue_max_ops:{type:2},op_queue_ops:{type:2},ops:{type:10},op_queue_max_bytes:{type:2},op_queue_bytes:{type:2},bytes:{type:10},apply_latency:{type:5},committing:{type:2},commitcycle:{type:10},commitcycle_interval:{type:5},commitcycle_latency:{type:5},journal_full:{type:10}},osd:{opq:{type:2},op_wip:{type:2},op:{type:10},op_in_bytes:{type:10},op_out_bytes:{type:10},op_latency:{type:5},op_r:{type:10},op_r_out_bytes:{type:10},op_r_latency:{type:5},op_w:{type:10},op_w_in_bytes:{type:10},op_w_rlat:{type:5},op_w_latency:{type:5},op_rw:{type:10},op_rw_in_bytes:{type:10},op_rw_out_bytes:{type:10},op_rw_rlat:{type:5},op_rw_latency:{type:5},


[PATCH 0/6] ceph: a few more messenger cleanups

2012-06-18 Thread Alex Elder
Here are a few more messenger cleanup patches.

[PATCH 1/6] libceph: encapsulate out message data setup
[PATCH 2/6] libceph: encapsulate advancing msg page
These two encapsulate some code involved in sending message data
into separate functions.
[PATCH 3/6] libceph: don't mark footer complete before it is
This moves the setting of the FOOTER_COMPLETE flag so that it
doesn't get done until the footer really is complete.
[PATCH 4/6] libceph: move init_bio_*() functions up
This simply moves two functions, preparing for the next patch.
[PATCH 5/6] libceph: move init of bio_iter
This makes a message's bio_iter field get initialized when
the rest of the message is initialized, rather than conditionally
every time any attempt is made to send message data.
[PATCH 6/6] libceph: don't use bio_iter as a flag
Because bio_iter is now initialized in the right place we no
longer need to use its value as a flag to determine whether it
needs initialization.

-Alex
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: all rbd users: set 'filestore fiemap = false'

2012-06-18 Thread Sage Weil
On Mon, 18 Jun 2012, Christoph Hellwig wrote:
 On Sun, Jun 17, 2012 at 09:02:15PM -0700, Sage Weil wrote:
  that data over the wire.  We have observed incorrect/changing FIEMAP on 
  both btrfs:
 
 both btrfs and?

Whoops, it was XFS.  :/ 

 Btw, btrfs had SEEK_HOLE/SEEK_DATA which are a lot more useful for this
 kind of operations, and xfs has added support for it as well now.

Yeah, started looking at that last night.  (This code predates SEEK_HOLE.)

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] libceph: encapsulate advancing msg page

2012-06-18 Thread Alex Elder
In write_partial_msg_pages(), once all the data from a page has been
sent we advance to the next one.  Put the code that takes care of
this into its own function.

While modifying write_partial_msg_pages(), make its local variable
in_trail be Boolean, and use the local variable msg (which is
just the connection's current out_msg pointer) consistently.

Signed-off-by: Alex Elder el...@inktank.com
---
 net/ceph/messenger.c |   58
+--
 1 file changed, 34 insertions(+), 24 deletions(-)

Index: b/net/ceph/messenger.c
===
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -915,6 +915,33 @@ static void iter_bio_next(struct bio **b
 }
 #endif

+static void out_msg_pos_next(struct ceph_connection *con, struct page
*page,
+   size_t len, size_t sent, bool in_trail)
+{
+   struct ceph_msg *msg = con-out_msg;
+
+   BUG_ON(!msg);
+   BUG_ON(!sent);
+
+   con-out_msg_pos.data_pos += sent;
+   con-out_msg_pos.page_pos += sent;
+   if (sent == len) {
+   con-out_msg_pos.page_pos = 0;
+   con-out_msg_pos.page++;
+   con-out_msg_pos.did_page_crc = false;
+   if (in_trail)
+   list_move_tail(page-lru,
+  msg-trail-head);
+   else if (msg-pagelist)
+   list_move_tail(page-lru,
+  msg-pagelist-head);
+#ifdef CONFIG_BLOCK
+   else if (msg-bio)
+   iter_bio_next(msg-bio_iter, msg-bio_seg);
+#endif
+   }
+}
+
 /*
  * Write as much message data payload as we can.  If we finish, queue
  * up the footer.
@@ -930,11 +957,11 @@ static int write_partial_msg_pages(struc
bool do_datacrc = !con-msgr-nocrc;
int ret;
int total_max_write;
-   int in_trail = 0;
+   bool in_trail = false;
size_t trail_len = (msg-trail ? msg-trail-length : 0);

dout(write_partial_msg_pages %p msg %p page %d/%d offset %d\n,
-con, con-out_msg, con-out_msg_pos.page, con-out_msg-nr_pages,
+con, msg, con-out_msg_pos.page, msg-nr_pages,
 con-out_msg_pos.page_pos);

 #ifdef CONFIG_BLOCK
@@ -958,13 +985,12 @@ static int write_partial_msg_pages(struc

/* have we reached the trail part of the data? */
if (con-out_msg_pos.data_pos = data_len - trail_len) {
-   in_trail = 1;
+   in_trail = true;

total_max_write = data_len - con-out_msg_pos.data_pos;

page = list_first_entry(msg-trail-head,
struct page, lru);
-   max_write = PAGE_SIZE;
} else if (msg-pages) {
page = msg-pages[con-out_msg_pos.page];
} else if (msg-pagelist) {
@@ -988,14 +1014,14 @@ static int write_partial_msg_pages(struc
if (do_datacrc  !con-out_msg_pos.did_page_crc) {
void *base;
u32 crc;
-   u32 tmpcrc = le32_to_cpu(con-out_msg-footer.data_crc);
+   u32 tmpcrc = le32_to_cpu(msg-footer.data_crc);
char *kaddr;

kaddr = kmap(page);
BUG_ON(kaddr == NULL);
base = kaddr + con-out_msg_pos.page_pos + bio_offset;
crc = crc32c(tmpcrc, base, len);
-   con-out_msg-footer.data_crc = cpu_to_le32(crc);
+   msg-footer.data_crc = cpu_to_le32(crc);
con-out_msg_pos.did_page_crc = true;
}
ret = ceph_tcp_sendpage(con-sock, page,
@@ -1008,30 +1034,14 @@ static int write_partial_msg_pages(struc
if (ret = 0)
goto out;

-   con-out_msg_pos.data_pos += ret;
-   con-out_msg_pos.page_pos += ret;
-   if (ret == len) {
-   con-out_msg_pos.page_pos = 0;
-   con-out_msg_pos.page++;
-   con-out_msg_pos.did_page_crc = false;
-   if (in_trail)
-   list_move_tail(page-lru,
-  msg-trail-head);
-   else if (msg-pagelist)
-   list_move_tail(page-lru,
-  msg-pagelist-head);
-#ifdef CONFIG_BLOCK
-   else if (msg-bio)
-   iter_bio_next(msg-bio_iter, msg-bio_seg);
-#endif
-   }
+   out_msg_pos_next(con, page, len, (size_t) ret, in_trail);
}

dout(write_partial_msg_pages %p msg %p done\n, con, msg);

/* prepare and queue up footer, too */
  

[PATCH 4/6] libceph: move init_bio_*() functions up

2012-06-18 Thread Alex Elder
Move init_bio_iter() and iter_bio_next() up in their source file so
the'll be defined before they're needed.

Signed-off-by: Alex Elder el...@inktank.com
---
 net/ceph/messenger.c |   50
+-
 1 file changed, 25 insertions(+), 25 deletions(-)

Index: b/net/ceph/messenger.c
===
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -590,6 +590,31 @@ static void con_out_kvec_add(struct ceph
con-out_kvec_bytes += size;
 }

+#ifdef CONFIG_BLOCK
+static void init_bio_iter(struct bio *bio, struct bio **iter, int *seg)
+{
+   if (!bio) {
+   *iter = NULL;
+   *seg = 0;
+   return;
+   }
+   *iter = bio;
+   *seg = bio-bi_idx;
+}
+
+static void iter_bio_next(struct bio **bio_iter, int *seg)
+{
+   if (*bio_iter == NULL)
+   return;
+
+   BUG_ON(*seg = (*bio_iter)-bi_vcnt);
+
+   (*seg)++;
+   if (*seg == (*bio_iter)-bi_vcnt)
+   init_bio_iter((*bio_iter)-bi_next, bio_iter, seg);
+}
+#endif
+
 static void prepare_write_message_data(struct ceph_connection *con)
 {
struct ceph_msg *msg = con-out_msg;
@@ -892,31 +917,6 @@ out:
return ret;  /* done! */
 }

-#ifdef CONFIG_BLOCK
-static void init_bio_iter(struct bio *bio, struct bio **iter, int *seg)
-{
-   if (!bio) {
-   *iter = NULL;
-   *seg = 0;
-   return;
-   }
-   *iter = bio;
-   *seg = bio-bi_idx;
-}
-
-static void iter_bio_next(struct bio **bio_iter, int *seg)
-{
-   if (*bio_iter == NULL)
-   return;
-
-   BUG_ON(*seg = (*bio_iter)-bi_vcnt);
-
-   (*seg)++;
-   if (*seg == (*bio_iter)-bi_vcnt)
-   init_bio_iter((*bio_iter)-bi_next, bio_iter, seg);
-}
-#endif
-
 static void out_msg_pos_next(struct ceph_connection *con, struct page
*page,
size_t len, size_t sent, bool in_trail)
 {

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] libceph: move init of bio_iter

2012-06-18 Thread Alex Elder
If a message has a non-null bio pointer, its bio_iter field is
initialized in write_partial_msg_pages() if this has not been done
already.  This is really a one-time setup operation for sending a
message's (bio) data, so move that initialization code into
prepare_write_message_data() which serves that purpose.

Signed-off-by: Alex Elder el...@inktank.com
---
 net/ceph/messenger.c |9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

Index: b/net/ceph/messenger.c
===
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -627,6 +627,10 @@ static void prepare_write_message_data(s
con-out_msg_pos.page_pos = msg-page_alignment;
else
con-out_msg_pos.page_pos = 0;
+#ifdef CONFIG_BLOCK
+   if (msg-bio  !msg-bio_iter)
+   init_bio_iter(msg-bio, msg-bio_iter, msg-bio_seg);
+#endif
con-out_msg_pos.data_pos = 0;
con-out_msg_pos.did_page_crc = false;
con-out_more = 1;  /* data + footer will follow */
@@ -966,11 +970,6 @@ static int write_partial_msg_pages(struc
 con, msg, con-out_msg_pos.page, msg-nr_pages,
 con-out_msg_pos.page_pos);

-#ifdef CONFIG_BLOCK
-   if (msg-bio  !msg-bio_iter)
-   init_bio_iter(msg-bio, msg-bio_iter, msg-bio_seg);
-#endif
-
while (data_len  con-out_msg_pos.data_pos) {
struct page *page = NULL;
int max_write = PAGE_SIZE;

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] libceph: don't use bio_iter as a flag

2012-06-18 Thread Alex Elder
Recently a bug was fixed in which the bio_iter field in a ceph
message was not being properly re-initialized when a message got
re-transmitted:
commit 43643528cce60ca184fe8197efa8e8da7c89a037
Author: Yan, Zheng zheng.z@intel.com
rbd: Clear ceph_msg-bio_iter for retransmitted message

We are now only initializing the bio_iter field when we are about to
start to write message data (in prepare_write_message_data()),
rather than every time we are attempting to write any portion of the
message data (in write_partial_msg_pages()).  This means we no
longer need to use the msg-bio_iter field as a flag.

So just don't do that any more.  Trust prepare_write_message_data()
to ensure msg-bio_iter is properly initialized, every time we are
about to begin writing (or re-writing) a message's bio data.

Signed-off-by: Alex Elder el...@inktank.com
---
 net/ceph/messenger.c |6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

Index: b/net/ceph/messenger.c
===
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -628,7 +628,7 @@ static void prepare_write_message_data(s
else
con-out_msg_pos.page_pos = 0;
 #ifdef CONFIG_BLOCK
-   if (msg-bio  !msg-bio_iter)
+   if (msg-bio)
init_bio_iter(msg-bio, msg-bio_iter, msg-bio_seg);
 #endif
con-out_msg_pos.data_pos = 0;
@@ -696,10 +696,6 @@ static void prepare_write_message(struct
m-hdr.seq = cpu_to_le64(++con-out_seq);
m-needs_out_seq = false;
}
-#ifdef CONFIG_BLOCK
-   else
-   m-bio_iter = NULL;
-#endif

dout(prepare_write_message %p seq %lld type %d len %d+%d+%d %d pgs\n,
 m, con-out_seq, le16_to_cpu(m-hdr.type),
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ?

2012-06-18 Thread Alexandre DERUMIER
Hrm, look at your journal_queue_max_ops, journal_queue_max_bytes, 
op_queue_max_ops, and op_queue_max_bytes. Looks like you are set at 500 
ops and a maximum of 100MB. With 1GigE you'd be able to max out the 
data in the journal really fast. Try tweaking these up and see what 
happens. 

test was made with 15 osd, each osd with 1GB journal.

(so 1Gits = 100MB/S *3 with replication = 300MB /15 osd = 20MB/S (that around 
what I see with iostat)
with 1GB journal , it should handle around 50s.




I have redone a test, with 1 osd on 3 nodes with 8GB journal (write around 
60-80MB/S on each osd)

journal_queue_max_bytes show again 100MB
journal_queue_max_ops = 500
but
journal_ops = 6500 
journal_queue_ops = 0
journal_queue_bytes = 0
(I have done perfcounters_dump each second and 
journal_queue_ops,journal_queue_bytes are always 0)

op_queue_max_bytes:100MB
op_queue_max_ops:500
(what are op_ counters ? osd counter ?)

Should'nt be queues values as low as possible ? (0 queue = 0 bottleneck) ?



root@cephtest1:/var/run/ceph# ceph --admin-daemon ceph-osd.0.asok 
perfcounters_dump
{filestore:{journal_queue_max_ops:500,journal_queue_ops:0,journal_ops:6554,journal_queue_max_bytes:104857600,journal_queue_bytes:0,journal_bytes:6624795873,journal_latency:{avgcount:6554,sum:11.5094},op_queue_max_ops:500,op_queue_ops:0,ops:6554,op_queue_max_bytes:104857600,op_queue_bytes:0,bytes:6624755213,apply_latency:{avgcount:6554,sum:4462.6},committing:0,commitcycle:143,commitcycle_interval:{avgcount:143,sum:736.741},commitcycle_latency:{avgcount:143,sum:17.2976},journal_full:0},osd:{opq:0,op_wip:1,op:838,op_in_bytes:3514930636,op_out_bytes:0,op_latency:{avgcount:838,sum:201.494},op_r:0,op_r_out_bytes:0,op_r_latency:{avgcount:0,sum:0},op_w:838,op_w_in_bytes:3514930636,op_w_rlat:{avgcount:838,sum:0},op_w_latency:{avgcount:838,sum:201.494},op_rw:0,op_rw_in_bytes:0,op_rw_out_bytes:0,op_rw_rlat:{avgcount:0,sum:0},op_rw_latency:{avgcount:0,sum:0},subop:739,subop_in_bytes:3099988795,subop_latency:{avgcount:739,sum:45.7711},subop_w:0,subop_w_in_bytes:3099988795,subop_w_latency:{avgcount:739,sum:45.7711},subop_pull:0,subop_pull_latency:{avgcount:0,sum:0},subop_push:0,subop_push_in_bytes:0,subop_push_latency:{avgcount:0,sum:0},pull:0,push:0,push_out_bytes:0,recovery_ops:0,loadavg:0.56,buffer_bytes:0,numpg:1387,numpg_primary:701,numpg_replica:686,numpg_stray:0,heartbeat_to_peers:2,heartbeat_from_peers:0,map_messages:18,map_message_epochs:37,map_message_epoch_dups:31},throttle-filestore_bytes:{val:0,max:104857600,get:0,get_sum:0,get_or_fail_fail:0,get_or_fail_success:0,take:6554,take_sum:6624795873,put:6078,put_sum:6624795873,wait:{avgcount:0,sum:0}},throttle-filestore_ops:{val:0,max:500,get:0,get_sum:0,get_or_fail_fail:0,get_or_fail_success:0,take:6554,take_sum:6554,put:6078,put_sum:6554,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_throttler-client:{val:0,max:104857600,get:1076,get_sum:3523503185,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:1076,put_sum:3523503185,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_throttler-cluster:{val:0,max:104857600,get:5006,get_sum:3103900299,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:5006,put_sum:3103900299,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_throttler-hbclient:{val:0,max:104857600,get:478,get_sum:22466,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:478,put_sum:22466,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_throttler-hbserver:{val:0,max:104857600,get:484,get_sum:22748,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:484,put_sum:22748,wait:{avgcount:0,sum:0}},throttle-osd_client_bytes:{val:0,max:524288000,get:840,get_sum:3523353965,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:1679,put_sum:3523353965,wait:{avgcount:0,sum:0}}}




- Mail original - 

De: Mark Nelson mark.nel...@inktank.com 
À: Alexandre DERUMIER aderum...@odiso.com 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Lundi 18 Juin 2012 17:16:17 
Objet: Re: iostat show constants write to osd disk with writeahead journal, 
normal behaviour ? 

On 6/18/12 9:47 AM, Alexandre DERUMIER wrote: 
 Yeah, I just wanted to make sure that the constant writes weren't 
 because the filestore was falling behind. You may want to take a look 
 at some of the information that is provided by the admin socket for the 
 OSD while the test is running. dump_ops_in_flight, perf schema, and perf 
 dump are all useful. 
 
 
 don't know which values to check in these big json reponses ;) 
 But I have try with more osd, so write are splitted on more disks and and 
 write are smaller, and the behaviour is same 

No worries, there is a lot of data there! 

 
 
 root@cephtest1:/var/run/ceph# ceph --admin-daemon ceph-osd.0.asok 
 dump_ops_in_flight 
 { num_ops: 1, 
 ops: [ 
 { description: osd_op(client.4179.0:83 kvmtest1_1006560_object82 [write 
 0~4194304] 3.9f5c55af), 
 received_at: 2012-06-18 16:41:17.995167, 
 age: 0.406678, 
 flag_point: waiting for sub ops, 
 

Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ?

2012-06-18 Thread Tommi Virtanen
On Mon, Jun 18, 2012 at 5:34 AM, Alexandre DERUMIER aderum...@odiso.com wrote:
 I'm doing test with rados bench, and I see constant writes to osd disks.
 Is it the normal behaviour ? with write-ahead should write occur each 20-30 
 seconde ?

Is the osd data filesystem perhaps doing atime updates? noatime is your friend.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ?

2012-06-18 Thread Alexandre DERUMIER
noatime and nodiratime are already enabled

cat /etc/fstab

/dev/sdb   /srv/osd.0  xfs noatime,nodiratime  0   0


(drive was formatted simply with mkfs.xfs /dev/sdb)




- Mail original - 

De: Tommi Virtanen t...@inktank.com 
À: Alexandre DERUMIER aderum...@odiso.com 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Lundi 18 Juin 2012 18:01:52 
Objet: Re: iostat show constants write to osd disk with writeahead journal, 
normal behaviour ? 

On Mon, Jun 18, 2012 at 5:34 AM, Alexandre DERUMIER aderum...@odiso.com 
wrote: 
 I'm doing test with rados bench, and I see constant writes to osd disks. 
 Is it the normal behaviour ? with write-ahead should write occur each 20-30 
 seconde ? 

Is the osd data filesystem perhaps doing atime updates? noatime is your friend. 



-- 

-- 




Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD layering design draft

2012-06-18 Thread Tommi Virtanen
On Fri, Jun 15, 2012 at 5:46 PM, Sage Weil s...@inktank.com wrote:
 Is 'preserve' and 'unpreserve' the verbiage we want to use here?  Not sure
 I have a better suggestion, but preserve is unusual.

protect/unprotect? The flag protects the image snapshot from being deleted.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sync Ceph packaging efforts for Debian/Ubuntu

2012-06-18 Thread Sage Weil
On Mon, 18 Jun 2012, James Page wrote:
 Hi Sage/Laszlo
 
 Laszlo - thanks for sending the original email - I'd like to get
 everything as closely in-sync as possible between the three packaging
 sources as well.
 
 On 16/06/12 22:50, Sage Weil wrote:
  I've take a closer look at these patches, and have a few questions.
  
  - The URL change and nss patches I've applied; they are in the ceph.git 
  'debian' branch.
 
 Great!
 
  
  - Has the leveldb patch been sent upstream?  Once it is committed to 
  the upstream git, we can update ceph to use it; that's nicer than carrying 
  the patch.  However, I thought you needed to link against the existing
  libleveldb1 package... which means we shouldn't do anything on our
  side,
  right?
 
 I can't see any evidence that this has been sent upstream; ideally we
 would be building against libleveldb1 rather than using the embedded
 copy - I'm not familiar with the reason that this has not happened
 already (if there is one).  This package would also need to be reviewed
 for inclusion in main if that was the case.

We bundled it for expediency, that's all.  I just send the patch off to 
the leveldb mailing list (in case that hadn't happened yet); we'll see if 
they apply it.

  - I'm not sure how useful it is to break mount.ceph and cephfs into a 
  separate ceph-fs-common package, but we can do it.  Same goes for a 
  separate package for ceph-mds.  That was originally motivated by ubuntu 
  not wanting the mds in main, but in the end only the libraries went in, so 
  it's a moot point.  I'd rather hear from them what their intentions are 
  for 12.10 before complicating things...
 
 ceph-fs-common is in Ubuntu main; so I think the original motivation
 still stands IMHO.

Okay, split that part.

 For the Ubuntu quantal cycle we still have the same primary objective as
 we had during 12.04; namely ensuring that Ceph RBD can be used as a
 block store for qemu-kvm which ties nicely into the Ubuntu OpenStack
 story through Cinder; In addition we will be looking at Ceph RADOS as a
 backend for Glance (see [0] for more details).

I'm reading this to meant hat you still want the mds separated out; did 
that too.

 The MIR for Ceph occurred quite late in the 12.04 cycle so we had to
 trim the scope to actually get it done; We will be looking at libfcgi
 and google-perftools this cycle for main inclusion to re-enable the
 components that are currently disabled in the Ubuntu packaging.

Including those (and libleveldb1) would be ideal.

  - That same patch also switched all the Architecture: lines back to 
  linux-any.  Was that intentional?  I just changed them from that last 
  week.
 
 I think linux-any is correct - the change you have made would exclude
 the PPC architecture in Ubuntu and Debian.
 
 [...]
 
  Ben, James, can you please share in some sentences why ceph-fuse is
  dropped in Ubuntu? Do you need it Sage? If it's feasible, you may drop
  that as well.
 
 There is an outstanding question on the 12.04 MIR as to whether this
 package could still be built but not promoted to main - I'll follow up
 with the MIR reviewer as to whether that's possible as I don't think it
 requires any additional build dependencies.
 
 [...]
 
 I hope that explains the Ubuntu position on Ceph and what plans we have
 this development cycle.

Okay, keep us posted! 

I pushed a new 'debian' branch with those changes; please take a look and 
let me know if it loks okay.

Thanks-
sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD layering design draft

2012-06-18 Thread Tommi Virtanen
On Fri, Jun 15, 2012 at 1:48 PM, Josh Durgin josh.dur...@inktank.com wrote:
    $ rbd unpreserve pool/image@snap
    Error unpreserving: child images rely on this image

UX nit: this should also say what image it found.

rbd: Cannot unpreserve: Still in use by pool2/image2

    $ rbd list_children pool/image@snap
    pool2/child1
    pool2/child2

How about just rbd children? Especially the underscore makes me unhappy.

    $ rbd copyup pool2/child1

Does copyup make sense to everyone? Every time you say it, my brain
needs to flip the image inside the other way around -- I naturally
imagine a tree with the parent at the top, and children and
grandchildren down from it, but then I can't call that operation
copyup without wrecking my mental image.

I also can't seem to google good evidence that the term would be in
widespread use in the enterprisey block storage world, outside of the
unionfs world.. What do people call the un-dedupping, un-thinning of
copy-on-write thin provisioning?

unshare?

 In addition to knowing which parent a given image has, we want to be
 able to tell if a preserved image still has children. This is
 accomplished with a new per-pool object, `rbd_children`, which maps
 (parent pool, parent id, parent snapshot id) to a list of child
 image ids.

So the omap value is a list, and you need to support atomic add/remove
on the list members? Are you thinking of using an rbd class method
that does read-modify-write for that?

My instincts would have gone for (parent_pool, parent_id,
parent_snapshot_id, child_id) - None, to get atomic operations for
free.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD layering design draft

2012-06-18 Thread Josh Durgin

On 06/18/2012 10:00 AM, Tommi Virtanen wrote:

On Fri, Jun 15, 2012 at 1:48 PM, Josh Durginjosh.dur...@inktank.com  wrote:

$ rbd unpreserve pool/image@snap
Error unpreserving: child images rely on this image


UX nit: this should also say what image it found.

rbd: Cannot unpreserve: Still in use by pool2/image2


Agreed.


$ rbd list_children pool/image@snap
pool2/child1
pool2/child2


How about just rbd children? Especially the underscore makes me unhappy.


Yeah, that sounds better.


$ rbd copyup pool2/child1


Does copyup make sense to everyone? Every time you say it, my brain
needs to flip the image inside the other way around -- I naturally
imagine a tree with the parent at the top, and children and
grandchildren down from it, but then I can't call that operation
copyup without wrecking my mental image.

I also can't seem to google good evidence that the term would be in
widespread use in the enterprisey block storage world, outside of the
unionfs world.. What do people call the un-dedupping, un-thinning of
copy-on-write thin provisioning?

unshare?


I'm not sure what best term is, but there's probably something better 
than copyup.



In addition to knowing which parent a given image has, we want to be
able to tell if a preserved image still has children. This is
accomplished with a new per-pool object, `rbd_children`, which maps
(parent pool, parent id, parent snapshot id) to a list of child
image ids.


So the omap value is a list, and you need to support atomic add/remove
on the list members? Are you thinking of using an rbd class method
that does read-modify-write for that?

My instincts would have gone for (parent_pool, parent_id,
parent_snapshot_id, child_id) -  None, to get atomic operations for
free.


The reason for making it a class method is more about hiding the
implementation from clients. It could be the mapping you describe in
an omap.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD hotplugging Chef cookbook (chef-1)

2012-06-18 Thread Tommi Virtanen
On Thu, Jun 14, 2012 at 6:32 AM, Danny Kukawka danny.kuka...@bisect.de wrote:
 And where can I find this branch? I've checked the git repo at:

        https://github.com/ceph/ceph-cookbooks

 But couldn't find any branch called ceph-1.

Use master of ceph-cookbooks.git, and where those instructions said
ceph-1, put in master (= use master of ceph.git).

The instructions from that email are being distilled into proper
documentation at

http://ceph.com/docs/master/install/chef/
http://ceph.com/docs/master/config-cluster/chef/
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Ceph performance on Ubuntu Oneiric vs Ubuntu Precise

2012-06-18 Thread Mark Nelson

Hi Guys,

I've been tracking down some performance issues over the past month with 
our internal test nodes and believe I have narrowed it down to something 
related to Ubuntu Oneiric.  Tests done on nodes running Ubuntu Precise 
are significantly faster.


One of the major differences between the releases is the support for 
syncfs in libc.  Theoretically this shouldn't have a big effect on btrfs 
so I'm not totally sure that this is the culprit.  Having said that, 
previous tests showed good SSD performance on Oneiric leading me to 
believe the lower latency mitigates the effect.  Some of spinning disk 
seekwatcher results for Oneiric are quite strange with long periods of 
inactivity on the OSD data disks.


I wanted to post these results for those of you who have had performance 
problems in the past.  If you are continuing to have issues, you may 
want to try testing on precise and see if you notice any changes.  It is 
possible that all of this could be specific to our internal testing 
nodes, so I wouldn't mind hearing if other people have seen similar 
behavior.


These tests were done using rados bench with 16 concurrent requests. 
There are two nodes that each have a single 7200rpm OSD data disk and 
journal on a second 7200rpm disk.  Replication is set at the default 
level (2).  Kernel is 3.4 in all cases.


Here's a run down (Numbers are MB/s)

4KB Requests

BTRFS   EXT4XFS
Ceph 0.46/Oneiric:  0.073   0.694   0.723
Ceph 0.46/Precise:  2.152.031   1.546
Ceph 0.47.2/Oneiric:1.072   0.836   0.749
Ceph 0.47.2/Precise:2.566   2.579   1.498

128KB Requests:

BTRFS   EXT4XFS
Ceph 0.46/Oneiric:  11.874  20.066  12.641
Ceph 0.46/Precise:  49.304  39.736  38.982
Ceph 0.47.2/Oneiric:13.81   19.05   12.739
Ceph 0.47.2/Precise:47.943  49.655  36.764


4MB Requests:

BTRFS   EXT4XFS
Ceph 0.46/Oneiric:  110.202 26.58   15.445
Ceph 0.46/Precise:  135.975 128.759 106.426
Ceph 0.47.2/Oneiric:91.337  46.277  23.897
Ceph 0.47.2/Precise:136.906 134.955 106.545

I've posted seekwatcher results for all of the tests:

Ceph 0.46/Oneiric:  http://nhm.ceph.com/movies/sprint/test2
Ceph 0.46/Precise:  http://nhm.ceph.com/movies/sprint/test3
Ceph 0.47.2/Oneiric:http://nhm.ceph.com/movies/sprint/test4
Ceph 0.47.2/Precise:http://nhm.ceph.com/movies/sprint/test5

Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD layering design draft

2012-06-18 Thread Sage Weil
On Mon, 18 Jun 2012, Josh Durgin wrote:
   $ rbd copyup pool2/child1
  
  Does copyup make sense to everyone? Every time you say it, my brain
  needs to flip the image inside the other way around -- I naturally
  imagine a tree with the parent at the top, and children and
  grandchildren down from it, but then I can't call that operation
  copyup without wrecking my mental image.
  
  I also can't seem to google good evidence that the term would be in
  widespread use in the enterprisey block storage world, outside of the
  unionfs world.. What do people call the un-dedupping, un-thinning of
  copy-on-write thin provisioning?
  
  unshare?
 
 I'm not sure what best term is, but there's probably something better than
 copyup.

flatten?  My mental model is stuck on the layering analogy, where the 
child is a copy-on-write layer on top of a read-only parent.

Someday we may want to support the ability to add a parent to an existing 
image and do a sort of dedup, so having an opposite for whatever term we 
pick would be a bonus.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph performance on Ubuntu Oneiric vs Ubuntu Precise

2012-06-18 Thread Gregory Farnum
Do I correctly assume that these nodes hosted only the OSDs, and the
monitors were on a separate node?

On Mon, Jun 18, 2012 at 10:56 AM, Mark Nelson mark.nel...@inktank.com wrote:
 Hi Guys,

 I've been tracking down some performance issues over the past month with our
 internal test nodes and believe I have narrowed it down to something related
 to Ubuntu Oneiric.  Tests done on nodes running Ubuntu Precise are
 significantly faster.

 One of the major differences between the releases is the support for syncfs
 in libc.  Theoretically this shouldn't have a big effect on btrfs so I'm not
 totally sure that this is the culprit.  Having said that, previous tests
 showed good SSD performance on Oneiric leading me to believe the lower
 latency mitigates the effect.  Some of spinning disk seekwatcher results for
 Oneiric are quite strange with long periods of inactivity on the OSD data
 disks.

 I wanted to post these results for those of you who have had performance
 problems in the past.  If you are continuing to have issues, you may want to
 try testing on precise and see if you notice any changes.  It is possible
 that all of this could be specific to our internal testing nodes, so I
 wouldn't mind hearing if other people have seen similar behavior.

 These tests were done using rados bench with 16 concurrent requests. There
 are two nodes that each have a single 7200rpm OSD data disk and journal on a
 second 7200rpm disk.  Replication is set at the default level (2).  Kernel
 is 3.4 in all cases.

 Here's a run down (Numbers are MB/s)

 4KB Requests

                        BTRFS   EXT4    XFS
 Ceph 0.46/Oneiric:      0.073   0.694   0.723
 Ceph 0.46/Precise:      2.15    2.031   1.546
 Ceph 0.47.2/Oneiric:    1.072   0.836   0.749
 Ceph 0.47.2/Precise:    2.566   2.579   1.498

 128KB Requests:

                        BTRFS   EXT4    XFS
 Ceph 0.46/Oneiric:      11.874  20.066  12.641
 Ceph 0.46/Precise:      49.304  39.736  38.982
 Ceph 0.47.2/Oneiric:    13.81   19.05   12.739
 Ceph 0.47.2/Precise:    47.943  49.655  36.764


 4MB Requests:

                        BTRFS   EXT4    XFS
 Ceph 0.46/Oneiric:      110.202 26.58   15.445
 Ceph 0.46/Precise:      135.975 128.759 106.426
 Ceph 0.47.2/Oneiric:    91.337  46.277  23.897
 Ceph 0.47.2/Precise:    136.906 134.955 106.545

 I've posted seekwatcher results for all of the tests:

 Ceph 0.46/Oneiric:      http://nhm.ceph.com/movies/sprint/test2
 Ceph 0.46/Precise:      http://nhm.ceph.com/movies/sprint/test3
 Ceph 0.47.2/Oneiric:    http://nhm.ceph.com/movies/sprint/test4
 Ceph 0.47.2/Precise:    http://nhm.ceph.com/movies/sprint/test5

 Mark
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD layering design draft

2012-06-18 Thread Gregory Farnum
Locking is a separate mechanism we're already working on, which will
lock images so that they can't accidentally be mounted at more than
one location. :)
-Greg

On Sun, Jun 17, 2012 at 6:42 AM, Martin Mailand mar...@tuxadero.com wrote:
 Hi,
 what's up locked, unlocked, unlocking?

 -martin

 Am 16.06.2012 17:11, schrieb Sage Weil:

 On Fri, 15 Jun 2012, Yehuda Sadeh wrote:

 On Fri, Jun 15, 2012 at 5:46 PM, Sage Weils...@inktank.com  wrote:

 Looks good!  Couple small things:

     $ rbd unpreserve pool/image@snap


 Is 'preserve' and 'unpreserve' the verbiage we want to use here?  Not
 sure
 I have a better suggestion, but preserve is unusual.


 freeze, thaw/unfreeze?


 Freeze/thaw usually mean something like quiesce I/O or read-only, usually
 temporarily.  What we actaully mean is you can't delete this.  Maybe
 pin/unpin?  preserve/unpreserve may be fine, too!

 sage

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph performance on Ubuntu Oneiric vs Ubuntu Precise

2012-06-18 Thread Mark Nelson

Hi Greg,

Yep, 3 monitors each on their own node.

Mark

On 06/18/2012 01:04 PM, Gregory Farnum wrote:

Do I correctly assume that these nodes hosted only the OSDs, and the
monitors were on a separate node?

On Mon, Jun 18, 2012 at 10:56 AM, Mark Nelsonmark.nel...@inktank.com  wrote:

Hi Guys,

I've been tracking down some performance issues over the past month with our
internal test nodes and believe I have narrowed it down to something related
to Ubuntu Oneiric.  Tests done on nodes running Ubuntu Precise are
significantly faster.

One of the major differences between the releases is the support for syncfs
in libc.  Theoretically this shouldn't have a big effect on btrfs so I'm not
totally sure that this is the culprit.  Having said that, previous tests
showed good SSD performance on Oneiric leading me to believe the lower
latency mitigates the effect.  Some of spinning disk seekwatcher results for
Oneiric are quite strange with long periods of inactivity on the OSD data
disks.

I wanted to post these results for those of you who have had performance
problems in the past.  If you are continuing to have issues, you may want to
try testing on precise and see if you notice any changes.  It is possible
that all of this could be specific to our internal testing nodes, so I
wouldn't mind hearing if other people have seen similar behavior.

These tests were done using rados bench with 16 concurrent requests. There
are two nodes that each have a single 7200rpm OSD data disk and journal on a
second 7200rpm disk.  Replication is set at the default level (2).  Kernel
is 3.4 in all cases.

Here's a run down (Numbers are MB/s)

4KB Requests

BTRFS   EXT4XFS
Ceph 0.46/Oneiric:  0.073   0.694   0.723
Ceph 0.46/Precise:  2.152.031   1.546
Ceph 0.47.2/Oneiric:1.072   0.836   0.749
Ceph 0.47.2/Precise:2.566   2.579   1.498

128KB Requests:

BTRFS   EXT4XFS
Ceph 0.46/Oneiric:  11.874  20.066  12.641
Ceph 0.46/Precise:  49.304  39.736  38.982
Ceph 0.47.2/Oneiric:13.81   19.05   12.739
Ceph 0.47.2/Precise:47.943  49.655  36.764


4MB Requests:

BTRFS   EXT4XFS
Ceph 0.46/Oneiric:  110.202 26.58   15.445
Ceph 0.46/Precise:  135.975 128.759 106.426
Ceph 0.47.2/Oneiric:91.337  46.277  23.897
Ceph 0.47.2/Precise:136.906 134.955 106.545

I've posted seekwatcher results for all of the tests:

Ceph 0.46/Oneiric:  http://nhm.ceph.com/movies/sprint/test2
Ceph 0.46/Precise:  http://nhm.ceph.com/movies/sprint/test3
Ceph 0.47.2/Oneiric:http://nhm.ceph.com/movies/sprint/test4
Ceph 0.47.2/Precise:http://nhm.ceph.com/movies/sprint/test5

Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph performance on Ubuntu Oneiric vs Ubuntu Precise

2012-06-18 Thread Tommi Virtanen
On Mon, Jun 18, 2012 at 10:56 AM, Mark Nelson mark.nel...@inktank.com wrote:
 I've been tracking down some performance issues over the past month with our
 internal test nodes and believe I have narrowed it down to something related
 to Ubuntu Oneiric.  Tests done on nodes running Ubuntu Precise are
 significantly faster.

Did you use Ubuntu kernels or our own builds? Different/same across runs?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph performance on Ubuntu Oneiric vs Ubuntu Precise

2012-06-18 Thread Mark Nelson

On 06/18/2012 01:12 PM, Tommi Virtanen wrote:

On Mon, Jun 18, 2012 at 10:56 AM, Mark Nelsonmark.nel...@inktank.com  wrote:

I've been tracking down some performance issues over the past month with our
internal test nodes and believe I have narrowed it down to something related
to Ubuntu Oneiric.  Tests done on nodes running Ubuntu Precise are
significantly faster.


Did you use Ubuntu kernels or our own builds? Different/same across runs?


Each set of nodes is using our kernel from gitbuilder:

http://gitbuilder.ceph.com/kernel-deb-oneiric-x86_64-basic/ref/v3.4/

and

http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/v3.4/

respectively.

I should note that this problem was also seen on kernel 3.3 from 
gitbuilder with oneiric, though I do not have comparative numbers available.


Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Heavy speed difference between rbd and custom pool

2012-06-18 Thread Stefan Priebe

Hello list,

i'm getting these rbd bench values for pool rbd. They're high and constant.
- RBD pool
# rados -p rbd bench 30 write -t 16
 Maintaining 16 concurrent writes of 4194304 bytes for at least 30 seconds.
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 0   0 0 0 0 0 - 0
 1  16   274   258   1031.77  1032  0.043758 0.0602236
 2  16   549   533   1065.82  1100  0.072168 0.0590944
 3  16   825   8091078.5  1104  0.040162  0.058682
 4  16  1103  1087   1086.84  1112  0.052508 0.0584277
 5  16  1385  1369   1095.04  1128  0.060233 0.0581288
 6  16  1654  1638   1091.85  1076  0.050697 0.0583385
 7  16  1939  1923   1098.71  1140  0.063716  0.057964
 8  16  2219  2203   1101.35  1120  0.055435 0.0579105
 9  16  2497  2481   1102.52  1112  0.060413 0.0578282
10  16  2773  2757   1102.66  1104  0.051134 0.0578561
11  16  3049  3033   1102.77  1104  0.057742 0.0578803
12  16  3326  3310   1103.19  1108  0.053769 0.0578627
13  16  3604  3588   1103.86  1112  0.064574 0.0578453
14  16  3883  3867   1104.72  1116  0.056524 0.0578018
15  16  4162  4146   1105.46  1116  0.054581 0.0577626
16  16  4440  4424   1105.86  1112  0.079015  0.057758
17  16  4725  4709   1107.86  1140  0.043511 0.0576647
18  16  5007  4991   1108.97  1128  0.053005 0.0576147
19  16  5292  52761110.6  1140  0.069004  0.057538
2012-06-18 23:36:19.124472min lat: 0.028568 max lat: 0.201941 avg lat: 
0.0574953

   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
20  16  5574  5558   .46  1128  0.048482 0.0574953
21  16  5861  5845   1113.18  1148  0.051923 0.0574146
22  16  6147  6131   1114.58  1144   0.04461 0.0573461
23  16  6438  6422   1116.72  1164  0.050383 0.0572406
24  16  6724  6708   1117.85  1144  0.067827 0.0571864
25  16  7008  6992   1118.57  1136  0.049128  0.057147
26  16  7296  7280   1119.85  1152  0.050331 0.0570879
27  16  7573  75571119.4  1108  0.052711 0.0571132
28  16  7858  7842   1120.13  1140  0.056369 0.0570764
29  16  8143  8127   1120.81  1140  0.046558 0.0570438
30  16  8431  8415   1121.85  1152  0.049958 0.0569942
 Total time run: 30.045481
Total writes made:  8431
Write size: 4194304
Bandwidth (MB/sec): 1122.432

Stddev Bandwidth:   26.0451
Max bandwidth (MB/sec): 1164
Min bandwidth (MB/sec): 1032
Average Latency:0.0570069
Stddev Latency: 0.0128039
Max latency:0.235536
Min latency:0.028568
-

I created then a custom pool called kvmpool.

~# ceph osd pool create kvmpool
pool 'kvmpool' created

But with this one i get slow and jumping values:
 kvmpool
~# rados -p kvmpool bench 30 write -t 16
 Maintaining 16 concurrent writes of 4194304 bytes for at least 30 seconds.
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 0   0 0 0 0 0 - 0
 1  16   231   215   859.863   860  0.204867  0.069195
 2  16   393   377   753.899   648  0.049444 0.0811933
 3  16   535   519   691.908   568  0.232365 0.0899074
 4  16   634   618   617.913   396  0.032758 0.0963399
 5  16   806   790   631.913   688  0.075811  0.099529
 6  16   948   932   621.249   568  0.156988   0.10179
 7  16  1086  1070   611.348   552  0.036177  0.102064
 8  16  1206  1190   594.922   480  0.028491  0.105235
 9  16  1336  1320   586.589   520  0.041009  0.108735
10  16  1512  1496598.32   704  0.258165  0.105086
11  16  1666  1650   599.921   616  0.040967  0.106146
12  15  1825  1810   603.255   640  0.198851  0.105463
13  16  1925  1909   587.309   396  0.042577  0.108449
14  16  2135  2119   605.352   840  0.035767  0.105219
15  16  2272  2256   601.523   548  0.246136  0.105357
16  16  2426  2410   602.424   616   0.19881  0.105692
17  16  2529  2513591.22   412  0.031322  0.105463
18  16  2696  2680595.48   668  0.028081  0.106749

Re: Possible deadlock condition

2012-06-18 Thread Mandell Degerness
Here is, perhaps, a more useful traceback from a different run of
tests that we just ran into:

Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.680815] INFO: task
flush-254:0:29582 blocked for more than 120 seconds.
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681040] echo 0 
/proc/sys/kernel/hung_task_timeout_secs disables this message.
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681458] flush-254:0
  D 880bd9ca2fc0 0 29582  2 0x
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681740]
88006e51d160 0046 0002 88061b362040
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.682173]
88006e51d160 000120c0 000120c0 000120c0
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.682659]
88006e51dfd8 000120c0 000120c0 88006e51dfd8
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683088] Call Trace:
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683302]
[81520132] schedule+0x5a/0x5c
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683514]
[815203e7] schedule_timeout+0x36/0xe3
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683784]
[8101e0b2] ? physflat_send_IPI_mask+0xe/0x10
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683999]
[8101a237] ? native_smp_send_reschedule+0x46/0x48
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684219]
[811e0071] ? list_move_tail+0x27/0x2c
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684432]
[81520d13] __down_common+0x90/0xd4
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684708]
[811e1120] ? _xfs_buf_find+0x17f/0x210
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684925]
[81520dca] __down+0x1d/0x1f
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685139]
[8105db4e] down+0x2d/0x3d
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685350]
[811e0f68] xfs_buf_lock+0x76/0xaf
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685565]
[811e1120] _xfs_buf_find+0x17f/0x210
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685836]
[811e13b6] xfs_buf_get+0x2a/0x177
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686052]
[811e19f6] xfs_buf_read+0x1f/0xca
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686270]
[8122a0b7] xfs_trans_read_buf+0x205/0x308
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686490]
[81205e01] xfs_btree_read_buf_block.clone.22+0x4f/0xa7
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687015]
[8122a3ee] ? xfs_trans_log_buf+0xb2/0xc1
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687232]
[81205edd] xfs_btree_lookup_get_block+0x84/0xac
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687449]
[81208e83] xfs_btree_lookup+0x12b/0x3dc
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687721]
[811f6bb2] ? xfs_alloc_vextent+0x447/0x469
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687939]
[811fd171] xfs_bmbt_lookup_eq+0x1f/0x21
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688156]
[811ffa88] xfs_bmap_add_extent_delay_real+0x5b5/0xfec
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688378]
[810f155b] ? kmem_cache_alloc+0x87/0xf3
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688650]
[81204c40] ? xfs_bmbt_init_cursor+0x3f/0x107
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688867]
[81201160] xfs_bmapi_allocate+0x1f6/0x23a
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689084]
[812185bd] ? xfs_iext_bno_to_irec+0x95/0xb9
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689301]
[81203414] xfs_bmapi_write+0x32d/0x5a2
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689519]
[811e99e4] xfs_iomap_write_allocate+0x1a5/0x29f
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689797]
[811df12a] xfs_map_blocks+0x13e/0x1dd
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690016]
[811dfbff] xfs_vm_writepage+0x24e/0x410
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690233]
[810bde1e] __writepage+0x17/0x30
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690446]
[810be6ed] write_cache_pages+0x276/0x3c8
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690693]
[810bde07] ? set_page_dirty+0x60/0x60
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690908]
[810be884] generic_writepages+0x45/0x5c
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691123]
[811defcb] xfs_vm_writepages+0x4d/0x54
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691337]
[810bf832] do_writepages+0x21/0x2a
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691552]
[811218f5] writeback_single_inode+0x12a/0x2cc
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691800]
[81121d92] writeback_sb_inodes+0x174/0x215
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.692016]
[81122185] __writeback_inodes_wb+0x78/0xb9
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.692231]
[811224b5] 

Re: Heavy speed difference between rbd and custom pool

2012-06-18 Thread Mark Nelson

On 06/18/2012 04:39 PM, Stefan Priebe wrote:

Hello list,

i'm getting these rbd bench values for pool rbd. They're high and constant.
- RBD pool
# rados -p rbd bench 30 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 30 seconds.
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 16 274 258 1031.77 1032 0.043758 0.0602236
2 16 549 533 1065.82 1100 0.072168 0.0590944
3 16 825 809 1078.5 1104 0.040162 0.058682
4 16 1103 1087 1086.84 1112 0.052508 0.0584277
5 16 1385 1369 1095.04 1128 0.060233 0.0581288
6 16 1654 1638 1091.85 1076 0.050697 0.0583385
7 16 1939 1923 1098.71 1140 0.063716 0.057964
8 16 2219 2203 1101.35 1120 0.055435 0.0579105
9 16 2497 2481 1102.52 1112 0.060413 0.0578282
10 16 2773 2757 1102.66 1104 0.051134 0.0578561
11 16 3049 3033 1102.77 1104 0.057742 0.0578803
12 16 3326 3310 1103.19 1108 0.053769 0.0578627
13 16 3604 3588 1103.86 1112 0.064574 0.0578453
14 16 3883 3867 1104.72 1116 0.056524 0.0578018
15 16 4162 4146 1105.46 1116 0.054581 0.0577626
16 16 4440 4424 1105.86 1112 0.079015 0.057758
17 16 4725 4709 1107.86 1140 0.043511 0.0576647
18 16 5007 4991 1108.97 1128 0.053005 0.0576147
19 16 5292 5276 1110.6 1140 0.069004 0.057538
2012-06-18 23:36:19.124472min lat: 0.028568 max lat: 0.201941 avg lat:
0.0574953
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
20 16 5574 5558 .46 1128 0.048482 0.0574953
21 16 5861 5845 1113.18 1148 0.051923 0.0574146
22 16 6147 6131 1114.58 1144 0.04461 0.0573461
23 16 6438 6422 1116.72 1164 0.050383 0.0572406
24 16 6724 6708 1117.85 1144 0.067827 0.0571864
25 16 7008 6992 1118.57 1136 0.049128 0.057147
26 16 7296 7280 1119.85 1152 0.050331 0.0570879
27 16 7573 7557 1119.4 1108 0.052711 0.0571132
28 16 7858 7842 1120.13 1140 0.056369 0.0570764
29 16 8143 8127 1120.81 1140 0.046558 0.0570438
30 16 8431 8415 1121.85 1152 0.049958 0.0569942
Total time run: 30.045481
Total writes made: 8431
Write size: 4194304
Bandwidth (MB/sec): 1122.432

Stddev Bandwidth: 26.0451
Max bandwidth (MB/sec): 1164
Min bandwidth (MB/sec): 1032
Average Latency: 0.0570069
Stddev Latency: 0.0128039
Max latency: 0.235536
Min latency: 0.028568
-

I created then a custom pool called kvmpool.

~# ceph osd pool create kvmpool
pool 'kvmpool' created

But with this one i get slow and jumping values:
 kvmpool
~# rados -p kvmpool bench 30 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 30 seconds.
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 16 231 215 859.863 860 0.204867 0.069195
2 16 393 377 753.899 648 0.049444 0.0811933
3 16 535 519 691.908 568 0.232365 0.0899074
4 16 634 618 617.913 396 0.032758 0.0963399
5 16 806 790 631.913 688 0.075811 0.099529
6 16 948 932 621.249 568 0.156988 0.10179
7 16 1086 1070 611.348 552 0.036177 0.102064
8 16 1206 1190 594.922 480 0.028491 0.105235
9 16 1336 1320 586.589 520 0.041009 0.108735
10 16 1512 1496 598.32 704 0.258165 0.105086
11 16 1666 1650 599.921 616 0.040967 0.106146
12 15 1825 1810 603.255 640 0.198851 0.105463
13 16 1925 1909 587.309 396 0.042577 0.108449
14 16 2135 2119 605.352 840 0.035767 0.105219
15 16 2272 2256 601.523 548 0.246136 0.105357
16 16 2426 2410 602.424 616 0.19881 0.105692
17 16 2529 2513 591.22 412 0.031322 0.105463
18 16 2696 2680 595.48 668 0.028081 0.106749
19 16 2878 2862 602.449 728 0.044929 0.105856
2012-06-18 23:38:45.566094min lat: 0.023295 max lat: 0.763797 avg lat:
0.105597
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
20 16 3041 3025 604.921 652 0.036028 0.105597
21 16 3182 3166 602.964 564 0.035072 0.104915
22 16 3349  605.916 668 0.030493 0.105304
23 16 3512 3496 607.917 652 0.030523 0.10479
24 16 3668 3652 608.584 624 0.232933 0.10475
25 16 3821 3805 608.717 612 0.029881 0.104513
26 16 3963 3947 607.148 568 0.050244 0.10531
27 16 4112 4096 606.733 596 0.259069 0.105008
28 16 4261 4245 606.347 596 0.211877 0.105215
29 16 4437 4421 609.712 704 0.02802 0.104613
30 16 4566 4550 606.586 516 0.047076 0.105111
Total time run: 30.062141
Total writes made: 4566
Write size: 4194304
Bandwidth (MB/sec): 607.542

Stddev Bandwidth: 109.112
Max bandwidth (MB/sec): 860
Min bandwidth (MB/sec): 396
Average Latency: 0.10532
Stddev Latency: 0.108369
Max latency: 0.763797
Min latency: 0.023295


Why do these pools differ? Where is the difference?

Stefan


Are the number of placement groups the same for each pool?

try running ceph osd dump -o - | grep pool and looking for the 
pg_num value.


Mark

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Heavy speed difference between rbd and custom pool

2012-06-18 Thread Dan Mick
Yes, this is almost certainly the problem.  When you create the pool, 
you can specify a pg count; the default is 8, which is quite low.
The count can't currently be adjusted after pool-creation time (we're 
working on an enhancement for that).


http://ceph.com/docs/master/control/  shows

ceph osd pool create POOL [pg_num [pgp_num]]

You'll want to set pg_num the same for similar pools in order to get for 
similar pool performance.


I note also that you can get that field directlty:
$ ceph osd pool get rbd pg_num
PG_NUM: 448

I have a 'nova' pool that was created with pool create:

$ ceph osd pool get nova pg_num
PG_NUM: 8



On 06/18/2012 03:23 PM, Mark Nelson wrote:

On 06/18/2012 04:39 PM, Stefan Priebe wrote:

Hello list,

i'm getting these rbd bench values for pool rbd. They're high and
constant.
- RBD pool
# rados -p rbd bench 30 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 30
seconds.
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 16 274 258 1031.77 1032 0.043758 0.0602236
2 16 549 533 1065.82 1100 0.072168 0.0590944
3 16 825 809 1078.5 1104 0.040162 0.058682
4 16 1103 1087 1086.84 1112 0.052508 0.0584277
5 16 1385 1369 1095.04 1128 0.060233 0.0581288
6 16 1654 1638 1091.85 1076 0.050697 0.0583385
7 16 1939 1923 1098.71 1140 0.063716 0.057964
8 16 2219 2203 1101.35 1120 0.055435 0.0579105
9 16 2497 2481 1102.52 1112 0.060413 0.0578282
10 16 2773 2757 1102.66 1104 0.051134 0.0578561
11 16 3049 3033 1102.77 1104 0.057742 0.0578803
12 16 3326 3310 1103.19 1108 0.053769 0.0578627
13 16 3604 3588 1103.86 1112 0.064574 0.0578453
14 16 3883 3867 1104.72 1116 0.056524 0.0578018
15 16 4162 4146 1105.46 1116 0.054581 0.0577626
16 16 4440 4424 1105.86 1112 0.079015 0.057758
17 16 4725 4709 1107.86 1140 0.043511 0.0576647
18 16 5007 4991 1108.97 1128 0.053005 0.0576147
19 16 5292 5276 1110.6 1140 0.069004 0.057538
2012-06-18 23:36:19.124472min lat: 0.028568 max lat: 0.201941 avg lat:
0.0574953
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
20 16 5574 5558 .46 1128 0.048482 0.0574953
21 16 5861 5845 1113.18 1148 0.051923 0.0574146
22 16 6147 6131 1114.58 1144 0.04461 0.0573461
23 16 6438 6422 1116.72 1164 0.050383 0.0572406
24 16 6724 6708 1117.85 1144 0.067827 0.0571864
25 16 7008 6992 1118.57 1136 0.049128 0.057147
26 16 7296 7280 1119.85 1152 0.050331 0.0570879
27 16 7573 7557 1119.4 1108 0.052711 0.0571132
28 16 7858 7842 1120.13 1140 0.056369 0.0570764
29 16 8143 8127 1120.81 1140 0.046558 0.0570438
30 16 8431 8415 1121.85 1152 0.049958 0.0569942
Total time run: 30.045481
Total writes made: 8431
Write size: 4194304
Bandwidth (MB/sec): 1122.432

Stddev Bandwidth: 26.0451
Max bandwidth (MB/sec): 1164
Min bandwidth (MB/sec): 1032
Average Latency: 0.0570069
Stddev Latency: 0.0128039
Max latency: 0.235536
Min latency: 0.028568
-

I created then a custom pool called kvmpool.

~# ceph osd pool create kvmpool
pool 'kvmpool' created

But with this one i get slow and jumping values:
 kvmpool
~# rados -p kvmpool bench 30 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 30
seconds.
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 16 231 215 859.863 860 0.204867 0.069195
2 16 393 377 753.899 648 0.049444 0.0811933
3 16 535 519 691.908 568 0.232365 0.0899074
4 16 634 618 617.913 396 0.032758 0.0963399
5 16 806 790 631.913 688 0.075811 0.099529
6 16 948 932 621.249 568 0.156988 0.10179
7 16 1086 1070 611.348 552 0.036177 0.102064
8 16 1206 1190 594.922 480 0.028491 0.105235
9 16 1336 1320 586.589 520 0.041009 0.108735
10 16 1512 1496 598.32 704 0.258165 0.105086
11 16 1666 1650 599.921 616 0.040967 0.106146
12 15 1825 1810 603.255 640 0.198851 0.105463
13 16 1925 1909 587.309 396 0.042577 0.108449
14 16 2135 2119 605.352 840 0.035767 0.105219
15 16 2272 2256 601.523 548 0.246136 0.105357
16 16 2426 2410 602.424 616 0.19881 0.105692
17 16 2529 2513 591.22 412 0.031322 0.105463
18 16 2696 2680 595.48 668 0.028081 0.106749
19 16 2878 2862 602.449 728 0.044929 0.105856
2012-06-18 23:38:45.566094min lat: 0.023295 max lat: 0.763797 avg lat:
0.105597
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
20 16 3041 3025 604.921 652 0.036028 0.105597
21 16 3182 3166 602.964 564 0.035072 0.104915
22 16 3349  605.916 668 0.030493 0.105304
23 16 3512 3496 607.917 652 0.030523 0.10479
24 16 3668 3652 608.584 624 0.232933 0.10475
25 16 3821 3805 608.717 612 0.029881 0.104513
26 16 3963 3947 607.148 568 0.050244 0.10531
27 16 4112 4096 606.733 596 0.259069 0.105008
28 16 4261 4245 606.347 596 0.211877 0.105215
29 16 4437 4421 609.712 704 0.02802 0.104613
30 16 4566 4550 606.586 516 0.047076 0.105111
Total time run: 30.062141
Total writes made: 4566
Write size: 4194304
Bandwidth (MB/sec): 607.542

Stddev Bandwidth: 109.112
Max bandwidth (MB/sec): 860
Min bandwidth (MB/sec): 396
Average 

Re: Possible deadlock condition

2012-06-18 Thread Dan Mick
Does the xfs on the OSD have plenty of free space left, or could this be 
an allocation deadlock?


On 06/18/2012 03:17 PM, Mandell Degerness wrote:

Here is, perhaps, a more useful traceback from a different run of
tests that we just ran into:

Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.680815] INFO: task
flush-254:0:29582 blocked for more than 120 seconds.
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681040] echo 0
/proc/sys/kernel/hung_task_timeout_secs disables this message.
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681458] flush-254:0
   D 880bd9ca2fc0 0 29582  2 0x
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681740]
88006e51d160 0046 0002 88061b362040
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.682173]
88006e51d160 000120c0 000120c0 000120c0
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.682659]
88006e51dfd8 000120c0 000120c0 88006e51dfd8
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683088] Call Trace:
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683302]
[81520132] schedule+0x5a/0x5c
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683514]
[815203e7] schedule_timeout+0x36/0xe3
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683784]
[8101e0b2] ? physflat_send_IPI_mask+0xe/0x10
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683999]
[8101a237] ? native_smp_send_reschedule+0x46/0x48
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684219]
[811e0071] ? list_move_tail+0x27/0x2c
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684432]
[81520d13] __down_common+0x90/0xd4
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684708]
[811e1120] ? _xfs_buf_find+0x17f/0x210
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684925]
[81520dca] __down+0x1d/0x1f
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685139]
[8105db4e] down+0x2d/0x3d
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685350]
[811e0f68] xfs_buf_lock+0x76/0xaf
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685565]
[811e1120] _xfs_buf_find+0x17f/0x210
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685836]
[811e13b6] xfs_buf_get+0x2a/0x177
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686052]
[811e19f6] xfs_buf_read+0x1f/0xca
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686270]
[8122a0b7] xfs_trans_read_buf+0x205/0x308
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686490]
[81205e01] xfs_btree_read_buf_block.clone.22+0x4f/0xa7
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687015]
[8122a3ee] ? xfs_trans_log_buf+0xb2/0xc1
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687232]
[81205edd] xfs_btree_lookup_get_block+0x84/0xac
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687449]
[81208e83] xfs_btree_lookup+0x12b/0x3dc
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687721]
[811f6bb2] ? xfs_alloc_vextent+0x447/0x469
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687939]
[811fd171] xfs_bmbt_lookup_eq+0x1f/0x21
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688156]
[811ffa88] xfs_bmap_add_extent_delay_real+0x5b5/0xfec
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688378]
[810f155b] ? kmem_cache_alloc+0x87/0xf3
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688650]
[81204c40] ? xfs_bmbt_init_cursor+0x3f/0x107
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688867]
[81201160] xfs_bmapi_allocate+0x1f6/0x23a
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689084]
[812185bd] ? xfs_iext_bno_to_irec+0x95/0xb9
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689301]
[81203414] xfs_bmapi_write+0x32d/0x5a2
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689519]
[811e99e4] xfs_iomap_write_allocate+0x1a5/0x29f
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689797]
[811df12a] xfs_map_blocks+0x13e/0x1dd
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690016]
[811dfbff] xfs_vm_writepage+0x24e/0x410
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690233]
[810bde1e] __writepage+0x17/0x30
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690446]
[810be6ed] write_cache_pages+0x276/0x3c8
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690693]
[810bde07] ? set_page_dirty+0x60/0x60
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690908]
[810be884] generic_writepages+0x45/0x5c
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691123]
[811defcb] xfs_vm_writepages+0x4d/0x54
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691337]
[810bf832] do_writepages+0x21/0x2a
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691552]
[811218f5] writeback_single_inode+0x12a/0x2cc
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691800]
[81121d92] writeback_sb_inodes+0x174/0x215
Jun 18 17:58:51 node-172-29-0-15 kernel: 

Re: RBD layering design draft

2012-06-18 Thread Dan Mick

On 06/18/2012 11:01 AM, Sage Weil wrote:

On Mon, 18 Jun 2012, Josh Durgin wrote:

 $ rbd copyup pool2/child1


Does copyup make sense to everyone? Every time you say it, my brain
needs to flip the image inside the other way around -- I naturally
imagine a tree with the parent at the top, and children and
grandchildren down from it, but then I can't call that operation
copyup without wrecking my mental image.

I also can't seem to google good evidence that the term would be in
widespread use in the enterprisey block storage world, outside of the
unionfs world.. What do people call the un-dedupping, un-thinning of
copy-on-write thin provisioning?

unshare?


I'm not sure what best term is, but there's probably something better than
copyup.


flatten?  My mental model is stuck on the layering analogy, where the
child is a copy-on-write layer on top of a read-only parent.

Someday we may want to support the ability to add a parent to an existing
image and do a sort of dedup, so having an opposite for whatever term we
pick would be a bonus.


disown and adopt?  :)  (actually I started as a joke, but really I 
kinda like that; fits with the parent-child name)



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible deadlock condition

2012-06-18 Thread Mandell Degerness
None of the OSDs seem to be more than 82% full. I didn't think we were
running quite that close to the margin, but it is still far from
actually full.


On Mon, Jun 18, 2012 at 3:57 PM, Dan Mick dan.m...@inktank.com wrote:
 Does the xfs on the OSD have plenty of free space left, or could this be an
 allocation deadlock?


 On 06/18/2012 03:17 PM, Mandell Degerness wrote:

 Here is, perhaps, a more useful traceback from a different run of
 tests that we just ran into:

 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.680815] INFO: task
 flush-254:0:29582 blocked for more than 120 seconds.
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681040] echo 0
 /proc/sys/kernel/hung_task_timeout_secs disables this message.
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681458] flush-254:0
   D 880bd9ca2fc0     0 29582      2 0x
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681740]
 88006e51d160 0046 0002 88061b362040
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.682173]
 88006e51d160 000120c0 000120c0 000120c0
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.682659]
 88006e51dfd8 000120c0 000120c0 88006e51dfd8
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683088] Call Trace:
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683302]
 [81520132] schedule+0x5a/0x5c
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683514]
 [815203e7] schedule_timeout+0x36/0xe3
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683784]
 [8101e0b2] ? physflat_send_IPI_mask+0xe/0x10
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683999]
 [8101a237] ? native_smp_send_reschedule+0x46/0x48
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684219]
 [811e0071] ? list_move_tail+0x27/0x2c
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684432]
 [81520d13] __down_common+0x90/0xd4
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684708]
 [811e1120] ? _xfs_buf_find+0x17f/0x210
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684925]
 [81520dca] __down+0x1d/0x1f
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685139]
 [8105db4e] down+0x2d/0x3d
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685350]
 [811e0f68] xfs_buf_lock+0x76/0xaf
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685565]
 [811e1120] _xfs_buf_find+0x17f/0x210
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685836]
 [811e13b6] xfs_buf_get+0x2a/0x177
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686052]
 [811e19f6] xfs_buf_read+0x1f/0xca
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686270]
 [8122a0b7] xfs_trans_read_buf+0x205/0x308
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686490]
 [81205e01] xfs_btree_read_buf_block.clone.22+0x4f/0xa7
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687015]
 [8122a3ee] ? xfs_trans_log_buf+0xb2/0xc1
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687232]
 [81205edd] xfs_btree_lookup_get_block+0x84/0xac
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687449]
 [81208e83] xfs_btree_lookup+0x12b/0x3dc
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687721]
 [811f6bb2] ? xfs_alloc_vextent+0x447/0x469
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687939]
 [811fd171] xfs_bmbt_lookup_eq+0x1f/0x21
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688156]
 [811ffa88] xfs_bmap_add_extent_delay_real+0x5b5/0xfec
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688378]
 [810f155b] ? kmem_cache_alloc+0x87/0xf3
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688650]
 [81204c40] ? xfs_bmbt_init_cursor+0x3f/0x107
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688867]
 [81201160] xfs_bmapi_allocate+0x1f6/0x23a
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689084]
 [812185bd] ? xfs_iext_bno_to_irec+0x95/0xb9
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689301]
 [81203414] xfs_bmapi_write+0x32d/0x5a2
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689519]
 [811e99e4] xfs_iomap_write_allocate+0x1a5/0x29f
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689797]
 [811df12a] xfs_map_blocks+0x13e/0x1dd
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690016]
 [811dfbff] xfs_vm_writepage+0x24e/0x410
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690233]
 [810bde1e] __writepage+0x17/0x30
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690446]
 [810be6ed] write_cache_pages+0x276/0x3c8
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690693]
 [810bde07] ? set_page_dirty+0x60/0x60
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690908]
 [810be884] generic_writepages+0x45/0x5c
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691123]
 [811defcb] xfs_vm_writepages+0x4d/0x54
 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691337]
 

Re: RBD layering design draft

2012-06-18 Thread Dan Mick



On 06/18/2012 09:25 AM, Tommi Virtanen wrote:

On Fri, Jun 15, 2012 at 5:46 PM, Sage Weils...@inktank.com  wrote:

Is 'preserve' and 'unpreserve' the verbiage we want to use here?  Not sure
I have a better suggestion, but preserve is unusual.


protect/unprotect? The flag protects the image snapshot from being deleted.


unremovable/removable?

undeletable/deletable?

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: Possible deadlock condition

2012-06-18 Thread Dan Mick
I don't know enough to know if there's a connection, but I do note this 
prior thread that sounds kinda similar:


http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/6574


On 06/18/2012 04:08 PM, Mandell Degerness wrote:

None of the OSDs seem to be more than 82% full. I didn't think we were
running quite that close to the margin, but it is still far from
actually full.


On Mon, Jun 18, 2012 at 3:57 PM, Dan Mickdan.m...@inktank.com  wrote:

Does the xfs on the OSD have plenty of free space left, or could this be an
allocation deadlock?


On 06/18/2012 03:17 PM, Mandell Degerness wrote:


Here is, perhaps, a more useful traceback from a different run of
tests that we just ran into:

Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.680815] INFO: task
flush-254:0:29582 blocked for more than 120 seconds.
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681040] echo 0
/proc/sys/kernel/hung_task_timeout_secs disables this message.
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681458] flush-254:0
   D 880bd9ca2fc0 0 29582  2 0x
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681740]
88006e51d160 0046 0002 88061b362040
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.682173]
88006e51d160 000120c0 000120c0 000120c0
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.682659]
88006e51dfd8 000120c0 000120c0 88006e51dfd8
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683088] Call Trace:
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683302]
[81520132] schedule+0x5a/0x5c
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683514]
[815203e7] schedule_timeout+0x36/0xe3
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683784]
[8101e0b2] ? physflat_send_IPI_mask+0xe/0x10
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683999]
[8101a237] ? native_smp_send_reschedule+0x46/0x48
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684219]
[811e0071] ? list_move_tail+0x27/0x2c
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684432]
[81520d13] __down_common+0x90/0xd4
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684708]
[811e1120] ? _xfs_buf_find+0x17f/0x210
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684925]
[81520dca] __down+0x1d/0x1f
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685139]
[8105db4e] down+0x2d/0x3d
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685350]
[811e0f68] xfs_buf_lock+0x76/0xaf
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685565]
[811e1120] _xfs_buf_find+0x17f/0x210
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685836]
[811e13b6] xfs_buf_get+0x2a/0x177
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686052]
[811e19f6] xfs_buf_read+0x1f/0xca
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686270]
[8122a0b7] xfs_trans_read_buf+0x205/0x308
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686490]
[81205e01] xfs_btree_read_buf_block.clone.22+0x4f/0xa7
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687015]
[8122a3ee] ? xfs_trans_log_buf+0xb2/0xc1
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687232]
[81205edd] xfs_btree_lookup_get_block+0x84/0xac
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687449]
[81208e83] xfs_btree_lookup+0x12b/0x3dc
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687721]
[811f6bb2] ? xfs_alloc_vextent+0x447/0x469
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687939]
[811fd171] xfs_bmbt_lookup_eq+0x1f/0x21
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688156]
[811ffa88] xfs_bmap_add_extent_delay_real+0x5b5/0xfec
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688378]
[810f155b] ? kmem_cache_alloc+0x87/0xf3
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688650]
[81204c40] ? xfs_bmbt_init_cursor+0x3f/0x107
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688867]
[81201160] xfs_bmapi_allocate+0x1f6/0x23a
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689084]
[812185bd] ? xfs_iext_bno_to_irec+0x95/0xb9
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689301]
[81203414] xfs_bmapi_write+0x32d/0x5a2
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689519]
[811e99e4] xfs_iomap_write_allocate+0x1a5/0x29f
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689797]
[811df12a] xfs_map_blocks+0x13e/0x1dd
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690016]
[811dfbff] xfs_vm_writepage+0x24e/0x410
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690233]
[810bde1e] __writepage+0x17/0x30
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690446]
[810be6ed] write_cache_pages+0x276/0x3c8
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690693]
[810bde07] ? set_page_dirty+0x60/0x60
Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690908]
[810be884] generic_writepages+0x45/0x5c
Jun 18 17:58:51 

RE: Performance benchmark of rbd

2012-06-18 Thread Eric_YH_Chen
Hi, Mark and all:

I think you may miss this mail before, so I send it again. 

==

I forget to mention one thing, I create the rbd at the same machine and
test it. That means the network latency may be lower than normal case.

1. 
I use ext4 as the backend filesystem and with following attribute.
data=writeback,noatime,nodiratime,user_xattr

2. 
I use the default replication number, I think it is 2, right?

3. 
On my platform, I have 192GB memory

4. Sorry about the column name is left-right reversal. Here is the
correct one
Seq-writeSeq-read
  32 KB 23 MB/s  690 MB/s
 512 KB 26 MB/s  960 MB/s
   4 MB 27 MB/s 1290 MB/s
  32 MB 36 MB/s 1435 MB/s

5. If I put all the journal data on a SSD device (Intel 520). 
  The sequence write performance would reach 135MB/s instead of
  27MB/s in original. (object size = 4MB). And others are no different,
  including random-write. I am curious why the SSD device doesn't 
  help the performance of random-write.

6. For the random read write, the data I provided before was correct.
  But I can give you the detail. Is it too high than what you expected?
  
rand-write-4k   rand-write-16k
bw  iopsbw  iops
3,524   881 9,032   564

mix-4k (50/50)  
r:bwr:iops  w:bww:iops
2,925   731 2,924   731

mix-8k (50/50)  
r:bwr:iops  w:bww:iops
4,509   563 4,509   563

mix-16k (50/50) 
r:bwr:iops  w:bww:iops
8,366   522 8,345   521


7. 
Here is the hw raid cache policy we used now.
Write PolicyWrite Back with BBU
Read Policy ReadAhead

If you are interested in how HW raid help the performance, I can do for
little help, since we also want to know what is the best configuration
on our platform. Any test you want to know?


Furthermore, is there any suggestion for our platform that can improve
the performance? Thanks!



-Original Message-
From: Mark Nelson [mailto:mark.nel...@inktank.com] 
Sent: Wednesday, June 13, 2012 8:30 PM
To: Eric YH Chen/WYHQ/Wiwynn
Cc: ceph-devel@vger.kernel.org
Subject: Re: Performance benchmark of rbd

Hi Eric!

On 6/13/12 5:06 AM, eric_yh_c...@wiwynn.com wrote:
 Hi, all:

  I am doing some benchmark of rbd.
  The platform is on a NAS storage.

  CPU: Intel E5640 2.67GHz
  Memory: 192 GB
  Hard Disk: SATA 250G * 1, 7200 rpm (H0) + SATA 1T * 12 , 7200rpm
 (H1~ H12)
  RAID Card: LSI 9260-4i
  OS: Ubuntu12.04 with Kernel 3.2.0-24
  Network:  1 Gb/s

  We create 12 OSD on H1 ~ H12 with the journal is put on H0.

Just to make sure I understand, you have a single node with 12 OSDs and 
3 mons, and all 12 OSDs are using the H0 disk for their journals?  What 
filesystem are you using for the OSDs?  How much replication?

  We also create 3 MON in the cluster.
  In briefly, we setup a ceph cluster all-in-one, with 3 monitors
and
 12 OSD.

  The benchmark tool we used is fio 2.0.3. We had 7 basic test case
  1)  sequence write with bs=64k
  2)  sequence read with bs=64k
  3)  random write with bs=4k
  4)  random write with bs=16k
  5)  mix read/write with bs=4k
  6)  mix read/write with bs=8k
  7)  mix read/write with bs=16k

  We create several rbd with different object size for the
benchmark.

  1.  size = 20G, object size =  32KB
  2.  size = 20G, object size = 512KB
  3.  size = 20G, object size =  4MB
  4.  size = 20G, object size = 32MB

Given how much memory you have, you may want to increase the amount of 
data you are writing during each test to rule out caching.


  We have some conclusion after the benchmark.

  a.  We can get better performance of sequence read/write when the
 object size is bigger.
 Seq-read  Seq-write
  32 KB23 MB/s  690 MB/s
 512 KB26 MB/s  960 MB/s
   4 MB27 MB/s 1290 MB/s
  32 MB36 MB/s 1435 MB/s

Which test are these results from?  I'm suspicious that the write 
numbers are so high.  Figure that even with a local client and 1X 
replication, your journals and data partitions are each writing out a 
copy of the data.  You don't have enough disk in that box to sustain 
1.4GB/s to both even under perfectly ideal conditions.  Given that it 
sounds like you are using a single 7200rpm disk for 12 journals, I would

expect far lower numbers...


  b. There is no obvious influence for random read/write when the
 object size is different.
All the result are in a range not more than 10%.

 rand-write-4K rand-write-16K  mix-4K
 mix-8kmix-16k
 881 iops  564 iops
 1462 iops 

Re: Heavy speed difference between rbd and custom pool

2012-06-18 Thread Alexandre DERUMIER
Hi Stephann
recommandations are 30-50 PGS by osd if I remember.

- Mail original - 

De: Mark Nelson mark.nel...@inktank.com 
À: Stefan Priebe s.pri...@profihost.ag 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Mardi 19 Juin 2012 00:23:49 
Objet: Re: Heavy speed difference between rbd and custom pool 

On 06/18/2012 04:39 PM, Stefan Priebe wrote: 
 Hello list, 
 
 i'm getting these rbd bench values for pool rbd. They're high and constant. 
 - RBD pool 
 # rados -p rbd bench 30 write -t 16 
 Maintaining 16 concurrent writes of 4194304 bytes for at least 30 seconds. 
 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 
 0 0 0 0 0 0 - 0 
 1 16 274 258 1031.77 1032 0.043758 0.0602236 
 2 16 549 533 1065.82 1100 0.072168 0.0590944 
 3 16 825 809 1078.5 1104 0.040162 0.058682 
 4 16 1103 1087 1086.84 1112 0.052508 0.0584277 
 5 16 1385 1369 1095.04 1128 0.060233 0.0581288 
 6 16 1654 1638 1091.85 1076 0.050697 0.0583385 
 7 16 1939 1923 1098.71 1140 0.063716 0.057964 
 8 16 2219 2203 1101.35 1120 0.055435 0.0579105 
 9 16 2497 2481 1102.52 1112 0.060413 0.0578282 
 10 16 2773 2757 1102.66 1104 0.051134 0.0578561 
 11 16 3049 3033 1102.77 1104 0.057742 0.0578803 
 12 16 3326 3310 1103.19 1108 0.053769 0.0578627 
 13 16 3604 3588 1103.86 1112 0.064574 0.0578453 
 14 16 3883 3867 1104.72 1116 0.056524 0.0578018 
 15 16 4162 4146 1105.46 1116 0.054581 0.0577626 
 16 16 4440 4424 1105.86 1112 0.079015 0.057758 
 17 16 4725 4709 1107.86 1140 0.043511 0.0576647 
 18 16 5007 4991 1108.97 1128 0.053005 0.0576147 
 19 16 5292 5276 1110.6 1140 0.069004 0.057538 
 2012-06-18 23:36:19.124472min lat: 0.028568 max lat: 0.201941 avg lat: 
 0.0574953 
 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 
 20 16 5574 5558 .46 1128 0.048482 0.0574953 
 21 16 5861 5845 1113.18 1148 0.051923 0.0574146 
 22 16 6147 6131 1114.58 1144 0.04461 0.0573461 
 23 16 6438 6422 1116.72 1164 0.050383 0.0572406 
 24 16 6724 6708 1117.85 1144 0.067827 0.0571864 
 25 16 7008 6992 1118.57 1136 0.049128 0.057147 
 26 16 7296 7280 1119.85 1152 0.050331 0.0570879 
 27 16 7573 7557 1119.4 1108 0.052711 0.0571132 
 28 16 7858 7842 1120.13 1140 0.056369 0.0570764 
 29 16 8143 8127 1120.81 1140 0.046558 0.0570438 
 30 16 8431 8415 1121.85 1152 0.049958 0.0569942 
 Total time run: 30.045481 
 Total writes made: 8431 
 Write size: 4194304 
 Bandwidth (MB/sec): 1122.432 
 
 Stddev Bandwidth: 26.0451 
 Max bandwidth (MB/sec): 1164 
 Min bandwidth (MB/sec): 1032 
 Average Latency: 0.0570069 
 Stddev Latency: 0.0128039 
 Max latency: 0.235536 
 Min latency: 0.028568 
 - 
 
 I created then a custom pool called kvmpool. 
 
 ~# ceph osd pool create kvmpool 
 pool 'kvmpool' created 
 
 But with this one i get slow and jumping values: 
  kvmpool 
 ~# rados -p kvmpool bench 30 write -t 16 
 Maintaining 16 concurrent writes of 4194304 bytes for at least 30 seconds. 
 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 
 0 0 0 0 0 0 - 0 
 1 16 231 215 859.863 860 0.204867 0.069195 
 2 16 393 377 753.899 648 0.049444 0.0811933 
 3 16 535 519 691.908 568 0.232365 0.0899074 
 4 16 634 618 617.913 396 0.032758 0.0963399 
 5 16 806 790 631.913 688 0.075811 0.099529 
 6 16 948 932 621.249 568 0.156988 0.10179 
 7 16 1086 1070 611.348 552 0.036177 0.102064 
 8 16 1206 1190 594.922 480 0.028491 0.105235 
 9 16 1336 1320 586.589 520 0.041009 0.108735 
 10 16 1512 1496 598.32 704 0.258165 0.105086 
 11 16 1666 1650 599.921 616 0.040967 0.106146 
 12 15 1825 1810 603.255 640 0.198851 0.105463 
 13 16 1925 1909 587.309 396 0.042577 0.108449 
 14 16 2135 2119 605.352 840 0.035767 0.105219 
 15 16 2272 2256 601.523 548 0.246136 0.105357 
 16 16 2426 2410 602.424 616 0.19881 0.105692 
 17 16 2529 2513 591.22 412 0.031322 0.105463 
 18 16 2696 2680 595.48 668 0.028081 0.106749 
 19 16 2878 2862 602.449 728 0.044929 0.105856 
 2012-06-18 23:38:45.566094min lat: 0.023295 max lat: 0.763797 avg lat: 
 0.105597 
 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 
 20 16 3041 3025 604.921 652 0.036028 0.105597 
 21 16 3182 3166 602.964 564 0.035072 0.104915 
 22 16 3349  605.916 668 0.030493 0.105304 
 23 16 3512 3496 607.917 652 0.030523 0.10479 
 24 16 3668 3652 608.584 624 0.232933 0.10475 
 25 16 3821 3805 608.717 612 0.029881 0.104513 
 26 16 3963 3947 607.148 568 0.050244 0.10531 
 27 16 4112 4096 606.733 596 0.259069 0.105008 
 28 16 4261 4245 606.347 596 0.211877 0.105215 
 29 16 4437 4421 609.712 704 0.02802 0.104613 
 30 16 4566 4550 606.586 516 0.047076 0.105111 
 Total time run: 30.062141 
 Total writes made: 4566 
 Write size: 4194304 
 Bandwidth (MB/sec): 607.542 
 
 Stddev Bandwidth: 109.112 
 Max bandwidth (MB/sec): 860 
 Min bandwidth (MB/sec): 396 
 Average Latency: 0.10532 
 Stddev Latency: 0.108369 
 Max latency: 0.763797 
 Min latency: 0.023295 
  
 
 Why do these pools differ? Where is the