RE: Designing a cluster guide

2012-05-29 Thread Quenten Grasso
Interesting, I've been thinking about this and I think most Ceph installations 
could benefit from more nodes and less disks per node.

For example 

We have a replica level of 2, your RBD block size of 4mb. You start writing a 
file of 10gb, This is divided effectively into 4mb chunks, 

The first chunk to node 1 and node 2 (at the same time I assume) which is 
written to a journal then replayed to the data file system.

Second chunk might be sent to node 2 and 3 at the same time which is written to 
a journal then replayed. (we now have overlap from chunk 1) 

Third chunk might be sent to 1 and 3 (we have more overlap from chunks 1 and 2) 
and as you can see this quickly this becomes an issue.

So if we have 10 nodes vs. 3 nodes with the same mount of disks we should see 
better write and read performance as you would have less overlap.

Now we take BTRFS into the picture as I understand journals are not necessary 
due to the nature of the way it writes/snapshots and reads data this alone 
would be a major performance increase on a BTRFS Raid level (like ZFS RAIDZ).

Side note this may sound crazy but the more I read about SSD's the less I wish 
to use/rely on them and RAM SSD's are crazly priced imo. =)

Regards,
Quenten


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Slawomir Skowron
Sent: Tuesday, 22 May 2012 3:52 PM
To: Quenten Grasso
Cc: Gregory Farnum; ceph-devel@vger.kernel.org
Subject: Re: Designing a cluster guide

I have some performance from rbd cluster near 320MB/s on VM from 3
node cluster, but with 10GE, and with 26 2.5 SAS drives used on every
machine it's not everything that can be.
Every osd drive is raid0 with one drive via battery cached nvram in
hardware raid ctrl.
Every osd take much ram for caching.

That's why i'am thinking about to change 2 drives for SSD in raid1
with hpa tuned for increase durability of drive for journaling - but
if this will work ;)

With newest drives can theoreticaly get 500MB/s with a long queue
depth. This means that i can in theory improve bandwith score, and
take lower latency, and better handling of multiple IO writes, from
many hosts.
Reads are cached in ram from OSD daemon, VFS in kernel, nvram in ctrl,
and in near future improve from cache in kvm (i need to test that -
this will improve performance)

But if SSD drive goes slower, it can get whole performance down in
writes. It's is very delicate.

Pozdrawiam

iSS

Dnia 22 maj 2012 o godz. 02:47 Quenten Grasso qgra...@onq.com.au napisał(a):

 I Should have added For storage I'm considering something like Enterprise 
 nearline SAS 3TB disks running individual disks not raided with rep level of 
 2 as suggested :)


 Regards,
 Quenten


 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Quenten Grasso
 Sent: Tuesday, 22 May 2012 10:43 AM
 To: 'Gregory Farnum'
 Cc: ceph-devel@vger.kernel.org
 Subject: RE: Designing a cluster guide

 Hi Greg,

 I'm only talking about journal disks not storage. :)



 Regards,
 Quenten


 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Gregory Farnum
 Sent: Tuesday, 22 May 2012 10:30 AM
 To: Quenten Grasso
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: Designing a cluster guide

 On Mon, May 21, 2012 at 4:52 PM, Quenten Grasso qgra...@onq.com.au wrote:
 Hi All,


 I've been thinking about this issue myself past few days, and an idea I've 
 come up with is running 16 x 2.5 15K 72/146GB Disks,
 in raid 10 inside a 2U Server with JBOD's attached to the server for actual 
 storage.

 Can someone help clarify this one,

 Once the data is written to the (journal disk) and then read from the 
 (journal disk) then written to the (storage disk) once this is complete this 
 is considered a successful write by the client?
 Or
 Once the data is written to the (journal disk) is this considered successful 
 by the client?
 This one — the write is considered safe once it is on-disk on all
 OSDs currently responsible for hosting the object.

 Every time anybody mentions RAID10 I have to remind them of the
 storage amplification that entails, though. Are you sure you want that
 on top of (well, underneath, really) Ceph's own replication?

 Or
 Once the data is written to the (journal disk) and written to the (storage 
 disk) at the same time, once complete this is considered a successful write 
 by the client? (if this is the case SSD's may not be so useful)


 Pros
 Quite fast Write throughput to the journal disks,
 No write wareout of SSD's
 RAID 10 with 1GB Cache Controller also helps improve things (if really keen 
 you could use a cachecade as well)


 Cons
 Not as fast as SSD's
 More rackspace required per server.


 Regards,
 Quenten

 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of 

OSD deadlock with cephfs client and OSD on same machine

2012-05-29 Thread Amon Ott
Hello again!

On Linux, if you run OSD on ext4 filesystem, have a cephfs kernel client mount 
on the same system and no syncfs system call (as to be expected with libc6  
2.14 or kernel  2.6.39), OSD deadlocks in sys_sync(). Only reboot recovers 
the system.

After some investigation in the code, this is what I found:
In src/common/sync_filesystem.h, the function sync_filesystem() first tries a 
syncfs() (not available), then a btrfs ioctrl sync (not available with 
non-btrfs), then finally a sync(). sys_sync tries to sync all filesystems, 
including the journal device, the osd storage area and the cephfs mount. 
Under some load, when OSD calls sync(), cephfs sync waits for the local osd, 
which already waits for its storage to sync, which the kernel wants to do 
after the cephfs sync. Deadlock.

The function sync_filesystem() is called by FileStore::sync_entry() in 
src/os/FileStore.cc, but only on non-btrfs storage and if 
filestore_fsync_flushes_journal_data is false. After forcing this to true in 
OSD config, our test cluster survived three days of heavy load (and still 
running fine) instead of deadlocking all nodes within an hour. Reproduced 
with 0.47.2 and kernel 3.2.18, but the related code seems unchanged in 
current master.

Conclusion: If you want to run OSD and cephfs kernel client on the same Linux 
server and have a libc6 before 2.14 (e.g. Debian's newest in experimental is 
2.13) or a kernel before 2.6.39, either do not use ext4 (but btrfs is still 
unstable) or risk data loss by missing syncs through the workaround of 
forcing filestore_fsync_flushes_journal_data to true.

Please consider putting out a fat warning at least at build time, if syncfs() 
is not available, e.g. No syncfs() syscall, please expect a deadlock when 
running osd on non-btrfs together with a local cephfs mount. Even better 
would be a quick runtime test for missing syncfs() and storage on non-btrfs 
that spits out a warning, if deadlock is possible.

As a side effect, the experienced lockup seems to be a good way to reproduce 
the long standing bug 1047 - when our cluster tried to recover, all MDS 
instances died with those symptoms. It seems that a partial sync of journal 
or data partition causes that broken state.

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH   Tel: +49 30 24342334
Am Köllnischen Park 1Fax: +49 30 24342336
10179 Berlin http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: distributed cluster

2012-05-29 Thread Quenten Grasso
This is also something I'm very interested in as well from a Power outage or 
some other Data centre issue.

I assume the main issue here would be our friend latency however there is a 
bloke on the mailing list who is currently running a 2 site cluster setup as 
well.

I've been thinking about a setup with 2 replica level (1 replica per site) with 
the sites only 2-3km apart latency shouldn't be much of an issue but the 
obvious bottleneck will be the 10gbe link between sites and split brain isn't 
an issue if the RBD Vol is only mounted at a single site anyway.

If the data is sitting on a BTRFS/ZFS raid (or raid6 until BTRFS is ready) this 
would be reasonable level of risk. As for data integrity/availability of only 
having 2 replicas because the likely hood of having a complete server failure 
and a link outage at the same time would be fairly minimal.

Regards,
Quenten 


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Jimmy Tang
Sent: Monday, 28 May 2012 11:48 PM
To: Jerker Nyberg
Cc: ceph-devel@vger.kernel.org
Subject: Re: distributed cluster

Hi All,

On 28 May 2012, at 12:28, Jerker Nyberg wrote:

 
 This may not really be a subject ceph-devel mailinglist but rather a 
 potential ceph-users? I hope it is ok to write here. I would like to discuss 
 the if it sounds reasonable to run a Ceph cluster distributed over a metro 
 (city) network.
 
 Let us assume we have a couple of sites distributed over a metro network with 
 at least gigabit interconnect. The demands for storage capacity and speed at 
 our sites are increasing together with the demands for reasonably stable 
 storage. May Ceph be a port of a solution?
 
 One idea is to set up Ceph distributed over this metro network. A public 
 service network is announced at all sites, anycasted from the storage 
 SMB/NFS/RGW(?)-to-Ceph gateway. (for stateless connections). Statefull 
 connections (iSCSI?) has to contact the individual storage gateways and 
 redundancy is handled at the application level (dual path). Ceph kernel 
 clients contact the storage servers directly.
 
 Hopefully this means that clients at the sites with a storage gateway will 
 contact it. Clients at a site without a local storage gateway, or when the 
 local gateway is down, will contact a storage gateway at another site.
 
 Hopefully not all power and network at the whole city will go down at once!
 
 Does this sound reasonable? It should be easy to scale up with more storage 
 nodes with Ceph. Or is it better to put all servers in the same server room?
 
Internet
 |   |
Routers
 |   |
   Metro network  =
  | | | |||
   Sites  R R R RRR
  | | | |
   Servers  Ceph1 Ceph2 Ceph3 Ceph4
 
 


I'm also interested in this type of use case, I would be interested in running 
a ceph cluster across a metropolitan area network. Has anyone tried running 
ceph in a WAN/MAN environment across a city/state/country?

Regards,
Jimmy Tang

--
Senior Software Engineer, Digital Repository of Ireland (DRI)
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-29 Thread Stefan Priebe - Profihost AG
Am 29.05.2012 05:54, schrieb Alexandre DERUMIER:
 This happens with ext4 or btrfs too. 
 
 maybe this is related to io scheduler ?
 did you have compared cfq,deadline,noop scheduler ?

This is something i consider for performance tuning later on, when
everything is running smooth. Right now i'm using CFQ with the tuned IBM
settings (which proxmox uses too).


Here are some outputs of basic fio Tests running on 3.4 and 3.0.

3.4: http://pastebin.com/raw.php?i=6GEKsCYH
3.0: http://pastebin.com/raw.php?i=FU4AtUck

strangely 3.4 is faster but this corresponds to the fact that the normal
Disk I/O is working fine with 3.4 It's just ceph which isn't working fine.

 also what's is your sas/sata controller  ?
Intel onboard SATA controller in this testsetup.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-29 Thread Stefan Priebe - Profihost AG
It would be really nice if somebody from inktank can comment this whole
sitation.

Thanks!

Stefan

Am 29.05.2012 05:54, schrieb Alexandre DERUMIER:
 This happens with ext4 or btrfs too. 
 
 maybe this is related to io scheduler ?
 
 did you have compared cfq,deadline,noop scheduler ?
 
 noop should be fast with ssd.
 
 
 also what's is your sas/sata controller  ?
 
 - Mail original - 
 
 De: Stefan Priebe s.pri...@profihost.ag 
 À: Alexandre DERUMIER aderum...@odiso.com 
 Cc: ceph-devel@vger.kernel.org, Mark Nelson mark.nel...@inktank.com 
 Envoyé: Lundi 28 Mai 2012 21:48:34 
 Objet: Re: poor OSD performance using kernel 3.4 
 
 Am 28.05.2012 08:52, schrieb Alexandre DERUMIER: 
 I think filestore journal parallel works only with btrfs. 
 Other filesystem are writeahead. 
 ... you might be right but i can't change ceph's implementation. 

 See my schema, 
 I think you see parallel writes, because you see flush write of first wave 
 to disk, in the same time 
 of second wave write to journal. 
 Yes i fulllý understand and agree - but still this should at least 
 result in a constant bandwidth near max of underlying disk. 
 
 I totally aggree with you but this is just a test setup AND if you have 
 a big log file to copy let's say 100GB your journal will never be big 
 enough and the speed should never drop to 0MB/s. Also i see the correct 
 behaviour with 3.0.X where the speed is maxed to the underlying device. 
 So i still see no reason that with 3.4 the speed drops to 0MB/s and is 
 mostly 10-20MB/s instead of 130MB/s. 

 Maybe something is wrong with 3.4, then your disk write more slowly. (xfs 
 bug, sata driver controller bug, ...) 
 
 This happens with ext4 or btrfs too. 
 
 Squential write speed to FS is exactly the same under 3.0 and 3.4 using 
 oflag=direct. 
 
 3.4: 
 1+0 records in 
 1+0 records out 
 1048576 bytes (10 GB) copied, 41,4899 s, 253 MB/s 
 
 3.0: 
 1+0 records in 
 1+0 records out 
 1048576 bytes (10 GB) copied, 40,861 s, 257 MB/s 
 
 maybe some local benchmark of your ssd with 3.4 can give some tips ? 
 
 How many disks (7,2K) do you have by osd ? 
 One intel 520 SSD per OSD. 

 I see some benchmark on internet about 150-300MB/s (depend of the 
 blocksize). 
 bench OSD shows around 260MB/s 
 
 ceph osd tell X bench shows me a speed of 260MB/s under both kernels 
 which corresponds to the dd from above. 
 
 Something must be wrong, Doing local benchmark can really help I think. 
 You can use sysbench-tools 
 https://github.com/tsuna/sysbench-tools 
 It make bench compare with nice graphs. 
 Thx hopefully i'll find something. 
 
 Stefan 
 
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-29 Thread Alexandre DERUMIER
fio benchmark will give you raw device performance bypassing filesystem.

So maybe the problem is in xfs or linux vfs layer.

I think you need to bench the filesystem to compare performance


- Mail original - 

De: Stefan Priebe - Profihost AG s.pri...@profihost.ag 
À: Alexandre DERUMIER aderum...@odiso.com 
Cc: ceph-devel@vger.kernel.org, Mark Nelson mark.nel...@inktank.com 
Envoyé: Mardi 29 Mai 2012 10:22:34 
Objet: Re: poor OSD performance using kernel 3.4 

Am 29.05.2012 05:54, schrieb Alexandre DERUMIER: 
 This happens with ext4 or btrfs too. 
 
 maybe this is related to io scheduler ? 
 did you have compared cfq,deadline,noop scheduler ? 

This is something i consider for performance tuning later on, when 
everything is running smooth. Right now i'm using CFQ with the tuned IBM 
settings (which proxmox uses too). 


Here are some outputs of basic fio Tests running on 3.4 and 3.0. 

3.4: http://pastebin.com/raw.php?i=6GEKsCYH 
3.0: http://pastebin.com/raw.php?i=FU4AtUck 

strangely 3.4 is faster but this corresponds to the fact that the normal 
Disk I/O is working fine with 3.4 It's just ceph which isn't working fine. 

 also what's is your sas/sata controller ? 
Intel onboard SATA controller in this testsetup. 

Stefan 



-- 

-- 




Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-29 Thread Yann Dupont

On 29/05/2012 11:46, Stefan Priebe - Profihost AG wrote:

It would be really nice if somebody from inktank can comment this whole
sitation.


Hello.
I think I have the same bug :

My setup is with 8 OSD nodes, 3 MDS (1 active)  3 MON.
All my machines are debian, using a custom 3.4.0 kernel. Ceph is 
0.47.2-1~bpo60+1 (debian package)


root@label5:~#  rados -p data bench 20 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 20 seconds.
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
0   0 0 0 0 0 - 0
1  169983 331.9   332  0.059756 0.0946512
2  16   141   125   249.946   168  0.049822  0.212338
3  16   166   150   199.963   100  0.057352  0.257179
4  16   227   211   210.965   244  0.043592  0.265005
5  16   257   241   192.767   120  0.040883  0.276718
6  16   260   244   162.64112   1.59593  0.293439
7  16   319   303   173.118   236  0.056913  0.357856
8  16   348   332   165.976   116  0.052954  0.332424
9  16   348   332   147.535 0 -  0.332424
   10  16   472   456   182.374   248  0.038543  0.343745
   11  16   485   469   170.52252  0.040475  0.347328
   12  16   485   469   156.312 0 -  0.347328
   13  16   517   501   154.13364  0.047759  0.378595
   14  16   562   546155.98   180  0.042814  0.395036
   15  16   563   547   145.847 4  0.045834  0.394398
   16  16   563   547   136.732 0 -  0.394398
   17  16   563   547   128.689 0 -  0.394398
   18  16   667   651   144.648   138.667   0.06501  0.440847
   19  16   703   687   144.613   144  0.040772  0.421935
min lat: 0.030505 max lat: 5.05834 avg lat: 0.421935
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
   20  16   703   687   137.382 0 -  0.421935
   21  16   704   688   131.031 2   2.65675  0.425184
   22  14   704   690   125.439 8   3.26857  0.433417
Total time run:22.042041
Total writes made: 704
Write size:4194304
Bandwidth (MB/sec):127.756

Average Latency:   0.498932
Max latency:   5.05834
Min latency:   0.030505


What puzzle me is if I test with pool rbd instead :


root@label5:~#  rados -p rbd bench 20 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 20 seconds.
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
0   0 0 0 0 0 - 0
1  16   191   175   699.782   700  0.236737 0.0841979
2  16   397   381   761.837   824  0.065643 0.0813094
3  16   602   586   781.193   820   0.07921 0.0808584
4  16   815   799798.88   852  0.066597 0.0785906
5  16  1026  1010   807.885   844   0.10364 0.0785475
6  16  1249  1233   821.886   892  0.069324 0.0773951
7  16  1461  1445   825.608   848  0.053176 0.0770628
8  16  1680  1664   831.895   876   0.09612 0.0765263
9  16  1897  1881   835.891   868  0.100736 0.0761617
   10  16  2105  2089   835.491   832  0.114913 0.0761897
   11  16  2329  2313   840.983   896  0.042009 0.0758589
   12  16  2553  2537   845.559   896   0.07017 0.0754364
   13  16  2786  2770   852.203   932  0.066365 0.0749136
   14  16  3009  2993   855.041   892   0.06491 0.0746046
   15  16  3228  3212   856.431   876   0.05698 0.0745573
   16  16  3437  3421   855.148   836  0.062162 0.0746339
   17  16  3652  3636   855.428   860  0.140451  0.074534
   18  16  3878  3862   858.121   904  0.081505 0.0743125
   19  16  4106  4090   860.952   912  0.079922 0.0742146
min lat: 0.032342 max lat: 0.63151 avg lat: 0.0741575
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
   20  16  4324  4308   861.495   872   0.06199 0.0741575
Total time run:20.102264
Total writes made: 4325
Write size:4194304
Bandwidth (MB/sec):860.600

Average Latency:   0.0743131
Max latency:   0.63151
Min latency:   0.032342


As you can see, much more stable bandwith with this pool.

I understand data  rbd pool probably don't use the same internals, but 
is this difference expected ?


disclaimer: By no mean I'm a ceph expert, I'm just experimenting with 
it, and still don't 

Re: NFS re-exporting CEPH cluster

2012-05-29 Thread madhusudhana U
Greg Farnum greg at inktank.com writes:

 
 Have you tried something and it failed? Or are you looking for suggestions?
 If the former, please report the failure. :)
 If the latter: http://ceph.com/wiki/Re-exporting_NFS
 -Greg

Greg,
I have tried the link. But, my production build (t_make) is failing on the
NFS exported ceph_cluster where as it runs fine over another NFS directory
coming from a NFS server.

Is CEPH is 100% compatible with NFS ?

Thanks
__M

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-29 Thread Stefan Priebe - Profihost AG
Am 29.05.2012 15:01, schrieb Alexandre DERUMIER:
 fio benchmark will give you raw device performance bypassing filesystem.
 
 So maybe the problem is in xfs or linux vfs layer.
 
 I think you need to bench the filesystem to compare performance
here another test with bonnie, which shows the same:
http://pastebin.com/raw.php?i=fGTt4NLi

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-29 Thread Stefan Priebe - Profihost AG
Am 29.05.2012 15:39, schrieb Yann Dupont:
 On 29/05/2012 11:46, Stefan Priebe - Profihost AG wrote:
 It would be really nice if somebody from inktank can comment this whole
 sitation.

 Hello.
 I think I have the same bug :
 
 My setup is with 8 OSD nodes, 3 MDS (1 active)  3 MON.
 All my machines are debian, using a custom 3.4.0 kernel. Ceph is
 0.47.2-1~bpo60+1 (debian package)

That sounds absolutely like the same issue. Sadly nobody from inktank
has replied to this problems for the last days.

 As you can see, much more stable bandwith with this pool.
That's pretty strange...

 I understand data  rbd pool probably don't use the same internals, but
 is this difference expected ?

There must be differences in pool handling.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD deadlock with cephfs client and OSD on same machine

2012-05-29 Thread Sage Weil
On Tue, 29 May 2012, Amon Ott wrote:
 Hello again!
 
 On Linux, if you run OSD on ext4 filesystem, have a cephfs kernel client 
 mount 
 on the same system and no syncfs system call (as to be expected with libc6  
 2.14 or kernel  2.6.39), OSD deadlocks in sys_sync(). Only reboot recovers 
 the system.
 
 After some investigation in the code, this is what I found:
 In src/common/sync_filesystem.h, the function sync_filesystem() first tries a 
 syncfs() (not available), then a btrfs ioctrl sync (not available with 
 non-btrfs), then finally a sync(). sys_sync tries to sync all filesystems, 
 including the journal device, the osd storage area and the cephfs mount. 
 Under some load, when OSD calls sync(), cephfs sync waits for the local osd, 
 which already waits for its storage to sync, which the kernel wants to do 
 after the cephfs sync. Deadlock.
 
 The function sync_filesystem() is called by FileStore::sync_entry() in 
 src/os/FileStore.cc, but only on non-btrfs storage and if 
 filestore_fsync_flushes_journal_data is false. After forcing this to true in 
 OSD config, our test cluster survived three days of heavy load (and still 
 running fine) instead of deadlocking all nodes within an hour. Reproduced 
 with 0.47.2 and kernel 3.2.18, but the related code seems unchanged in 
 current master.
 
 Conclusion: If you want to run OSD and cephfs kernel client on the same Linux 
 server and have a libc6 before 2.14 (e.g. Debian's newest in experimental is 
 2.13) or a kernel before 2.6.39, either do not use ext4 (but btrfs is still 
 unstable) or risk data loss by missing syncs through the workaround of 
 forcing filestore_fsync_flushes_journal_data to true.

Note that fsync_flushed_journal_data should only be set to true with ext3 
and the 'data=ordered' or 'data=journal' mount option.  It is an 
implementation artifact only that fsync() will flush all previous writes.

 Please consider putting out a fat warning at least at build time, if syncfs() 
 is not available, e.g. No syncfs() syscall, please expect a deadlock when 
 running osd on non-btrfs together with a local cephfs mount. Even better 
 would be a quick runtime test for missing syncfs() and storage on non-btrfs 
 that spits out a warning, if deadlock is possible.

I think a runtime warning makes more sense; nobody will see the build time 
warning (e.g., those installed debs).

 As a side effect, the experienced lockup seems to be a good way to reproduce 
 the long standing bug 1047 - when our cluster tried to recover, all MDS 
 instances died with those symptoms. It seems that a partial sync of journal 
 or data partition causes that broken state.

Interesting!  If you could also note on that bug what the metadata 
workload was (what was making hard links?), that would be great!

Thanks-
sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD deadlock with cephfs client and OSD on same machine

2012-05-29 Thread Tommi Virtanen
On Tue, May 29, 2012 at 12:44 AM, Amon Ott a@m-privacy.de wrote:
 On Linux, if you run OSD on ext4 filesystem, have a cephfs kernel client mount
 on the same system and no syncfs system call (as to be expected with libc6 
 2.14 or kernel  2.6.39), OSD deadlocks in sys_sync(). Only reboot recovers
 the system.

This is the classic issue of memory pressure needing free memory to be
relieved. While syncfs(2) may make the hang less common, I do not
think having syncfs(2) is enough; nothing sort of having a reserved
memory pool guaranteed to be big enough to handle the request will,
and maintaining that solution is hideously complex.

Loopback NFS suffers from the exact same thing.

Apparently using ceph-fuse is enough to move so much of the processing
to user space, that the pageability of userspace memory allows the
system to recover.

Here's a fragment of the earlier conversation on this topic. Apologies
for gmane/mail clients breaking the thread, anything with that subject
line is part of the conversation:

http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/1673
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD per disk.

2012-05-29 Thread Tommi Virtanen
On Mon, May 28, 2012 at 2:34 AM, Alexandre DERUMIER aderum...@odiso.com wrote:
 maybe try

 [osd.0]
 host = testnode
 osd data = /data/osd0
 osd journal = /data/osd0/osd0journal
 osd journal size = 1000

That shouldn't be needed. osd.0 will happily read the [osd] section in
the config file.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD per disk.

2012-05-29 Thread Tommi Virtanen
On Mon, May 28, 2012 at 2:25 AM, chandrashekhar chandub...@gmail.com wrote:
 Thanks Alexandre,  I created four directories in /data  (osd0,osd1,osd2,osd3)
 and mounted as below:

 /dev/sdb1 - /data/osd1
 /dev/sdc1 - /data/osd2
 /dev/sdd1 - /data/osd3


 But when I start ceph its starting mons and md daemons but not osds. Please 
 help
 me to get this working.

How did you create the cluster? mkcephfs?

Do you see log entries in /var/log/ceph/*osd*.log ? What do they say?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: distributed cluster

2012-05-29 Thread Tommi Virtanen
On Mon, May 28, 2012 at 4:28 AM, Jerker Nyberg jer...@update.uu.se wrote:
 This may not really be a subject ceph-devel mailinglist but rather a
 potential ceph-users? I hope it is ok to write here.

It's absolutely ok to talk on this mailing list about using Ceph. We
may create a separate ceph-users later on, but right now this list is
where the conversation should go.

 Let us assume we have a couple of sites distributed over a metro network
 with at least gigabit interconnect. The demands for storage capacity and
 speed at our sites are increasing together with the demands for reasonably
 stable storage. May Ceph be a port of a solution?

Ceph was designed to work within a single data center. If parts of the
cluster reside in remote locations, you essentially suffer the worst
combination of their latency and bandwidth limits. A write that gets
replicated to three different data centers is not complete until the
data has been transferred to all three, and an acknowledgement has
been received.

For example: with data replicated over data centers A, B, C, connected
at 1Gb/s, the fastest all of A will ever handle writes is 0.5Gb/s --
it'll need to replicate everything to B and C, over that single pipe.

I am aware of a few people building multi-dc Ceph clusters. Some have
shared their network latency, bandwidth and availability numbers with
me (confidentially), and at first glance their wide-area network
performs better than many single-dc networks. They are far above a 1
gigabit interconnect.

I would really recommend you embark on a project like this only if you
are able to understand the Ceph replication model, and do the math for
yourself and figure out what your expected service levels for Ceph
operations would be. (Naturally, Inktank Professional Services will
help you in your endeavors, though their first response should be
that's not a recommended setup.)

 One idea is to set up Ceph distributed over this metro network. A public
 service network is announced at all sites, anycasted from the storage
 SMB/NFS/RGW(?)-to-Ceph gateway. (for stateless connections). Statefull
 connections (iSCSI?) has to contact the individual storage gateways and
 redundancy is handled at the application level (dual path). Ceph kernel
 clients contact the storage servers directly.

The Ceph Distributed File System is not considered production ready yet.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-29 Thread Tommi Virtanen
On Tue, May 29, 2012 at 12:25 AM, Quenten Grasso qgra...@onq.com.au wrote:
 So if we have 10 nodes vs. 3 nodes with the same mount of disks we should see 
 better write and read performance as you would have less overlap.

First of all, a typical way to run Ceph is with say 8-12 disks per
node, and an OSD per disk. That means your 3-10 node clusters actually
have 24-120 OSDs on them. The number of physical machines is not
really a factor, number of OSDs is what matters.

Secondly, 10-node or 3-node clusters are fairly uninteresting for
Ceph. The real challenge is at the hundreds, thousands and above
range.

 Now we take BTRFS into the picture as I understand journals are not necessary 
 due to the nature of the way it writes/snapshots and reads data this alone 
 would be a major performance increase on a BTRFS Raid level (like ZFS RAIDZ).

A journal is still needed on btrfs, snapshots just enable us to write
to the journal in parallel to the real write, instead of needing to
journal first.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: distributed cluster

2012-05-29 Thread Sam Zaydel
I could see a lot of use-cases for BC and DR tiers where performance
may not be as much an issue, but availability being critical above
all.

Most options in use today rely on some async rep. and are in most
cases quite expensive and still do not view performance as their
primary concern.

On Tue, May 29, 2012 at 9:44 AM, Tommi Virtanen t...@inktank.com wrote:
 On Mon, May 28, 2012 at 4:28 AM, Jerker Nyberg jer...@update.uu.se wrote:
 This may not really be a subject ceph-devel mailinglist but rather a
 potential ceph-users? I hope it is ok to write here.

 It's absolutely ok to talk on this mailing list about using Ceph. We
 may create a separate ceph-users later on, but right now this list is
 where the conversation should go.

 Let us assume we have a couple of sites distributed over a metro network
 with at least gigabit interconnect. The demands for storage capacity and
 speed at our sites are increasing together with the demands for reasonably
 stable storage. May Ceph be a port of a solution?

 Ceph was designed to work within a single data center. If parts of the
 cluster reside in remote locations, you essentially suffer the worst
 combination of their latency and bandwidth limits. A write that gets
 replicated to three different data centers is not complete until the
 data has been transferred to all three, and an acknowledgement has
 been received.

 For example: with data replicated over data centers A, B, C, connected
 at 1Gb/s, the fastest all of A will ever handle writes is 0.5Gb/s --
 it'll need to replicate everything to B and C, over that single pipe.

 I am aware of a few people building multi-dc Ceph clusters. Some have
 shared their network latency, bandwidth and availability numbers with
 me (confidentially), and at first glance their wide-area network
 performs better than many single-dc networks. They are far above a 1
 gigabit interconnect.

 I would really recommend you embark on a project like this only if you
 are able to understand the Ceph replication model, and do the math for
 yourself and figure out what your expected service levels for Ceph
 operations would be. (Naturally, Inktank Professional Services will
 help you in your endeavors, though their first response should be
 that's not a recommended setup.)

 One idea is to set up Ceph distributed over this metro network. A public
 service network is announced at all sites, anycasted from the storage
 SMB/NFS/RGW(?)-to-Ceph gateway. (for stateless connections). Statefull
 connections (iSCSI?) has to contact the individual storage gateways and
 redundancy is handled at the application level (dual path). Ceph kernel
 clients contact the storage servers directly.

 The Ceph Distributed File System is not considered production ready yet.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-29 Thread Mark Nelson

On 05/29/2012 09:43 AM, Stefan Priebe - Profihost AG wrote:

Am 29.05.2012 15:39, schrieb Yann Dupont:

On 29/05/2012 11:46, Stefan Priebe - Profihost AG wrote:

It would be really nice if somebody from inktank can comment this whole
sitation.


Hello.
I think I have the same bug :

My setup is with 8 OSD nodes, 3 MDS (1 active)  3 MON.
All my machines are debian, using a custom 3.4.0 kernel. Ceph is
0.47.2-1~bpo60+1 (debian package)

That sounds absolutely like the same issue. Sadly nobody from inktank
has replied to this problems for the last days.


Sorry about that, yesterday was a holiday in the US.

I did some quick tests on a couple of nodes I had laying around this 
morning.


Distro: Oneiric (IE no syncfs in glibc)
Ceph: 0.46-65-gf6c5dff

1 1GbE Client node
3 1GbE Mon nodes
2 1GbE OSD nodes with 1 OSD on each mounted on a 7200rpm SAS drive.  
btrfs with -l 64k -n64k, mounted using noatime.  H700 Raid controller 
with each drive in a 1 disk raid0.  Journals are partitioned on a 
separate drive.


/proc/version:
Linux version 3.4.0-ceph (autobuild-ceph@gitbuilder-kernel-amd64)

rados -p data bench 120 write:

Total time run:120.601286
Total writes made: 2979
Write size:4194304
Bandwidth (MB/sec):98.805

Average Latency:   0.647507
Max latency:   1.39966
Min latency:   0.181663

Once I get these nodes up to 0.47 and get them switched over to 10GbE 
I'll redo the btrfs tests and try out xfs as well with longer running tests.



As you can see, much more stable bandwith with this pool.

That's pretty strange...


Indeed, that is very strange!  Can you check to see how many pgs are in 
each?  Any difference in replication level?  You can check with:


ceph osd pool get pool size
ceph osd pool get pool pg_num


I understand data  rbd pool probably don't use the same internals, but
is this difference expected ?

There must be differences in pool handling.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Thanks,
Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Appearing messages per 10 sec

2012-05-29 Thread Tommi Virtanen
On Mon, May 28, 2012 at 3:35 AM, Tomoki BENIYA ben...@bit-isle.co.jp wrote:
 Following messages are appearing per 10 sec on terminal of mds.1.
 But, not appearing on mds.0.

 What does these mean? And, how to stop these?
 
 Message from syslogd@mds1 at May 28 19:26:03 ...
  ceph-mds: 2012-05-28 19:26:03.497958 7fede74e5700  0 mds.0.bal   mds.0 
 mdsload[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.05 = 0 ~ 0

That looks like
https://github.com/ceph/ceph/blob/master/src/mds/MDBalancer.cc#L472
which looks like a diagnostic message only output by the root MDS.
That's why you only see it on one of your MDS servers.

It's a harmless status message. It's logged at a fairly high priority,
but you can ignore it.

The decision to output it to a console is made by your syslog daemon,
and that is the right place to configure that.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question regarding API doc

2012-05-29 Thread Sam Just
You are quite right.  I've updated the documentation in master, commit
f953c4c0b0ba69342cab52243c1b73987f7f94f6.  Thanks for the info!
-Sam

On Fri, May 25, 2012 at 8:08 PM, Xiaopong Tran xiaopong.t...@gmail.com wrote:
 I'm looking at the description in this API:

 http://ceph.com/docs/master/api/librados/#rados_objects_list_next

 For the parameters entry and key, the doc said (caller must free).
 I looked up in the code, and found this statement in the doc
 a bit misleading.

 Is the doc outdated, or did I miss anything?

 Cheers

 xp
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiple named clusters on same nodes

2012-05-29 Thread Greg Farnum
On Thursday, May 24, 2012 at 1:58 AM, Amon Ott wrote:
 On Thursday 24 May 2012 wrote Amon Ott:
  Attached is a patch based on current git stable that makes mkcephfs work
  fine for me with --cluster name. ceph-mon uses the wrong mkfs path for mon
  data (default ceph instead of supplied cluster name), so I put in a
  workaround.
  
  Please have a look and consider inclusion as well as fixing mon data path.
  Thanks.
 
 
 
 And another patch for the init script to handle multiple clusters.

Amon:
Thanks for the patches! Unfortunately nobody who's competent to review these 
(ie, not me) has time to look into them right now, but they're on the queue 
when TV or Sage gets some time. :)
-Greg


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-29 Thread Yann Dupont

Le 29/05/2012 19:50, Mark Nelson a écrit :


1 1GbE Client node
3 1GbE Mon nodes
2 1GbE OSD nodes with 1 OSD on each mounted on a 7200rpm SAS drive.  
btrfs with -l 64k -n64k, mounted using noatime.  H700 Raid controller 
with each drive in a 1 disk raid0.  Journals are partitioned on a 
separate drive.



Hello ,
Forgot to mention I'm using 10 Gbe and FS using btrfs with -l 64k -n64k, 
but also space_cache,compress=lzo,nobarrier,noatime.

journal is on tmpfs :

 osd journal = /dev/shm/journal
 osd journal size = 6144

Remember It's not a production system for the moment. I'm just trying to 
evaluate what is the best performance I can get. (and if the system is 
stable enough to start alpha/pre-production services). BTW, I noticed 
OSD usings XFS are much much slower than OSD with btrfs right now, 
particulary in rbd tests. btrfs have some stability problems, even if 
with newer kernels it seems better.



/proc/version:
Linux version 3.4.0-ceph (autobuild-ceph@gitbuilder-kernel-amd64)

rados -p data bench 120 write:

Total time run:120.601286
Total writes made: 2979
Write size:4194304
Bandwidth (MB/sec):98.805

Average Latency:   0.647507
Max latency:   1.39966
Min latency:   0.181663

Once I get these nodes up to 0.47 and get them switched over to 10GbE 
I'll redo the btrfs tests and try out xfs as well with longer running 
tests.



As you can see, much more stable bandwith with this pool.

That's pretty strange...


Indeed, that is very strange!  Can you check to see how many pgs are 
in each?  Any difference in replication level?  You can check with:


ceph osd pool get pool size

root@label5:~# ceph osd pool get data size
don't know how to get pool field size
root@label5:~# ceph osd pool get rbd size
don't know how to get pool field size

Is size the good name of the field ? In the the wiki size isn't listed 
as a valid field



ceph osd pool get pool pg_num


root@label5:~# ceph osd pool get rbd pg_num
PG_NUM: 576
root@label5:~# ceph osd pool get data pg_num
PG_NUM: 576


Th pg num is quite low because I started with small OSD (9 osd with 200G 
each - internal disks) when I formatted. Now, I reduced to 8 osd, (osd.4 
is out) but with much larger ( faster) storage. 6 OSD have 5T on it, 2 
have still 200G but they are planned to migrate before the end of the week.


I try, for the moment, to keep the OSD similars. Replication is set to 2.

No OSD is full, I don't have much data stored for the moment.

Concerning crush map, I'm not using the default one :

The 8 nodes are in 3 different locations (some kilometers away). 2 are 
in 1 place, 2 in another, and the 4 last in the principal place.
I try to group host together to avoid problem when I loose a location 
(electrical problem, for example). Not sure I really customized the 
crush map as I should have.


here is the map :
 begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 device4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8

# types
type 0 osd
type 1 host
type 2 rack
type 3 pool

# buckets
host karuizawa {
id -5# do not change unnecessarily
# weight 1.000
alg straw
hash 0# rjenkins1
item osd.2 weight 1.000
}
host hazelburn {
id -6# do not change unnecessarily
# weight 1.000
alg straw
hash 0# rjenkins1
item osd.3 weight 1.000
}
rack loire {
id -3# do not change unnecessarily
# weight 2.000
alg straw
hash 0# rjenkins1
item karuizawa weight 1.000
item hazelburn weight 1.000
}
host carsebridge {
id -8# do not change unnecessarily
# weight 1.000
alg straw
hash 0# rjenkins1
item osd.5 weight 1.000
}
host cameronbridge {
id -9# do not change unnecessarily
# weight 1.000
alg straw
hash 0# rjenkins1
item osd.6 weight 1.000
}
rack chantrerie {
id -7# do not change unnecessarily
# weight 2.000
alg straw
hash 0# rjenkins1
item carsebridge weight 1.000
item cameronbridge weight 1.000
}
host chichibu {
id -2# do not change unnecessarily
# weight 1.000
alg straw
hash 0# rjenkins1
item osd.0 weight 1.000
}
host glenesk {
id -4# do not change unnecessarily
# weight 1.000
alg straw
hash 0# rjenkins1
item osd.1 weight 1.000
}
host braeval {
id -10# do not change unnecessarily
# weight 1.000
alg straw
hash 0# rjenkins1
item osd.7 weight 1.000
}
host hanyu {
id -11# do not change unnecessarily
# weight 1.000
alg straw
hash 0# rjenkins1
item osd.8 weight 1.000
}
rack lombarderie {
id -12# do not change unnecessarily
# weight 4.000
alg straw
hash 0# rjenkins1
item chichibu weight 1.000
item glenesk weight 1.000
item braeval weight 1.000
item hanyu weight 1.000
}
pool default {
id -1# do 

Re: poor OSD performance using kernel 3.4

2012-05-29 Thread Stefan Priebe

Am 29.05.2012 19:50, schrieb Mark Nelson:

Once I get these nodes up to 0.47 and get them switched over to 10GbE
I'll redo the btrfs tests and try out xfs as well with longer running
tests.
I always test on 1GE and see this proble no matter whether btrfs or xfs. 
So i think this is just a waste of time.


At least my test differ as i see this problem on ALL pools.

Mark should i try 0.46?

Thanks,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-29 Thread Stefan Priebe

Am 29.05.2012 19:50, schrieb Mark Nelson:

I did some quick tests on a couple of nodes I had laying around this
morning.


I just noticed that i get a constant rate of 40MB/s while using 1 
thread. When i use two thread or more i get drop to 0MB/s and crazy 
jumping values.


~# rados -p rbd bench 90 write -t 1
Maintaining 1 concurrent writes of 4194304 bytes for at least 90 seconds.
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
0   0 0 0 0 0 - 0
1   110 935.99436  0.100147  0.101133
2   12019   37.993140  0.096893  0.100719
3   13130   39.992144   0.09784 0.0999607
4   14140   39.992940  0.099156 0.0999003
5   15150   39.993240  0.098239 0.0996518
6   16160   39.993240  0.098682 0.0994851
7   17170   39.993340  0.094397  0.099184
8   18180   39.993140  0.099823 0.0993327
9   19190   39.993140  0.101013 0.0992236
   10   1   101   10039.99340  0.098277  0.099237



# rados -p rbd bench 90 write -t 2
Maintaining 2 concurrent writes of 4194304 bytes for at least 90 seconds.
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
0   0 0 0 0 0 - 0
1   21513   51.9888520.0956  0.115315
2   22220   39.992828  0.120065  0.193125
3   24139   51.991776   0.09557   0.15246
4   25856   55.991268   0.09875  0.137688
5   2676551.99236  0.111211  0.139465
6   28583   55.325172  0.136967  0.143079
7   2   10199   56.562564  0.098664  0.136263
8   2   10199   49.4919 0 -  0.136263
9   2   112   110   48.880822  0.099479  0.160563

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-29 Thread Yann Dupont

Le 29/05/2012 23:08, Stefan Priebe a écrit :

Am 29.05.2012 19:50, schrieb Mark Nelson:

I did some quick tests on a couple of nodes I had laying around this
morning.


I just noticed that i get a constant rate of 40MB/s while using 1 
thread. When i use two thread or more i get drop to 0MB/s and crazy 
jumping values.


~# rados -p rbd bench 90 write -t 1
Maintaining 1 concurrent writes of 4194304 bytes for at least 90 seconds.
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
0   0 0 0 0 0 - 0
1   110 935.99436  0.100147  0.101133
2   12019   37.993140  0.096893  0.100719
3   13130   39.992144   0.09784 0.0999607
4   14140   39.992940  0.099156 0.0999003
5   15150   39.993240  0.098239 0.0996518
6   16160   39.993240  0.098682 0.0994851
7   17170   39.993340  0.094397  0.099184
8   18180   39.993140  0.099823 0.0993327
9   19190   39.993140  0.101013 0.0992236
   10   1   101   10039.99340  0.098277  0.099237




not here :

on data :
root@label5:~# rados -p data bench 20 write -t 1
Maintaining 1 concurrent writes of 4194304 bytes for at least 20 seconds.
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
0   0 0 0 0 0 - 0
1   11514   55.983756  0.096813 0.0677311
2   13332   63.985272  0.088802 0.0612602
3   15150   66.652972  0.056883 0.0594909
4   1605958.98936  0.046377 0.0577145
5   16059   47.1916 0 - 0.0577145
6   17978   51.991138  0.041831 0.0768918
7   1989755.41976  0.050436 0.0718439
8   1   101   100   49.991912  0.043673 0.0712079
9   1   101   100   44.4375 0 - 0.0712079
   10   1   115   114   45.592928  0.043768 0.0876947
   11   1   134   13348.35676  0.052382 0.0826428
   12   1   154   153   50.991980  0.042077 0.0783619
   13   1   175   174   53.529984  0.053474 0.0745956
   14   1   194   193   55.133976  0.049631 0.0724711
   15   1   211   21055.99168  0.052683 0.0712887
   16   1   232   231   57.740784  0.044341 0.0692121
   17   1   249   248   58.343668  0.053707 0.0684414
   18   1   258   25757.10236  0.086088 0.0680656
   19   1   267   266   55.991136  0.050902 0.0713341
min lat: 0.033395 max lat: 2.14757 avg lat: 0.0703545
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
   20   1   285   284   56.790972  0.047755 0.0703545
Total time run:20.066134
Total writes made: 286
Write size:4194304
Bandwidth (MB/sec):57.011

on rbd :


Maintaining 1 concurrent writes of 4194304 bytes for at least 20 seconds.
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
0   1 1 0 0 0 - 0
1   11817   67.980168  0.065869 0.0587313
2   13534   67.984268  0.056982 0.0580468
3   15554   71.984880  0.050305 0.0554721
4   17271   70.985868  0.039387 0.0561269
5   1919071.98676  0.055236 0.0554057
6   1   109   108   71.986472  0.069547 0.0554112
7   1   126   125   71.415468  0.049234 0.0556564
8   1   146   145   72.486880  0.052302 0.0551064
9   1   165   164   72.8758760.0533 0.0548858
   10   1   184   18373.18776  0.041342 0.0543598
   11   1   202   20173.07872  0.048963 0.0544978
   12   1   218   217   72.320764  0.071926 0.0549402
   13   1   236   235   72.295172  0.055804 0.0551936
   14   1   254   253   72.273172  0.058315 0.0552612
   15   1   272   271   72.254172  0.047687 0.0552036
   16   1   290   289   72.237572  0.059162  0.055275
   17   1   308   307   72.222972  0.051991 0.0553467
   18   1   327   32672.43276  0.053271 0.0552114
   19   1   346   345   72.619276  0.058125 0.0550658
min lat: 

Re: poor OSD performance using kernel 3.4

2012-05-29 Thread Stefan Priebe

Am 29.05.2012 23:31, schrieb Yann Dupont:

on the contrary, pool data is jumping up  down, no matter how much
thread involved :)

Maybe this is because journal is too tight ? Or because 2 of the 8 nodes
have slower disks ?
Can you try with 3.0.X? I would be really interested what happens in 
this case.


Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-29 Thread Mark Nelson

On 05/29/2012 04:08 PM, Stefan Priebe wrote:

Am 29.05.2012 19:50, schrieb Mark Nelson:

I did some quick tests on a couple of nodes I had laying around this
morning.


I just noticed that i get a constant rate of 40MB/s while using 1 
thread. When i use two thread or more i get drop to 0MB/s and crazy 
jumping values.


~# rados -p rbd bench 90 write -t 1
Maintaining 1 concurrent writes of 4194304 bytes for at least 90 seconds.
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
0   0 0 0 0 0 - 0
1   110 935.99436  0.100147  0.101133
2   12019   37.993140  0.096893  0.100719
3   13130   39.992144   0.09784 0.0999607
4   14140   39.992940  0.099156 0.0999003
5   15150   39.993240  0.098239 0.0996518
6   16160   39.993240  0.098682 0.0994851
7   17170   39.993340  0.094397  0.099184
8   18180   39.993140  0.099823 0.0993327
9   19190   39.993140  0.101013 0.0992236
   10   1   101   10039.99340  0.098277  0.099237




When you are using 1 thread, you are hitting a ~40MB/s limit (probably 
networking related) before the data gets to the journal.  Because (in 
this case) the filestore data disk can handle that throughput, 
everything looks nice and consistent.




# rados -p rbd bench 90 write -t 2
Maintaining 2 concurrent writes of 4194304 bytes for at least 90 seconds.
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
0   0 0 0 0 0 - 0
1   21513   51.9888520.0956  0.115315
2   22220   39.992828  0.120065  0.193125
3   24139   51.991776   0.09557   0.15246
4   25856   55.991268   0.09875  0.137688
5   2676551.99236  0.111211  0.139465
6   28583   55.325172  0.136967  0.143079
7   2   10199   56.562564  0.098664  0.136263
8   2   10199   49.4919 0 -  0.136263
9   2   112   110   48.880822  0.099479  0.160563



In this case, that 40MB/s limit with 1 thread has increased.  Now more 
data is getting fed into the journal than the filestore can write out to 
disk.  Eventually writes stall while the data is being written out.

Stefan


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD format changes and layering

2012-05-29 Thread Tommi Virtanen
On Fri, May 25, 2012 at 4:07 PM, Josh Durgin josh.dur...@inktank.com wrote:
 To check whether children exist, you can iterate over
 all the pools and check the rbd_clones object in each one.
 Since the number of pools is relatively small, this isn't
 very expensive. If the pool is deleted, by definition all the children in it
 are deleted.

 With separate namespaces in the future, this will be a bit more
 expensive, but it's only needed at base image deletion time,
 which is relatively rare. Deleting the image itself already
 requires an I/O per object, so this is probably not the slow
 part anyway.

 Yehuda, Tv, did I miss anything?

One thing: that's still racy, and we discussed a solution.

1. A: walk through all pools, look for clones, find none
2. B: create a clone
3. A: rbd unpreserve parent
4. A: rbd rm parent

Oopsie.

To avoid that, I proposed a deleting flag. Clones can only be
created when parent is preserved  !deleting. Now,

1. A: rbd deleting parent
2. A: walk through all pools, look for clones, find none
3. B: attempt to create a clone, fails
...

Now, that doesn't have to be strictly deleting.. going_unpreserved
or something; instead of deletion, the intended operation might be
starting a vm against the parent image to e.g. add security updates.

And, as we discussed, these flags would be per snapshot (or also on
the master image, if you want to support that). Thus, one snapshot can
preserved while an older one is scheduled for deletion.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4

2012-05-29 Thread Mark Nelson

On 05/24/2012 09:10 AM, Stefan Priebe - Profihost AG wrote:

Hi list,

today while testing btrfs i discovered a very poor osd performance using
kernel 3.4.

Underlying FS is XFS but it is the same with btrfs.

3.0.30:
~# rados -p data bench 10 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 0   0 0 0 0 0 - 0
 1  164125   99.9767   100  0.586984  0.447293
 2  167155   109.979   120  0.934388  0.488375
 3  169983   110.647   112   1.15982  0.503111
 4  16   130   114   113.981   124   1.05952  0.516925
 5  16   159   143   114.382   116  0.149313  0.510734
 6  16   188   172   114.649   116  0.287166   0.52203
 7  16   215   199   113.697   108  0.151784  0.531461
 8  16   242   226   112.984   108  0.623478  0.539896
 9  16   265   249   110.65192   0.50354  0.538504
10  16   296   280   111.984   124  0.155048  0.542846
Total time run:10.776153
Total writes made: 297
Write size:4194304
Bandwidth (MB/sec):110.243

Average Latency:   0.577534
Max latency:   1.85499
Min latency:   0.091473


3.4:
~# rados -p data bench 10 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 0   0 0 0 0 0 - 0
 1  164024   95.979496  0.393196  0.455936
 2  166852   103.983   112  0.835652  0.517297
 3  168569   91.984968   1.00535  0.493058
 4  169680   79.986944  0.096564  0.577948
 5  16   10387   69.587928  0.092722  0.589147
 6  16   117   101   67.321656  0.222175  0.675334
 7  16   130   114   65.132152   0.15677  0.623806
 8  16   144   128   63.989656  0.089157   0.56746
 9  16   144   128   56.8794 0 -   0.56746
10  16   144   128   51.1912 0 -   0.56746
11  16   144   128   46.5373 0 -   0.56746
12  16   144   128   42.6591 0 -   0.56746
13  16   144   128   39.3776 0 -   0.56746
14  16   144   128   36.5649 0 -   0.56746
15  16   144   128   34.1272 0 -   0.56746
16  16   145   129   32.2443   0.5   11.3422  0.650985
Total time run:16.193871
Total writes made: 145
Write size:4194304
Bandwidth (MB/sec):35.816

Average Latency:   1.78467
Max latency:   14.4744
Min latency:   0.088753

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


I setup some tests today to try to replicate your findings (and also 
check results against some previous ones I've done).  I don't think I'm 
seeing exactly the same results as you, but I definitely see xfs 
performing worse in this specific test than btrfs.  I've included the 
results here.


Distro: Ubuntu Oneiric (IE no syncfs in glibc)
Ceph: 0.47.2
Kernel 3.4.0-ceph (autobuild-ceph@gitbuilder-kernel-amd64)
Network: 10GbE

1 Client node
3 Mon nodes
2 OSD nodes with 1 OSD each mounted on a 7200rpm SAS drive.  H700 Raid 
controller with each drive in a 1 disk raid0.  Journals are partitioned 
on a separate drive.  OSD data disks are using WT cache while journals 
are using WB.

btrfs created with -l 64k -n64k, mounted using noatime.
xfs created with -f -d su=64k,sw=1 -i size=2048, mounted using noatime.
rados bench invocation: rados -p data bench 300 write -t 16 -b 4194304

btrfs:

Total time run:300.413696
Total writes made: 7582
Write size:4194304
Bandwidth (MB/sec):100.954

Average Latency:   0.633932
Max latency:   3.78661
Min latency:   0.065734

xfs:

Total time run:304.435966
Total writes made: 5023
Write size:4194304
Bandwidth (MB/sec):65.997

Average Latency:   0.96965
Max latency:   36.4993
Min latency:   0.07516

Full results are available here:

http://nhm.ceph.com/results/mailinglist-tests/

I created seekwatcher movies by running blktrace on the underlying OSD 
data disks during the tests.  These show throughput over time, 
seeks/sec, and visual representation of where the disk is being written 
to for each OSD.  You can 

Kernel crash bug status

2012-05-29 Thread Nick Bartos
We are still seeing a crash on 0.47.2 with 3.2.18, which seems to be this bug:

http://tracker.newdream.net/issues/2260

Any one else seeing this problem, and/or have any ideas how to fix or
work around it?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html