[ceph-users] Samsung DC SV843 SSD

2016-09-14 Thread Quenten Grasso
Hi Everyone,

I'm looking for some SSD's for our cluster and I came across these Samsung DC 
SV843 SSD's and noticed in the mailing lists from awhile back some people were 
talking about them.

Just wondering if anyone ended up using them and how they are going?

Thanks in advance,

Regards,
Quenten 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Single OSD down

2015-04-20 Thread Quenten Grasso
r_entry done
2015-04-16 00:43:10.198391 7f963a08c780 20 -- 10.100.128.13:6800/43939 wait: 
stopped reaper thread
2015-04-16 00:43:10.198399 7f963a08c780 10 -- 10.100.128.13:6800/43939 wait: 
closing pipes
2015-04-16 00:43:10.198401 7f963a08c780 10 -- 10.100.128.13:6800/43939 reaper
2015-04-16 00:43:10.198406 7f963a08c780 10 -- 10.100.128.13:6800/43939 reaper 
done
2015-04-16 00:43:10.198409 7f963a08c780 10 -- 10.100.128.13:6800/43939 wait: 
waiting for pipes  to close
2015-04-16 00:43:10.198411 7f963a08c780 10 -- 10.100.128.13:6800/43939 wait: 
done.
2015-04-16 00:43:10.198413 7f963a08c780  1 -- 10.100.128.13:6800/43939 shutdown 
complete.
2015-04-16 00:43:10.198416 7f963a08c780 10 -- 10.100.96.13:6830/43939 wait: 
waiting for dispatch queue
2015-04-16 00:43:10.198429 7f963a08c780 10 -- 10.100.96.13:6830/43939 wait: 
dispatch queue is stopped
2015-04-16 00:43:10.198433 7f963a08c780 20 -- 10.100.96.13:6830/43939 wait: 
stopping accepter thread
2015-04-16 00:43:10.198436 7f963a08c780 10 accepter.stop accepter
2015-04-16 00:43:10.198450 7f962558d700 20 accepter.accepter poll got 1
2015-04-16 00:43:10.198457 7f962558d700 20 accepter.accepter closing
2015-04-16 00:43:10.198465 7f962558d700 10 accepter.accepter stopping
2015-04-16 00:43:10.198495 7f963a08c780 20 -- 10.100.96.13:6830/43939 wait: 
stopped accepter thread
2015-04-16 00:43:10.198500 7f963a08c780 20 -- 10.100.96.13:6830/43939 wait: 
stopping reaper thread
2015-04-16 00:43:10.198517 7f96347d6700 10 -- 10.100.96.13:6830/43939 
reaper_entry done
2015-04-16 00:43:10.198565 7f963a08c780 20 -- 10.100.96.13:6830/43939 wait: 
stopped reaper thread
2015-04-16 00:43:10.198578 7f963a08c780 10 -- 10.100.96.13:6830/43939 wait: 
closing pipes
2015-04-16 00:43:10.198581 7f963a08c780 10 -- 10.100.96.13:6830/43939 reaper
2015-04-16 00:43:10.198583 7f963a08c780 10 -- 10.100.96.13:6830/43939 reaper 
done
2015-04-16 00:43:10.198586 7f963a08c780 10 -- 10.100.96.13:6830/43939 wait: 
waiting for pipes  to close
2015-04-16 00:43:10.198588 7f963a08c780 10 -- 10.100.96.13:6830/43939 wait: 
done.
2015-04-16 00:43:10.198590 7f963a08c780  1 -- 10.100.96.13:6830/43939 shutdown 
complete.


Full OSD log below
https://drive.google.com/file/d/0B578d6cBmDPYQ1lCMUR2Y0tLNTA/view?usp=sharing


Regards,
Quenten Grasso

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] use ZFS for OSDs

2015-04-14 Thread Quenten Grasso
Hi Michal,

Really nice work on the ZFS testing.

I've been thinking about this myself from time to time, However I wasn't sure 
if ZoL was ready to use in  production with Ceph.

I would like to see instead of using multiple osd's in zfs/ceph but running say 
a z+2 for say 8-12 3-4TB spinners and leverage some nice SSD's maybe a P3700 
400GB
for the zil/l2arc with compression and going back to 2x replicas which then 
this could give us some pretty fast/safe/efficient storage.

Now to find that money tree.

Regards,
Quenten Grasso

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Michal 
Kozanecki
Sent: Friday, 10 April 2015 5:15 AM
To: Christian Balzer; ceph-users
Subject: Re: [ceph-users] use ZFS for OSDs

I had surgery and have been off for a while. Had to rebuild test ceph+openstack 
cluster with whatever spare parts I had. I apologize for the delay for anyone 
who's been interested.

Here are the results;
==
Hardware/Software
3 node CEPH cluster, 3 OSDs (one OSD per node)
--
CPU = 1x E5-2670 v1
RAM = 8GB
OS Disk = 500GB SATA
OSD = 900GB 10k SAS (sdc - whole device) Journal = Shared Intel SSD DC3500 80GB 
(sdb1 - 10GB partition) ZFS log = Shared Intel SSD DC3500 80GB (sdb2 - 4GB 
partition) ZFS L2ARC = Intel SSD 320 40GB (sdd - whole device)
-
ceph 0.87
ZoL 0.63
CentOS 7.0

2 node KVM/Openstack cluster

CPU = 2x Xeon X5650
RAM = 24 GB
OS Disk = 500GB SATA
-
Ubuntu 14.04
OpenStack Juno

the rough performance of this oddball sized test ceph cluster is 8k 1000-1500 
IOPS 

==
Compression; (cut out unneeded details)
Various Debian and CentOS images, with lots of test SVN and GIT data 
KVM/OpenStack

[root@ceph03 ~]# zfs get all SAS1
NAME  PROPERTY  VALUE  SOURCE
SAS1  used  586G   -
SAS1  compressratio 1.50x  -
SAS1  recordsize32Klocal
SAS1  checksum  on default
SAS1  compression   lz4local
SAS1  refcompressratio  1.50x  -
SAS1  written   586G   -
SAS1  logicalused   877G   -

==
Dedupe; (dedupe is enabled on a dataset level but can dedupe space savings only 
be viewed at a pool level - bit odd I know) Various Debian and CentOS images, 
with lots of test SVN and GIT data KVM/OpenStack

[root@ceph01 ~]# zpool get all SAS1
NAME  PROPERTY   VALUE  SOURCE
SAS1  size   836G   -
SAS1  capacity   70%-
SAS1  dedupratio 1.02x  -
SAS1  free   250G   -
SAS1  allocated  586G   -

==
Bitrot/Corruption;
Injected random data to random locations (changed seek to random value) of sdc 
with;

dd if=/dev/urandom of=/dev/sdc seek=54356 bs=4k count=1

Results;

1. ZFS detects error on disk affecting PG files, being as this is a single vdev 
(no zraid or mirror) it cannot automatically fix. It blocks all(but delete) 
access to the entire files(inaccessible). 
*note: I ran this after status after already repairing 2 PGs (5.15 and 5.25), 
ZFS status will no longer list filename after it has been 
repaired/deleted/cleared*



[root@ceph01 ~]# zpool status -v
  pool: SAS1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub in progress since Thu Apr  9 13:04:54 2015
153G scanned out of 586G at 40.3M/s, 3h3m to go
0 repaired, 26.05% done
config:

NAME  STATE READ WRITE CKSUM
SAS1  ONLINE   0 035
  sdc ONLINE   0 070
logs
  sdb2ONLINE   0 0 0
cache
  sdd ONLINE   0 0 0

errors: Permanent errors have been detected in the following files: 

/SAS1/current/5.e_head/DIR_E/DIR_0/DIR_6/rbd\udata.2ba762ae8944a.24cc__head_6153260E__5



2. CEPH-OSD cannot read PG file. Kicks off scrub/deep-scrub



/var/log/ceph/ceph-osd.2.log
2015-04-09 13:10:18.319312 7fcbb163a700 -1 log_channel(default) log [ERR] : 
5.18 shard 1: soid cd635018/rbd_data.93d1f74b0dc51.18ee/head//5 
candidate had a read error, digest 1835988768 != known digest 473354757
2015-04-09 13:11:38.587014 7fcbb1e3b700 -1 log_channel(default) log [ERR] : 
5.1

Re: [ceph-users] Consumer Grade SSD Clusters

2015-01-27 Thread Quenten Grasso
Hi Nick,

Agreed, I see your point of basically once your past the 150TBW or whatever 
that number maybe, your just waiting for failure effectively but aren't we 
anyway?

I guess it depends on your use case at the end of the day. I wonder what the 
likes of Amazon, Rackspace etc are doing in the way of SSD's, either they are 
buying them so cheap per GB due to the "volume" or they are possibly using 
"consumer grade"  SSD'.

hmm.. using consumer grade SSD's it may be an interesting option if you have 
descent monitoring and alerting using SMART you should be able to still see how 
much spare flash you have available.
As suggested by Wido using multiple brands would help remove the possible 
cascading failure affect which I guess we all should be doing anyway on our 
spinners.

I guess we have to decide is it worth the extra effort in the long run vs 
running enterprise ssds.

Regards,
Quenten Grasso

From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: Saturday, 24 January 2015 7:33 PM
To: Quenten Grasso; ceph-users@lists.ceph.com
Subject: RE: Consumer Grade SSD Clusters

Hi Quenten,

There is no real answer to your question. It really depends on how busy your 
storage will be and particularly if it is mainly reads or writes.

I wouldn't pay too much attention to that SSD endurance test, whilst it's great 
to know that they have a lot more headroom than their official spec's, you run 
the risk of having a spectacular multiple disk failure if you intend to run 
them all that high. You can probably guarantee that as 1 SSD starts to fail the 
increase in workload to re-balance the cluster will cause failures on the rest.

I guess it really comes down to how important is the availability of your data. 
Whilst an average pc user might balk at the price of paying 4 times per GB more 
for a S3700 SSD, in the enterprise world they are still comparatively cheap.

The other thing you need to be aware of is that most consumer SSD's don't have 
power loss protection, again if you are mainly doing reads and cost is more 
important than availability, there may be an argument to use them.

Nick

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Quenten Grasso
Sent: 24 January 2015 09:13
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Consumer Grade SSD Clusters

Hi Everyone,

Just wondering if anyone has had any experience in using consumer grade SSD's 
for a Ceph cluster?

I came across this article 
http://techreport.com/review/26523/the-ssd-endurance-experiment-casualties-on-the-way-to-a-petabyte/3<http://xo4t.mjt.lu/link/xo4t/gg573yr/1/QRjiN_2beI5qST5ggOanaQ/aHR0cDovL3RlY2hyZXBvcnQuY29tL3Jldmlldy8yNjUyMy90aGUtc3NkLWVuZHVyYW5jZS1leHBlcmltZW50LWNhc3VhbHRpZXMtb24tdGhlLXdheS10by1hLXBldGFieXRlLzM>

They have been testing different SSD's write endurance and they have been able 
to write up to 1PB+ to a Samsung 840 Pro 256GB which is only "rated" at 150TBW 
and of course other SSD's have failed well before 1PBW, So defiantly worth a 
read.

So I've been thinking about using consumer grade SSD's for OSD's and Enterprise 
SSD's for journals.

Reasoning is enterprise SSD's are a lot faster at journaling then consumer 
grade drives plus this would effectively half the overall write requirements on 
the consumer grade disks.

This also could be a cost effective alternative to using enterprise SSD's as 
OSD's however it seems if your happy to use 2 x replication it's a pretty good 
cost saving however 3x replication not so much.

Cheers,
Quenten Grasso



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD removal rebalancing again

2015-01-26 Thread Quenten Grasso
Hi Christian,

Ahh yes, The overall host weight changed when removing the OSD as all OSD's 
make up the host weight in turn removal of the OSD then decreased the host 
weight which then triggered the rebalancing.

I guess it would have made more sense if setting the osd as "out" caused the 
same affect earlier instead of after removing the already emptied disk. 
*frustrating*

So would it be possible/recommended to "statically" set the host weight as 11 
in this case and once removal from crush happens it shouldn't cause a rebalance 
because its already been rebalanced anyway?

Regards,
Quenten Grasso


-Original Message-
From: Christian Balzer [mailto:ch...@gol.com] 
Sent: Tuesday, 27 January 2015 11:53 AM
To: ceph-users@lists.ceph.com
Cc: Quenten Grasso
Subject: Re: [ceph-users] OSD removal rebalancing again

On Tue, 27 Jan 2015 01:37:52 + Quenten Grasso wrote:

> Hi Christian,
> 
> As you'll probably notice we have 11,22,33,44 marked as out as well. 
> but here's our tree.
> 
> all of the OSD's in question had already been rebalanced/emptied from 
> the hosts. osd.0 existed on pbnerbd01
> 
Ah, lemme re-phrase that then, I was assuming a simpler scenario. 

Same reasoning, by removing the ODS the weight (not reweight) of the host 
changed (from 11 to 10) and that then triggered the re-balancing. 

Clear as mud? ^.^

Christian

> 
> # ceph osd tree
> # idweight  type name   up/down reweight
> -1  54  root default
> -3  54  rack unknownrack
> -2  10  host pbnerbd01
> 1   1   osd.1   up  1
> 10  1   osd.10  up  1
> 2   1   osd.2   up  1
> 3   1   osd.3   up  1
> 4   1   osd.4   up  1
> 5   1   osd.5   up  1
> 6   1   osd.6   up  1
> 7   1   osd.7   up  1
> 8   1   osd.8   up  1
> 9   1   osd.9   up  1
> -4  11  host pbnerbd02
> 11  1   osd.11  up  0
> 12  1   osd.12  up  1
> 13  1   osd.13  up  1
> 14  1   osd.14  up  1
> 15  1   osd.15  up  1
> 16  1   osd.16  up  1
> 17  1   osd.17  up  1
> 18  1   osd.18  up  1
> 19  1   osd.19  up  1
> 20  1   osd.20  up  1
> 21  1   osd.21  up  1
> -5  11  host pbnerbd03
> 22  1   osd.22  up  0
> 23  1   osd.23  up  1
> 24  1   osd.24  up  1
> 25  1   osd.25  up  1
> 26  1   osd.26  up  1
> 27  1   osd.27  up  1
> 28  1   osd.28  up  1
> 29  1   osd.29  up  1
> 30  1   osd.30  up  1
> 31  1   osd.31  up  1
> 32  1   osd.32  up  1
> -6  11  host pbnerbd04
> 33  1   osd.33  up  0
> 34  1   osd.34  up  1
> 35  1   osd.35  up  1
> 36  1   osd.36  up  1
> 37  1   osd.37  up  1
> 38  1   osd.38  up  1
> 39  1   osd.39  up  1
> 40  1   osd.40  up  1
> 41  1   osd.41  up  1
> 42  1   osd.42  up  1
> 43  1   osd.43  up  1
> -7  11  host pbnerbd05
> 44  1   osd.44  up  0
> 45  1   osd.45  up  1
> 46  1   osd.46  up  1
> 47  1   osd.47  up  1
> 48  1   osd.48  up  1
> 49  1   osd.49  up  1
> 50  1   osd.50  

Re: [ceph-users] OSD removal rebalancing again

2015-01-26 Thread Quenten Grasso
Hi Christian,

As you'll probably notice we have 11,22,33,44 marked as out as well. but here's 
our tree.

all of the OSD's in question had already been rebalanced/emptied from the 
hosts. osd.0 existed on pbnerbd01


# ceph osd tree
# idweight  type name   up/down reweight
-1  54  root default
-3  54  rack unknownrack
-2  10  host pbnerbd01
1   1   osd.1   up  1
10  1   osd.10  up  1
2   1   osd.2   up  1
3   1   osd.3   up  1
4   1   osd.4   up  1
5   1   osd.5   up  1
6   1   osd.6   up  1
7   1   osd.7   up  1
8   1   osd.8   up  1
9   1   osd.9   up  1
-4  11  host pbnerbd02
11  1   osd.11  up  0
12  1   osd.12  up  1
13  1   osd.13  up  1
14  1   osd.14  up  1
15  1   osd.15  up  1
16  1   osd.16  up  1
17  1   osd.17  up  1
18  1   osd.18  up  1
19  1   osd.19  up  1
20  1   osd.20  up  1
21  1   osd.21  up  1
-5  11  host pbnerbd03
22  1   osd.22  up  0
23  1   osd.23  up  1
24  1   osd.24  up  1
25  1   osd.25  up  1
26  1   osd.26  up  1
27  1   osd.27  up  1
28  1   osd.28  up  1
29  1   osd.29  up  1
30  1   osd.30  up  1
31  1   osd.31  up  1
32  1   osd.32  up  1
-6  11  host pbnerbd04
33  1   osd.33  up  0
34  1   osd.34  up  1
35  1   osd.35  up  1
36  1   osd.36  up  1
37  1   osd.37  up  1
38  1   osd.38  up  1
39  1   osd.39  up  1
40  1   osd.40  up  1
41  1   osd.41  up  1
42  1   osd.42  up  1
43  1   osd.43  up  1
-7  11  host pbnerbd05
44  1   osd.44  up  0
45  1   osd.45  up  1
46  1   osd.46  up  1
47  1   osd.47  up  1
48  1   osd.48  up  1
49  1   osd.49  up  1
50  1   osd.50  up  1
51  1   osd.51  up  1
52  1   osd.52  up  1
53  1   osd.53  up  1
54  1   osd.54  up      1

Regards,
Quenten Grasso

-Original Message-
From: Christian Balzer [mailto:ch...@gol.com] 
Sent: Tuesday, 27 January 2015 11:33 AM
To: ceph-users@lists.ceph.com
Cc: Quenten Grasso
Subject: Re: [ceph-users] OSD removal rebalancing again


Hello,

A "ceph -s" and "ceph osd tree" would have been nice, but my guess is that
osd.0 was the only osd on that particular storage server?

In that case the removal of the bucket (host) by removing the last OSD in it 
also triggered a re-balancing.
Not really/well documented AFAIK and annoying, but OTOH both expected (from a 
CRUSH perspective) and harmless.

Christian

On Tue, 27 Jan 2015 01:21:28 + Quenten Grasso wrote:

> Hi All,
> 
> I just removed an OSD from our cluster following the steps on 
> http://ceph.com/docs/master/rados/operations/add-or-rm-osds/
> 
> First I set the OSD as out,
> 
> ceph osd out osd.0
> 
> This emptied the OSD and eventually health of the cluster came back to 
> normal/ok. and OSD was up and out. (took about 2-3 hours) (OSD.0 used 
> space before setting as OUT was 900~ GB after rebalance took place OSD 
> Usage was ~150MB)
> 
> Once this was all ok I then proceeded to STOP the OSD.
> 
> se

[ceph-users] OSD removal rebalancing again

2015-01-26 Thread Quenten Grasso
Hi All,

I just removed an OSD from our cluster following the steps on 
http://ceph.com/docs/master/rados/operations/add-or-rm-osds/

First I set the OSD as out,

ceph osd out osd.0

This emptied the OSD and eventually health of the cluster came back to 
normal/ok. and OSD was up and out. (took about 2-3 hours) (OSD.0 used space 
before setting as OUT was 900~ GB after rebalance took place OSD Usage was 
~150MB)

Once this was all ok I then proceeded to STOP the OSD.

service ceph stop osd.0

checked cluster health and all looked ok, then I decided to remove the osd 
using the following commands.

ceph osd crush remove osd.0
ceph auth del osd.0
ceph osd rm 0


Now our cluster says
health HEALTH_WARN 414 pgs backfill; 12 pgs backfilling; 19 pgs recovering; 344 
pgs recovery_wait; 789 pgs stuck unclean; recovery 390967/10986568 objects 
degraded (3.559%)

before using the removal procedure everything was "ok" and the osd.0 had been 
emptied and seemingly rebalanced.

Any ideas why its rebalancing again?

we're using Ubuntu 12.04 w/ Ceph 80.8 & Kernel 3.13.0-43-generic 
#72~precise1-Ubuntu SMP Tue Dec 9 12:14:18 UTC 2014 x86_64 x86_64 x86_64 
GNU/Linux



Regards,
Quenten Grasso
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Consumer Grade SSD Clusters

2015-01-24 Thread Quenten Grasso
Hi Everyone,

Just wondering if anyone has had any experience in using consumer grade SSD's 
for a Ceph cluster?

I came across this article 
http://techreport.com/review/26523/the-ssd-endurance-experiment-casualties-on-the-way-to-a-petabyte/3

They have been testing different SSD's write endurance and they have been able 
to write up to 1PB+ to a Samsung 840 Pro 256GB which is only "rated" at 150TBW 
and of course other SSD's have failed well before 1PBW, So defiantly worth a 
read.

So I've been thinking about using consumer grade SSD's for OSD's and Enterprise 
SSD's for journals.

Reasoning is enterprise SSD's are a lot faster at journaling then consumer 
grade drives plus this would effectively half the overall write requirements on 
the consumer grade disks.

This also could be a cost effective alternative to using enterprise SSD's as 
OSD's however it seems if your happy to use 2 x replication it's a pretty good 
cost saving however 3x replication not so much.

Cheers,
Quenten Grasso

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IO wait spike in VM

2014-09-29 Thread Quenten Grasso
Hi Alexandre,

No problem, I hope this saves you some pain

It's probably worth going for a larger journal probably around 20Gig if you 
wish to play with tuning of "filestore max sync interval" could be have some 
interesting results.
Also probably already know this however most of us when starting with ceph, use 
xfs/file for the journal instead of a using a partition using a "raw partition" 
 this removes file system overhead on the journal.

I highly recommend looking into dedicated journals for your systems as your 
spinning disks are going to work very hard trying to keep up with all the 
read/write seeking on these disks particularly if you're going to be using for 
vm's. 
Also you'll get about 1/3 of the write performance as a "best case scenario" 
using journals on the same disk and this comes down to the disks IOPS.

Depending on your hardware & budget you could look into using one of these 
options for dedicated journals

Intel DC P3700 400GB PCIe these are good for about ~1000mb/s write (haven't 
tested these myself however are looking to use these in our additional nodes)
Intel DC S3700 200GB  these are good for about ~360mb/s write

At the time we used the Intel DC S3700 100GB  these drives don't have enough 
throughput so I'd recommend you stay away from this particular 100GB model.

So if you have spare hard disk slots in your servers the 200GB DC S3700 is the 
best bang for buck. Usually I run 6 spinning disks to  1 SSD in an ideal world 
I'd like to cut this back to 4 instead of 6 tho when using the 200GB disks.

Both of these SSD options would do nicely and have on board capacitors and very 
high write/wear rates as well.

Cheers,
Quenten Grasso

-Original Message-
From: Bécholey Alexandre [mailto:alexandre.becho...@nagra.com] 
Sent: Monday, 29 September 2014 4:15 PM
To: Quenten Grasso; ceph-users@lists.ceph.com
Cc: Aviolat Romain
Subject: RE: [ceph-users] IO wait spike in VM

Hello Quenten,

Thanks for your reply.

We have a 5GB journal for each OSD on the same disk.

Right now, we are migrating our OSD to XFS and we'll add a 5th monitor. We will 
perform the benchmarks afterwards.

Cheers,
Alexandre

-Original Message-
From: Quenten Grasso [mailto:qgra...@onq.com.au] 
Sent: lundi 29 septembre 2014 01:56
To: Bécholey Alexandre; ceph-users@lists.ceph.com
Cc: Aviolat Romain
Subject: RE: [ceph-users] IO wait spike in VM

G'day Alexandre

I'm not sure if this is causing your issues, however it could be contributing 
to them. 

I noticed you have 4 Mon's, this could contributing to your problems as its 
recommended due to paxos algorithm which ceph uses for achieving quorum of 
mon's, you should be running an odd number of mon's 1, 3, 5, 7, etc Also worth 
it's mentioning running 4 mon's would still only give you a possible failure of 
1 mon without an outage. 

Spec wise the machines look pretty good, only thing I can see is the lack of 
journals and using btrfs at this stage. 

You could try some iperf testing between the machines to make sure the 
networking is working as expected.

If you do rados benches for extended time what kind of stats do you see?

For example,

Write)
ceph osd pool create benchmark1   ceph osd pool set benchmark1 size 3 
rados bench -p benchmark1 180 write --no-cleanup --concurrent-ios=32

* I suggest you create a 2nd benchmark pool and write for another 180 seconds 
or so to ensure nothing is cached then do a read test.

Read)
rados bench -p benchmark1 180 seq --concurrent-ios=32

You can also try the same using 4k blocks

rados bench -p benchmark1 180 write -b 4096 --no-cleanup --concurrent-ios=32 
rados bench -p benchmark1 180 seq -b 4096

As you may know increasing the concurrent io's will increase cpu/disk load.

 = Total PG = OSD * 100 / Replicas
Ie: 50 OSD System with 3 replicas would be around 1600

Hope this helps a little,

Cheers,
Quenten Grasso


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Bécholey Alexandre
Sent: Thursday, 25 September 2014 1:27 AM
To: ceph-users@lists.ceph.com
Cc: Aviolat Romain
Subject: [ceph-users] IO wait spike in VM

Dear Ceph guru,

We have a Ceph cluster (version 0.80.5 
38b73c67d375a2552d8ed67843c8a65c2c0feba6) with 4 MON and 16 OSDs (4 per host) 
used as a backend storage for libvirt.

Hosts:
Ubuntu 14.04
CPU: 2 Xeon X5650
RAM: 48 GB (no swap)
No SSD for journals
HDD: 4 WDC WD2003FYYS-02W0B0 (2 TB, 7200 rpm) dedicated to OSD (one partition 
for the journal, the rest for the OSD)
FS: btrfs (I know it's not recommended in the doc and I hope it's not the 
culprit)
Network: dedicated 10GbE

As we added some VMs to the cluster, we saw some sporadic huge IO wait on the 
VM. The hosts running the OSDs seem fine.
I followed a similar discussion here: 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/04062

Re: [ceph-users] IO wait spike in VM

2014-09-28 Thread Quenten Grasso
G'day Alexandre

I'm not sure if this is causing your issues, however it could be contributing 
to them. 

I noticed you have 4 Mon's, this could contributing to your problems as its 
recommended due to paxos algorithm which ceph uses for achieving quorum of 
mon's, you should be running an odd number of mon's 1, 3, 5, 7, etc
Also worth it's mentioning running 4 mon's would still only give you a possible 
failure of 1 mon without an outage. 

Spec wise the machines look pretty good, only thing I can see is the lack of 
journals and using btrfs at this stage. 

You could try some iperf testing between the machines to make sure the 
networking is working as expected.

If you do rados benches for extended time what kind of stats do you see?

For example,

Write) 
ceph osd pool create benchmark1  
ceph osd pool set benchmark1 size 3
rados bench -p benchmark1 180 write --no-cleanup --concurrent-ios=32

* I suggest you create a 2nd benchmark pool and write for another 180 seconds 
or so to ensure nothing is cached then do a read test.

Read)
rados bench -p benchmark1 180 seq --concurrent-ios=32

You can also try the same using 4k blocks

rados bench -p benchmark1 180 write -b 4096 --no-cleanup --concurrent-ios=32
rados bench -p benchmark1 180 seq -b 4096

As you may know increasing the concurrent io's will increase cpu/disk load.

 = Total PG = OSD * 100 / Replicas
Ie: 50 OSD System with 3 replicas would be around 1600

Hope this helps a little,

Cheers,
Quenten Grasso


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Bécholey Alexandre
Sent: Thursday, 25 September 2014 1:27 AM
To: ceph-users@lists.ceph.com
Cc: Aviolat Romain
Subject: [ceph-users] IO wait spike in VM

Dear Ceph guru,

We have a Ceph cluster (version 0.80.5 
38b73c67d375a2552d8ed67843c8a65c2c0feba6) with 4 MON and 16 OSDs (4 per host) 
used as a backend storage for libvirt.

Hosts:
Ubuntu 14.04
CPU: 2 Xeon X5650
RAM: 48 GB (no swap)
No SSD for journals
HDD: 4 WDC WD2003FYYS-02W0B0 (2 TB, 7200 rpm) dedicated to OSD (one partition 
for the journal, the rest for the OSD)
FS: btrfs (I know it's not recommended in the doc and I hope it's not the 
culprit)
Network: dedicated 10GbE

As we added some VMs to the cluster, we saw some sporadic huge IO wait on the 
VM. The hosts running the OSDs seem fine.
I followed a similar discussion here: 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040621.html

Here is an example of a transaction that took some time:

{ "description": "osd_op(client.5275.0:262936 
rbd_data.22e42ae8944a.0807 [] 3.c9699248 ack+ondisk+write e3158)",
  "received_at": "2014-09-23 15:23:30.820958",
  "age": "108.329989",
  "duration": "5.814286",
  "type_data": [
"commit sent; apply or cleanup",
{ "client": "client.5275",
  "tid": 262936},
[
{ "time": "2014-09-23 15:23:30.821097",
  "event": "waiting_for_osdmap"},
{ "time": "2014-09-23 15:23:30.821282",
  "event": "reached_pg"},
{ "time": "2014-09-23 15:23:30.821384",
  "event": "started"},
{ "time": "2014-09-23 15:23:30.821401",
  "event": "started"},
{ "time": "2014-09-23 15:23:30.821459",
  "event": "waiting for subops from 14"},
{ "time": "2014-09-23 15:23:30.821561",
  "event": "commit_queued_for_journal_write"},
{ "time": "2014-09-23 15:23:30.821666",
  "event": "write_thread_in_journal_buffer"},
{ "time": "2014-09-23 15:23:30.822591",
  "event": "op_applied"},
{ "time": "2014-09-23 15:23:30.824707",
  "event": "sub_op_applied_rec"},
{ "time": "2014-09-23 15:23:31.225157",
  "event": "journaled_completion_queued"},
{ "time": "2014-09-23 15:23:31.225297",
  "event": "op_commit"},
{ "time": "2014-09-23 15:23:36.635085",
  "event": "sub_op_commit_rec"},

Re: [ceph-users] NAS on RBD

2014-09-09 Thread Quenten Grasso

We have been using the NFS/Pacemaker/RBD Method for a while explains it a bit 
better here, http://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/
PS: Thanks Sebastien,

Our use case is vmware storage, So as I mentioned we've been running it for 
some time and we've had pretty mixed results. 
Pros are when it works it works really well!
Cons When it doesn't, I've had a couple of instances where the XFS volumes 
needed fsck and this took about 3 hours on a 4TB Volume. (Lesson learnt use 
smaller volumes)
 

ZFS RaidZ Option could be interesting but expensive if using say 3 Pools with 
2x replicas with a RBD volume from each and a RaidZ on top of that. (I assume 
you would use 3 Pools here so we don't end up with data in the same PG which 
may be corrupted.)


Currently we also use FreeNAS VM's which are backed via RBD w/ 3 replicas and 
ZFS Striped Volumes and iSCSI/NFS out of these. While not really HA seems 
mostly work be it FreeNAS iSCSI can get a bit cranky at times. 

We are moving towards another KVM Hypervisor such as proxmox for these vm's 
which don't quite fit into our Openstack environment instead of having to use 
"RBD Proxys"

Regards,
Quenten Grasso

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Dan 
Van Der Ster
Sent: Wednesday, 10 September 2014 12:54 AM
To: Michal Kozanecki
Cc: ceph-users@lists.ceph.com; Blair Bethwaite
Subject: Re: [ceph-users] NAS on RBD


> On 09 Sep 2014, at 16:39, Michal Kozanecki  wrote:
> On 9 September 2014 08:47, Blair Bethwaite  wrote:
>> On 9 September 2014 20:12, Dan Van Der Ster  
>> wrote:
>>> One thing I’m not comfortable with is the idea of ZFS checking the data in 
>>> addition to Ceph. Sure, ZFS will tell us if there is a checksum error, but 
>>> without any redundancy at the ZFS layer there will be no way to correct 
>>> that error. Of course, the hope is that RADOS will ensure 100% data 
>>> consistency, but what happens if not?...
>> 
>> The ZFS checksumming would tell us if there has been any corruption, which 
>> as you've pointed out shouldn't happen anyway on top of Ceph.
> 
> Just want to quickly address this, someone correct me if I'm wrong, but IIRC 
> even with replica value of 3 or more, ceph does not(currently) have any 
> intelligence when it detects a corrupted/"incorrect" PG, it will always 
> replace/repair the PG with whatever data is in the primary, meaning that if 
> the primary PG is the one that’s corrupted/bit-rotted/"incorrect", it will 
> replace the good replicas with the bad.  

According to the the "scrub error on firefly” thread, repair "tends to choose 
the copy with the lowest osd number which is not obviously corrupted.  Even 
with three replicas, it does not do any kind of voting at this time.”

Cheers, Dan




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD journal deployment experiences

2014-09-08 Thread Quenten Grasso
This reminds me of something I was trying to find out awhile back.

If we have 2000 "Random" IOPS of which are 4K Blocks our cluster  (assuming 3 x 
Replicas) will generate 6000 IOPS @ 4K onto the journals.

Does this mean our Journals will absorb 6000 IOPS and turn these into X IOPS 
onto our spindles? 

If this is the case Is it possible to calculate how many IOPS a journal would 
"absorb" and how this would translate to x IOPS on spindle disk?

Regards,
Quenten Grasso

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Christian Balzer
Sent: Sunday, 7 September 2014 1:38 AM
To: ceph-users
Subject: Re: [ceph-users] SSD journal deployment experiences

On Sat, 6 Sep 2014 14:50:20 + Dan van der Ster wrote:

> September 6 2014 4:01 PM, "Christian Balzer"  wrote: 
> > On Sat, 6 Sep 2014 13:07:27 + Dan van der Ster wrote:
> > 
> >> Hi Christian,
> >> 
> >> Let's keep debating until a dev corrects us ;)
> > 
> > For the time being, I give the recent:
> > 
> > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12203.html
> > 
> > And not so recent:
> > http://www.spinics.net/lists/ceph-users/msg04152.html
> > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021
> > 
> > And I'm not going to use BTRFS for mainly RBD backed VM images 
> > (fragmentation city), never mind the other stability issues that 
> > crop up here ever so often.
> 
> 
> Thanks for the links... So until I learn otherwise, I better assume 
> the OSD is lost when the journal fails. Even though I haven't 
> understood exactly why :( I'm going to UTSL to understand the consistency 
> better.
> An op state diagram would help, but I didn't find one yet.
> 
Using the source as an option of last resort is always nice, having to actually 
do so for something like this feels a bit lacking in the documentation 
department (that or my google foo being weak). ^o^

> BTW, do you happen to know, _if_ we re-use an OSD after the journal 
> has failed, are any object inconsistencies going to be found by a 
> scrub/deep-scrub?
> 
No idea. 
And really a scenario I hope to never encounter. ^^;;

> >> 
> >> We have 4 servers in a 3U rack, then each of those servers is 
> >> connected to one of these enclosures with a single SAS cable.
> >> 
> >>>> With the current config, when I dd to all drives in parallel I 
> >>>> can write at 24*74MB/s = 1776MB/s.
> >>> 
> >>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 
> >>> 2.0 lanes, so as far as that bus goes, it can do 4GB/s.
> >>> And given your storage pod I assume it is connected with 2 
> >>> mini-SAS cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s 
> >>> SATA bandwidth.
> >> 
> >> From above, we are only using 4 lanes -- so around 2GB/s is expected.
> > 
> > Alright, that explains that then. Any reason for not using both ports?
> > 
> 
> Probably to minimize costs, and since the single 10Gig-E is a 
> bottleneck anyway. The whole thing is suboptimal anyway, since this 
> hardware was not purchased for Ceph to begin with. Hence retrofitting SSDs, 
> etc...
>
The single 10Gb/s link is the bottleneck for sustained stuff, but when looking 
at spikes...
Oh well, I guess if you ever connect that 2nd 10GbE card that 2nd port might 
also get some loving. ^o^

The cluster I'm currently building is based on storage nodes with 4 SSDs (100GB 
DC 3700s, so 800MB/s would be the absolute write speed limit) and 8 HDDs. 
Connected with 40Gb/s Infiniband. Dual port, dual switch for redundancy, not 
speed. ^^ 
 
> >>> Impressive, even given your huge cluster with 1128 OSDs.
> >>> However that's not really answering my question, how much data is 
> >>> on an average OSD and thus gets backfilled in that hour?
> >> 
> >> That's true -- our drives have around 300TB on them. So I guess it 
> >> will take longer - 3x longer - when the drives are 1TB full.
> > 
> > On your slides, when the crazy user filled the cluster with 250 
> > million objects and thus 1PB of data, I recall seeing a 7 hour backfill 
> > time?
> > 
> 
> Yeah that was fun :) It was 250 million (mostly) 4k objects, so not 
> close to 1PB. The point was that to fill the cluster with RBD, we'd 
> need
> 250 million (4MB) objects. So, object-count-wise this was a full 
> cluster, but for the real volume it was more like 70TB IIRC (there 
> were some other larger objects too).
> 
Ah, I see. ^^

> In that case, the backfilling was CPU-boun

Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time

2014-07-16 Thread Quenten Grasso
Hi Sage & List

I understand this is probably a hard question to answer.

I mentioned previously our cluster is co-located MON’s on OSD servers, which 
are R515’s w/ 1 x AMD 6 Core processor & 11 3TB OSD’s w/ dual 10GBE.

When our cluster is doing these busy operations and IO has stopped as in my 
case, I mentioned earlier running/setting tuneable to optimal or heavy recovery
operations is there a way to ensure our IO doesn’t get completely 
blocked/stopped/frozen in our vms?

Could it be as simple as putting all 3 of our mon servers on baremetal  
w/ssd’s? (I recall reading somewhere that a mon disk was doing several thousand 
IOPS during a recovery operation)

I assume putting just one on baremetal won’t help because our mon’s will only 
ever be as fast as our slowest mon server?

Thanks,
Quenten
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time

2014-07-16 Thread Quenten Grasso
Hi Sage, Andrija & List

I have seen the tuneables issue on our cluster when I upgraded to firefly.

I ended up going back to legacy settings after about an hour as my cluster is 
of 55 3TB OSD’s over 5 nodes and it decided it needed to move around 32% of our 
data, which after an hour all of our vm’s were frozen and I had to revert the 
change back to legacy settings and wait about the same time again until our 
cluster had recovered and reboot our vms. (wasn’t really expecting that one 
from the patch notes)

Also our CPU usage went through the roof as well on our nodes, do you per 
chance have your metadata servers co-located on your osd nodes as we do?  I’ve 
been thinking about trying to move these to dedicated nodes as it may resolve 
our issues.

Regards,
Quenten

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Andrija Panic
Sent: Tuesday, 15 July 2014 8:38 PM
To: Sage Weil
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at 
the same time

Hi Sage,

since this problem is tunables-related, do we need to expect same behavior or 
not  when we do regular data rebalancing caused by adding new/removing OSD? I 
guess not, but would like your confirmation.
I'm already on optimal tunables, but I'm afraid to test this by i.e. shuting 
down 1 OSD.

Thanks,
Andrija

On 14 July 2014 18:18, Sage Weil mailto:sw...@redhat.com>> 
wrote:
I've added some additional notes/warnings to the upgrade and release
notes:

 https://github.com/ceph/ceph/commit/fc597e5e3473d7db6548405ce347ca7732832451

If there is somewhere else where you think a warning flag would be useful,
let me know!

Generally speaking, we want to be able to cope with huge data rebalances
without interrupting service.  It's an ongoing process of improving the
recovery vs client prioritization, though, and removing sources of
overhead related to rebalancing... and it's clearly not perfect yet. :/

sage


On Sun, 13 Jul 2014, Andrija Panic wrote:

> Hi,
> after seting ceph upgrade (0.72.2 to 0.80.3) I have issued "ceph osd crush
> tunables optimal" and after only few minutes I have added 2 more OSDs to the
> CEPH cluster...
>
> So these 2 changes were more or a less done at the same time - rebalancing
> because of tunables optimal, and rebalancing because of adding new OSD...
>
> Result - all VMs living on CEPH storage have gone mad, no disk access
> efectively, blocked so to speak.
>
> Since this rebalancing took 5h-6h, I had bunch of VMs down for that long...
>
> Did I do wrong by causing "2 rebalancing" to happen at the same time ?
> Is this behaviour normal, to cause great load on all VMs because they are
> unable to access CEPH storage efectively ?
>
> Thanks for any input...
> --
>
> Andrija Pani?
>
>



--

Andrija Panić
--
  http://admintweets.com
--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Firefly Upgrade

2014-07-14 Thread Quenten Grasso
Hi All,

Just a quick question for the list, has anyone seen a significant increase in 
ram usage since firefly? I upgraded from 0.72.2 to 80.3 now all of my Ceph 
servers are using about double the ram they used to.

Only other significant change to our setup was a upgrade to Kernel 
3.13.0-30-generic #55~precise1-Ubuntu SMP

Any ideas?

Regards,
Quenten
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Restarts cause excessively high load average and "requests are blocked > 32 sec"

2014-03-31 Thread Quenten Grasso
Thanks Greg,

Looking forward to the new release!

Regards,
Quenten Grasso

-Original Message-
From: Gregory Farnum [mailto:g...@inktank.com] 
Sent: Tuesday, 1 April 2014 3:08 AM
To: Quenten Grasso
Cc: Kyle Bader; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] OSD Restarts cause excessively high load average and 
"requests are blocked > 32 sec"

Yep, that looks like http://tracker.ceph.com/issues/7093, which is fixed in 
dumpling and most of the dev releases since emperor. ;) I also cherry-picked 
the fix to the emperor branch and it will be included whenever we do another 
point release of that.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Mar 25, 2014 at 6:39 PM, Quenten Grasso  wrote:
> Hi Greg,
>
> Restarting the actual service ie: service ceph restart osd.50, only takes a 
> few seconds.
>
> Attached is a ceph -w of just running a service ceph restart osd.50,
>
> You can see it marks itself down pretty much straight away. Takes a little 
> while to mark itself as up and finish "recovery"
>
> If I do this to all 12 osd's the node goes crazy, It's almost like the 
> node is cpu bound but it has 6 cores, and load average goes to 300+
>
> http://pastie.org/pastes/8968950/text?key=0e0bs1ojbm2arnexn52iwq
>
> Regards,
> Quenten
>
> -Original Message-
> From: Gregory Farnum [mailto:g...@inktank.com]
> Sent: Wednesday, 26 March 2014 2:02 AM
> To: Quenten Grasso
> Cc: Kyle Bader; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] OSD Restarts cause excessively high load average 
> and "requests are blocked > 32 sec"
>
> How long does it take for the OSDs to restart? Are you just issuing a restart 
> command via upstart/sysvinit/whatever? How many OSDMaps are generated from 
> the time you issue that command to the time the cluster is healthy again?
>
> This sounds like an issue we had for a while where OSDs would start peering 
> before they had processed the maps they needed to look at; the fix might not 
> have been backported to Emperor. But I'd like to be sure this isn't some 
> other issue you're seeing.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Sat, Mar 22, 2014 at 8:16 PM, Quenten Grasso  wrote:
>> Hi Kyle,
>>
>> Thanks, I turned on debug ms = 1 and debug osd = 10 and restarted osd.54 
>> heres here's log for that one.
>>
>> ceph-osd.54.log.bz2
>> http://www67.zippyshare.com/v/99704627/file.html
>>
>>
>> Strace osd 53,
>> strace.zip
>> http://www43.zippyshare.com/v/17581165/file.html
>>
>>
>> Thanks,
>> Quenten
>> -Original Message-
>> From: Kyle Bader [mailto:kyle.ba...@gmail.com]
>> Sent: Sunday, 23 March 2014 12:10 PM
>> To: Quenten Grasso
>> Subject: Re: [ceph-users] OSD Restarts cause excessively high load average 
>> and "requests are blocked > 32 sec"
>>
>>> Any ideas on why the load average goes so crazy & starts to block IO?
>>
>> Could you turn on "debug ms = 1" and "debug osd = 10" prior to restarting 
>> the OSDs on one of your hosts and sharing the logs so we can take a look?
>>
>> It also might be worth while to strace one of the OSDs to try to determine 
>> what it's working so hard on, maybe:
>>
>> strace -fc -p   > strace.osd1.log
>>
>> Thanks!
>>
>> --
>>
>> Kyle
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Restarts cause excessively high load average and "requests are blocked > 32 sec"

2014-03-25 Thread Quenten Grasso
Hi Greg,

Restarting the actual service ie: service ceph restart osd.50, only takes a few 
seconds.

Attached is a ceph -w of just running a service ceph restart osd.50, 

You can see it marks itself down pretty much straight away. Takes a little 
while to mark itself as up and finish "recovery"

If I do this to all 12 osd's the node goes crazy, It's almost like the node is 
cpu bound but it has 6 cores, and load average goes to 300+ 

http://pastie.org/pastes/8968950/text?key=0e0bs1ojbm2arnexn52iwq

Regards,
Quenten

-Original Message-
From: Gregory Farnum [mailto:g...@inktank.com] 
Sent: Wednesday, 26 March 2014 2:02 AM
To: Quenten Grasso
Cc: Kyle Bader; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] OSD Restarts cause excessively high load average and 
"requests are blocked > 32 sec"

How long does it take for the OSDs to restart? Are you just issuing a restart 
command via upstart/sysvinit/whatever? How many OSDMaps are generated from the 
time you issue that command to the time the cluster is healthy again?

This sounds like an issue we had for a while where OSDs would start peering 
before they had processed the maps they needed to look at; the fix might not 
have been backported to Emperor. But I'd like to be sure this isn't some other 
issue you're seeing.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sat, Mar 22, 2014 at 8:16 PM, Quenten Grasso  wrote:
> Hi Kyle,
>
> Thanks, I turned on debug ms = 1 and debug osd = 10 and restarted osd.54 
> heres here's log for that one.
>
> ceph-osd.54.log.bz2
> http://www67.zippyshare.com/v/99704627/file.html
>
>
> Strace osd 53,
> strace.zip
> http://www43.zippyshare.com/v/17581165/file.html
>
>
> Thanks,
> Quenten
> -Original Message-----
> From: Kyle Bader [mailto:kyle.ba...@gmail.com]
> Sent: Sunday, 23 March 2014 12:10 PM
> To: Quenten Grasso
> Subject: Re: [ceph-users] OSD Restarts cause excessively high load average 
> and "requests are blocked > 32 sec"
>
>> Any ideas on why the load average goes so crazy & starts to block IO?
>
> Could you turn on "debug ms = 1" and "debug osd = 10" prior to restarting the 
> OSDs on one of your hosts and sharing the logs so we can take a look?
>
> It also might be worth while to strace one of the OSDs to try to determine 
> what it's working so hard on, maybe:
>
> strace -fc -p   > strace.osd1.log
>
> Thanks!
>
> --
>
> Kyle
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Restarts cause excessively high load average and "requests are blocked > 32 sec"

2014-03-22 Thread Quenten Grasso
Hi Kyle,

Thanks, I turned on debug ms = 1 and debug osd = 10 and restarted osd.54 heres 
here's log for that one.

ceph-osd.54.log.bz2
http://www67.zippyshare.com/v/99704627/file.html


Strace osd 53,
strace.zip
http://www43.zippyshare.com/v/17581165/file.html


Thanks,
Quenten
-Original Message-
From: Kyle Bader [mailto:kyle.ba...@gmail.com] 
Sent: Sunday, 23 March 2014 12:10 PM
To: Quenten Grasso
Subject: Re: [ceph-users] OSD Restarts cause excessively high load average and 
"requests are blocked > 32 sec"

> Any ideas on why the load average goes so crazy & starts to block IO?

Could you turn on "debug ms = 1" and "debug osd = 10" prior to restarting the 
OSDs on one of your hosts and sharing the logs so we can take a look?

It also might be worth while to strace one of the OSDs to try to determine what 
it's working so hard on, maybe:

strace -fc -p   > strace.osd1.log

Thanks!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Restarts cause excessively high load average and "requests are blocked > 32 sec"

2014-03-20 Thread Quenten Grasso
Hi All,

I left out my OS/kernel version, Ubuntu 12.04.4 LTS w/ Kernel 
3.10.33-031033-generic (We upgrade our kernels to 3.10 due to Dell Drivers).

Here's an example of starting all the OSD's after a reboot.

top - 09:10:51 up 2 min,  1 user,  load average: 332.93, 112.28, 39.96
Tasks: 310 total,   1 running, 309 sleeping,   0 stopped,   0 zombie
Cpu(s): 50.3%us, 32.5%sy,  0.0%ni,  0.0%id,  0.0%wa, 17.2%hi,  0.0%si,  0.0%st
Mem:  32917276k total,  6331224k used, 26586052k free, 1332k buffers
Swap: 33496060k total,0k used, 33496060k free,  1474084k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
15875 root  20   0  910m 381m  50m S   60  1.2   0:50.57 ceph-osd
2996 root  20   0  867m 330m  44m S   59  1.0   0:58.32 ceph-osd
4502 root  20   0  907m 372m  47m S   58  1.2   0:55.14 ceph-osd
12465 root  20   0  949m 418m  55m S   58  1.3   0:51.79 ceph-osd
4171 root  20   0  886m 348m  45m S   57  1.1   0:56.17 ceph-osd
3707 root  20   0  941m 405m  50m S   57  1.3   0:59.68 ceph-osd
3560 root  20   0  924m 394m  51m S   56  1.2   0:59.37 ceph-osd
4318 root  20   0  965m 435m  55m S   56  1.4   0:54.80 ceph-osd
3337 root  20   0  935m 407m  51m S   56  1.3   1:01.96 ceph-osd
3854 root  20   0  897m 366m  48m S   55  1.1   1:00.55 ceph-osd
3143 root  20   0 1364m 424m  24m S   16  1.3   1:08.72 ceph-osd
2509 root  20   0  652m 261m  62m S2  0.8   0:26.42 ceph-mon
4 root  20   0 000 S0  0.0   0:00.08 kworker/0:0

Regards,
Quenten Grasso

From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Quenten Grasso
Sent: Tuesday, 18 March 2014 10:19 PM
To: 'ceph-users@lists.ceph.com'
Subject: [ceph-users] OSD Restarts cause excessively high load average and 
"requests are blocked > 32 sec"

Hi All,

I'm trying to troubleshoot a strange issue with my Ceph cluster.

We're Running Ceph Version 0.72.2
All Nodes are Dell R515's w/ 6C AMD CPU w/ 32GB Ram, 12 x 3TB NearlineSAS 
Drives and 2 x 100GB Intel DC S3700 SSD's for Journals.
All Pools have a replica of 2 or better. I.e. metadata replica of 3.

I have 55 OSD's in the cluster across 5 nodes. When I restart the OSD's on a 
single node (any node) the load average of that node shoots up to 230+ and the 
whole cluster starts blocking IO requests until it settles down and its fine 
again.

Any ideas on why the load average goes so crazy & starts to block IO?



[osd]
osd data = /var/ceph/osd.$id
osd journal size = 15000
osd mkfs type = xfs
osd mkfs options xfs = "-i size=2048 -f"
osd mount options xfs = 
"rw,noexec,nodev,noatime,nodiratime,barrier=0,inode64,logbufs=8,logbsize=256k"
osd max backfills = 5
osd recovery max active = 3

[osd.0]
host = pbnerbd01
public addr = 10.100.96.10
cluster addr = 10.100.128.10
osd journal = 
/dev/disk/by-id/scsi-36b8ca3a0eaa2660019deaf8d3a40bec4-part1
devs = /dev/sda4


Thanks,
Quenten

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD Restarts cause excessively high load average and "requests are blocked > 32 sec"

2014-03-18 Thread Quenten Grasso
Hi All,

I'm trying to troubleshoot a strange issue with my Ceph cluster.

We're Running Ceph Version 0.72.2
All Nodes are Dell R515's w/ 6C AMD CPU w/ 32GB Ram, 12 x 3TB NearlineSAS 
Drives and 2 x 100GB Intel DC S3700 SSD's for Journals.
All Pools have a replica of 2 or better. I.e. metadata replica of 3.

I have 55 OSD's in the cluster across 5 nodes. When I restart the OSD's on a 
single node (any node) the load average of that node shoots up to 230+ and the 
whole cluster starts blocking IO requests until it settles down and its fine 
again.

Any ideas on why the load average goes so crazy & starts to block IO?



[osd]
osd data = /var/ceph/osd.$id
osd journal size = 15000
osd mkfs type = xfs
osd mkfs options xfs = "-i size=2048 -f"
osd mount options xfs = 
"rw,noexec,nodev,noatime,nodiratime,barrier=0,inode64,logbufs=8,logbsize=256k"
osd max backfills = 5
osd recovery max active = 3

[osd.0]
host = pbnerbd01
public addr = 10.100.96.10
cluster addr = 10.100.128.10
osd journal = 
/dev/disk/by-id/scsi-36b8ca3a0eaa2660019deaf8d3a40bec4-part1
devs = /dev/sda4


Thanks,
Quenten

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD Restarts cause excessively high load average and "requests are blocked > 32 sec"

2014-03-18 Thread Quenten Grasso
Hi All,

I'm trying to troubleshoot a strange issue with my Ceph cluster.

We're Running Ceph Version 0.72.2
All Nodes are Dell R515's w/ 6C AMD CPU w/ 32GB Ram, 12 x 3TB NearlineSAS 
Drives and 2 x 100GB Intel DC S3700 SSD's for Journals.
All Pools have a replica of 2 or better. I.e. metadata replica of 3.

I have 55 OSD's in the cluster across 5 nodes. When I restart the OSD's on a 
single node (any node) the load average of that node shoots up to 230+ and the 
whole cluster starts blocking IO requests until it settles down and its fine 
again.

Any ideas on why the load average goes so crazy & starts to block IO?



[osd]
osd data = /var/ceph/osd.$id
osd journal size = 15000
osd mkfs type = xfs
osd mkfs options xfs = "-i size=2048 -f"
osd mount options xfs = 
"rw,noexec,nodev,noatime,nodiratime,barrier=0,inode64,logbufs=8,logbsize=256k"
osd max backfills = 5
osd recovery max active = 3

[osd.0]
host = pbnerbd01
public addr = 10.100.96.10
cluster addr = 10.100.128.10
osd journal = 
/dev/disk/by-id/scsi-36b8ca3a0eaa2660019deaf8d3a40bec4-part1
devs = /dev/sda4


Thanks,
Quenten

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw public url

2013-12-20 Thread Quenten Grasso
Hi All,

Does Radosgw support a "Public URL" For static content?

Being that I wish to share a "File" publicly but not give out 
username/passwords etc.

I noticed in the http://ceph.com/docs/master/radosgw/swift/ it says Static 
Websites isn't supported.. which I assume is talking about this feature, I'm 
just not 100% sure.

Cheers,
Quenten
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] issues with 'https://ceph.com/git/?p=ceph.git; a=blob_plain; f=keys/release.asc'

2013-09-29 Thread Quenten Grasso
Hey Guys,

Looks like 'https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc' 
is down.

Regards,
Quenten Grasso
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph write performance and my Dell R515's

2013-09-25 Thread Quenten Grasso
G'day Mark,

I stumbled across an older thread it looks like you were involved with the 
centos and poor seq write performance on the R515's.

Were you using centos or Ubuntu on your server at the time? (I'm wondering if 
this could be related to Ubuntu)

http://marc.info/?t=13481911702&r=1&w=2

Also I tried as you suggested to put the raid controller into JBOD mode but no 
joy. I also tried cross flashing the card as its apparently a 9260 but we don't 
have 
any spare slots outside of the storage slot of which the raid controller cables 
can reach so that was a non-event :(

If you want to give it a try, if you have access to longer cables and or other 
servers you can put the perc h700 into.

I downloaded this flashing kit from here, (has all of the tools) grabbed a 
freedos usb and copied it all onto that.

http://forums.laptopvideo2go.com/topic/29166-sas2108-lsi-9260-based-firmware-files/

Then grabbed the latest 9260 firmware from,

http://www.lsi.com/downloads/Public/MegaRAID%20Common%20Files/12.13.0-0154_SAS_2108_Fw_Image_APP2.130.383-2315.zip


*** Steps to Cross Flash ***
 Disclaimer you do this at your own risk, I take no responsibility if you 
brick your card, Warranty, etc 

In a Dell R515 If you write the SBR of a LSI card i.e. the 9260 and reboot the 
system, The system will be halted as it's now a non-dell card in the storage 
slot.
However if you attempt to flash the LSI firmware onto the perch700 without the 
correct SBR it won't flash correctly it seems.

So if you have longer cables and or another server to try the h700 in that's 
not a dell. You can try and cross flash the card.
(FYI if you're trying to do this in a dell and you fudge up you can recover 
your system/raid card by plugging it into another pci-e slot and reapplying the 
Dell H700 SBR/Firmware)

Now I'll assume you have one raid controller in your system so you only have 
adapter 0

1) Backup your SBR in case you need to restore it ie:

Megarec -readsbr 0 prch700.sbr

2) Write the SBR of the card you want to flash ie:

megarec -writesbr 0 sbr9260.bin

3) Erase the raid controller bios/firmware
Megarec -cleanflash 0

4) Reboot

5) flash new firmware
Megarec -m0flash 0 mr2108fw.rom

6) Reboot & Done.

Also if your command errors out half way through flashing/erasing run it again.

Regards,
Quenten Grasso

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Sunday, 22 September 2013 10:40 PM
Cc: ceph-de...@vger.kernel.org
Subject: Re: [ceph-users] Ceph write performance and my Dell R515's

On 09/22/2013 03:12 AM, Quenten Grasso wrote:
>
> Hi All,
>
> I'm finding my write performance is less than I would have expected. 
> After spending some considerable amount of time testing several 
> different configurations I can never seems to break over ~360mb/s 
> write even when using tmpfs for journaling.
>
> So I've purchased 3x Dell R515's with 1 x AMD 6C CPU with 12 x 3TB SAS 
> & 2 x 100GB Intel DC S3700 SSD's & 32GB Ram with the Perc H710p Raid 
> controller and Dual Port 10GBE Network Cards.
>
> So first up I realise the SSD's were a mistake, I should have bought 
> the 200GB Ones as they have considerably better write though put of
> ~375 Mb/s vs 200 Mb/s
>
> So to our Nodes Configuration,
>
> 2 x 3TB disks in Raid1 for OS/MON & 1 partition for OSD, 12 Disks in a 
> Single each in a Raid0 (like a JBOD Fashion) with a 1MB Stripe size,
>
> (Stripe size this part was particularly important because I found the 
> stripe size matters considerably even on a single disk raid0. contrary 
> to what you might read on the internet)
>
> Also each disk is configured with (write back cache) is enabled and 
> (read head) disabled.
>
> For Networking, All nodes are connected via LACP bond with L3 hashing 
> and using iperf I can get up to 16gbit/s tx and rx between the nodes.
>
> OS: Ubuntu 12.04.3 LTS w/ Kernel 3.10.12-031012-generic (had to 
> upgrade kernel due to 10Gbit Intel NIC's driver issues)
>
> So this gives me 11 OSD's & 2 SSD's Per Node.
>

I'm a bit leery about that 1 OSD on the RAID1. It may be fine, but you 
definitely will want to do some investigation to make sure that OSD isn't 
holding the other ones back. iostat or collectl might be useful, along with the 
ceph osd admin socket and the dump_ops_in_flight and dump_historic_ops commands.

> Next I've tried several different configurations which I'll briefly 
> describe 2 of which below,
>
> 1)Cluster Configuration 1,
>
> 33 OSD's with 6x SSD's as Journals, w/ 15GB Journals on SSD.
>
> # ceph osd pool create benchmark1 1800 1800
>
> # rados bench -p benchmark1 180 write --no-cleanup
>
> 

[ceph-users] Ceph write performance and my Dell R515's

2013-09-22 Thread Quenten Grasso
Hi All,

I'm finding my write performance is less than I would have expected. After 
spending some considerable amount of time testing several different 
configurations I can never seems to break over ~360mb/s write even when using 
tmpfs for journaling.

So I've purchased 3x Dell R515's with 1 x AMD 6C CPU with 12 x 3TB SAS & 2 x 
100GB Intel DC S3700 SSD's & 32GB Ram with the Perc H710p Raid controller and 
Dual Port 10GBE Network Cards.

So first up I realise the SSD's were a mistake, I should have bought the 200GB 
Ones as they have considerably better write though put of ~375 Mb/s vs 200 Mb/s

So to our Nodes Configuration,
2 x 3TB disks in Raid1 for OS/MON & 1 partition for OSD, 12 Disks in a Single 
each in a Raid0  (like a JBOD Fashion) with a 1MB Stripe size,
(Stripe size this part was particularly important because I found the stripe 
size matters considerably even on a single disk raid0. contrary to what you 
might read on the internet)
Also each disk is configured with (write back cache) is enabled and (read head) 
disabled.

For Networking, All nodes are connected via LACP bond with L3 hashing and using 
iperf I can get up to 16gbit/s tx and rx between the nodes.

OS: Ubuntu 12.04.3 LTS w/ Kernel 3.10.12-031012-generic (had to upgrade kernel 
due to 10Gbit Intel NIC's driver issues)

So this gives me 11 OSD's & 2 SSD's Per Node.

Next I've tried several different configurations which I'll briefly describe 2 
of which below,


1)  Cluster Configuration 1,

33 OSD's with 6x SSD's as Journals,  w/ 15GB Journals on SSD.


# ceph osd pool create benchmark1 1800 1800

# rados bench -p benchmark1 180 write --no-cleanup

--

Maintaining 16 concurrent writes of 4194304 bytes for up to 180 seconds or 0 
objects



Total time run: 180.250417

Total writes made:  10152

Write size: 4194304

Bandwidth (MB/sec): 225.287



Stddev Bandwidth:   35.0897

Max bandwidth (MB/sec): 312

Min bandwidth (MB/sec): 0

Average Latency:0.284054

Stddev Latency: 0.199075

Max latency:1.46791

Min latency:0.038512

--



# rados bench -p benchmark1 180 seq



-

Total time run:43.782554

Total reads made: 10120

Read size:4194304

Bandwidth (MB/sec):924.569



Average Latency:   0.0691903

Max latency:   0.262542

Min latency:   0.015756

-



In this configuration I found my write performance suffers a lot to the SSD's 
seem to be a bottleneck and my write performance using rados bench was around 
224-230mb/s



2)  Cluster Configuration 2,

33 OSD's with 1Gbyte Journals on tmpfs.


# ceph osd pool create benchmark1 1800 1800

# rados bench -p benchmark1 180 write --no-cleanup

--

Maintaining 16 concurrent writes of 4194304 bytes for up to 180 seconds or 0 
objects



Total time run: 180.044669

Total writes made:  15328

Write size: 4194304

Bandwidth (MB/sec): 340.538



Stddev Bandwidth:   26.6096

Max bandwidth (MB/sec): 380

Min bandwidth (MB/sec): 0

Average Latency:0.187916

Stddev Latency: 0.0102989

Max latency:0.336581

Min latency:0.034475

--



# rados bench -p benchmark1 180 seq



-

Total time run:76.481303

Total reads made: 15328

Read size:4194304

Bandwidth (MB/sec):801.660



Average Latency:   0.079814

Max latency:   0.317827

Min latency:   0.016857

-



Now it seems there is no bottleneck for journaling as we are using tmpfs, 
however still less then what I would expect write speed the sas disks are 
barely busy via iostat..



So I thought it might be a disk bus throughput issue.



Next I completed some dd tests...



This below is in a script dd-x.sh which executes the 11 readers or writers at 
once.



dd if=/dev/zero of=/srv/ceph/osd.0/ddfile bs=32k count=100k oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.1/ddfile bs=32k count=100k oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.2/ddfile bs=32k count=100k oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.3/ddfile bs=32k count=100k oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.4/ddfile bs=32k count=100k oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.5/ddfile bs=32k count=100k oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.6/ddfile bs=32k count=100k oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.7/ddfile bs=32k count=100k oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.8/ddfile bs=32k count=100k oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.9/ddfile bs=32k count=100k oflag=direct &

dd if=/dev/zero of=/srv/