[ceph-users] Failed to read JournalPointer - MDS error (mds rank 0 is damaged)

2017-04-29 Thread Martin B Nielsen
Hi,

We're using ceph 10.2.5 and cephfs.

We had a weird monitor (mon0r0) which had some sort of meltdown as current
active mds node.

The monitor node called elections on/off over ~1 hour, sometimes with
5-10min between.

On every occasion mds was also doing a replay, reconnect, rejoin => active
(it never switched to use a standby mds).

Then after 1 hour of it mostly working it gave up with:

[ ... ]
2017-04-29 07:30:24.444980 7fe6d7e9c700  0 mds.beacon.mon0r0
handle_mds_beacon no longer laggy
2017-04-29 07:30:46.783817 7fe6d7e9c700  0 monclient: hunting for new mon
*< bunch of errors like this >*
2017-04-29 07:31:11.782049 7fe6d7e9c700  1 mds.mon0r0 handle_mds_map i (
172.16.130.10:6811/8235) dne in the mdsmap, respawning myself
2017-04-29 07:31:11.782054 7fe6d7e9c700  1 mds.mon0r0 respawn
2017-04-29 07:31:11.782056 7fe6d7e9c700  1 mds.mon0r0  e:
'/usr/bin/ceph-mds'
2017-04-29 07:31:11.782058 7fe6d7e9c700  1 mds.mon0r0  0:
'/usr/bin/ceph-mds'
2017-04-29 07:31:11.782060 7fe6d7e9c700  1 mds.mon0r0  1: '--cluster=ceph'
2017-04-29 07:31:11.782071 7fe6d7e9c700  1 mds.mon0r0  2: '-i'
2017-04-29 07:31:11.782072 7fe6d7e9c700  1 mds.mon0r0  3: 'mon0r0'
2017-04-29 07:31:11.782073 7fe6d7e9c700  1 mds.mon0r0  4: '-f'
2017-04-29 07:31:11.782074 7fe6d7e9c700  1 mds.mon0r0  5: '--setuser'
2017-04-29 07:31:11.782075 7fe6d7e9c700  1 mds.mon0r0  6: 'ceph'
2017-04-29 07:31:11.782076 7fe6d7e9c700  1 mds.mon0r0  7: '--setgroup'
2017-04-29 07:31:11.782077 7fe6d7e9c700  1 mds.mon0r0  8: 'ceph'
2017-04-29 07:31:11.782106 7fe6d7e9c700  1 mds.mon0r0  exe_path
/usr/bin/ceph-mds
2017-04-29 07:31:11.799625 7f5487a92180  0 ceph version 10.2.5
(c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-mds, pid 8235
2017-04-29 07:31:11.800097 7f5487a92180  0 pidfile_write: ignore empty
--pid-file
2017-04-29 07:31:12.746033 7f5481a40700  1 mds.mon0r0 handle_mds_map standby
2017-04-29 07:32:01.941948 7f5481a40700  0 monclient: hunting for new mon
2017-04-29 07:32:48.186313 7f5481a40700  1 mds.mon0r0 handle_mds_map standby
2017-04-29 07:33:04.539413 7f5481a40700  0 monclient: hunting for new mon
2017-04-29 07:33:09.560848 7f5481a40700  1 mds.0.764 handle_mds_map i am
now mds.0.764
2017-04-29 07:33:09.560857 7f5481a40700  1 mds.0.764 handle_mds_map state
change up:boot --> up:replay
2017-04-29 07:33:09.560879 7f5481a40700  1 mds.0.764 replay_start
2017-04-29 07:33:09.560882 7f5481a40700  1 mds.0.764  recovery set is
2017-04-29 07:33:09.560890 7f5481a40700  1 mds.0.764  waiting for osdmap
17134 (which blacklists prior instance)
2017-04-29 07:33:09.571120 7f547c733700 -1 log_channel(cluster) log [ERR] :
failed to read JournalPointer: -108 ((108) Cannot send after transport
endpoint shutdown)
2017-04-29 07:33:09.575176 7f547c733700  1 mds.mon0r0 respawn
2017-04-29 07:33:09.575185 7f547c733700  1 mds.mon0r0  e:
'/usr/bin/ceph-mds'
2017-04-29 07:33:09.575187 7f547c733700  1 mds.mon0r0  0:
'/usr/bin/ceph-mds'
2017-04-29 07:33:09.575189 7f547c733700  1 mds.mon0r0  1: '--cluster=ceph'
2017-04-29 07:33:09.575191 7f547c733700  1 mds.mon0r0  2: '-i'
2017-04-29 07:33:09.575192 7f547c733700  1 mds.mon0r0  3: 'mon0r0'
2017-04-29 07:33:09.575193 7f547c733700  1 mds.mon0r0  4: '-f'
2017-04-29 07:33:09.575194 7f547c733700  1 mds.mon0r0  5: '--setuser'
2017-04-29 07:33:09.575195 7f547c733700  1 mds.mon0r0  6: 'ceph'
2017-04-29 07:33:09.575196 7f547c733700  1 mds.mon0r0  7: '--setgroup'
2017-04-29 07:33:09.575197 7f547c733700  1 mds.mon0r0  8: 'ceph'
2017-04-29 07:33:09.575221 7f547c733700  1 mds.mon0r0  exe_path
/usr/bin/ceph-mds
2017-04-29 07:33:09.589993 7f9a9d0d1180  0 ceph version 10.2.5
(c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-mds, pid 8235
2017-04-29 07:33:09.590461 7f9a9d0d1180  0 pidfile_write: ignore empty
--pid-file
2017-04-29 07:33:10.567466 7f9a9707f700  1 mds.mon0r0 handle_mds_map standby
2017-04-29 07:34:46.972551 7f9a9707f700  0 monclient: hunting for new mon
2017-04-29 07:34:50.583321 7f9a9707f700  1 mds.mon0r0 handle_mds_map standby
2017-04-29 07:35:24.575818 7f9a9707f700  0 monclient: hunting for new mon
2017-04-29 07:36:31.988193 7f9a9707f700  0 monclient: hunting for new mon
2017-04-29 07:38:06.999197 7f9a9707f700  0 monclient: hunting for new mon
2017-04-29 07:39:12.009821 7f9a9707f700  0 monclient: hunting for new mon
2017-04-29 07:39:21.855605 7f9a9707f700  1 mds.mon0r0 handle_mds_map standby
2017-04-29 07:41:39.994418 7f9a9707f700  0 monclient: hunting for new mon
*< Continues like the above until mds was restarted ~1 h later*
[ ... ]
2017-04-29 08:49:22.204803 7f9a9300 -1 mds.mon0r0 *** got signal
Terminated ***
2017-04-29 08:49:22.204821 7f9a9300  1 mds.mon0r0 suicide.  wanted
state up:standby
2017-04-29 09:00:31.510392 7ff9acd5e180  0 set uid:gid to 64045:64045
(ceph:ceph)
2017-04-29 09:00:31.510412 7ff9acd5e180  0 ceph version 10.2.5
(c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-mds, pid 23804
2017-04-29 09:00:31.510853 7ff9acd5e180  0 pidfile_write: ignore empty
--pid-file
2017-04-29 

Re: [ceph-users] Find out the location of OSD Journal

2015-05-07 Thread Martin B Nielsen
Hi,

Inside your mounted osd there is a symlink - journal - pointing to a file
or disk/partition used with it.

Cheers,
Martin

On Thu, May 7, 2015 at 11:06 AM, Patrik Plank pat...@plank.me wrote:

  Hi,


 i cant remember on which drive I install which OSD journal :-||
 Is there any command to show this?


 thanks
 regards



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Extreme slowness in SSD cluster with 3 nodes and 9 OSD with 3.16-3 kernel

2015-02-28 Thread Martin B Nielsen
Hi Andrei,

If there is one thing I've come to understand by now is that ceph configs,
performance, hw and well - everything - seems to vary on almost people
basis.

I do not recognize that latency issue either, this is from one of our nodes
(4x 500GB samsung 840 pro - sd[c-f]) which has been running for 600+ days
(so the iostat -x is an avg of that):

# uptime
 16:24:57 up 611 days,  4:03,  1 user,  load average: 1.18, 1.55, 1.72

# iostat -x
[ ... ]
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdc   0.00 0.164.87   22.62   344.18   458.65
58.41 0.051.920.452.24   0.76   2.10
sdd   0.00 0.124.37   20.02   317.98   437.95
61.98 0.051.900.442.21   0.78   1.91
sde   0.00 0.124.17   19.33   302.45   403.02
60.02 0.041.870.432.18   0.77   1.80
sdf   0.00 0.124.51   20.84   322.84   439.70
60.17 0.051.840.432.15   0.76   1.93
[ ... ]

Granted, we do not have very high usage on this cluster on a ssd-basis and
it might change as we put more load on it, but we will deal with it then. I
do not think ~2ms access time is neither good nor bad.

This is from another cluster we operate - this one has an intel DC S3700
800gb ssd (sdb)
# uptime
 09:37:26 up 654 days,  8:40,  1 user,  load average: 0.33, 0.40, 0.54

# iostat -x
[ ... ]
sdb   0.01 1.49   39.76   86.79  1252.80  2096.98
52.94 0.020.761.220.54   0.41   5.21
[ ... ]

It is misleading as the latter just have 3 disks + hardware based 1gb
backed raidcontroller whereas the first is a 'cheap' dumb 12disk jbod IT
based setup.

All the ssd from both clusters have 3 partitions - 1 ceph-data and 2
journal partitions (1 journal for the ssd itself and 1 journal for 1
platter disk).

The intel ssd is very sturdy though - it has had a 2.1MB/sec avg. write
over 654 days - that is somewhere around 120TB so far.

But ultimately it boils down to what you need - in our usecase the latter
cluster has be to rockstable and performing - and we chose the intel ones
based on that. The first one we don't really care if we loose a node or two
and we replace disks every month or whenever it fits into our
going-to-datacenter-schedule - we wanted an ok'ish performing cluster and
focused more on total space / price than highperforming hardware. The
fantastic thing is we are not locked into any specific hardware and we can
replace any of it if we need to and/or find it is suddenly starting to have
issues.

Cheers,
Martin



On Sat, Feb 28, 2015 at 2:55 PM, Andrei Mikhailovsky and...@arhont.com
wrote:


 Martin,

 I have been using Samsung 840 Pro for journals about 2 years now and have
 just replaced all my samsung drives with Intel. We have found a lot of
 performance issues with 840 Pro (we are using 128mb). In particular, a very
 strange behaviour with using 4 partitions (with 50% underprovisioning left
 as empty unpartitioned space on the drive) where the drive would grind to
 almost a halt after a few weeks of use. I was getting 100% utilisation on
 the drives doing just 3-4MB/s writes. This was not the case when I've
 installed the new drives. Manual Trimming helps for a few weeks until the
 same happens again.

 This has been happening with all 840 Pro ssds that we have and contacting
 Samsung Support has proven to be utterly useless. They do not want to speak
 with you until you install windows and run their monkey utility ((.

 Also, i've noticed the latencies of the Samsung 840 Pro ssd drives to be
 about 15-20 slower compared with a consumer grade Intel drives, like Intel
 520. According to  ceph osd pef, I would consistently get higher figures on
 the osds with Samsung journal drive compared with the Intel drive on the
 same server. Something like 2-3ms for Intel vs 40-50ms for Samsungs.

 At some point we had enough with Samsungs and scrapped them.

 Andrei

 --

 *From: *Martin B Nielsen mar...@unity3d.com
 *To: *Philippe Schwarz p...@schwarz-fr.net
 *Cc: *ceph-users@lists.ceph.com
 *Sent: *Saturday, 28 February, 2015 11:51:57 AM
 *Subject: *Re: [ceph-users] Extreme slowness in SSD cluster with 3 nodes
 and 9 OSD with 3.16-3 kernel


 Hi,

 I cannot recognize that picture; we've been using samsumg 840 pro in
 production for almost 2 years now - and have had 1 fail.

 We run a 8node mixed ssd/platter cluster with 4x samsung 840 pro (500gb)
 in each so that is 32x ssd.

 They've written ~25TB data in avg each.

 Using the dd you had inside an existing semi-busy mysql-guest I get:

 10240 bytes (102 MB) copied, 5.58218 s, 18.3 MB/s

 Which is still not a lot, but I think it is more a limitation of our
 setup/load.

 We are using dumpling.

 All that aside, I would prob. go with something tried and tested if I was
 to redo it today - we haven't had any issues, but it is still nice to use
 something

Re: [ceph-users] error adding OSD to crushmap

2015-01-14 Thread Martin B Nielsen
Hi Luis,

I might remember wrong, but don't you need to actually create the osd
first? (ceph osd create)

Then you can use assign it a position using cli crushrules.

Like Jason said, can you send the ceph osd tree output?

Cheers,
Martin

On Mon, Jan 12, 2015 at 1:45 PM, Luis Periquito periqu...@gmail.com wrote:

 Hi all,

 I've been trying to add a few new OSDs, and as I manage everything with
 puppet, it was manually adding via the CLI.

 At one point it adds the OSD to the crush map using:

 # ceph osd crush add 6 0.0 root=default

 but I get
 Error ENOENT: osd.6 does not exist.  create it before updating the crush
 map

 If I read correctly this command should be the correct one to create the
 OSD to the crush map...

 is this a bug? I'm running the latest firefly 0.80.7.

 thanks

 PS: I just edited the crushmap, but it would make it a lot easier to do it
 by the CLI commands...

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD MTBF

2014-10-07 Thread Martin B Nielsen
A bit late getting back on this one.

On Wed, Oct 1, 2014 at 5:05 PM, Christian Balzer ch...@gol.com wrote:

  smartctl states something like
  Wear = 092%, Hours = 12883, Datawritten = 15321.83 TB avg on those. I
  think that is ~30TB/day if I'm doing the calc right.
 
 Something very much does not add up there.
 Either you've written 15321.83 GB on those drives, making it about
 30GB/day and well withing the Samsung specs, or you've written 10-20 times
 the expected TBW level of those drives...


My bad, I forgot to say the Wear indicator here (92%) is sorta backwards -
so it means it still has 92% to go before reaching expected TBW limit.

I agree with what Massimiliano Cuttini wrote later as well - if your io
boundaries are well within the expected TBW of the lifetime I see no reason
to go for more expensive disks. Just monitor for wear and have a few in
stock ready for replacement.

Regarding the table of ssd and vendors:
Brand   Model TBW   €  €/TB
Intel   S3500 120Go   701221,74
Intel   S3500 240Go   140   2251,60
Intel   S3700 100Go   1873  2200,11
Intel   S3700 200Go   3737  4000,10
Samsung 840 pro 120Go 701201,71

I don't disagree with the above - but the table assumes you'll wear out
your SSD. Adjust the wear level and the price will change proportionally -
if you're only writing 50-100TB/year pr ssd then the value will heavily
swing in the cheaper consumer grade ssd favor. It is all about your
estimated usage pattern and whether they're 'good enough' for your scenario
or not (and/or you trust that vendor).

In my experience ceph seldom (ever) maxes out io of a ssd - it is much more
likely to be cpu or network before coming to that.

Cheers,
Martin



 In the article I mentioned previously:

 http://www.anandtech.com/show/8239/update-on-samsung-850-pro-endurance-vnand-die-size

 The author clearly comes with a relationship of durability versus SSD
 size, as one would expect. But the Samsung homepage just stated 150TBW,
 for all those models...

 Christian

  Not to advertise or say every samsung 840 ssd is like this:
  http://www.vojcik.net/samsung-ssd-840-endurance-destruct-test/
 
 Seen it before, but I have a feeling that this test doesn't quite put the
 same strain on the poor NANDs as Emmanuel's environment.

 Christian

  Cheers,
  Martin
 
 
  On Wed, Oct 1, 2014 at 10:18 AM, Christian Balzer ch...@gol.com wrote:
 
   On Wed, 1 Oct 2014 09:28:12 +0200 Kasper Dieter wrote:
  
On Tue, Sep 30, 2014 at 04:38:41PM +0200, Mark Nelson wrote:
 On 09/29/2014 03:58 AM, Dan Van Der Ster wrote:
  Hi Emmanuel,
  This is interesting, because we?ve had sales guys telling us that
  those Samsung drives are definitely the best for a Ceph journal
  O_o !

 Our sales guys or Samsung sales guys?  :)  If it was ours, let me
 know.

  The conventional wisdom has been to use the Intel DC S3700
  because of its massive durability.

 The S3700 is definitely one of the better drives on the market for
 Ceph journals.  Some of the higher end PCIE SSDs have pretty high
 durability (and performance) as well, but cost more (though you can
 save SAS bay space, so it's a trade-off).
Intel P3700 could be an alternative with 10 Drive-Writes/Day for 5
years (see attachment)
   
   They're certainly nice and competitively priced (TBW/$ wise at least).
   However as I said in another thread, once your SSDs start to outlive
   your planned server deployment time (in our case 5 years) that's
   probably good enough.
  
   It's all about finding the balance between cost, speed (BW and IOPS),
   durability and space.
  
   For example I'm currently building a cluster based on 2U, 12 hotswap
   bays servers (because I already had 2 floating around) and am using 4
   100GB DC S3700 (at US$200 each) and 8 HDDS in them.
   Putting in a 400GB DC P3700 (US$1200( instead and 4 more HDDs would
   have pushed me over the budget and left me with a less than 30% used
   SSD 5 years later, at a time when we clearly can expect these things
   to be massively faster and cheaper.
  
   Now if you're actually having a cluster that would wear out a P3700 in
   5 years (or you're planning to run your machines until they burst into
   flames), then that's another story. ^.^
  
   Christian
  
-Dieter
   

 
  Anyway, I?m curious what do the SMART counters say on your SSDs??
  are they really failing due to worn out P/E cycles or is it
  something else?
 
  Cheers, Dan
 
 
  On 29 Sep 2014, at 10:31, Emmanuel Lacour
  elac...@easter-eggs.com wrote:
 
 
  Dear ceph users,
 
 
  we are managing ceph clusters since 1 year now. Our setup is
  typically made of Supermicro servers with OSD sata drives and
  journal on SSD.
 
  Those SSD are all failing one after the other after one year :(
 
  We used Samsung 850 pro (120Go) with two setup (small 

Re: [ceph-users] SSD MTBF

2014-10-01 Thread Martin B Nielsen
Hi,

We settled on Samsung pro 840 240GB drives 1½ year ago and we've been happy
so far. We've over-provisioned them a lot (left 120GB unpartitioned).

We have 16x 240GB and 32x 500GB - we've lost 1x 500GB so far.

smartctl states something like
Wear = 092%, Hours = 12883, Datawritten = 15321.83 TB avg on those. I think
that is ~30TB/day if I'm doing the calc right.

Not to advertise or say every samsung 840 ssd is like this:
http://www.vojcik.net/samsung-ssd-840-endurance-destruct-test/

Cheers,
Martin


On Wed, Oct 1, 2014 at 10:18 AM, Christian Balzer ch...@gol.com wrote:

 On Wed, 1 Oct 2014 09:28:12 +0200 Kasper Dieter wrote:

  On Tue, Sep 30, 2014 at 04:38:41PM +0200, Mark Nelson wrote:
   On 09/29/2014 03:58 AM, Dan Van Der Ster wrote:
Hi Emmanuel,
This is interesting, because we?ve had sales guys telling us that
those Samsung drives are definitely the best for a Ceph journal O_o !
  
   Our sales guys or Samsung sales guys?  :)  If it was ours, let me know.
  
The conventional wisdom has been to use the Intel DC S3700 because
of its massive durability.
  
   The S3700 is definitely one of the better drives on the market for
   Ceph journals.  Some of the higher end PCIE SSDs have pretty high
   durability (and performance) as well, but cost more (though you can
   save SAS bay space, so it's a trade-off).
  Intel P3700 could be an alternative with 10 Drive-Writes/Day for 5 years
  (see attachment)
 
 They're certainly nice and competitively priced (TBW/$ wise at least).
 However as I said in another thread, once your SSDs start to outlive your
 planned server deployment time (in our case 5 years) that's probably good
 enough.

 It's all about finding the balance between cost, speed (BW and IOPS),
 durability and space.

 For example I'm currently building a cluster based on 2U, 12 hotswap bays
 servers (because I already had 2 floating around) and am using 4 100GB DC
 S3700 (at US$200 each) and 8 HDDS in them.
 Putting in a 400GB DC P3700 (US$1200( instead and 4 more HDDs would have
 pushed me over the budget and left me with a less than 30% used SSD 5
 years later, at a time when we clearly can expect these things to be
 massively faster and cheaper.

 Now if you're actually having a cluster that would wear out a P3700 in 5
 years (or you're planning to run your machines until they burst into
 flames), then that's another story. ^.^

 Christian

  -Dieter
 
  
   
Anyway, I?m curious what do the SMART counters say on your SSDs??
are they really failing due to worn out P/E cycles or is it
something else?
   
Cheers, Dan
   
   
On 29 Sep 2014, at 10:31, Emmanuel Lacour elac...@easter-eggs.com
wrote:
   
   
Dear ceph users,
   
   
we are managing ceph clusters since 1 year now. Our setup is
typically made of Supermicro servers with OSD sata drives and
journal on SSD.
   
Those SSD are all failing one after the other after one year :(
   
We used Samsung 850 pro (120Go) with two setup (small nodes with 2
ssd, 2 HD in 1U):
   
1) raid 1 :( (bad idea, each SSD support all the OSDs journals
writes :() 2) raid 1 for OS (nearly no writes) and dedicated
partition for journals (one per OSD)
   
   
I'm convinced that the second setup is better and we migrate old
setup to this one.
   
Thought, statistics gives 60GB (option 2) to 100 GB (option 1)
writes per day on SSD on a not really over loaded cluster. Samsung
claims to give 5 years warranty if under 40GB/day. Those numbers
seems very low to me.
   
What are your experiences on this? What write volumes do you
encounter, on wich SSD models, which setup and what MTBF?
   
   
--
Easter-eggs  Spécialiste GNU/Linux
44-46 rue de l'Ouest  -  75014 Paris  -  France -  Métro Gaité
Phone: +33 (0) 1 43 35 00 37-   Fax: +33 (0) 1 43 35 00 76
mailto:elac...@easter-eggs.com  -   http://www.easter-eggs.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
   
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
   
  
   ___
   ceph-users mailing list
   ceph-users@lists.ceph.com
   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 --
 Christian BalzerNetwork/Systems Engineer
 ch...@gol.com   Global OnLine Japan/Fusion Communications
 http://www.gol.com/
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] resizing the OSD

2014-09-09 Thread Martin B Nielsen
Hi,

Or did you mean some OSD are near full while others are under-utilized?

On Sat, Sep 6, 2014 at 5:04 PM, Christian Balzer ch...@gol.com wrote:


 Hello,

 On Fri, 05 Sep 2014 15:31:01 -0700 JIten Shah wrote:

  Hello Cephers,
 
  We created a ceph cluster with 100 OSD, 5 MON and 1 MSD and most of the
  stuff seems to be working fine but we are seeing some degrading on the
  osd's due to lack of space on the osd's.

 Please elaborate on that degradation.

  Is there a way to resize the
  OSD without bringing the cluster down?
 

 Define both resize and cluster down.

 As in, resizing how?
 Are your current OSDs on disks/LVMs that are not fully used and thus could
 be grown?
 What is the size of your current OSDs?

 The normal way of growing a cluster is to add more OSDs.
 Preferably of the same size and same performance disks.
 This will not only simplify things immensely but also make them a lot more
 predictable.
 This of course depends on your use case and usage patterns, but often when
 running out of space you're also running out of other resources like CPU,
 memory or IOPS of the disks involved. So adding more instead of growing
 them is most likely the way forward.

 If you were to replace actual disks with larger ones, take them (the OSDs)
 out one at a time and re-add it. If you're using ceph-deploy, it will use
 the disk size as basic weight, if you're doing things manually make sure
 to specify that size/weight accordingly.
 Again, you do want to do this for all disks to keep things uniform.


Just want to emphasize this - if your disks already have high utilization
and you add a [much] larger drive and auto-weights it for say 2 or 3x the
other disks, that disk will have that much higher utilization and will most
likely max out and bottleneck your cluster. So keep that in mind :).

Cheers,
Martin



 If your cluster (pools really) are set to a replica size of at least 2
 (risky!) or 3 (as per Firefly default), taking a single OSD out would of
 course never bring the cluster down.
 However taking an OSD out and/or adding a new one will cause data movement
 that might impact your cluster's performance.

 Regards,

 Christian
 --
 Christian BalzerNetwork/Systems Engineer
 ch...@gol.com   Global OnLine Japan/Fusion Communications
 http://www.gol.com/
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One stuck PG

2014-09-04 Thread Martin B Nielsen
Hi Erwin,

Did you try and restart the primary osd for that pg (24) - sometimes it
needs a little ..nudge that way.

Otherwise what does ceph pg dump say about that pg?

Cheers,
Martin


On Thu, Sep 4, 2014 at 9:00 AM, Erwin Lubbers c...@erwin.lubbers.org
wrote:

 Hi,

 My cluster is giving one stuck pg which seems to be backfilling for days
 now. Any suggestions on how to solve it?

 HEALTH_WARN 1 pgs backfilling; 1 pgs stuck unclean; recovery 32/6000626
 degraded (0.001%)
 pg 206.3f is stuck unclean for 557655.601540, current state
 active+remapped+backfilling, last acting [24,28,3,44]
 pg 206.3f is active+remapped+backfilling, acting [24,28,3,44]
 recovery 32/6000626 degraded (0.001%)

 Regards,
 Erwin
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD journal deployment experiences

2014-09-04 Thread Martin B Nielsen
Hi Dan,

We took a different approach (and our cluster is tiny compared to many
others) - we have two pools; normal and ssd.

We use 14 disks in each osd-server; 8 platter and 4 ssd for ceph, and 2 ssd
for OS/journals. We partitioned the two OS ssd as raid1 using about half
the space for the OS and leaving the rest on each for 2x journals and
unprovisioned. We've partitioned the OS disks to each hold 2x platter
journals. On top of that our ssd pooled disks also hold 2x journals; their
own + an additional from a platter disk. We have 8 osd-nodes.

So whenever an ssd fail we lose 2 osd (but never more).

We've had this system in production for ~1½ year now and so far we've had 1
ssd and 2 platter disk fail. We run a couple of hundred vm-guests on it and
use ~60TB.

On a daily basis we avg. 30MB/sec r/w and ~600 iops so not very high usage.
The times we lost disks we hardly noticed. All SSD (OS included) have a
general utilization of 5%, platter disks near 10%.

We did a lot of initial testing about putting journals on the OS-ssd as
well extra on the ssd-osd, but we didn't find much difference or high
latencies as others have experienced. When/if we notice otherwise we'll
prob. switch to pure ssd as journalholders.

We originally deployed using saltstack and even though we have automated
replacing disks we still do it manually 'just to be sure'. It takes 5-10min
to replace an old disk and get it backfilling, so I don't expect us to
spend any time automating this.

Recovering 2 disks at once for us takes a long time but we've intentionally
set backfilling low and it is not noticeable on the cluster when it happens.

Anyways, we have pretty low cluster usage but in our experience ssd seem to
handle the constant load very well.

Cheers,
Martin




On Thu, Sep 4, 2014 at 6:21 PM, Dan Van Der Ster daniel.vanders...@cern.ch
wrote:

 Dear Cephalopods,

 In a few weeks we will receive a batch of 200GB Intel DC S3700’s to
 augment our cluster, and I’d like to hear your practical experience and
 discuss options how best to deploy these.

 We’ll be able to equip each of our 24-disk OSD servers with 4 SSDs, so
 they will become 20 OSDs + 4 SSDs per server. Until recently I’ve been
 planning to use the traditional deployment: 5 journal partitions per SSD.
 But as SSD-day approaches, I growing less comfortable with the idea of 5
 OSDs going down every time an SSD fails, so perhaps there are better
 options out there.

 Before getting into options, I’m curious about real reliability of these
 drives:

 1) How often are DC S3700's failing in your deployments?
 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is the
 backfilling which results from an SSD failure? Have you considered tricks
 like increasing the down out interval so backfilling doesn’t happen in this
 case (leaving time for the SSD to be replaced)?

 Beyond the usually 5 partitions deployment, is anyone running a RAID1 or
 RAID10 for the journals? If so, are you using the raw block devices or
 formatting it and storing the journals as files on the SSD array(s)? Recent
 discussions seem to indicate that XFS is just as fast as the block dev,
 since these drives are so fast.

 Next, I wonder how people with puppet/chef/… are handling the
 creation/re-creation of the SSD devices. Are you just wiping and rebuilding
 all the dependent OSDs completely when the journal dev fails? I’m not keen
 on puppetizing the re-creation of journals for OSDs...

 We also have this crazy idea of failing over to a local journal file in
 case an SSD fails. In this model, when an SSD fails we’d quickly create a
 new journal either on another SSD or on the local OSD filesystem, then
 restart the OSDs before backfilling started. Thoughts?

 Lastly, I would also consider using 2 of the SSDs in a data pool (with the
 other 2 SSDs to hold 20 journals — probably in a RAID1 to avoid backfilling
 10 OSDs when an SSD fails). If the 10-1 ratio of SSDs would perform
 adequately, that’d give us quite a few SSDs to build a dedicated high-IOPS
 pool.

 I’d also appreciate any other suggestions/experiences which might be
 relevant.

 Thanks!
 Dan

 -- Dan van der Ster || Data  Storage Services || CERN IT Department --


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge issues with slow requests

2014-09-04 Thread Martin B Nielsen
Just echoing what Christian said.

Also, iirc the currently waiting for subobs on [ could also mean a
problem on those as it waits for ack from them (I might remember wrong).

If that is the case you might want to check in on osd 13  37 as well.

With the cluster load and size you should not have this problem; I'm pretty
sure you're dealing with a rogue/faulty osd/node somewhere.

Cheers,
Martin


On Fri, Sep 5, 2014 at 2:28 AM, Christian Balzer ch...@gol.com wrote:

 On Thu, 4 Sep 2014 12:02:13 +0200 David wrote:

  Hi,
 
  We’re running a ceph cluster with version:
 
  0.67.7-1~bpo70+1
 
  All of a sudden we’re having issues with the cluster (running RBD images
  for kvm) with slow requests on all of the OSD servers. Any idea why and
  how to fix it?
 
 You give us a Ceph version at least, but for anybody to make guesses we
 need much more information than a log spew.

 How many nodes/OSDs, OS, hardware, OSD details (FS, journals on SSDs), etc.

 Run atop (in a sufficiently large terminal) on all your nodes, see if you
 can spot a bottleneck, like a disk being at 100% all the time with a
 much higher avio than the others.
 Looking at your logs, I'd pay particular attention to the disk holding
 osd.22.
 A single slow disk can bring a whole large cluster to a crawl.
 If you're using a hardware controller with a battery backed up cache,
 check if that is fine, loss of the battery would switch from writeback to
 writethrough and massively slow down IOPS.

 Regards,

 Christian
 
  2014-09-04 11:56:35.868521 mon.0 [INF] pgmap v12504451: 6860 pgs: 6860
  active+clean; 12163 GB data, 36308 GB used, 142 TB / 178 TB avail;
  634KB/s rd, 487KB/s wr, 90op/s 2014-09-04 11:56:29.510270 osd.22 [WRN]
  15 slow requests, 1 included below; oldest blocked for  44.745754 secs
  2014-09-04 11:56:29.510276 osd.22 [WRN] slow request 30.999821 seconds
  old, received at 2014-09-04 11:55:58.510424:
  osd_op(client.10731617.0:81868956
  rbd_data.967e022eb141f2.0e72 [write 0~4194304] 3.c585cebe
  e13901) v4 currently waiting for subops from [37,13] 2014-09-04
  11:56:30.510528 osd.22 [WRN] 21 slow requests, 6 included below; oldest
  blocked for  45.745989 secs 2014-09-04 11:56:30.510534 osd.22 [WRN]
  slow request 30.122555 seconds old, received at 2014-09-04
  11:56:00.387925: osd_op(client.13425082.0:11962345
  rbd_data.54f24c3d1b58ba.3753 [stat,write 1114112~8192]
  3.c9e49140 e13901) v4 currently waiting for subops from [13,42]
  2014-09-04 11:56:30.510537 osd.22 [WRN] slow request 30.122362 seconds
  old, received at 2014-09-04 11:56:00.388118:
  osd_op(client.13425082.0:11962352
  rbd_data.54f24c3d1b58ba.3753 [stat,write 1134592~4096]
  3.c9e49140 e13901) v4 currently waiting for subops from [13,42]
  2014-09-04 11:56:30.510541 osd.22 [WRN] slow request 30.122298 seconds
  old, received at 2014-09-04 11:56:00.388182:
  osd_op(client.13425082.0:11962353
  rbd_data.54f24c3d1b58ba.3753 [stat,write 4046848~8192]
  3.c9e49140 e13901) v4 currently waiting for subops from [13,42]
  2014-09-04 11:56:30.510544 osd.22 [WRN] slow request 30.121577 seconds
  old, received at 2014-09-04 11:56:00.388903:
  osd_op(client.13425082.0:11962374
  rbd_data.54f24c3d1b58ba.47f2 [stat,write 2527232~4096]
  3.cd9a9015 e13901) v4 currently waiting for subops from [45,1]
  2014-09-04 11:56:30.510548 osd.22 [WRN] slow request 30.121518 seconds
  old, received at 2014-09-04 11:56:00.388962:
  osd_op(client.13425082.0:11962375
  rbd_data.54f24c3d1b58ba.47f2 [stat,write 3133440~4096]
  3.cd9a9015 e13901) v4 currently waiting for subops from [45,1]
  2014-09-04 11:56:31.510706 osd.22 [WRN] 26 slow requests, 6 included
  below; oldest blocked for  46.746163 secs 2014-09-04 11:56:31.510711
  osd.22 [WRN] slow request 31.035418 seconds old, received at 2014-09-04
  11:56:00.475236: osd_op(client.9266625.0:135900595
  rbd_data.42d6792eb141f2.bc00 [stat,write 2097152~4096]
  3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
  2014-09-04 11:56:31.510715 osd.22 [WRN] slow request 31.035335 seconds
  old, received at 2014-09-04 11:56:00.475319:
  osd_op(client.9266625.0:135900596
  rbd_data.42d6792eb141f2.bc00 [stat,write 2162688~4096]
  3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
  2014-09-04 11:56:31.510718 osd.22 [WRN] slow request 31.035270 seconds
  old, received at 2014-09-04 11:56:00.475384:
  osd_op(client.9266625.0:135900597
  rbd_data.42d6792eb141f2.bc00 [stat,write 2400256~16384]
  3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
  2014-09-04 11:56:31.510721 osd.22 [WRN] slow request 31.035093 seconds
  old, received at 2014-09-04 11:56:00.475561:
  osd_op(client.9266625.0:135900598
  rbd_data.42d6792eb141f2.bc00 [stat,write 2420736~4096]
  3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
  2014-09-04 11:56:31.510724 osd.22 [WRN] slow request 31.034990 seconds
  old, 

Re: [ceph-users] Ceph Not getting into a clean state

2014-05-09 Thread Martin B Nielsen
Hi,

I experienced exactly the same with 14.04 and the 0.79 release.

It was a fresh clean install with default crushmap and ceph-deploy install
as pr. the quick-start guide.

Oddly enough changing replica size (incl min_size) from 3 - 2 (and 2-1)
and back again it worked.

I didn't have time to look into replicating the issue.

Cheers,
Martin


On Thu, May 8, 2014 at 4:30 PM, Georg Höllrigl
georg.hoellr...@xidras.comwrote:

 Hello,

 We've a fresh cluster setup - with Ubuntu 14.04 and ceph firefly. By now
 I've tried this multiple times - but the result keeps the same and shows me
 lots of troubles (the cluster is empty, no client has accessed it)

 #ceph -s
 cluster b04fc583-9e71-48b7-a741-92f4dff4cfef
  health HEALTH_WARN 470 pgs stale; 470 pgs stuck stale; 18 pgs stuck
 unclean; 26 requests are blocked  32 sec
  monmap e2: 3 mons at {ceph-m-01=10.0.0.100:6789/0,
 ceph-m-02=10.0.1.101:6789/0,ceph-m-03=10.0.1.102:6789/0}, election epoch
 8, quorum 0,1,2 ceph-m-01,ceph-m-02,ceph-m-03
  osdmap e409: 9 osds: 9 up, 9 in
   pgmap v1231: 480 pgs, 9 pools, 822 bytes data, 43 objects
 9373 MB used, 78317 GB / 78326 GB avail
  451 stale+active+clean
1 stale+active+clean+scrubbing
   10 active+clean
   18 stale+active+remapped

 Anyone an idea what happens here? Should an empty cluster not show only
 active+clean pgs?


 Regards,
 Georg
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Red Hat to acquire Inktank

2014-05-01 Thread Martin B Nielsen
First off, congrats to inktank!

I'm sure having Redhat backing the project it will see even quicker
development.

My only worry is support for future non-RHEL platforms; like many others
we've built our ceph stack around ubuntu and I'm just hoping it won't
deteriorate into something like how it is only built/tested around
Centos/Redhat ( ie moving the I+C from ubuntu to only be on Centos/Redhat
- http://ceph.com/docs/master/start/os-recommendations/ and just keep a
basic build-test around all other distroes) I fear a political decision to
only have those extra tests on Centos/Redhat will 'force' people to run it
on Centos/Redhat eventually.

Cheers,
Martin



On Wed, Apr 30, 2014 at 2:18 PM, Sage Weil s...@inktank.com wrote:

 Today we are announcing some very big news: Red Hat is acquiring Inktank.
 We are very excited about what this means for Ceph, the community, the
 team, our partners, and our customers. Ceph has come a long way in the ten
 years since the first line of code has been written, particularly over the
 last two years that Inktank has been focused on its development. The fifty
 members of the Inktank team, our partners, and the hundreds of other
 contributors have done amazing work in bringing us to where we are today.

 We believe that, as part of Red Hat, the Inktank team will be able to
 build a better quality Ceph storage platform that will benefit the entire
 ecosystem. Red Hat brings a broad base of expertise in building and
 delivering hardened software stacks as well as a wealth of resources that
 will help Ceph become the transformative and ubiquitous storage platform
 that we always believed it could be.

 For existing Inktank customers, this is going to mean turning a reliable
 and robust storage system into something that delivers even more value. In
 particular, joining forces with the Red Hat team will improve our ability
 to address problems at all layers of the storage stack, including in the
 kernel. We naturally recognize that many customers and users have built
 platforms based on other Linux distributions. We will continue to support
 these installations while we determine how to provide the best customer
 experience moving forward and how the next iteration of the enterprise
 Ceph product will be structured. In the meantime, our team remains
 committed to keeping Ceph an open, multiplatform project that works in any
 environment where it makes sense, including other Linux distributions and
 non-Linux operating systems.

 Red Hat is one of only a handful of companies that I trust to steward the
 Ceph project. When we started Inktank two years ago, our goal was to build
 the business by making Ceph successful as a broad-based, collaborative
 open source project with a vibrant user, developer, and commercial
 community. Red Hat shares this vision. They are passionate about open
 source, and have demonstrated that they are strong and fair stewards with
 other critical projects (like KVM). Red Hat intends to administer the Ceph
 trademark in a manner that protects the ecosystem as a whole and creates a
 level playing field where everyone is held to the same standards of use.
 Similarly, policies like upstream first ensure that bug fixes and
 improvements that go into Ceph-derived products are always shared with the
 community to streamline development and benefit all members of the
 ecosystem.

 One important change that will take place involves Inktank's product
 strategy, in which some add-on software we have developed is proprietary.
 In contrast, Red Hat favors a pure open source model. That means that
 Calamari, the monitoring and diagnostics tool that Inktank has developed
 as part of the Inktank Ceph Enterprise product, will soon be open sourced.

 This is a big step forward for the Ceph community. Very little will change
 on day one as it will take some time to integrate the Inktank business and
 for any significant changes to happen with our engineering activities.
 However, we are very excited about what is coming next for Ceph and are
 looking forward to this new chapter.

 I'd like to thank everyone who has helped Ceph get to where we are today:
 the amazing research group at UCSC where it began, DreamHost for
 supporting us for so many years, the incredible Inktank team, and the many
 contributors and users that have helped shape the system. We continue to
 believe that robust, scalable, and completely open storage platforms like
 Ceph will transform a storage industry that is still dominated by
 proprietary systems. Let's make it happen!

 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Live database files on Ceph

2014-04-04 Thread Martin B Nielsen
Hi,

We're running mysql in multi-master cluster (galera), mysql standalones,
postgresql, mssql and oracle db's on ceph RBD via QEMU/KVM. As someone else
pointed out it is usually faster with ceph, but sometimes you'll get some
odd slow reads.

Latency is our biggest enemy.

Oracle comes with an awesome way to capture an insane amount of perf stats
and through that we can see that our avg. latency for writes are ~12ms and
reads slightly higher around 15ms. In our usecase that is acceptable. If we
used local [SSD] disks this would be much lower ( 1-2ms).

We've also experienced once that our galera cluster went out of sync due to
a very stressed cluster/network (this particular cluster is satuated every
now and then - both disk and network).

We had to change scheduler from cfq - deadline on most db-servers to get
acceptable speeds or we encountered writes taking up to 2sec whenever lots
of seq. data had to be written.

I wouldn't run super high precision/performance databases on it though.
Your db performance will always reflect the status of your entire cluster
system. I'd say for anything not requiring extremely finetuned
always-consistent access times it runs very well. At the very least if you
plan to do that, I'd suggest finding some way to isolate and gurantee
performance for your guests no matter how busy your cluster would be (which
I don't think you can do).

We run with SSD journals and SSD backends for most of our db-stuff as we
found using normal platter disks as backend could cause some issues if we
hit a spikey period of cluster activity (even with ssd journals).

Cheers,
Martin


On Thu, Apr 3, 2014 at 8:04 PM, Brian Beverage 
bbever...@americandatanetwork.com wrote:

 I am looking at setting up a Ganeti cluster using KVM and CentOS. While
 looking at storage I first looked at Gluster but noticed in the
 documentation it does not allow Live Database files to be saved to it. Does
 Ceph allow the use of LIVE database files being saved to it. If so does the
 database perform well? We have a couple Database servers that will be
 virtualized. I would like to know what other Ceph users are doing with
 their virtual environments that contain databases. I do not want to be
 locked into a SAN. I also would like to do this without being locked into a
 proprietary VM software. That is why Ganeti and KVM was the preferred
 software.



 Thanks,

 Brian

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS debugging

2014-03-31 Thread Martin B Nielsen
Hi,

I can see you're running mon, mds and osd on the same server.

Also, from a quick glance you're using around 13GB resident memory.

If you only have 16GB in your system I'm guessing you'll be swapping about
now (or close). How much mem does the system hold?

Also, how busy are the disks? Or is it primarily cpu-bound? Do you have
many processes waiting for run time or high interrupt count?

/Martin


On Mon, Mar 31, 2014 at 1:49 PM, Kenneth Waegeman kenneth.waege...@ugent.be
 wrote:

 Hi all,

 Before the weekend we started some copying tests over ceph-fuse.
 Initially, this went ok. But then the performance started dropping
 gradually. Things are going very slow now:

 2014-03-31 13:36:37.047423 mon.0 [INF] pgmap v265871: 1300 pgs: 1300
 active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 44747
 kB/s rd, 216 kB/s wr, 10 op/s
 2014-03-31 13:36:38.049286 mon.0 [INF] pgmap v265872: 1300 pgs: 1300
 active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 4069
 B/s rd, 363 kB/s wr, 24 op/s
 2014-03-31 13:36:39.057680 mon.0 [INF] pgmap v265873: 1300 pgs: 1300
 active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 5092
 B/s rd, 151 kB/s wr, 22 op/s
 2014-03-31 13:36:40.075718 mon.0 [INF] pgmap v265874: 1300 pgs: 1300
 active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 25961
 B/s rd, 1527 B/s wr, 10 op/s
 2014-03-31 13:36:41.087764 mon.0 [INF] pgmap v265875: 1300 pgs: 1300
 active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 71574
 kB/s rd, 4564 B/s wr, 17 op/s
 2014-03-31 13:36:42.109200 mon.0 [INF] pgmap v265876: 1300 pgs: 1300
 active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 71238
 kB/s rd, 3534 B/s wr, 9 op/s
 2014-03-31 13:36:43.128113 mon.0 [INF] pgmap v265877: 1300 pgs: 1300
 active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 4022
 B/s rd, 116 kB/s wr, 24 op/s
 2014-03-31 13:36:44.143382 mon.0 [INF] pgmap v265878: 1300 pgs: 1300
 active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 8030
 B/s rd, 117 kB/s wr, 29 op/s
 2014-03-31 13:36:45.160405 mon.0 [INF] pgmap v265879: 1300 pgs: 1300
 active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 7049
 B/s rd, 4531 B/s wr, 9 op/s


 ceph-mds seems very busy, and also only one osd!

 PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 54279 root  20   0 8561m 7.5g 4408 S 105.6 23.8   3202:05 ceph-mds
 50242 root  20   0 1378m 373m 6452 S  0.7  1.2 523:38.77 ceph-osd
 49446 root  18  -2 10644  356  320 S  0.0  0.0   0:00.00 udevd
 49444 root  18  -2 10644  428  320 S  0.0  0.0   0:00.00 udevd
 49319 root  20   0 1444m 405m 5684 S  0.0  1.3 513:41.13 ceph-osd
 48452 root  20   0 1365m 364m 5636 S  0.0  1.1 551:52.31 ceph-osd
 47641 root  20   0 1567m 388m 5880 S  0.0  1.2 754:50.60 ceph-osd
 46811 root  20   0 1441m 393m 8256 S  0.0  1.2 603:11.26 ceph-osd
 46028 root  20   0 1594m 398m 6156 S  0.0  1.2 657:22.16 ceph-osd
 45275 root  20   0 1545m 510m 9920 S 18.9  1.6 943:11.99 ceph-osd
 44532 root  20   0 1509m 395m 7380 S  0.0  1.2 665:30.66 ceph-osd
 43835 root  20   0 1397m 384m 8292 S  0.0  1.2 466:35.47 ceph-osd
 43146 root  20   0 1412m 393m 5884 S  0.0  1.2 506:42.07 ceph-osd
 42496 root  20   0 1389m 364m 5292 S  0.0  1.1 522:37.70 ceph-osd
 41863 root  20   0 1504m 393m 5864 S  0.0  1.2 462:58.11 ceph-osd
 39035 root  20   0  918m 694m 3396 S  3.3  2.2  55:53.59 ceph-mon

 Does this look familiar to someone?

 How can we debug this further?
 I already have set the debug level of mds to 5. There are a lot of
 'lookup' entries, but I can't see any reported warnings or errors.

 Thanks!

 Kind regards,
 Kenneth

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] help, add mon failed lead to cluster failure

2014-03-26 Thread Martin B Nielsen
Hi,

I experienced this from time to time with older releases of ceph, but
haven't stumbled upon it for some time.

Often I had to revert to the older state by using:
http://ceph.com/docs/master/rados/operations/add-or-rm-mons/#removing-monitors-from-an-unhealthy-cluster

and dump the monlist, find the original monitor - kill the newest addition,
inject and restart it - then it should get online again.

Cheers,
Martin


On Wed, Mar 26, 2014 at 11:40 AM, duan.xuf...@zte.com.cn wrote:


 Hi,
 I just add a new mon to a health cluster by following website
 manual http://ceph.com/docs/master/rados/operations/add-or-rm-mons/; ADDING
 MONITORS step by step,

 but when i execute step 6:
 ceph mon add mon-id ip[:port]

 the command didn't return, then i execute ceph -s on health mon node,
 this command didn't return either.

 so i try to restart mon to recover the whole cluster, but it seems never
 recover.

 Please anyone tell me how to deal with it?


 === mon.storage1 ===
 Starting Ceph mon.storage1 on storage1...
 Starting ceph-create-keys on storage1...

 [root@storage1 ~]# ceph -s   //after restart mon , ceph -s still have
 no output




 [root@storage1 ceph]# tail ceph-mon.storage1.log
 2014-03-26 18:20:33.338554 7f60dbb967a0  0 ceph version 0.72.2
 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 24214
 2014-03-26 18:20:33.460282 7f60dbb967a0  1 mon.storage1@-1(probing) e2
 preinit fsid 3429fd17-4a92-4d3b-a7fa-04adedb0da82
 2014-03-26 18:20:33.460694 7f60dbb967a0  1 mon.storage1@-1(probing).pg v0
 on_upgrade discarding in-core PGMap
 2014-03-26 18:20:33.487899 7f60dbb967a0  0 mon.storage1@-1(probing) e2
  my rank is now 0 (was -1)
 2014-03-26 18:20:33.488575 7f60d6854700  0 -- 193.168.1.100:6789/0 
 193.168.1.133:6789/0 pipe(0x3f38280 sd=21 :0 s=1 pgs=0 cs=0 l=0
 c=0x3f19600).fault
 2014-03-26 18:21:33.487686 7f60d8657700  0 
 mon.storage1@0(probing).data_health(0)
 update_stats avail 86% total 51606140 used 4324004 avail 44660696
 2014-03-26 18:22:33.488091 7f60d8657700  0 
 mon.storage1@0(probing).data_health(0)
 update_stats avail 86% total 51606140 used 4324004 avail 44660696
 2014-03-26 18:23:33.488500 7f60d8657700  0 
 mon.storage1@0(probing).data_health(0)
 update_stats avail 86% total 51606140 used 4324004 avail 44660696


 
 ZTE Information Security Notice: The information contained in this mail (and 
 any attachment transmitted herewith) is privileged and confidential and is 
 intended for the exclusive use of the addressee(s).  If you are not an 
 intended recipient, any disclosure, reproduction, distribution or other 
 dissemination or use of the information contained is strictly prohibited.  If 
 you have received this mail in error, please delete it and notify us 
 immediately.




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Restarts cause excessively high load average and requests are blocked 32 sec

2014-03-23 Thread Martin B Nielsen
Hi,

I can see ~17% hardware interrupts which I find a little high - can you
make sure all load is spread over all your cores (/proc/interrupts)?

What about disk util once you restart them? Are they all 100% utilized or
is it 'only' mostly cpu-bound?

Also you're running a monitor on this node - how is the load on the nodes
where you run a monitor compared to those where you dont?

Cheers,
Martin


On Thu, Mar 20, 2014 at 10:18 AM, Quenten Grasso qgra...@onq.com.au wrote:

  Hi All,



 I left out my OS/kernel version, Ubuntu 12.04.4 LTS w/ Kernel
 3.10.33-031033-generic (We upgrade our kernels to 3.10 due to Dell Drivers).



 Here's an example of starting all the OSD's after a reboot.



 top - 09:10:51 up 2 min,  1 user,  load average: 332.93, 112.28, 39.96

 Tasks: 310 total,   1 running, 309 sleeping,   0 stopped,   0 zombie

 Cpu(s): 50.3%us, 32.5%sy,  0.0%ni,  0.0%id,  0.0%wa, 17.2%hi,  0.0%si,
 0.0%st

 Mem:  32917276k total,  6331224k used, 26586052k free, 1332k buffers

 Swap: 33496060k total,0k used, 33496060k free,  1474084k cached



   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND

 15875 root  20   0  910m 381m  50m S   60  1.2   0:50.57 ceph-osd

 2996 root  20   0  867m 330m  44m S   59  1.0   0:58.32 ceph-osd

 4502 root  20   0  907m 372m  47m S   58  1.2   0:55.14 ceph-osd

 12465 root  20   0  949m 418m  55m S   58  1.3   0:51.79 ceph-osd

 4171 root  20   0  886m 348m  45m S   57  1.1   0:56.17 ceph-osd

 3707 root  20   0  941m 405m  50m S   57  1.3   0:59.68 ceph-osd

 3560 root  20   0  924m 394m  51m S   56  1.2   0:59.37 ceph-osd

 4318 root  20   0  965m 435m  55m S   56  1.4   0:54.80 ceph-osd

 3337 root  20   0  935m 407m  51m S   56  1.3   1:01.96 ceph-osd

 3854 root  20   0  897m 366m  48m S   55  1.1   1:00.55 ceph-osd

 3143 root  20   0 1364m 424m  24m S   16  1.3   1:08.72 ceph-osd

 2509 root  20   0  652m 261m  62m S2  0.8   0:26.42 ceph-mon

 4 root  20   0 000 S0  0.0   0:00.08 kworker/0:0



 Regards,

 Quenten Grasso



 *From:* ceph-users-boun...@lists.ceph.com [mailto:
 ceph-users-boun...@lists.ceph.com] *On Behalf Of *Quenten Grasso
 *Sent:* Tuesday, 18 March 2014 10:19 PM
 *To:* 'ceph-users@lists.ceph.com'
 *Subject:* [ceph-users] OSD Restarts cause excessively high load average
 and requests are blocked  32 sec



 Hi All,



 I'm trying to troubleshoot a strange issue with my Ceph cluster.



 We're Running Ceph Version 0.72.2

 All Nodes are Dell R515's w/ 6C AMD CPU w/ 32GB Ram, 12 x 3TB NearlineSAS
 Drives and 2 x 100GB Intel DC S3700 SSD's for Journals.

 All Pools have a replica of 2 or better. I.e. metadata replica of 3.



 I have 55 OSD's in the cluster across 5 nodes. When I restart the OSD's on
 a single node (any node) the load average of that node shoots up to 230+
 and the whole cluster starts blocking IO requests until it settles down and
 its fine again.



 Any ideas on why the load average goes so crazy  starts to block IO?





 snips from my ceph.conf

 [osd]

 osd data = /var/ceph/osd.$id

 osd journal size = 15000

 osd mkfs type = xfs

 osd mkfs options xfs = -i size=2048 -f

 osd mount options xfs =
 rw,noexec,nodev,noatime,nodiratime,barrier=0,inode64,logbufs=8,logbsize=256k

 osd max backfills = 5

 osd recovery max active = 3



 [osd.0]

 host = pbnerbd01

 public addr = 10.100.96.10

 cluster addr = 10.100.128.10

 osd journal =
 /dev/disk/by-id/scsi-36b8ca3a0eaa2660019deaf8d3a40bec4-part1

 devs = /dev/sda4

 /end



 Thanks,

 Quenten



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fluctuating I/O speed degrading over time

2014-03-08 Thread Martin B Nielsen
Hi Indra,

I'd probably start by looking at your nodes and check if the SSDs are
 saturated
 or if they have high write access times.

 Any recommended benchmark tool to do this? Especially those specific to
 Ceph OSDs and will not cause any impact on overall performance?


A simple way would be using iostat. I think something like 'iostat 1 -x'
will show both total disk utilization (rightmost column) and the 3 columns
just before should depict read/write io-access time in realtime. For ssd
they should be low ( 6-8). Column 3 and 4 (r/s, w/s - iops) might be of
interest as well. Just 'iostat -x' should show avg since uptime iirc.

A better way is prob. collecting and graphing it - I think something like
collectd or munin can do that or you. We use munin to keep historical data
about all nodes performance - we can see if performance drops over time,
iops increase or wait/latency time for disks explode as well. I think the
default munin-node on ubuntu will capture this data automatically.



 Maybe test them individually directly on the cluster.

 Is it possible to test I/O speed from client directly to certain OSD on
 the cluster? From what I understand, the PGs are being randomly mapped to
 any of the OSDs (based on the crush map?).


I was thinking more lowlvl - like fio or bonnie++ or the like :). I think
with fio you can get the most detailed picture of how your ssd perform in
terms of throughput and iops.




 At some point in time we accidentially had a node being reinstalled with
 a non-LTS
 image (13.04 I think) - and the kernel (3.5.something)  had a
 bug/'feature' which
 caused lots of tcp segments to be retransmittet (approx. 1/100).

 Do you have any information about this bug? We are using Ubuntu 13.04 for
 all our Ceph nodes. If you can refer me to any documentation on this bug
 and how to resolve this issue, I will appreciate it very much.


To be honest, I can't recall - the server in question was hosted @SoftLayer
in Dallas and it was their techs asking us to upgrade the kernel after we
found the issue with the high retransmit count and reported it. It was easy
to just upgrade the kernel and test - and the issue went away. I didn't dig
any deeper; if I remember, I'll try accessing the ticket monday to get all
the details if it is still there.

Cheers,
Martin



 Looking forward to your reply, thank you.

 Cheers.


 On Fri, Mar 7, 2014 at 6:10 PM, Martin B Nielsen mar...@unity3d.comwrote:

 Hi,

 I'd probably start by looking at your nodes and check if the SSDs are
 saturated or if they have high write access times. If any of that is true,
 does that account for all SSD or just some of them? Maybe some of the disks
 needs a trim. Maybe test them individually directly on the cluster.

 If you can't find anything with the disks, then try and look further up
 the stack. Network, interrupts etc. At some point in time we accidentially
 had a node being reinstalled with a non-LTS image (13.04 I think) - and the
 kernel (3.5.something)  had a bug/'feature' which caused lots of tcp
 segments to be retransmittet (approx. 1/100). This one node slowed down our
 entire cluster and caused high access time across the board. 'Upgrading' to
 LTS fixed it.

 As you say, it can just be that the increased utilization of the the
 cluster causes it and that you'll 'just' have to add more nodes.

 Cheers,
 Martin


 On Fri, Mar 7, 2014 at 10:50 AM, Indra Pramana in...@sg.or.id wrote:

 Hi,

 I have a Ceph cluster, currently with 5 osd servers and around 22 OSDs
 with SSD drives and I noted that the I/O speed, especially write access to
 the cluster is degrading over time. When we first started the cluster, we
 can get up to 250-300 MB/s write speed to the SSD cluster but now we can
 only get up to half the mark. Furthermore, it now fluctuates so sometimes I
 can get slightly better speed but on another time I get very bad result.

 We started with 3 osd servers and 12 OSDs and gradually add more
 servers. We are using KVM hypervisors as the Ceph clients and connection
 between clients and servers and between the servers are through 10 GBps
 switch with jumbo frames enabled on all interfaces.

 Any advice on how can I start to troubleshoot what might have caused the
 degradation of the I/O speed? Does utilisation contributes to it (since now
 we have more users compared to last time when we started)? Any optimisation
 we can do to improve the I/O performance?

 Appreciate any advice, thank you.

 Cheers.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fluctuating I/O speed degrading over time

2014-03-07 Thread Martin B Nielsen
Hi,

I'd probably start by looking at your nodes and check if the SSDs are
saturated or if they have high write access times. If any of that is true,
does that account for all SSD or just some of them? Maybe some of the disks
needs a trim. Maybe test them individually directly on the cluster.

If you can't find anything with the disks, then try and look further up the
stack. Network, interrupts etc. At some point in time we accidentially had
a node being reinstalled with a non-LTS image (13.04 I think) - and the
kernel (3.5.something)  had a bug/'feature' which caused lots of tcp
segments to be retransmittet (approx. 1/100). This one node slowed down our
entire cluster and caused high access time across the board. 'Upgrading' to
LTS fixed it.

As you say, it can just be that the increased utilization of the the
cluster causes it and that you'll 'just' have to add more nodes.

Cheers,
Martin


On Fri, Mar 7, 2014 at 10:50 AM, Indra Pramana in...@sg.or.id wrote:

 Hi,

 I have a Ceph cluster, currently with 5 osd servers and around 22 OSDs
 with SSD drives and I noted that the I/O speed, especially write access to
 the cluster is degrading over time. When we first started the cluster, we
 can get up to 250-300 MB/s write speed to the SSD cluster but now we can
 only get up to half the mark. Furthermore, it now fluctuates so sometimes I
 can get slightly better speed but on another time I get very bad result.

 We started with 3 osd servers and 12 OSDs and gradually add more servers.
 We are using KVM hypervisors as the Ceph clients and connection between
 clients and servers and between the servers are through 10 GBps switch with
 jumbo frames enabled on all interfaces.

 Any advice on how can I start to troubleshoot what might have caused the
 degradation of the I/O speed? Does utilisation contributes to it (since now
 we have more users compared to last time when we started)? Any optimisation
 we can do to improve the I/O performance?

 Appreciate any advice, thank you.

 Cheers.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Trying to rescue a lost quorum

2014-03-01 Thread Martin B Nielsen
Hi,

You can't form quorom with your monitors on cuttlefish if you're mixing 
0.61.5 with any 0.61.5+ ( https://ceph.com/docs/master/release-notes/ ) =
section about 0.61.5.

I'll advice installing pre-0.61.5, form quorom and then upgrade to 0.61.9
(if needs be) - and then latest dumpling on top.

Cheers,
Martin


On Fri, Feb 28, 2014 at 2:09 AM, Marc m...@shoowin.de wrote:

 Hi,

 thanks for the reply. I updated one of the new mons. And after a
 resonably long init phase (inconsistent state), I am now seeing these:

 2014-02-28 01:05:12.344648 7fe9d05cb700  0 cephx: verify_reply coudln't
 decrypt with error: error decoding block for decryption
 2014-02-28 01:05:12.345599 7fe9d05cb700  0 -- X.Y.Z.207:6789/0 
 X.Y.Z.201:6789/0 pipe(0x14e1400 sd=21 :49082 s=1 pgs=5421935 cs=12
 l=0).failed verifying authorize reply

 with .207 being the updated mon and .201 being one of the old alive
 mons. I guess they don't understand each other? I would rather not try
 to update the mons running on servers that also host OSDs, especially
 since there seem to be communication issues between those versions... or
 am I reading this wrong?

 KR,
 Marc

 On 28/02/2014 01:32, Gregory Farnum wrote:
  On Thu, Feb 27, 2014 at 4:25 PM, Marc m...@shoowin.de wrote:
  Hi,
 
  I was handed a Ceph cluster that had just lost quorum due to 2/3 mons
  (b,c) running out of disk space (using up 15GB each). We were trying to
  rescue this cluster without service downtime. As such we freed up some
  space to keep mon b running a while longer, which succeeded, quorum
  restored (a,b), mon c remained offline. Even though we have freed up
  some space on mon c's disk also, that mon just won't start. It's log
  file does say
 
  ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60), process
  ceph-mon, pid 27846
 
  and thats all she wrote. Even when starting ceph-mon with -d mind you.
 
  So we had a cluster with 2/3 mons up and wanted to add another mon since
  it was only a matter of time til mon b failed again due to disk space.
 
  As such I added mon.g to the cluster, which took a long while to sync,
  but now reports running.
 
  Then mon.h got added for the same reason. mon.h fails to start much the
  same as mon.c does.
 
  Still that should leave us with 3/5 mons up. However running ceph
  daemon mon.{g,h} mon_status on the respective node also blocks. The
  only output we get from those are fault messages.
 
  Ok so now mon.g apparantly crashed:
 
  2014-02-28 00:11:48.861263 7f4728042700 -1 mon/Monitor.cc: In function
  'void Monitor::sync_timeout(entity_inst_t)' thread 7f4728042700 time
  2014-02-28 00:11:48.782305 mon/Monitor.cc: 1099: FAILED
  assert(sync_state == SYNC_STATE_CHUNKS)
 
  ... and now blocks trying to start much like c and h.
 
  Long story short: is it possible to add .61.9 mons to a cluster running
  .61.2 on the 2 alive mons and all the osds? I'm guessing this is the
  last shot at trying to rescue the cluster without downtime.
  That should be fine, and is likely (though not guaranteed) to resolve
  your sync issues -- although it's pretty unfortunate that you're that
  far behind on the point releases; they fixed a whole lot of sync
  issues and related things and you might need to upgrade the existing
  monitors too in order to get the fixes you need... :/
  -Greg
  Software Engineer #42 @ http://inktank.com | http://ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] questions about monitor data and ceph recovery

2014-02-25 Thread Martin B Nielsen
Hi Pavel,

Will try and answer some of your questions:

My first question will be about monitor data directory. How much space I
 need to reserve for it? Can monitor-fs be corrupted if monitor goes out of
 storage space?


We have about 20GB partitions for monitors - they really don't use much
space, but in case you need to do some extra logging it is nice to have
(and ceph doing max debug consumes scary amounts of space).
Also if you look in the monitor log they constantly monitor for free space.
I don't know what will happen if a monitor runs full (or close to full),
but I'm guessing that monitor will simply be marked as down or stopped
somehow. You can change some of the values for a mon about how much data to
keep before trimming etc.



 I also have questions about ceph auto-recovery process.
 For example, I have two nodes with 8 drives on each, each drive is
 presented as separate osd. The number of replicas = 2. I have wrote a crush
 ruleset, which picks two nodes and one osd on each to store replicas. Which
 will happens on following scenarios:

 1. One drive in one node failed. Will ceph automatically re-replicate
 affected objects? Where replicas will be stored?

Yes, as long as you have available space on the node that lost one OSD the
data that was on that disk will be distributed aross the remaining 7 OSD on
that node (according to your CRUSH rules)



 1.1 The failed osd will appears online again with all of it's data. How
 ceph cluster will deal with it?

This is just how I _think_ it works; please correct me if I'm wrong. All
OSD have an internal map (pg map) which is constantly updated throughout
the cluster. When any OSD goes offline/down and is started back up the
latest pgmap of that OSD is 'diffed' up vs the latest map from the cluster
and then the cluster can generate a new map based on what it has/had, what
is missing/updated and generate a new map with the objects the newly
started OSD should have. Then it will start to replicate and only get the
changed/new objects.

Bottom line, this really just works and works very well.



 2. One node (with 8 osds) goes offline. Will ceph automatically replicate
 all objects on the remaining node to maintain number of replicas = 2?

No, because it can no longer satisfy your CRUSH rules. Your crush rule
states 1x copy pr. node and it will keep it that way. The cluster will go
into a degraded state until you can bring up another node (ie all your data
now is very vulnerable). I think it is often suggested to run with 3x
replica if possible - or at the very least nr_nodes = replicas + 1. If you
had to make it replicate on the remaining node you'd have to change your
CRUSH rule to replicate based on OSD and not node. But then you'll most
likely have problems when 1 node dies because objects could easily be on 2x
OSD on the failed node.



 2.1 The failed node goes online again with all data. How ceph cluster will
 deal with it?

Same as the above with the OSD.

Cheers,
Martin


 Thanks in advance,
   Pavel.



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pages stuck unclean (but remapped)

2014-02-23 Thread Martin B Nielsen
Hi,

I would prob. start by figuring out exactly what pg are stuck unclean.

You can do 'ceph pg dump | grep unclean' to get that info - then if your
theory holds you should be able to verify the disk(s) in question.

I cannot see any _too_full so am curious what could be the cause.

You can also always adjust the weights manually if needed (
http://ceph.com/docs/master/rados/operations/control/#osd-subsystem ) with
the (re)weight cmd.

Cheers,
Martin


On Mon, Feb 24, 2014 at 2:09 AM, Gautam Saxena gsax...@i-a-inc.com wrote:

 I have 19 pages that are stuck unclean (see below result of ceph -s). This
 occurred after I executed a ceph osd reweight-by-utilization 108 to
 resolve problems with backfill_too_full messages, which I believe
 occurred because my OSDs sizes vary significantly in size (from a low of
 600GB to a high of 3 TB). How can I get ceph to get these pages out of
 stuck-unclean? (And why is this occurring anyways?) My best guess of how to
 fix (though I don't know why) is that I need to run:

 ceph osd crush tunables optimal.

 However, my kernel version (on a fully up-to-date Centos 6.5) is 2.6.32,
 which is well below the minimum required version of 3.6 that's stated in
 the documentation (http://ceph.com/docs/master/rados/operations/crush-map/
 ) -- so if I must run ceph osd crush tunables optimal to fix this
 problem, I presume I must upgrade my kernel first, right?...Any thoughts or
 am I chasing the wrong solution -- I want to avoid kernel upgrade unless
 it's needed.)

 =

 [root@ia2 ceph4]# ceph -s
 cluster 14f78538-6085-43f9-ac80-e886ca4de119
  health HEALTH_WARN 19 pgs backfilling; 19 pgs stuck unclean; recovery
 42959/5511127 objects degraded (0.779%)
  monmap e9: 3 mons at {ia1=
 192.168.1.11:6789/0,ia2=192.168.1.12:6789/0,ia3=192.168.1.13:6789/0},
 election epoch 496, quorum 0,1,2 ia1,ia2,ia3
  osdmap e7931: 23 osds: 23 up, 23 in
   pgmap v1904820: 1500 pgs, 1 pools, 10531 GB data, 2670 kobjects
 18708 GB used, 26758 GB / 45467 GB avail
 42959/5511127 objects degraded (0.779%)
 1481 active+clean
   19 active+remapped+backfilling
   client io 1457 B/s wr, 0 op/s

 [root@ia2 ceph4]# ceph -v
 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)

 [root@ia2 ceph4]# uname -r
 2.6.32-431.3.1.el6.x86_64

 

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] can one slow hardisk slow whole cluster down?

2014-01-29 Thread Martin B Nielsen
Hi,

At least it used to be like that - I'm not sure if that has changed. I
believe this is also part why it is adviced to go with the same kind of hw
and setup if possible.

Since at least rbd images are spread in objects throughout the cluster
you'll prob. have to wait for a slow disk when reading - writing will still
go journal - disk so if you have the ssd journal - sata you prob. wont
notice that much unless you're doing lots and/or heavy writes.

You can peak into it via the admin socket and get some perfstats for each
osd (iirc it is 'perf dump' you want). You could set something up to poll
at given intervals and graph it and prob. spot trends/slow disks that way.

I think it is a manual process to locate a slow drive and either drop it
from the cluster or give it lower weight.

If possible, I'd suggest toying with something like fio/bonnie++ in a guest
and run some tests with and without the osd/node in question - you'll know
for certain then.

Cheers,
Martin


On Tue, Jan 28, 2014 at 4:22 PM, Gautam Saxena gsax...@i-a-inc.com wrote:

 If one node which happens to have a single raid 0 hardisk is slow, would
 that impact the whole ceph cluster? That is, when VMs interact with the rbd
 pool to read and write data, would the kvm client wait for that slow
 hardisk/node to return with the requested data, thus making that slow
 hardisk/node the ultimate bottleneck? Or, would kvm/ceph be smart enough to
 get the needed data from whichever node is ready to serve it up? That is,
 kvm/ceph will request all possilbe osds to return data, but if one osd is
 done with its request, it can chose to return more data that the slow
 harddisk/node still hasn't returnedI'm trying to decide whether to
 remove the slow harddisk/node from the ceph cluster (depending on how ceph
 works).

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] many blocked requests when recovery

2013-12-09 Thread Martin B Nielsen
Hi,

You didn't state what version of ceph or kvm/qemu you're using. I think it
wasn't until qemu 1.5.0 (1.4.2+?) that an async patch from inktank was
accepted into mainstream which significantly helps in situations like this.

If not using that on top of not limiting recovery threads you'll prob. see
issues like you describe.

Also more nodes make it easier on the entire cluster in case of recovery so
it might make sense adding smaller ones if/when you expand it.

Cheers,
Martin


On Tue, Dec 3, 2013 at 7:09 AM, 飞 duron...@qq.com wrote:

 hello,  I'm testing Ceph as storage for KVM virtual machine images,
 my cluster have 3 mons and 3 data nodes, every data node have 8x2T SATA
 HDD and 1 SSD for journal.
 when I shutdown one data node to imitate server fault, the cluster begin
 to recovery , when under recovery,
 I can see many blocked requests, and the KVM VMs will be crash (crash as
 they think their disk is offline),
 how Can I solve this issue ? any idea ? thank you

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Big or small node ?

2013-11-20 Thread Martin B Nielsen
Hi,

I'd almost always go with more lesser beefy nodes than bigger ones. You're
much more vulnerable if the big one(s) die and replication will not impact
your cluster as much.

I also find it easier to extend a cluster with smaller nodes. At least it
feels like you can increase in more smooth rates with smaller nodes at your
preferred rate instead of big chunks of extra added storage.

But I guess it depends on intended cluster usage.

In your example you can loose 1 of the smaller nodes (depending about
replication level and total space usage ofc), but loosing the big one means
nothing works.

If only 1 node I would prob. not go ceph and just opt for zfs or raid6
instead (and drop the extra ssd and get 12x sata) - it will prob. perform
better and you'll have more total space assuming you'll go with replication
x2 with ceph.

Cheers,
Martin


On Wed, Nov 20, 2013 at 8:47 AM, Daniel Schwager
daniel.schwa...@dtnet.dewrote:

 Hallo,

 we are going to setup a 36-40TB (brutto) test setup in our company for
 disk2disk2tape backup. Now, we are faced to decide if we go the high- or
 low density ceph way.

 --- The big node (only one necessary):

 1 x Supermicro, System 6027R-E1R12T with 2 x CPU E5-2620v2  (6 core (Hyper
 threading) per CPU), 32 GB with
 - 2 x SSD  80GB  (RAID1, OS only)
 - 2 x SSDD 100GB, Intel s3700  (bandwidth R/W: 500/200MB/sec) for 10
 journal (5 each SSD)
 - 10 x 4TB Seagate Enterprise Capacity ST4000NM0033
 - 2 x embedded 10GBit for public/storage network

 So, I' would install also 1 monitor on this node containing also 10 OSD's.
 The price is about 20 US Cent / GB

  The small node (ok - we have to by 3 guy's) could be like

 3 x Supermicro, SuperServer 6017R-TDF+ with 1 x E5-2603 (4 cores without
 Hyper threading), 16GB with
 1 x 120 GB SSD Intel Serie S3500 SATA for OS and 3 OSD journal's
 (bandwidth R/W: 340/100 MB/sec)
 2 x Intel® PRO/1000 PT Dual Port Server Adapter (LACP link aggregation, 2
 ports for public-, 2 ports for storage network)
 3 x 4TB Seagate Enterprise Capacity ST4000NM0033

 I would also install a monitor on each node.
 The price is about 23 US Cent / GB

 I think, the performance (because of the better components like 10GBit,
 SSD, CPU) is much better in the big node. Because we may not add more HDD
 to the cluster, I'm not sure how to decide - big or small node.

 Is there a recommendation? Maybe also in regards to my chosen hardware
 components ?

 best regards
 Danny










 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HDD bad sector, pg inconsistent, no object remapping

2013-11-13 Thread Martin B Nielsen
Probably common sense but I was bitten by this once in a likewise
situation..

If you run 3x replica and distribute them over 3x hosts (is that default
now?) make sure that the disks on the host with the failed disk have space
for it - the remaining two disks will have to hold the content of the
failed disk and if they can't, your cluster will run full and halt.

Cheers,
Martin


On Wed, Nov 13, 2013 at 12:59 AM, David Zafman david.zaf...@inktank.comwrote:


 Since the disk is failing and you have 2 other copies I would take osd.0
 down.  This means that ceph will not attempt to read the bad disk either
 for clients or to make another copy of the data:

 * Not sure about the syntax of this for the version of ceph you are
 running
 ceph osd down 0

 Mark it “out” which will immediately trigger recovery to create more
 copies of the data with the remaining OSDs.
 ceph osd out 0

 You can now finish the process of removing the osd by looking at these
 instructions:


 http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual

 David Zafman
 Senior Developer
 http://www.inktank.com

 On Nov 12, 2013, at 3:16 AM, Mihály Árva-Tóth 
 mihaly.arva-t...@virtual-call-center.eu wrote:

  Hello,
 
  I have 3 node, with 3 OSD in each node. I'm using .rgw.buckets pool with
 3 replica. One of my HDD (osd.0) has just bad sectors, when I try to read
 an object from OSD direct, I get Input/output errror. dmesg:
 
  [1214525.670065] mpt2sas0: log_info(0x3108): originator(PL),
 code(0x08), sub_code(0x)
  [1214525.670072] mpt2sas0: log_info(0x3108): originator(PL),
 code(0x08), sub_code(0x)
  [1214525.670100] sd 0:0:2:0: [sdc] Unhandled sense code
  [1214525.670104] sd 0:0:2:0: [sdc]
  [1214525.670107] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
  [1214525.670110] sd 0:0:2:0: [sdc]
  [1214525.670112] Sense Key : Medium Error [current]
  [1214525.670117] Info fld=0x60c8f21
  [1214525.670120] sd 0:0:2:0: [sdc]
  [1214525.670123] Add. Sense: Unrecovered read error
  [1214525.670126] sd 0:0:2:0: [sdc] CDB:
  [1214525.670128] Read(16): 88 00 00 00 00 00 06 0c 8f 20 00 00 00 08 00
 00
 
  Okay I known need to replace HDD.
 
  Fragment of ceph -s  output:
pgmap v922039: 856 pgs: 855 active+clean, 1 active+clean+inconsistent;
 
  ceph pg dump | grep inconsistent
 
  11.15d  25443   0   0   0   6185091790  30013001
  active+clean+inconsistent   2013-11-06 02:30:45.23416.
 
  ceph pg map 11.15d
 
  osdmap e1600 pg 11.15d (11.15d) - up [0,8,3] acting [0,8,3]
 
  pg repair or deep-scrub can not fix this issue. But if I understand
 correctly, osd has to known it can not retrieve object from osd.0 and need
 to be replicate an another osd because there is no 3 working replicas now.
 
  Thank you,
  Mihaly
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD question

2013-10-22 Thread Martin B Nielsen
Hi,

Plus reads will still come from your non-SSD disks unless you're using
something like flashcache in front and as Greg said, having much more IOPS
available for your db often makes a difference (depending on load, usage
etc ofc).

We're using Samsung Pro 840 256GB pretty much like Martin describes and we
haven't had any issues (yet).

We've setup our environment so that if we lose a node or take one offline
for maint it won't impact the cluster much; with only 3 nodes I would prob.
go with more durable hw-specs.

Cheers,
Martin


On Tue, Oct 22, 2013 at 6:38 AM, Gregory Farnum g...@inktank.com wrote:

 On Mon, Oct 21, 2013 at 7:05 PM, Martin Catudal mcatu...@metanor.ca
 wrote:
  Hi,
   I have purchase my hardware for my Ceph storage cluster but did not
  open any of my 960GB SSD drive box since I need to answer my question
 first.
 
  Here's my hardware.
 
  THREE server Dual 6 core Xeon 2U capable with 8 hotswap tray plus 2 SSD
  mount internally.
  In each server I will have :
  2 x SSD 840 Pro Samsung 128 GB in RAID 1 for the OS
  2 x SSD 840 Pro Samsung for journal
  4 x 4TB Hitachi 7K4000 7200RPM
  1 x 960GB Crucial M500 for one fast OSD pool.
 
  Configuration : One SSD journal for two 4TB so If I lost one SSD
  journal, I will only lost Two OSD instead of all my storage for that
  particular node.
 
  I have also bought 3 x 960GB M500 SSD from Crucial for the creation of a
  fast Pool of OSD made from SSD's. So One 960GB per server for database
  application.
  It is advisable to do that but it is better to return them and for the
  same price buy 6 more 4TB Hitachi?
 
  Since the write acknowledgment is made from the SSD journal, does I have
  a huge improvement by using SSD as OSD?
  My goal is to have solid fast performance for database ERP and 3D
  modeling of mining gallery run in VM.

 The specifics depend on a lot of factors, but for database
 applications you are likely to have better performance with an SSD
 pool. This is because even though the journal can do fast
 acknowledgements, that's for evening out write bursts — on average it
 will restrict itself to the speed of the backing store. A good SSD can
 generally do much more than 6x a HDD's random IOPS.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com

 
  Thank's
  Martin
 
  --
  Martin Catudal
  Responsable TIC
  Ressources Metanor Inc
  Ligne directe: (819) 218-2708
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph with high disk densities?

2013-10-07 Thread Martin B Nielsen
Hi Scott,

Just some observations from here.

We run 8 nodes, 2U units with 12x OSD each (4x 500GB ssd, 8x 4TB platter)
attached to 2x LSI 2308 cards. Each node uses an intel E5-2620 with 32G mem.

Granted, we only have like 25 VM (some fairly io-hungry, both iops and
throughput-wise though) on that cluster, but we hardly see any cpu-usage at
all. We have ~6k PG and according to munin our avg. cpu time is ~9% (that
is out of all cores, so 9% out of 1200% (6 cores, 6 HT)).

Sadly I didn't record cpu-usage while stresstesting or breaking it.

We're using cuttlefish and XFS. And again, this cluster is still pretty
underused, so the cpu-usage does not reflect a more active system.

Cheers,
Martin


On Mon, Oct 7, 2013 at 6:15 PM, Scott Devoid dev...@anl.gov wrote:

 I brought this up within the context of the RAID discussion, but it did
 not garner any responses. [1]

 In our small test deployments (160 HDs and OSDs across 20 machines) our
 performance is quickly bounded by CPU and memory overhead. These are 2U
 machines with 2x 6-core Nehalem; and running 8 OSDs consumed 25% of the
 total CPU time. This was a cuttlefish deployment.

 This seems like a rather high CPU overhead. Particularly when we are
 looking to hit density target of 10-15 4TB drives / U within 1.5 years.
 Does anyone have suggestions for hitting this requirement? Are there ways
 to reduce CPU and memory overhead per OSD?

 My one suggestion was to do some form of RAID to join multiple drives and
 present them to a single OSD. A 2 drive RAID-0 would halve the OSD overhead
 while doubling the failure rate and doubling the rebalance overhead. It is
 not clear to me if that is better or not.

 [1]
 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-October/004833.html

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hardware recommendations

2013-08-26 Thread Martin B Nielsen
Hi Shain,

Those R515 seem to mimic our servers (2U supermicro w. 12x 3.5 bays and 2x
2.5 in the rear for OS).

Since we need a mix of SSD  platter we have 8x 4TB drives and 4x 500GB SSD
+ 2x 250GB SSD for OS in each node (2x 8-port LSI 2308 in IT-mode)

We've partitioned 10GB from each 4x 500GB to use as journal for 4x 4TB
drives and each of the OS disks each hold 2x journals each for the
remaining 4 platter disks.

We tested a lot how to put these journals and this setup seemed to fit best
into our setup (pure VM block storage - 3x replica).

Everything connected via 10GbE (1 network for cluster, 1 for public) and 3
standalone monitor servers.

For storage nodes we use E5-2620/32gb ram, and monitor nodes E3-1260L/16gb
ram - we've tested with both 1 and 2 nodes going down and starting
redistributing data and they seem to cope more than fine.

Overall I find these nodes as a good compromise between capacity, price and
performance - we looked into getting 2U servers with 8x 3.5 bays and get
more of them, but ultimately went with this.

We also have some boxes from coraid (SR  SRX with and without
flashcache/etherflash) so we've been able to do some direct comparison and
so far ceph is looking good - especially price-storage ratio.

At any rate, back to your mail, I think the most important factor is
looking at all the pieces and making sure you're not being [hard]
bottlenecked somewhere - we found 24gb ram to be a little on the low side
when all 12 disks started to redistribute, but 32 is fine. Also not having
journals on SSD before writing to platter really hurt a lot when we tested
- this can prob. be mitigated somewhat with better raid controllers.
CPU-wise the E5 2620 hardly breaks a sweat even when having to do just a
little with a node going down.

Good luck with your HW-adventure :).

Cheers,
Martin


On Mon, Aug 26, 2013 at 3:56 PM, Shain Miley smi...@npr.org wrote:

 Good morning,

 I am in the process of deciding what hardware we are going to purchase for
 our new ceph based storage cluster.

 I have been informed that I must submit my purchase needs by the end of
 this week in order to meet our FY13 budget requirements  (which does not
 leave me much time).

 We are planning to build multiple clusters (one primarily for radosgw at
 location 1; the other for vm block storage at location 2).

 We will be building our radosgw storage out first, so this is the main
 focus of this email thread.

 I have read all the docs and the white papers, etc on hardware suggestions
 ...and we have an existing relationship with Dell, so I have been planning
 on buying a bunch of Dell R515's with 4TB drives and using 10GigE
 networking for this radosgw setup (although this will be primary used for
 radosgw purposes...I will be testing running a limited number of vm's on
 this infrastructure  as well...in order to see what kind of performance we
 can achieve).

 I am just wondering if anyone else has any quick thoughts on these
 hardware choices, or any alternative suggestions that I might look at as I
 seek to finalize our purchasing this week.

 Thanks in advance,

 Shain

 Sent from my iPhone
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] performance questions

2013-08-20 Thread Martin B Nielsen
Hi Jeff,

I would be surprised as well - we initially tested on a 2-replica cluster
with 8 nodes having 12 osd each - and went to 3-replica as we re-built the
cluster.

The performance seems to be where I'd expect it (doing consistent writes in
a rbd VM @ ~400MB/sec on 10GbE which I'd expect is either a limit in disks,
network, qemu/kvm or the 3-replica setup kicking in)

Just curious, anything in dmesg about the disk mounted as osd.4?

Cheers,
Martin


On Tue, Aug 20, 2013 at 4:02 PM, Mark Nelson mark.nel...@inktank.comwrote:

 On 08/20/2013 08:42 AM, Jeff Moskow wrote:

 Hi,

 More information.  If I look in /var/log/ceph/ceph.log,  I see 7893 slow
 requests in the last 3 hours of which 7890 are from osd.4. Should I
 assume a bad drive?  I SMART says the drive is healthy? Bad osd?


 Definitely sounds suspicious!  Might be worth taking that OSD out and
 doing some testing on the drive.


 Thanks,
   Jeff


 __**_
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph instead of RAID

2013-08-13 Thread Martin B Nielsen
Hi,

I'd just like to echo what Wolfgang said about ceph being a complex system.

I initially started out testing ceph with a setup much like yours. And
while it overall performed ok, it was not as good as sw raid on the same
machine.

Also, as Mark said you'll have at very best half write speeds because of
how the journaling works if you do larger continuous writes.

Ceph really shines with multiple servers  multiple concurrency.

My testmachine was running for ½ a year+ (going from argonaut -
cuttlefish) and in that process I came to realize that mixing types of disk
(and size) was a bad idea (some enterprise SATA, some fast desktop and some
green disks) - as speed will be determined by the slowest drive in your
setup (that's why they're advocating using similar hw if at all possible I
guess).

I also experienced all the challenging issues having to deal with a very
young technology; osds suddenly refusing to start, pg's going into various
incomplete/down/inconsistent states, monitor leveldb running full, monitor
dying at weird times and well - I think it is good for a learning
experience, but like Wolfgang said I think it is too much hassle for too
little gain when you have something like raid10/zfs around.

But, by all means, don't let us discourage you if you want to go this route
- ceph's unique self-healing ability was what drew me into running a single
machine in the first place.

Cheers,
Martin



On Tue, Aug 13, 2013 at 9:32 AM, Wolfgang Hennerbichler 
wolfgang.hennerbich...@risc-software.at wrote:



 On 08/13/2013 09:23 AM, Jeffrey 'jf' Lim wrote:
  Anyway, I thought what if instead of RAID-10 I use ceph? All 6 disks
 will be local, so I could simply create
  6 local OSDs + a monitor, right? Is there anything I need to watch out
 for in such configuration?
 
  You can do that. Although it's nice to play with and everything, I
  wouldn't recommend doing it. It will give you more pain than pleasure.
 
  How so? Care to elaborate?

 Ceph is a complex system, built for clusters. It does some stuff in
 software that is otherwhise done in hardware (raid controllers). The
 nature of the complexity of a cluster system is a lot of overhead
 compared to a local raid [whatever] system, and latency of disk i/o will
 naturally suffer a bit. An OSD needs about 300 MB of RAM (may vary on
 your PGs), times 6 is a waste of nearly 2 GB of RAM (compared to a
 local RAID). Also ceph is young, and it does indeed have some bugs. RAID
 is old, and very mature. Although I rely on ceph on a productive
 cluster, too, it is way harder to maintain than a simple local raid.
 When a disk fails in ceph you don't have to worry about your data, which
 is a good thing, but you have to worry about the rebuilding (which isn't
 too hard, but at least you need to know SOMETHING about ceph), with
 (hardware) RAID you simply replace the disk, and it will be rebuilt.

 Others will find more reasons why this is not the best idea for a
 production system.

 Don't get me wrong, I'm a big supporter of ceph, but only for clusters,
 not for single systems.

 wogri

  -jf
 
 
  --
  He who settles on the idea of the intelligent man as a static entity
  only shows himself to be a fool.
 
  Every nonfree program has a lord, a master --
  and if you use the program, he is your master.
  --Richard Stallman
 


 --
 DI (FH) Wolfgang Hennerbichler
 Software Development
 Unit Advanced Computing Technologies
 RISC Software GmbH
 A company of the Johannes Kepler University Linz

 IT-Center
 Softwarepark 35
 4232 Hagenberg
 Austria

 Phone: +43 7236 3343 245
 Fax: +43 7236 3343 250
 wolfgang.hennerbich...@risc-software.at
 http://www.risc-software.at
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rebuild the monitor infrastructure

2013-04-24 Thread Martin B Nielsen
Hi Bryan,

I asked the same question a few months ago:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-February/000221.html

But basically, that is pretty bad; you'll be stuck on your own and
would need to get in contact with Inktank - they might be able to help
rebuild a monitor for you.

Cheers,
Martin

On Wed, Apr 24, 2013 at 8:19 AM, Bryan Stansell br...@stansell.org wrote:
 Sorry for possibly a silly new user question, but I was wondering if there 
 was any way to rebuild the monitor infrastructure in case of catastrophic 
 failure.

 My simple case is a single monitor.  If the data is lost because of hardware 
 failure, etc, can it be recreated from scratch?  The same could be said for 
 the suggested 3-node monitor setup - you'd probably have to have a user error 
 for that kind of destruction, but could happen.

 I've been searching for anything that explains how to recreate this.  I found 
 the docs that talk about how to recreate a single monitor from scratch if one 
 out of many are misbehaving (treat it like adding a new instance).

 Losing this data is REALLY bad, I understand that.  I'm hoping that by 
 dumping out some critical set of data while it is working would provide 
 enough data to recreate things from scratch.

 Is this a possibility?  Is there a documented procedure any place?

 Thanks much.

 Bryan
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph error: active+clean+scrubbing+deep

2013-04-16 Thread Martin B Nielsen
Hi Kakito,

You def. _want_ scrubbing to happen!

http://ceph.com/docs/master/rados/configuration/osd-config-ref/#scrubbing

If you feel it kills your system you can tweak some of the values; like:
osd scrub load threshold
osd scrub max interval
osd deep scrub interval

I have no experience in changing those values, so I can't say how it
will influence your system.

Also, not that it is any of my business, but it seems you're running
with replication set to 1.

Cheers,
Martin

On Tue, Apr 16, 2013 at 3:11 AM, kakito tientienminh080...@gmail.com wrote:
 Dear all,

 I use Ceph Storage,

 Recently, I often get an error:

 mon.0 [INF] pgmap v277690: 640 pgs: 639 active+clean, 1
 active+clean+scrubbing+deep; 14384 GB data, 14409 GB used, 90007 GB / 107 TB
 avail.

 It seems that it is not correct.

 I tried to restart. But not ok.

 It lows my system.

 I user ceph 0.56.4, kernel 3.8.6-1.el6.elrepo.x86_64

 How to fix it ?!



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to calculate the capacity of a ceph cluster

2013-03-13 Thread Martin B Nielsen
Hi Ashish,

Yep, that would be the correct way to do it.

If you already have a cluster running, a ceph -s will also show usage, ie
like:

ceph -s
pgmap v1842777: 8064 pgs: 8064 active+clean; 1069 GB data, 2144 GB used,
7930 GB / 10074 GB avail; 3569B/s wr, 0op/s

This is a small test-cluster with 2x replica, 1TB data used and roughly 2x
the amount used.

Also 'rados df' will show usage pr. pool :)

Cheers,
Martin




On Wed, Mar 13, 2013 at 11:15 AM, Ashish Kumar ku...@weclapp.com wrote:

 Hi Guys,

 ** **

 Just want to know, how can I calculate the capacity of my ceph cluster.  I
 don't know whether a simple RAID system calculation will work or not.

 ** **

 I have 5 servers each with the storage of 2TB and there are three copies
 of data, will it be ok to calculate the capacity in following way:

 ** **

 5(servers)* 2TB/3. 

 ** **

 ** **

 ** **

 *Ashish kumar|* Software Development *|* *weclapp GmbH*

 Frauenbergstraße 31-33 *|* D-35039 Marburg

 + 49 6421 999 1805 office  *|* + 49 6421 999 1899 fax

 ** **

 weclapp GmbH | Sitz der Gesellschaft: Marburg | Handelsregister:
 Amtsgericht Marburg HRB 5438

 Geschäftsführer: Michael Schmidt, Ertan Özdil, Uwe Knoke

 ** **

 *[image: cid:image001.png@01CD39A1.D8A1C250]* http://www.weclapp.com/***
 *

 ** **

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com