Re: [ceph-users] calamari gui

2014-11-17 Thread idzzy
Hello,

I might misunderstand this. I executed this command on ceph node like this. 
[ceph node]# ceph-deploy calamari connect {Calamari Server}

But it seems this command should execute on calamari server. 
[calamari server]# ceph-deploy calamari connect {Ceph Nodes}

Is this correct? I would like to know only this point.

 


On November 14, 2014 at 5:59:19 PM, idzzy (idez...@gmail.com) wrote:

Hello,

I see following message on calamari GUI.
--
New Calamari Installation

This appears to be the first time you have started Calamari and there are no 
clusters currently configured.

3 Ceph servers are connected to Calamari, but no Ceph cluster has been created 
yet. Please use ceph-deploy to create a cluster; please see the Inktank Ceph 
Enterprise documentation for more details.
--

And as next step, I executed below command to add ceph node to calamari server.
But “no calamari-minion repo found” message is being output.

--
# ceph-deploy calamari connect 10.32.37.44
[ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.9): /usr/bin/ceph-deploy calamari 
connect 10.32.37.44
[ceph_deploy][ERROR ] RuntimeError: no calamari-minion repo found
--
(10.32.37.44 is calamari server)


So I added the repository to ~/.cephdeploy.conf of ceph node like this.

--
[calamari-minion]
name=ceph repo noarch packages
baseurl=http://ceph.com/rpm-emperor/el6/noarch
#baseurl=http://ceph.com/rpm-emperor/rhel6/x86_64/
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/autobuild.asc
--

Then run again, but following error is being output.

--
# ceph-deploy calamari connect 10.32.37.44
[ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.9): /usr/bin/ceph-deploy calamari 
connect 10.32.37.44
Warning: Permanently added '10.32.37.44' (RSA) to the list of known hosts.
root@10.32.37.44's password:
[10.32.37.44][DEBUG ] connected to host: 10.32.37.44
[10.32.37.44][DEBUG ] detect platform information from remote host
[ceph_deploy][ERROR ] RuntimeError: ImportError: No module named ceph_deploy
--

How can I proceed to add ceph node to calamari server and start to use calamari 
guy?
Sorry this may be basically question, any advise will helpful for me.

Thank you.

—
idzzy

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to upgrade ceph from Firefly to Giant on Wheezy smothly?

2014-11-17 Thread debian Only
thanks a lot

2014-11-16 1:44 GMT+07:00 Alexandre DERUMIER aderum...@odiso.com:

 simply change your debian repository to giant

 deb http://ceph.com/debian-giant wheezy main


 then

 apt-get update
 apt-get dist-upgrade

 on each node


 then

 /etc/init.d/ceph restart mon

 on each node


 then

 /etc/init.d/ceph restart osd

 on each node
 ...

 - Mail original -

 De: debian Only onlydeb...@gmail.com
 À: ceph-users@lists.ceph.com
 Envoyé: Samedi 15 Novembre 2014 08:10:30
 Objet: [ceph-users] How to upgrade ceph from Firefly to Giant on Wheezy
 smothly?


 Dear all


 i have one Ceph Firefily test cluster on Debian Wheezy too, i want to
 upgrade ceph from Firefly to Giant, could you tell me how to do upgrade ?


 i saw the release notes like below , bu ti do not know how to upgrade,
 could you give me some guide ?




 Upgrade Sequencing
 --

 * If your existing cluster is running a version older than v0.80.x
 Firefly, please first upgrade to the latest Firefly release before
 moving on to Giant . We have not tested upgrades directly from
 Emperor, Dumpling, or older releases.

 We *have* tested:

 * Firefly to Giant
 * Dumpling to Firefly to Giant

 * Please upgrade daemons in the following order:

 #. Monitors
 #. OSDs
 #. MDSs and/or radosgw

 Note that the relative ordering of OSDs and monitors should not matter, but
 we primarily tested upgrading monitors first.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance data collection for Ceph

2014-11-17 Thread 10 minus
Thanks Dan, If I understand correctly , perf_counters have to run against
OSDs ( I mean for every OSD I have to run a check).



On Fri, Nov 14, 2014 at 8:44 PM, Dan Ryder (daryder) dary...@cisco.com
wrote:

  Hi,



 Take a look at the built in perf counters -
 http://ceph.com/docs/master/dev/perf_counters/. Through this you can get
 individual daemon performance as well as some cluster level statistics.



 Other (cluster-level) disk space utilization and pool
 utilization/performance is available through “ceph df detail”. Hope this
 helps.







 Dan Ryder



 *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
 Of *10 minus
 *Sent:* Friday, November 14, 2014 10:26 AM
 *To:* ceph-users
 *Subject:* [ceph-users] Performance data collection for Ceph



 Hi,

 I 'm trying to collect  performance data for Ceph

 I 'm looking to run some commands .. on regular intervals. to collect data.

 Apart from ceph osd perf . Are there other commands one can use.

 Can I also track how much data is being replicated ?

 Does  Ceph maintain performance counters for individual OSDs ?

   Something on the lines of  zpool iostat .

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osd crashed while there was no space

2014-11-17 Thread han vincent
hello, every one:

These days a problem of ceph has troubled me for a long time.

I build a cluster with 3 hosts and each host has three osds in it.
And after that
I used the command rados bench 360 -p data -b 4194304 -t 300 write
--no-cleanup
to test the write performance of the cluster.

When the cluster is near full, there couldn't write any data to
it. Unfortunately,
there was a host hung up, then a lots of PG was going to migrate to other OSDs.
After a while, a lots of OSD was marked down and out, my cluster couldn't work
any more.

The following is the output of ceph -s:

cluster 002c3742-ab04-470f-8a7a-ad0658b547d6
health HEALTH_ERR 103 pgs degraded; 993 pgs down; 617 pgs
incomplete; 1008 pgs peering; 12 pgs recovering; 534 pgs stale; 1625
pgs stuck inactive; 534 pgs stuck stale; 1728 pgs stuck unclean;
recovery 945/29649 objects degraded (3.187%); 1 full osd(s); 1 mons
down, quorum 0,2 2,1
 monmap e1: 3 mons at
{0=10.0.0.97:6789/0,1=10.0.0.98:6789/0,2=10.0.0.70:6789/0}, election
epoch 40, quorum 0,2 2,1
 osdmap e173: 9 osds: 2 up, 2 in
flags full
  pgmap v1779: 1728 pgs, 3 pools, 39528 MB data, 9883 objects
37541 MB used, 3398 MB / 40940 MB avail
945/29649 objects degraded (3.187%)
  34 stale+active+degraded+remapped
 176 stale+incomplete
 320 stale+down+peering
  53 active+degraded+remapped
 408 incomplete
   1 active+recovering+degraded
 673 down+peering
   1 stale+active+degraded
  15 remapped+peering
   3 stale+active+recovering+degraded+remapped
   3 active+degraded
  33 remapped+incomplete
   8 active+recovering+degraded+remapped

The following is the output of ceph osd tree:
# idweight  type name   up/down reweight
-1  9   root default
-3  9   rack unknownrack
-2  3   host 10.0.0.97
 0   1   osd.0   down0
 1   1   osd.1   down0
 2   1   osd.2   down0
 -4  3   host 10.0.0.98
 3   1   osd.3   down0
 4   1   osd.4   down0
 5   1   osd.5   down0
 -5  3   host 10.0.0.70
 6   1   osd.6   up  1
 7   1   osd.7   up  1
 8   1   osd.8   down0

The following is part of output os osd.0.log

-3 2014-11-14 17:33:02.166022 7fd9dd1ab700  0
filestore(/data/osd/osd.0)  error (28) No space left on device not
handled on operation 10 (15804.0.13, or op 13, counting from 0)
-2 2014-11-14 17:33:02.216768 7fd9dd1ab700  0
filestore(/data/osd/osd.0) ENOSPC handling not implemented
-1 2014-11-14 17:33:02.216783 7fd9dd1ab700  0
filestore(/data/osd/osd.0)  transaction dump:
...
...
0 2014-11-14 17:33:02.541008 7fd9dd1ab700 -1 os/FileStore.cc: In
function 'unsigned int
FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int,
ThreadPool::TPHandle*)' thread 7fd9dd1ab700 time
2014-11-14 17:33:02.251570
  os/FileStore.cc: 2540: FAILED assert(0 == unexpected error)

  ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0x17f8675]
 2: (FileStore::_do_transaction(ObjectStore::Transaction,
unsigned long, int, ThreadPool::TPHandle*)+0x4855) [0x1534c21]
 3: (FileStore::_do_transactions(std::listObjectStore::Transaction*,
std::allocatorObjectStore::Transaction* ,  unsigned long,
ThreadPool::TPHandle*)+0x101) [0x152d67d]
 4: (FileStore::_do_op(FileStore::OpSequencer*,
ThreadPool::TPHandle)+0x57b) [0x152bdc3]
 5: (FileStore::OpWQ::_process(FileStore::OpSequencer*,
ThreadPool::TPHandle)+0x2f) [0x1553c6f]
 6: (ThreadPool::WorkQueueFileStore::OpSequencer::_void_process(void*,
ThreadPool::TPHandle)+0x37)  [0x15625e7]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x7a4) [0x18801de]
 8: (ThreadPool::WorkThread::entry()+0x23) [0x1881f2d]
 9: (Thread::_entry_func(void*)+0x23) [0x1998117]
10: (()+0x79d1) [0x7fd9e92bf9d1]
11: (clone()+0x6d) [0x7fd9e78ca9dd]
NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

It seens the error code was ENOSPC(No space left), why the osd
program exited with assert at
this time? If there was no space left, why the cluster should choose
to migrate? Only osd.6
and osd.7 was alive. I tried to restarted other OSDs, but after a
while, there osds crashed again.
And now I can't read the data any more.
Is 

[ceph-users] OSDs down

2014-11-17 Thread NEVEU Stephane
Hi all :) ,

I need some help, I'm in a sad situation : i've lost 2 ceph server nodes 
physically (5 nodes initialy/ 5 monitors). So 3 nodes left : node1, node2, node3
On my first node leaving, I've updated the crush map to remove every osds 
running on those 2 lost servers :
Ceph osd crush remove osds  ceph auth del osds  ceph osd rm osds  ceph 
osd remove my2Lostnodes
So the crush map seems to be ok now on node1.
Ceph osd tree on node 1 returns that every osds running on node2 are down 1 
and up 1 on node 3 and up 1 on node1. Nevertheless on node3 every ceph * 
commands stay freezed, so I'm not sure the crush map has been updated on node2 
and node3. I don't know how to set ods on node 2 up again.
My node2 says it cannot connect to the cluster !

Ceph -s on node 1 gives me (so still 5 monitors):

cluster 45d9195b-365e-491a-8853-34b46553db94
 health HEALTH_WARN 10016 pgs degraded; 10016 pgs stuck unclean; recovery 
181055/544038 objects degraded (33.280%); 11/33 in osds are down; noout flag(s) 
set; 2 mons down, quorum 0,1,2 node1,node2,node3; clock skew detected on 
mon.node2
 monmap e1: 5 mons at 
{node1=172.23.6.11:6789/0,node2=172.23.6.12:6789/0,node3=172.23.6.13:6789/0,node4=172.23.6.14:6789/0,node5=172.23.6.15:6789/0http://172.23.6.14:6789/0,omcinfcph02d=172.23.6.15:6789/0,omcinfcph61d=172.23.6.11:6789/0,omcinfcph62d=172.23.6.12:6789/0,omcinfcph63d=172.23.6.13:6789/0},
 election epoch 488, quorum 0,1,2 node1,node2,node3
 mdsmap e48: 1/1/1 up {0=node3=up:active}
 osdmap e3852: 33 osds: 22 up, 33 in
flags noout
  pgmap v8189785: 10016 pgs, 9 pools, 705 GB data, 177 kobjects
2122 GB used, 90051 GB / 92174 GB avail
181055/544038 objects degraded (33.280%)
   10016 active+degraded
  client io 0 B/s rd, 233 kB/s wr, 22 op/s


Thx for your help !!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.88 released

2014-11-17 Thread Ken Dreyer
On 11/14/14 12:54 PM, Robert LeBlanc wrote:
 Will there be RPMs built for this release?

Hi Robert,

Since this is a development release, you can find the v0.88 RPMs in the
development download area on ceph.com, which is currently at
http://ceph.com/rpm-testing/ .

(Down the road, we're working on making these URLs a bit easier to
find... but for now, that's where you can find v0.88)

- Ken
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance data collection for Ceph

2014-11-17 Thread Dan Ryder (daryder)
For OSDs, that is correct. FYI - perf counters are also available for all Ceph 
daemon types (mon, mds, rgw).


Dan Ryder

From: 10 minus [mailto:t10te...@gmail.com]
Sent: Monday, November 17, 2014 7:25 AM
To: Dan Ryder (daryder)
Cc: ceph-users
Subject: Re: [ceph-users] Performance data collection for Ceph

Thanks Dan, If I understand correctly , perf_counters have to run against OSDs 
( I mean for every OSD I have to run a check).


On Fri, Nov 14, 2014 at 8:44 PM, Dan Ryder (daryder) 
dary...@cisco.commailto:dary...@cisco.com wrote:
Hi,

Take a look at the built in perf counters - 
http://ceph.com/docs/master/dev/perf_counters/. Through this you can get 
individual daemon performance as well as some cluster level statistics.

Other (cluster-level) disk space utilization and pool utilization/performance 
is available through “ceph df detail”. Hope this helps.



Dan Ryder

From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.commailto:ceph-users-boun...@lists.ceph.com]
 On Behalf Of 10 minus
Sent: Friday, November 14, 2014 10:26 AM
To: ceph-users
Subject: [ceph-users] Performance data collection for Ceph

Hi,
I 'm trying to collect  performance data for Ceph
I 'm looking to run some commands .. on regular intervals. to collect data.
Apart from ceph osd perf . Are there other commands one can use.
Can I also track how much data is being replicated ?
Does  Ceph maintain performance counters for individual OSDs ?
Something on the lines of  zpool iostat .

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Troubleshooting an erasure coded pool with a cache tier

2014-11-17 Thread Laurent GUERBY
Hi,

Just a follow-up on this issue, we're probably hitting:

http://tracker.ceph.com/issues/9285

We had the issue a few weeks ago with replicated SSD pool in front of
rotational pool and turned off cache tiering. 

Yesterday we made a new test and activating cache tiering on a single
erasure pool threw the whole ceph cluster performance to the floor
(including non cached non erasure coded pools) with frequent slow
write in the logs. Removing cache tiering was enough to go back to
normal performance.

I assume no one use cache tiering on 0.80.7 in production clusters?

Sincerely,

Laurent

Le Sunday 09 November 2014 à 00:24 +0100, Loic Dachary a écrit :
 
 On 09/11/2014 00:03, Gregory Farnum wrote:
  It's all about the disk accesses. What's the slow part when you dump 
  historic and in-progress ops?
 
 This is what I see on g1 (6% iowait)
 
 root@g1:~# ceph daemon osd.0 dump_ops_in_flight
 { num_ops: 0,
   ops: []}
 
 root@g1:~# ceph daemon osd.0 dump_ops_in_flight
 { num_ops: 1,
   ops: [
 { description: osd_op(client.4407100.0:11030174 
 rb.0.410809.238e1f29.1038 [set-alloc-hint object_size 4194304 
 write_size 4194304,write 4095488~4096] 58.3aabb66d ack+ondisk+write e15613),
   received_at: 2014-11-09 00:14:17.385256,
   age: 0.538802,
   duration: 0.011955,
   type_data: [
 waiting for sub ops,
 { client: client.4407100,
   tid: 11030174},
 [
 { time: 2014-11-09 00:14:17.385393,
   event: waiting_for_osdmap},
 { time: 2014-11-09 00:14:17.385563,
   event: reached_pg},
 { time: 2014-11-09 00:14:17.385793,
   event: started},
 { time: 2014-11-09 00:14:17.385807,
   event: started},
 { time: 2014-11-09 00:14:17.385875,
   event: waiting for subops from 1,10},
 { time: 2014-11-09 00:14:17.386201,
   event: commit_queued_for_journal_write},
 { time: 2014-11-09 00:14:17.386336,
   event: write_thread_in_journal_buffer},
 { time: 2014-11-09 00:14:17.396293,
   event: journaled_completion_queued},
 { time: 2014-11-09 00:14:17.396332,
   event: op_commit},
 { time: 2014-11-09 00:14:17.396678,
   event: op_applied},
 { time: 2014-11-09 00:14:17.397211,
   event: sub_op_commit_rec}]]}]}
 
 and it looks ok. When I go to n7 which has 20% iowait, I see a much larger 
 output http://pastebin.com/DPxsaf6z which includes a number of event: 
 waiting_for_osdmap.
 
 I'm not sure what to make of this and it would certainly be better if n7 had 
 a lower iowait. Also when I ceph -w I see a new pgmap is created every second 
 which is also not a good sign.
 
 2014-11-09 00:22:47.090795 mon.0 [INF] pgmap v4389613: 460 pgs: 460 
 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 3889 B/s 
 rd, 2125 kB/s wr, 237 op/s
 2014-11-09 00:22:48.143412 mon.0 [INF] pgmap v4389614: 460 pgs: 460 
 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 1586 
 kB/s wr, 204 op/s
 2014-11-09 00:22:49.172794 mon.0 [INF] pgmap v4389615: 460 pgs: 460 
 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 343 kB/s 
 wr, 88 op/s
 2014-11-09 00:22:50.222958 mon.0 [INF] pgmap v4389616: 460 pgs: 460 
 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 412 kB/s 
 wr, 130 op/s
 2014-11-09 00:22:51.281294 mon.0 [INF] pgmap v4389617: 460 pgs: 460 
 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 1195 
 kB/s wr, 167 op/s
 2014-11-09 00:22:52.318895 mon.0 [INF] pgmap v4389618: 460 pgs: 460 
 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 5864 B/s 
 rd, 2762 kB/s wr, 206 op/s
 
 Cheers
 
 
  On Sat, Nov 8, 2014 at 2:30 PM Loic Dachary l...@dachary.org 
  mailto:l...@dachary.org wrote:
  
  Hi Greg,
  
  On 08/11/2014 20:19, Gregory Farnum wrote: When acting as a cache pool 
  it needs to go do a lookup on the base pool for every object it hasn't 
  encountered before. I assume that's why it's slower.
   (The penalty should not be nearly as high as you're seeing here, but 
  based on the low numbers I imagine you're running everything on an 
  overloaded laptop or something.)
  
  It's running on a small cluster that is busy but not to a point that I 
  expect such a difference:
  
  # dsh --concurrent-shell --show-machine-names --remoteshellopt=-p 
  -m g1 -m g2 -m g3 -m n7 -m stri dstat -c 10 3
  g1: total-cpu-usage
  g1: usr sys idl wai hiq siq
  g1:   6   1  88   6   0   0
  g2: total-cpu-usage
  g2: usr sys idl wai hiq siq
  g2:   4   1  

Re: [ceph-users] Troubleshooting an erasure coded pool with a cache tier

2014-11-17 Thread Erik Logtenberg
I think I might be running into the same issue. I'm using Giant though.
A lot of slow writes. My thoughts went to: the OSD's get too much work
to do (commodity hardware), so I'll have to do some performance tuning
to limit parallellism a bit. And indeed, limiting the amount of threads
for different tasks reduced some of the load, but I keep getting slow
writes very often, especially if the load is coming from CephFS (which
is the only thing I use a cache tier for).

To answer your question: no, it's not yet production, and it's not
suited for production currently either.

In my case the slow writes keep stacking up, until OSD's commit suicide,
and then the recovery process adds even further to the load of the
remaining OSD's, causing a chain reaction in which other OSD's also kill
themselves.

Non-optimal performance could in my case be acceptable for
semi-production, but stability is essential. So I hope these issues can
be fixed.

Kind regards,

Erik.


On 17-11-14 17:45, Laurent GUERBY wrote:
 Hi,
 
 Just a follow-up on this issue, we're probably hitting:
 
 http://tracker.ceph.com/issues/9285
 
 We had the issue a few weeks ago with replicated SSD pool in front of
 rotational pool and turned off cache tiering. 
 
 Yesterday we made a new test and activating cache tiering on a single
 erasure pool threw the whole ceph cluster performance to the floor
 (including non cached non erasure coded pools) with frequent slow
 write in the logs. Removing cache tiering was enough to go back to
 normal performance.
 
 I assume no one use cache tiering on 0.80.7 in production clusters?
 
 Sincerely,
 
 Laurent
 
 Le Sunday 09 November 2014 à 00:24 +0100, Loic Dachary a écrit :

 On 09/11/2014 00:03, Gregory Farnum wrote:
 It's all about the disk accesses. What's the slow part when you dump 
 historic and in-progress ops?

 This is what I see on g1 (6% iowait)

 root@g1:~# ceph daemon osd.0 dump_ops_in_flight
 { num_ops: 0,
   ops: []}

 root@g1:~# ceph daemon osd.0 dump_ops_in_flight
 { num_ops: 1,
   ops: [
 { description: osd_op(client.4407100.0:11030174 
 rb.0.410809.238e1f29.1038 [set-alloc-hint object_size 4194304 
 write_size 4194304,write 4095488~4096] 58.3aabb66d ack+ondisk+write e15613),
   received_at: 2014-11-09 00:14:17.385256,
   age: 0.538802,
   duration: 0.011955,
   type_data: [
 waiting for sub ops,
 { client: client.4407100,
   tid: 11030174},
 [
 { time: 2014-11-09 00:14:17.385393,
   event: waiting_for_osdmap},
 { time: 2014-11-09 00:14:17.385563,
   event: reached_pg},
 { time: 2014-11-09 00:14:17.385793,
   event: started},
 { time: 2014-11-09 00:14:17.385807,
   event: started},
 { time: 2014-11-09 00:14:17.385875,
   event: waiting for subops from 1,10},
 { time: 2014-11-09 00:14:17.386201,
   event: commit_queued_for_journal_write},
 { time: 2014-11-09 00:14:17.386336,
   event: write_thread_in_journal_buffer},
 { time: 2014-11-09 00:14:17.396293,
   event: journaled_completion_queued},
 { time: 2014-11-09 00:14:17.396332,
   event: op_commit},
 { time: 2014-11-09 00:14:17.396678,
   event: op_applied},
 { time: 2014-11-09 00:14:17.397211,
   event: sub_op_commit_rec}]]}]}

 and it looks ok. When I go to n7 which has 20% iowait, I see a much larger 
 output http://pastebin.com/DPxsaf6z which includes a number of event: 
 waiting_for_osdmap.

 I'm not sure what to make of this and it would certainly be better if n7 had 
 a lower iowait. Also when I ceph -w I see a new pgmap is created every 
 second which is also not a good sign.

 2014-11-09 00:22:47.090795 mon.0 [INF] pgmap v4389613: 460 pgs: 460 
 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 3889 
 B/s rd, 2125 kB/s wr, 237 op/s
 2014-11-09 00:22:48.143412 mon.0 [INF] pgmap v4389614: 460 pgs: 460 
 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 1586 
 kB/s wr, 204 op/s
 2014-11-09 00:22:49.172794 mon.0 [INF] pgmap v4389615: 460 pgs: 460 
 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 343 
 kB/s wr, 88 op/s
 2014-11-09 00:22:50.222958 mon.0 [INF] pgmap v4389616: 460 pgs: 460 
 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 412 
 kB/s wr, 130 op/s
 2014-11-09 00:22:51.281294 mon.0 [INF] pgmap v4389617: 460 pgs: 460 
 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 1195 
 kB/s wr, 167 op/s
 2014-11-09 00:22:52.318895 mon.0 [INF] pgmap v4389618: 460 pgs: 460 
 active+clean; 2580 GB data, 

Re: [ceph-users] Creating RGW S3 User using the Admin Ops API

2014-11-17 Thread Yehuda Sadeh
On Sun, Nov 16, 2014 at 10:50 PM, Wido den Hollander w...@42on.com wrote:
 On 17-11-14 07:44, Lei Dong wrote:
 I think you should send the data (uid  display-name) as arguments. I
 successfully create user via adminOps without any problems.


 To be clear:

 PUT /admin/user?format=jsonuid=XXXdisplay-name=

 Did you try this with Dumpling (0.67.X) as well?


I just tested it on dumpling, and it worked. One thing that I did see
is that you can get 403 response if the uid was not provided. Maybe
this param is getting clobbered somehow?

Yehuda
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mds cluster degraded

2014-11-17 Thread JIten Shah
After i rebuilt the OSD’s, the MDS went into the degraded mode and will not 
recover.


[jshah@Lab-cephmon001 ~]$ sudo tail -100f 
/var/log/ceph/ceph-mds.Lab-cephmon001.log
2014-11-17 17:55:27.855861 7fffef5d3700  0 -- X.X.16.111:6800/3046050  
X.X.16.114:0/838757053 pipe(0x1e18000 sd=22 :6800 s=0 pgs=0 cs=0 l=0 
c=0x1e02c00).accept peer addr is really X.X.16.114:0/838757053 (socket is 
X.X.16.114:34672/0)
2014-11-17 17:57:27.855519 7fffef5d3700  0 -- X.X.16.111:6800/3046050  
X.X.16.114:0/838757053 pipe(0x1e18000 sd=22 :6800 s=2 pgs=2 cs=1 l=0 
c=0x1e02c00).fault with nothing to send, going to standby
2014-11-17 17:58:47.883799 7fffef3d1700  0 -- X.X.16.111:6800/3046050  
X.X.16.114:0/26738200 pipe(0x1e1be80 sd=23 :6800 s=0 pgs=0 cs=0 l=0 
c=0x1e04ba0).accept peer addr is really X.X.16.114:0/26738200 (socket is 
X.X.16.114:34699/0)
2014-11-17 18:00:47.882484 7fffef3d1700  0 -- X.X.16.111:6800/3046050  
X.X.16.114:0/26738200 pipe(0x1e1be80 sd=23 :6800 s=2 pgs=2 cs=1 l=0 
c=0x1e04ba0).fault with nothing to send, going to standby
2014-11-17 18:01:47.886662 7fffef1cf700  0 -- X.X.16.111:6800/3046050  
X.X.16.114:0/3673954317 pipe(0x1e1c380 sd=24 :6800 s=0 pgs=0 cs=0 l=0 
c=0x1e05540).accept peer addr is really X.X.16.114:0/3673954317 (socket is 
X.X.16.114:34718/0)
2014-11-17 18:03:47.885488 7fffef1cf700  0 -- X.X.16.111:6800/3046050  
X.X.16.114:0/3673954317 pipe(0x1e1c380 sd=24 :6800 s=2 pgs=2 cs=1 l=0 
c=0x1e05540).fault with nothing to send, going to standby
2014-11-17 18:04:47.888983 7fffeefcd700  0 -- X.X.16.111:6800/3046050  
X.X.16.114:0/3403131574 pipe(0x1e18a00 sd=25 :6800 s=0 pgs=0 cs=0 l=0 
c=0x1e05280).accept peer addr is really X.X.16.114:0/3403131574 (socket is 
X.X.16.114:34744/0)
2014-11-17 18:06:47.888427 7fffeefcd700  0 -- X.X.16.111:6800/3046050  
X.X.16.114:0/3403131574 pipe(0x1e18a00 sd=25 :6800 s=2 pgs=2 cs=1 l=0 
c=0x1e05280).fault with nothing to send, going to standby
2014-11-17 20:02:03.558250 707de700 -1 mds.0.1 *** got signal Terminated ***
2014-11-17 20:02:03.558297 707de700  1 mds.0.1 suicide.  wanted down:dne, 
now up:active
2014-11-17 20:02:56.053339 77fe77a0  0 ceph version 0.80.5 
(38b73c67d375a2552d8ed67843c8a65c2c0feba6), process ceph-mds, pid 3424727
2014-11-17 20:02:56.121367 730e4700  1 mds.-1.0 handle_mds_map standby
2014-11-17 20:02:56.124343 730e4700  1 mds.0.2 handle_mds_map i am now 
mds.0.2
2014-11-17 20:02:56.124345 730e4700  1 mds.0.2 handle_mds_map state change 
up:standby -- up:replay
2014-11-17 20:02:56.124348 730e4700  1 mds.0.2 replay_start
2014-11-17 20:02:56.124359 730e4700  1 mds.0.2  recovery set is 
2014-11-17 20:02:56.124362 730e4700  1 mds.0.2  need osdmap epoch 93, have 
92
2014-11-17 20:02:56.124363 730e4700  1 mds.0.2  waiting for osdmap 93 
(which blacklists prior instance)


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs down

2014-11-17 Thread Craig Lewis
Firstly, any chance of getting node4 and node5 back up?  You can move the
disks (monitor and osd) to a new chasis, and bring it back up.  As long as
it has the same IP as the original node4 and node5, the monitor should join.

How much is the clock skewed on node2?  I haven't had problems with small
skew (~100 ms), but I've seen posts to the mailing list about large skews
(minutes) causing quorum and authentication problems.

When you say Nevertheless on node3 every ceph * commands stay freezed, do
you by chance mean node2 instead of node3?  If so, that supports the clock
skew being a problem, preventing the commands and the OSDs
from authenticating with the monitors.

If you really did mean node3, then something strange else going on.



On Mon, Nov 17, 2014 at 7:07 AM, NEVEU Stephane 
stephane.ne...@thalesgroup.com wrote:

 Hi all J ,



 I need some help, I’m in a sad situation : i’ve lost 2 ceph server nodes
 physically (5 nodes initialy/ 5 monitors). So 3 nodes left : node1, node2,
 node3

 On my first node leaving, I’ve updated the crush map to remove every osds
 running on those 2 lost servers :

 Ceph osd crush remove osds  ceph auth del osds  ceph osd rm osds 
 ceph osd remove my2Lostnodes

 So the crush map seems to be ok now on node1.

 Ceph osd tree on node 1 returns that every osds running on node2 are “down
 1” and “up 1” on node 3 and “up 1” on node1. Nevertheless on node3 every
 ceph * commands stay freezed, so I’m not sure the crush map has been
 updated on node2 and node3. I don’t know how to set ods on node 2 up again.

 My node2 says it cannot connect to the cluster !



 Ceph –s on node 1 gives me (so still 5 monitors):



 cluster 45d9195b-365e-491a-8853-34b46553db94
  health HEALTH_WARN 10016 pgs degraded; 10016 pgs stuck unclean;
 recovery 181055/544038 objects degraded (33.280%); 11/33 in osds are down;
 noout flag(s) set; 2 mons down, quorum 0,1,2 node1,node2,node3; clock skew
 detected on mon.node2
  monmap e1: 5 mons at {node1=
 172.23.6.11:6789/0,node2=172.23.6.12:6789/0,node3=172.23.6.13:6789/0,node4=172.23.6.14:6789/0,node5=172.23.6.15:6789/0
 http://172.23.6.14:6789/0,omcinfcph02d=172.23.6.15:6789/0,omcinfcph61d=172.23.6.11:6789/0,omcinfcph62d=172.23.6.12:6789/0,omcinfcph63d=172.23.6.13:6789/0},
 election epoch 488, quorum 0,1,2 node1,node2,node3
  mdsmap e48: 1/1/1 up {0=node3=up:active}
  osdmap e3852: 33 osds: 22 up, 33 in
 flags noout
   pgmap v8189785: 10016 pgs, 9 pools, 705 GB data, 177 kobjects
 2122 GB used, 90051 GB / 92174 GB avail
 181055/544038 objects degraded (33.280%)
10016 active+degraded
   client io 0 B/s rd, 233 kB/s wr, 22 op/s





 Thx for your help !!



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd crashed while there was no space

2014-11-17 Thread Craig Lewis
At this point, it's probably best to delete the pool.  I'm assuming the
pool only contains benchmark data, and nothing important.

Assuming you can delete the pool:
First, figure out the ID of the data pool.  You can get that from ceph osd
dump | grep '^pool'

Once you have the number, delete the data pool: rados rmpool data
data --yes-i-really-really-mean-it

That will only free up space on OSDs that are up.  You'll need to manually
some PGs on the OSDs that are 100% full.  Go
to /var/lib/ceph/osd/ceph-OSDID/current, and delete a few directories
that start with your data pool ID.  You don't need to delete all of them.
Once the disk is below 95% full, you should be able to start that OSD.
Once it's up, it will finish deleting the pool.

If you can't delete the pool, it is possible, but it's more work, and you
still run the risk of losing data if you make a mistake.  You need to
disable backfilling, then delete some PGs on each OSD that's full. Try to
only delete one copy of each PG.  If you delete every copy of a PG on all
OSDs, then you lost the data that was in that PG.  As before, once you
delete enough that the disk is less than 95% full, you can start the OSD.
Once you start it, start deleting your benchmark data out of the data
pool.  Once that's done, you can re-enable backfilling.  You may need to
scrub or deep-scrub the OSDs you deleted data from to get everything back
to normal.


So how did you get the disks 100% full anyway?  Ceph normally won't let you
do that.  Did you increase mon_osd_full_ratio, osd_backfill_full_ratio, or
 osd_failsafe_full_ratio?


On Mon, Nov 17, 2014 at 7:00 AM, han vincent hang...@gmail.com wrote:

 hello, every one:

 These days a problem of ceph has troubled me for a long time.

 I build a cluster with 3 hosts and each host has three osds in it.
 And after that
 I used the command rados bench 360 -p data -b 4194304 -t 300 write
 --no-cleanup
 to test the write performance of the cluster.

 When the cluster is near full, there couldn't write any data to
 it. Unfortunately,
 there was a host hung up, then a lots of PG was going to migrate to other
 OSDs.
 After a while, a lots of OSD was marked down and out, my cluster couldn't
 work
 any more.

 The following is the output of ceph -s:

 cluster 002c3742-ab04-470f-8a7a-ad0658b547d6
 health HEALTH_ERR 103 pgs degraded; 993 pgs down; 617 pgs
 incomplete; 1008 pgs peering; 12 pgs recovering; 534 pgs stale; 1625
 pgs stuck inactive; 534 pgs stuck stale; 1728 pgs stuck unclean;
 recovery 945/29649 objects degraded (3.187%); 1 full osd(s); 1 mons
 down, quorum 0,2 2,1
  monmap e1: 3 mons at
 {0=10.0.0.97:6789/0,1=10.0.0.98:6789/0,2=10.0.0.70:6789/0}, election
 epoch 40, quorum 0,2 2,1
  osdmap e173: 9 osds: 2 up, 2 in
 flags full
   pgmap v1779: 1728 pgs, 3 pools, 39528 MB data, 9883 objects
 37541 MB used, 3398 MB / 40940 MB avail
 945/29649 objects degraded (3.187%)
   34 stale+active+degraded+remapped
  176 stale+incomplete
  320 stale+down+peering
   53 active+degraded+remapped
  408 incomplete
1 active+recovering+degraded
  673 down+peering
1 stale+active+degraded
   15 remapped+peering
3 stale+active+recovering+degraded+remapped
3 active+degraded
   33 remapped+incomplete
8 active+recovering+degraded+remapped

 The following is the output of ceph osd tree:
 # idweight  type name   up/down reweight
 -1  9   root default
 -3  9   rack unknownrack
 -2  3   host 10.0.0.97
  0   1   osd.0   down0
  1   1   osd.1   down0
  2   1   osd.2   down0
  -4  3   host 10.0.0.98
  3   1   osd.3   down0
  4   1   osd.4   down0
  5   1   osd.5   down0
  -5  3   host 10.0.0.70
  6   1   osd.6   up  1
  7   1   osd.7   up  1
  8   1   osd.8   down0

 The following is part of output os osd.0.log

 -3 2014-11-14 17:33:02.166022 7fd9dd1ab700  0
 filestore(/data/osd/osd.0)  error (28) No space left on device not
 handled on operation 10 (15804.0.13, or op 13, counting from 0)
 -2 2014-11-14 17:33:02.216768 7fd9dd1ab700  0
 filestore(/data/osd/osd.0) ENOSPC handling not implemented
 -1 2014-11-14 17:33:02.216783 7fd9dd1ab700  0
 filestore(/data/osd/osd.0)  transaction dump:
 ...
 ...
 0 2014-11-14 17:33:02.541008 7fd9dd1ab700 -1 

Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-17 Thread Craig Lewis
I use `dd` to force activity to the disk I want to replace, and watch the
activity lights.  That only works if your disks aren't 100% busy.  If they
are, stop the ceph-osd daemon, and see which drive stops having activity.
Repeat until you're 100% confident that you're pulling the right drive.

On Wed, Nov 12, 2014 at 5:05 AM, SCHAER Frederic frederic.sch...@cea.fr
wrote:

  Hi,



 I’m used to RAID software giving me the failing disks  slots, and most
 often blinking the disks on the disk bays.

 I recently installed a  DELL “6GB HBA SAS” JBOD card, said to be an LSI
 2008 one, and I now have to identify 3 pre-failed disks (so says S.M.A.R.T)
 .



 Since this is an LSI, I thought I’d use MegaCli to identify the disks
 slot, but MegaCli does not see the HBA card.

 Then I found the LSI “sas2ircu” utility, but again, this one fails at
 giving me the disk slots (it finds the disks, serials and others, but slot
 is always 0)

 Because of this, I’m going to head over to the disk bay and unplug the
 disk which I think corresponds to the alphabetical order in linux, and see
 if it’s the correct one…. But even if this is correct this time, it might
 not be next time.



 But this makes me wonder : how do you guys, Ceph users, manage your disks
 if you really have JBOD servers ?

 I can’t imagine having to guess slots that each time, and I can’t imagine
 neither creating serial number stickers for every single disk I could have
 to manage …

 Is there any specific advice reguarding JBOD cards people should (not) use
 in their systems ?

 Any magical way to “blink” a drive in linux ?



 Thanks  regards

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD commits suicide

2014-11-17 Thread Craig Lewis
I did have a problem in my secondary cluster that sounds similar to yours.
I was using XFS, and traced my problem back to 64 kB inodes (osd mkfs
options xfs = -i size=64k).   This showed up with a lot of XFS: possible
memory allocation deadlock in kmem_alloc in the kernel logs.  I was able
to keep things limping along by flushing the cache frequently, but I
eventually re-formatted every OSD to get rid of the 64k inodes.

After I finished the reformat, I had problems because of deep-scrubbing.
While reformatting, I disabled deep-scrubbing.  Once I re-enabled it, Ceph
wanted to deep-scrub the whole cluster, and sometimes 90% of my OSDs would
be doing a deep-scrub.  I'm manually deep-scrubbing now, trying to spread
out the schedule a bit.  Once this finishes in a few day, I should be able
to re-enable deep-scrubbing and keep my HEALTH_OK.


My primary cluster has always been well behaved.  It completed the
re-format without having any problems.  The clusters are nearly identical,
the biggest difference being that the secondary had a higher sustained load
due to a replication backlog.




On Sat, Nov 15, 2014 at 12:38 PM, Erik Logtenberg e...@logtenberg.eu
wrote:

 Hi,

 Thanks for the tip, I applied these configuration settings and it does
 lower the load during rebuilding a bit. Are there settings like these
 that also tune Ceph down a bit during regular operations? The slow
 requests, timeouts and OSD suicides are killing me.

 If I allow the cluster to regain consciousness and stay idle a bit, it
 all seems to settle down nicely, but as soon as I apply some load it
 immediately starts to overstress and complain like crazy.

 I'm also seeing this behaviour: http://tracker.ceph.com/issues/9844
 This was reported by Dmitry Smirnov 26 days ago, but the report has no
 response yet. Any ideas?

 In my experience, OSD's are quite unstable in Giant and very easily
 stressed, causing chain effects, further worsening the issues. It would
 be nice to know if this is also noticed by other users?

 Thanks,

 Erik.


 On 11/10/2014 08:40 PM, Craig Lewis wrote:
  Have you tuned any of the recovery or backfill parameters?  My ceph.conf
  has:
  [osd]
osd max backfills = 1
osd recovery max active = 1
osd recovery op priority = 1
 
  Still, if it's running for a few hours, then failing, it sounds like
  there might be something else at play.  OSDs use a lot of RAM during
  recovery.  How much RAM and how many OSDs do you have in these nodes?
  What does memory usage look like after a fresh restart, and what does it
  look like when the problems start?  Even better if you know what it
  looks like 5 minutes before the problems start.
 
  Is there anything interesting in the kernel logs?  OOM killers, or
  memory deadlocks?
 
 
 
  On Sat, Nov 8, 2014 at 11:19 AM, Erik Logtenberg e...@logtenberg.eu
  mailto:e...@logtenberg.eu wrote:
 
  Hi,
 
  I have some OSD's that keep committing suicide. My cluster has ~1.3M
  misplaced objects, and it can't really recover, because OSD's keep
  failing before recovering finishes. The load on the hosts is quite
 high,
  but the cluster currently has no other tasks than just the
  backfilling/recovering.
 
  I attached the logfile from a failed OSD. It shows the suicide, the
  recent events and also me starting the OSD again after some time.
 
  It'll keep running for a couple of hours and then fail again, for the
  same reason.
 
  I noticed a lot of timeouts. Apparently ceph stresses the hosts to
 the
  limit with the recovery tasks, so much that they timeout and can't
  finish that task. I don't understand why. Can I somehow throttle
 ceph a
  bit so that it doesn't keep overrunning itself? I kinda feel like it
  should chill out a bit and simply recover one step at a time instead
 of
  full force and then fail.
 
  Thanks,
 
  Erik.
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD commits suicide

2014-11-17 Thread Andrey Korolyov
On Tue, Nov 18, 2014 at 12:54 AM, Craig Lewis cle...@centraldesktop.com wrote:
 I did have a problem in my secondary cluster that sounds similar to yours.
 I was using XFS, and traced my problem back to 64 kB inodes (osd mkfs
 options xfs = -i size=64k).   This showed up with a lot of XFS: possible
 memory allocation deadlock in kmem_alloc in the kernel logs.  I was able to
 keep things limping along by flushing the cache frequently, but I eventually
 re-formatted every OSD to get rid of the 64k inodes.

 After I finished the reformat, I had problems because of deep-scrubbing.
 While reformatting, I disabled deep-scrubbing.  Once I re-enabled it, Ceph
 wanted to deep-scrub the whole cluster, and sometimes 90% of my OSDs would
 be doing a deep-scrub.  I'm manually deep-scrubbing now, trying to spread
 out the schedule a bit.  Once this finishes in a few day, I should be able
 to re-enable deep-scrubbing and keep my HEALTH_OK.



Would you mind to check suggestions by following mine hints or hints
from mentioned URLs from there
http://marc.info/?l=linux-mmm=141607712831090w=2 with 64k again? As
for me, I am not observing lock loop after setting min_free_kbytes for
a half of gigabyte per OSD. Even if your locks has a different nature,
it may be worthy to try anyway.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deep scrub parameter tuning

2014-11-17 Thread Craig Lewis
The minimum value for osd_deep_scrub_interval  is osd_scrub_min_interval,
and it wouldn't be advisable to go that low.

I can't find the documentation, but basically Ceph will attempt a scrub
sometime between osd_scrub_min_interval and osd_scrub_max_interval.  If the
PG hasn't been deep-scrubbed in the last osd_deep_scrub_interval seconds,
it does a deep-scrub instead.

So if you set  osd_deep_scrub_interval to osd_scrub_min_interval, you'll
never scrub your PGs, you'll only deep-scrub.

Obviously, you can lower the two scrub intervals too.  As Loïc says, test
it well.  I find when I'm playing with these values, I use injectargs to
find a good value, then persist that value in the ceph.conf.


On Fri, Nov 14, 2014 at 3:16 AM, Loic Dachary l...@dachary.org wrote:

 Hi,

 On 14/11/2014 12:11, Mallikarjun Biradar wrote:
  Hi,
 
  Default deep scrub interval is once per week, which we can set using
 osd_deep_scrub_interval parameter.
 
  Whether can we reduce it to less than a week or minimum interval is one
 week?

 You can reduce it to a shorter period. It is worth testing the impact on
 disk IO before going to production with shorter intervals though.

 Cheers

 
  -Thanks  regards,
  Mallikarjun Biradar
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

 --
 Loïc Dachary, Artisan Logiciel Libre


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Negative number of objects degraded for extended period of time

2014-11-17 Thread Craig Lewis
Well, after 4 days, this is probably moot.  Hopefully it's finished
backfilling, and your problem is gone.

If not, I believe that if you fix those backfill_toofull, the negative
numbers will start approaching zero.  I seem to recall that negative
degraded is a special case of degraded, but I don't remember exactly, and
can't find any references.  I have seen it before, and it went away when my
cluster became healthy.

As long as you still have OSDs completing their backfilling, I'd let it
run.

If you get to the point that all of the backfills are done, and you're left
with only wait_backfill+backfill_toofull, then you can bump
osd_backfill_full_ratio, mon_osd_nearfull_ratio, and maybe
osd_failsafe_nearfull_ratio.
 If you do, be careful, and only bump them just enough to let them start
backfilling.  If you set them to 0.99, bad things will happen.




On Thu, Nov 13, 2014 at 7:57 AM, Fred Yang frederic.y...@gmail.com wrote:

 Hi,

 The Ceph cluster we are running have few OSDs approaching to 95% 1+ weeks
 ago so I ran a reweight to balance it out, in the meantime, instructing
 application to purge data not required. But after large amount of data
 purge issued from application side(all OSDs' usage dropped below 20%), the
 cluster fall into this weird state for days, the objects degraded remain
 negative for more than 7 days, I'm seeing some IOs going on on OSDs
 consistently, but the number(negative) objects degraded does not change
 much:

 2014-11-13 10:43:07.237292 mon.0 [INF] pgmap v5935301: 44816 pgs: 44713
 active+clean, 1 active+backfilling, 20 active+remapped+wait_backfill, 27
 active+remapped+wait_backfill+backfill_toofull, 11 active+recovery_wait, 33
 active+remapped+backfilling, 11 active+wait_backfill+backfill_toofull; 1473
 GB data, 2985 GB used, 17123 GB / 20109 GB avail; 30172 kB/s wr, 58 op/s;
 -13582/1468299 objects degraded (-0.925%)
 2014-11-13 10:43:08.248232 mon.0 [INF] pgmap v5935302: 44816 pgs: 44713
 active+clean, 1 active+backfilling, 20 active+remapped+wait_backfill, 27
 active+remapped+wait_backfill+backfill_toofull, 11 active+recovery_wait, 33
 active+remapped+backfilling, 11 active+wait_backfill+backfill_toofull; 1473
 GB data, 2985 GB used, 17123 GB / 20109 GB avail; 26459 kB/s wr, 51 op/s;
 -13582/1468303 objects degraded (-0.925%)

 Any idea what might be happening here? It
 seems active+remapped+wait_backfill+backfill_toofull stuck?

  osdmap e43029: 36 osds: 36 up, 36 in
   pgmap v5935658: 44816 pgs, 32 pools, 1488 GB data, 714 kobjects
 3017 GB used, 17092 GB / 20109 GB avail
 -13438/1475773 objects degraded (-0.911%)
44713 active+clean
1 active+backfilling
   20 active+remapped+wait_backfill
   27 active+remapped+wait_backfill+backfill_toofull
   11 active+recovery_wait
   33 active+remapped+backfilling
   11 active+wait_backfill+backfill_toofull
   client io 478 B/s rd, 40170 kB/s wr, 80 op/s

 The cluster is running on v0.72.2, we are planning to upgrade cluster to
 firefly, but I would like to get the cluster state clean first before the
 upgrade.

 Thanks,
 Fred

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-17 Thread Cedric Lemarchand
Hi,

Try looking for file locate in a folder named Slot X where X in the
number of the slot, then echoing 1 in the locate file will make the
led blink. :

# find /sys -name locate  |grep Slot
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
01/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
02/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
03/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
04/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
05/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
06/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
07/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
08/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
09/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
10/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
11/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
12/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
13/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
14/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
15/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
16/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
17/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
18/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
19/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
20/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
21/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
22/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
23/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
24/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
25/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
26/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
27/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
28/locate

LSI 9200-8e with a Supermicro JBOD 28 slots, Ubuntu 12.04, 3.13 kernel.

Cheers


Le 12/11/2014 14:05, SCHAER Frederic a écrit :

 Hi,

  

 I’m used to RAID software giving me the failing disks  slots, and most
 

Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-17 Thread Cedric Lemarchand
Sorry, I forgot to say that in Slot X/device/block you could find the
device name, like sdc.

Cheers



Le 18/11/2014 00:15, Cedric Lemarchand a écrit :
 Hi,

 Try looking for file locate in a folder named Slot X where X in
 the number of the slot, then echoing 1 in the locate file will make
 the led blink. :

 # find /sys -name locate  |grep Slot
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 01/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 02/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 03/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 04/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 05/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 06/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 07/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 08/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 09/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 10/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 11/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 12/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 13/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 14/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 15/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 16/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 17/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 18/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 19/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 20/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 21/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 22/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 23/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 24/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 25/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 26/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 27/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 28/locate


[ceph-users] CephFS unresponsive at scale (2M files,

2014-11-17 Thread Kevin Sumner
I’ve got a test cluster together with a ~500 OSDs and, 5 MON, and 1 MDS.  All 
the OSDs also mount CephFS at /ceph.  I’ve got Graphite pointing at a space 
under /ceph.  Over the weekend, I drove almost 2 million metrics, each of which 
creates a ~3MB file in a hierarchical path, each sending a datapoint into the 
metric file once a minute.  CephFS seemed to handle the writes ok while I was 
driving load.  All files containing each metric are at paths like this:

/ceph/whisper/sandbox/cephtest-osd0013/2/3/4/5.wsp

Today, however, with the load generator still running, reading metadata of 
files (e.g. directory entries and stat(2) info) in the filesystem (presumably 
MDS-managed data) seems nearly impossible, especially deeper into the tree.  
For example, in a shell cd seems to work but ls hangs, seemingly indefinitely.  
After turning off the load generator and allowing a while for things to settle 
down, everything seems to behave better.

ceph status and ceph health both return good statuses the entire time.  During 
load generation, the ceph-mds process seems pegged at between 100% and 150%, 
but with load generation turned off, the process has some high variability from 
near-idle up to similar 100-150% CPU.

Hopefully, I’ve missed something in the CephFS tuning.  However, I’m looking 
for direction on figuring out if it is, indeed, a tuning problem or if this 
behavior is a symptom of the “not ready for production” banner in the 
documentation.
--
Kevin Sumner
ke...@sumner.io



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS unresponsive at scale (2M files,

2014-11-17 Thread Sage Weil
On Mon, 17 Nov 2014, Kevin Sumner wrote:
 I?ve got a test cluster together with a ~500 OSDs and, 5 MON, and 1 MDS.  All
 the OSDs also mount CephFS at /ceph.  I?ve got Graphite pointing at a space
 under /ceph.  Over the weekend, I drove almost 2 million metrics, each of
 which creates a ~3MB file in a hierarchical path, each sending a datapoint
 into the metric file once a minute.  CephFS seemed to handle the writes ok
 while I was driving load.  All files containing each metric are at paths
 like this:
 /ceph/whisper/sandbox/cephtest-osd0013/2/3/4/5.wsp
 
 Today, however, with the load generator still running, reading metadata of
 files (e.g. directory entries and stat(2) info) in the filesystem
 (presumably MDS-managed data) seems nearly impossible, especially deeper
 into the tree.  For example, in a shell cd seems to work but ls hangs,
 seemingly indefinitely.  After turning off the load generator and allowing a
 while for things to settle down, everything seems to behave better.
 
 ceph status and ceph health both return good statuses the entire time.
  During load generation, the ceph-mds process seems pegged at between 100%
 and 150%, but with load generation turned off, the process has some high
 variability from near-idle up to similar 100-150% CPU.
 
 Hopefully, I?ve missed something in the CephFS tuning.  However, I?m looking 
 for
 direction on figuring out if it is, indeed, a tuning problem or if this
 behavior is a symptom of the ?not ready for production? banner in the
 documentation.

My first guess is that the MDS cache is just too small and it is 
thrashing.  Try

 ceph mds tell 0 injectargs '--mds-cache-size 100'

That's 10x bigger than the default, tho be aware that it will eat up 10x 
as much RAM too.

We've also seen teh cache behave in a non-optimal way when evicting 
things, making it thrash more often than it should.  I'm hoping we can 
implement something like MQ instead of our two-level LRU, but it isn't 
high on the priority list right now.

sage___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-17 Thread Christian Balzer

On Mon, 17 Nov 2014 13:31:57 -0800 Craig Lewis wrote:

 I use `dd` to force activity to the disk I want to replace, and watch the
 activity lights.  That only works if your disks aren't 100% busy.  If
 they are, stop the ceph-osd daemon, and see which drive stops having
 activity. Repeat until you're 100% confident that you're pulling the
 right drive.

I use smartctl for lighting up the disk, but same diff. 
JBOD can become a big PITA quickly with large deployments and if you don't
have people with sufficient skill doing disk replacements.

Also depending on how a disk died you might not be able to reclaim the
drive ID (sdc for example) without a reboot, making things even more
confusing. 

Some RAID cards in IT/JBOD mode _will_ actually light up the fail LED if
a disk fails and/or have tools to blink a specific disk. 
However with the later the task of matching a disk from the controller's
perspective to what linux enumerated it as is still on you.

Ceph might scale up to really large deployments, but you better have a
well staffed data center to come with that or deploy it in a non-JBOD
fashion. 

Christian

 On Wed, Nov 12, 2014 at 5:05 AM, SCHAER Frederic frederic.sch...@cea.fr
 wrote:
 
   Hi,
 
 
 
  I’m used to RAID software giving me the failing disks  slots, and most
  often blinking the disks on the disk bays.
 
  I recently installed a  DELL “6GB HBA SAS” JBOD card, said to be an LSI
  2008 one, and I now have to identify 3 pre-failed disks (so says
  S.M.A.R.T) .
 
 
 
  Since this is an LSI, I thought I’d use MegaCli to identify the disks
  slot, but MegaCli does not see the HBA card.
 
  Then I found the LSI “sas2ircu” utility, but again, this one fails at
  giving me the disk slots (it finds the disks, serials and others, but
  slot is always 0)
 
  Because of this, I’m going to head over to the disk bay and unplug the
  disk which I think corresponds to the alphabetical order in linux, and
  see if it’s the correct one…. But even if this is correct this time,
  it might not be next time.
 
 
 
  But this makes me wonder : how do you guys, Ceph users, manage your
  disks if you really have JBOD servers ?
 
  I can’t imagine having to guess slots that each time, and I can’t
  imagine neither creating serial number stickers for every single disk
  I could have to manage …
 
  Is there any specific advice reguarding JBOD cards people should (not)
  use in their systems ?
 
  Any magical way to “blink” a drive in linux ?
 
 
 
  Thanks  regards
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS unresponsive at scale (2M files,

2014-11-17 Thread Kevin Sumner
 On Nov 17, 2014, at 15:52, Sage Weil s...@newdream.net wrote:
 
 On Mon, 17 Nov 2014, Kevin Sumner wrote:
 I?ve got a test cluster together with a ~500 OSDs and, 5 MON, and 1 MDS.  All
 the OSDs also mount CephFS at /ceph.  I?ve got Graphite pointing at a space
 under /ceph.  Over the weekend, I drove almost 2 million metrics, each of
 which creates a ~3MB file in a hierarchical path, each sending a datapoint
 into the metric file once a minute.  CephFS seemed to handle the writes ok
 while I was driving load.  All files containing each metric are at paths
 like this:
 /ceph/whisper/sandbox/cephtest-osd0013/2/3/4/5.wsp
 
 Today, however, with the load generator still running, reading metadata of
 files (e.g. directory entries and stat(2) info) in the filesystem
 (presumably MDS-managed data) seems nearly impossible, especially deeper
 into the tree.  For example, in a shell cd seems to work but ls hangs,
 seemingly indefinitely.  After turning off the load generator and allowing a
 while for things to settle down, everything seems to behave better.
 
 ceph status and ceph health both return good statuses the entire time.
  During load generation, the ceph-mds process seems pegged at between 100%
 and 150%, but with load generation turned off, the process has some high
 variability from near-idle up to similar 100-150% CPU.
 
 Hopefully, I?ve missed something in the CephFS tuning.  However, I?m looking 
 for
 direction on figuring out if it is, indeed, a tuning problem or if this
 behavior is a symptom of the ?not ready for production? banner in the
 documentation.
 
 My first guess is that the MDS cache is just too small and it is 
 thrashing.  Try
 
 ceph mds tell 0 injectargs '--mds-cache-size 100'
 
 That's 10x bigger than the default, tho be aware that it will eat up 10x 
 as much RAM too.
 
 We've also seen teh cache behave in a non-optimal way when evicting 
 things, making it thrash more often than it should.  I'm hoping we can 
 implement something like MQ instead of our two-level LRU, but it isn't 
 high on the priority list right now.
 
 sage


Thanks!  I’ll pursue mds cache size tuning.  Is there any guidance on setting 
the cache and other mds tunables correctly, or is it an adjust-and-test sort of 
thing?  Cursory searching doesn’t return any relevant documentation for 
ceph.com http://ceph.com/.  I’m plowing through some other list posts now.
--
Kevin Sumner
ke...@sumner.io

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Troubleshooting an erasure coded pool with a cache tier

2014-11-17 Thread Christian Balzer

Hello,

On Mon, 17 Nov 2014 17:45:54 +0100 Laurent GUERBY wrote:

 Hi,
 
 Just a follow-up on this issue, we're probably hitting:
 
 http://tracker.ceph.com/issues/9285
 
 We had the issue a few weeks ago with replicated SSD pool in front of
 rotational pool and turned off cache tiering. 
 
 Yesterday we made a new test and activating cache tiering on a single
 erasure pool threw the whole ceph cluster performance to the floor
 (including non cached non erasure coded pools) with frequent slow
 write in the logs. Removing cache tiering was enough to go back to
 normal performance.

Ouch!
 
 I assume no one use cache tiering on 0.80.7 in production clusters?

Not me and now I'm even less inclined to do so. 
Since this particular item is not the first one that puts cache tiers in
doubt, but certainly the most compelling one.

I wonder how much pressure was on that cache tier, though. 
If I understand the bug report correctly, this should only happen if
some object gets evicted before it was fully replicated.
So I suppose if the cache pool is sized correctly for the working set in
question (which of course is a bugger given a 4MB granularity), things
should work. Until you hit the threshold and they don't anymore...

Given that this isn't fixed in Giant either, there goes my plan to speed
up a cluster with ample space but insufficient IOPS with cache tiering.

Christian
 
 Sincerely,
 
 Laurent
 
 Le Sunday 09 November 2014 à 00:24 +0100, Loic Dachary a écrit :
  
  On 09/11/2014 00:03, Gregory Farnum wrote:
   It's all about the disk accesses. What's the slow part when you dump
   historic and in-progress ops?
  
  This is what I see on g1 (6% iowait)
  
  root@g1:~# ceph daemon osd.0 dump_ops_in_flight
  { num_ops: 0,
ops: []}
  
  root@g1:~# ceph daemon osd.0 dump_ops_in_flight
  { num_ops: 1,
ops: [
  { description: osd_op(client.4407100.0:11030174
  rb.0.410809.238e1f29.1038 [set-alloc-hint object_size 4194304
  write_size 4194304,write 4095488~4096] 58.3aabb66d ack+ondisk+write
  e15613), received_at: 2014-11-09 00:14:17.385256, age:
  0.538802, duration: 0.011955, type_data: [
  waiting for sub ops,
  { client: client.4407100,
tid: 11030174},
  [
  { time: 2014-11-09 00:14:17.385393,
event: waiting_for_osdmap},
  { time: 2014-11-09 00:14:17.385563,
event: reached_pg},
  { time: 2014-11-09 00:14:17.385793,
event: started},
  { time: 2014-11-09 00:14:17.385807,
event: started},
  { time: 2014-11-09 00:14:17.385875,
event: waiting for subops from 1,10},
  { time: 2014-11-09 00:14:17.386201,
event: commit_queued_for_journal_write},
  { time: 2014-11-09 00:14:17.386336,
event: write_thread_in_journal_buffer},
  { time: 2014-11-09 00:14:17.396293,
event: journaled_completion_queued},
  { time: 2014-11-09 00:14:17.396332,
event: op_commit},
  { time: 2014-11-09 00:14:17.396678,
event: op_applied},
  { time: 2014-11-09 00:14:17.397211,
event: sub_op_commit_rec}]]}]}
  
  and it looks ok. When I go to n7 which has 20% iowait, I see a much
  larger output http://pastebin.com/DPxsaf6z which includes a number of
  event: waiting_for_osdmap.
  
  I'm not sure what to make of this and it would certainly be better if
  n7 had a lower iowait. Also when I ceph -w I see a new pgmap is
  created every second which is also not a good sign.
  
  2014-11-09 00:22:47.090795 mon.0 [INF] pgmap v4389613: 460 pgs: 460
  active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail;
  3889 B/s rd, 2125 kB/s wr, 237 op/s 2014-11-09 00:22:48.143412 mon.0
  [INF] pgmap v4389614: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB
  used, 18850 GB / 26955 GB avail; 1586 kB/s wr, 204 op/s 2014-11-09
  00:22:49.172794 mon.0 [INF] pgmap v4389615: 460 pgs: 460 active+clean;
  2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 343 kB/s wr, 88
  op/s 2014-11-09 00:22:50.222958 mon.0 [INF] pgmap v4389616: 460 pgs:
  460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB
  avail; 412 kB/s wr, 130 op/s 2014-11-09 00:22:51.281294 mon.0 [INF]
  pgmap v4389617: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used,
  18850 GB / 26955 GB avail; 1195 kB/s wr, 167 op/s 2014-11-09
  00:22:52.318895 mon.0 [INF] pgmap v4389618: 460 pgs: 460 active+clean;
  2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 5864 B/s rd,
  2762 kB/s wr, 206 op/s
  
  Cheers
  
  
   On Sat, Nov 8, 2014 at 2:30 PM Loic Dachary l...@dachary.org
   mailto:l...@dachary.org wrote:
   
   

Re: [ceph-users] osd crashed while there was no space

2014-11-17 Thread han vincent
hi, craig:

Your solution did work very well. But if the data is very
important, when remove directory of PG from OSDs, a small mistake will
result in loss of data. And if cluster is very large, do not you think
delete the data on the disk from 100% to 95% is a tedious and
error-prone thing, for so many OSDs, large disks, and so on.

 so my key question is: if there is no space in the cluster while
some OSDs crashed,  why the cluster should choose to migrate? And in
the migrating, other
OSDs will crashed one by one until the cluster could not work.

2014-11-18 5:28 GMT+08:00 Craig Lewis cle...@centraldesktop.com:
 At this point, it's probably best to delete the pool.  I'm assuming the pool
 only contains benchmark data, and nothing important.

 Assuming you can delete the pool:
 First, figure out the ID of the data pool.  You can get that from ceph osd
 dump | grep '^pool'

 Once you have the number, delete the data pool: rados rmpool data data
 --yes-i-really-really-mean-it

 That will only free up space on OSDs that are up.  You'll need to manually
 some PGs on the OSDs that are 100% full.  Go to
 /var/lib/ceph/osd/ceph-OSDID/current, and delete a few directories that
 start with your data pool ID.  You don't need to delete all of them.  Once
 the disk is below 95% full, you should be able to start that OSD.  Once it's
 up, it will finish deleting the pool.

 If you can't delete the pool, it is possible, but it's more work, and you
 still run the risk of losing data if you make a mistake.  You need to
 disable backfilling, then delete some PGs on each OSD that's full. Try to
 only delete one copy of each PG.  If you delete every copy of a PG on all
 OSDs, then you lost the data that was in that PG.  As before, once you
 delete enough that the disk is less than 95% full, you can start the OSD.
 Once you start it, start deleting your benchmark data out of the data pool.
 Once that's done, you can re-enable backfilling.  You may need to scrub or
 deep-scrub the OSDs you deleted data from to get everything back to normal.


 So how did you get the disks 100% full anyway?  Ceph normally won't let you
 do that.  Did you increase mon_osd_full_ratio, osd_backfill_full_ratio, or
 osd_failsafe_full_ratio?


 On Mon, Nov 17, 2014 at 7:00 AM, han vincent hang...@gmail.com wrote:

 hello, every one:

 These days a problem of ceph has troubled me for a long time.

 I build a cluster with 3 hosts and each host has three osds in it.
 And after that
 I used the command rados bench 360 -p data -b 4194304 -t 300 write
 --no-cleanup
 to test the write performance of the cluster.

 When the cluster is near full, there couldn't write any data to
 it. Unfortunately,
 there was a host hung up, then a lots of PG was going to migrate to other
 OSDs.
 After a while, a lots of OSD was marked down and out, my cluster couldn't
 work
 any more.

 The following is the output of ceph -s:

 cluster 002c3742-ab04-470f-8a7a-ad0658b547d6
 health HEALTH_ERR 103 pgs degraded; 993 pgs down; 617 pgs
 incomplete; 1008 pgs peering; 12 pgs recovering; 534 pgs stale; 1625
 pgs stuck inactive; 534 pgs stuck stale; 1728 pgs stuck unclean;
 recovery 945/29649 objects degraded (3.187%); 1 full osd(s); 1 mons
 down, quorum 0,2 2,1
  monmap e1: 3 mons at
 {0=10.0.0.97:6789/0,1=10.0.0.98:6789/0,2=10.0.0.70:6789/0}, election
 epoch 40, quorum 0,2 2,1
  osdmap e173: 9 osds: 2 up, 2 in
 flags full
   pgmap v1779: 1728 pgs, 3 pools, 39528 MB data, 9883 objects
 37541 MB used, 3398 MB / 40940 MB avail
 945/29649 objects degraded (3.187%)
   34 stale+active+degraded+remapped
  176 stale+incomplete
  320 stale+down+peering
   53 active+degraded+remapped
  408 incomplete
1 active+recovering+degraded
  673 down+peering
1 stale+active+degraded
   15 remapped+peering
3 stale+active+recovering+degraded+remapped
3 active+degraded
   33 remapped+incomplete
8 active+recovering+degraded+remapped

 The following is the output of ceph osd tree:
 # idweight  type name   up/down reweight
 -1  9   root default
 -3  9   rack unknownrack
 -2  3   host 10.0.0.97
  0   1   osd.0   down0
  1   1   osd.1   down0
  2   1   osd.2   down0
  -4  3   host 10.0.0.98
  3   1   osd.3   down0
  4   1   osd.4   down0
  5   1   osd.5   down0
  -5  3   host 10.0.0.70
  6   1   osd.6   up

Re: [ceph-users] Troubleshooting an erasure coded pool with a cache tier

2014-11-17 Thread Laurent GUERBY
Le Tuesday 18 November 2014 à 10:11 +0900, Christian Balzer a écrit :
 Hello,
 
 On Mon, 17 Nov 2014 17:45:54 +0100 Laurent GUERBY wrote:
 
  Hi,
  
  Just a follow-up on this issue, we're probably hitting:
  
  http://tracker.ceph.com/issues/9285
  
 I wonder how much pressure was on that cache tier, though. 
 If I understand the bug report correctly, this should only happen if
 some object gets evicted before it was fully replicated.
 So I suppose if the cache pool is sized correctly for the working set in
 question (which of course is a bugger given a 4MB granularity), things
 should work. Until you hit the threshold and they don't anymore...

Hi,

Same experience a 10 GB size=3 min=2 cache on a 1 TB 4+1 ec pool
and a 500 GB size=3 min=2 cache on 8 TB 3+1 ec pool (5 hosts, 9
rotational disks total).

We also noticed that well after we deleted the cache and ec pool we
still had frequent slow write until we restarted some of the slow
write OSD. Now the slow write are very rare a short episode ~ ten
seconds every few hours according to logs.

Let's hope the ceph developpers will fix this bug so that people can
give more testing to erasure coding, I have added a comment on the
ticket.

Sincerely,

Laurent

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com