Re: [ceph-users] PG num calculator live on Ceph.com

2015-01-08 Thread Michael J. Kidd
Lindsay,
  Yes, I would suggest starting with the 'RBD and libRados' use case from
the drop down, then adjusting the percentages / pool names (if you desire)
as appropriate.  I don't have a ton of experience with CephFS, but I would
suspect that the metadata is less than 5% of the total data usage across
those two pools.

I welcome anyone with more CephFS experience to weigh in on this! :)

Thanks,

Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
 - by Red Hat

On Wed, Jan 7, 2015 at 3:59 PM, Lindsay Mathieson 
lindsay.mathie...@gmail.com wrote:

 With cephfs we have the two pools - data  metadata. Does that effect the
 pg calculations? metadata pool will have substantially less data than the
 data pool.


 --
 Lindsay

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG num calculator live on Ceph.com

2015-01-07 Thread Michael J. Kidd
Hello Bill,
  Either 2048 or 4096 should be acceptable.  4096 gives about a 300 PG per
OSD ratio, which would leave room for tripling the OSD count without
needing to increase the PG number.  While 2048 gives about 150 PGs per OSD,
not leaving room but for about a 50% OSD count expansion.

The high PG count per OSD issue really doesn't manifest aggressively until
you get around 1000 PGs per OSD and beyond.  At those levels, steady state
operation continues without issue.. but recovery within the cluster will
see the memory utilization of the OSDs climb and could push into out of
memory conditions on the OSD host (or at a minimum, heavy swap usage if
enabled).  It still depends of course on the # of OSDs per node, and the
amount of memory on the node as to if you'll actually experience issues or
not.

As an example though, I worked on a cluster which was about 5500 PGs per
OSD.  The cluster experienced a network config issue in the switchgear
which isolated 2/3's of the OSD nodes from each other and the other 1/3 of
the cluster.  When the network issue was cleared, the OSDs started dropping
like flies... They'd start up, spool up the memory they needed for map
update parsing, and get killed before making any real headway.  We were
finally able to get the cluster online by limiting what the OSDs were doing
to a small slice of the normal start-up, waiting for the OSDs to calm down,
then opening up a bit more for them to do (noup, noin, norecover,
nobackfill, pause, noscrub, nodeep-scrub were all set, and then unset one
at a time until all OSDs were up/in and able to handle the recovery).

6 weeks later, that same cluster lost about 40% of the OSDs during a power
outage due to corruption from an HBA bug.. (it didn't flush the write cache
to disk).  This pushed the PG per OSD count over 9000!!  It simply couldn't
recover with the available memory at that PG count.  Each OSD, started by
itself, would consume  60gb of RAM and get killed (the nodes only had 64gb
total).

While this is an extreme example... we see cases generated with  1000 PGs
per OSD on a regular basis.  This is the type of thing we're trying to head
off.

It should be noted that you can increase the PG num of a pool.. but cannot
decrease!   The only way to reduce your cluster PG count is to create new
smaller PG num pools, migrate the data and then delete the old, high PG
count pools.  You could also simply add more OSDs to reduce the PG per OSD
ratio.

The issue with too few PGs is poor data distribution.  So it's all about
having enough PGs to get good data distribution without going too high and
having resource exhaustion during recovery.

Hope this helps put things into perspective.

Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
 - by Red Hat

On Wed, Jan 7, 2015 at 4:34 PM, Sanders, Bill bill.sand...@teradata.com
wrote:

  This is interesting.  Kudos to you guys for getting the calculator up, I
 think this'll help some folks.

 I have 1 pool, 40 OSDs, and replica of 3.  I based my PG count on:
 http://ceph.com/docs/master/rados/operations/placement-groups/

 '''
 Less than 5 OSDs set pg_num to 128
 Between 5 and 10 OSDs set pg_num to 512
 Between 10 and 50 OSDs set pg_num to 4096
 '''

 But the calculator gives a different result of 2048.  Out of curiosity,
 what sorts of issues might one encounter by having too many placement
 groups?  I understand there's some resource overhead.  I don't suppose it
 would manifest itself in a recognizable way?

 Bill

  --
 *From:* ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of
 Michael J. Kidd [michael.k...@inktank.com]
 *Sent:* Wednesday, January 07, 2015 3:51 PM
 *To:* Loic Dachary
 *Cc:* ceph-us...@ceph.com
 *Subject:* Re: [ceph-users] PG num calculator live on Ceph.com

 Where is the source ?
  On the page.. :)  It does link out to jquery and jquery-ui, but all the
 custom bits are embedded in the HTML.

  Glad it's helpful :)

   Michael J. Kidd
 Sr. Storage Consultant
 Inktank Professional Services
   - by Red Hat

 On Wed, Jan 7, 2015 at 3:46 PM, Loic Dachary l...@dachary.org wrote:



 On 07/01/2015 23:08, Michael J. Kidd wrote:
  Hello all,
Just a quick heads up that we now have a PG calculator to help
 determine the proper PG per pool numbers to achieve a target PG per OSD
 ratio.
 
  http://ceph.com/pgcalc
 
  Please check it out!  Happy to answer any questions, and always welcome
 any feedback on the tool / verbiage, etc...

 Great work ! That will be immensely useful :-)

 Where is the source ?

 Cheers

 
  As an aside, we're also working to update the documentation to reflect
 the best practices.  See Ceph.com tracker for this at:
  http://tracker.ceph.com/issues/9867
 
  Thanks!
  Michael J. Kidd
  Sr. Storage Consultant
  Inktank Professional Services
   - by Red Hat
 
 
   ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph

[ceph-users] PG num calculator live on Ceph.com

2015-01-07 Thread Michael J. Kidd
Hello all,
  Just a quick heads up that we now have a PG calculator to help determine
the proper PG per pool numbers to achieve a target PG per OSD ratio.

http://ceph.com/pgcalc

Please check it out!  Happy to answer any questions, and always welcome any
feedback on the tool / verbiage, etc...

As an aside, we're also working to update the documentation to reflect the
best practices.  See Ceph.com tracker for this at:
http://tracker.ceph.com/issues/9867

Thanks!
Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
 - by Red Hat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG num calculator live on Ceph.com

2015-01-07 Thread Michael J. Kidd
Hello Christopher,
  Keep in mind that the PGs per OSD (and per pool) calculations take into
account the replica count ( pool size= parameter ).  So, for example.. if
you're using a default of 3 replicas.. 16 * 3 = 48 PGs which allows for at
least one PG per OSD on that pool.  Even with a size=2, 32 PGs total still
gives very close to 1 PG per OSD.  Being that it's such a low utilization
pool, this is still sufficient.

Thanks,
Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
 - by Red Hat

On Wed, Jan 7, 2015 at 3:17 PM, Christopher O'Connell c...@sendfaster.com
wrote:

 Hi,

 Im playing with this with a modest sized ceph cluster (36x6TB disks).
 Based on this it says that small pools (such as .users) would have just 16
 PGs. Is this correct? I've historically always made even these small pools
 have at least as many PGs as the next power of 2 over my number of OSDs (64
 in this case).

 All the best,

 ~ Christopher

 On Wed, Jan 7, 2015 at 3:08 PM, Michael J. Kidd michael.k...@inktank.com
 wrote:

 Hello all,
   Just a quick heads up that we now have a PG calculator to help
 determine the proper PG per pool numbers to achieve a target PG per OSD
 ratio.

 http://ceph.com/pgcalc

 Please check it out!  Happy to answer any questions, and always welcome
 any feedback on the tool / verbiage, etc...

 As an aside, we're also working to update the documentation to reflect
 the best practices.  See Ceph.com tracker for this at:
 http://tracker.ceph.com/issues/9867

 Thanks!
 Michael J. Kidd
 Sr. Storage Consultant
 Inktank Professional Services
  - by Red Hat

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG num calculator live on Ceph.com

2015-01-07 Thread Michael J. Kidd
 Where is the source ?
On the page.. :)  It does link out to jquery and jquery-ui, but all the
custom bits are embedded in the HTML.

Glad it's helpful :)

Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
 - by Red Hat

On Wed, Jan 7, 2015 at 3:46 PM, Loic Dachary l...@dachary.org wrote:



 On 07/01/2015 23:08, Michael J. Kidd wrote:
  Hello all,
Just a quick heads up that we now have a PG calculator to help
 determine the proper PG per pool numbers to achieve a target PG per OSD
 ratio.
 
  http://ceph.com/pgcalc
 
  Please check it out!  Happy to answer any questions, and always welcome
 any feedback on the tool / verbiage, etc...

 Great work ! That will be immensely useful :-)

 Where is the source ?

 Cheers

 
  As an aside, we're also working to update the documentation to reflect
 the best practices.  See Ceph.com tracker for this at:
  http://tracker.ceph.com/issues/9867
 
  Thanks!
  Michael J. Kidd
  Sr. Storage Consultant
  Inktank Professional Services
   - by Red Hat
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

 --
 Loïc Dachary, Artisan Logiciel Libre


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD process exhausting server memory

2014-10-30 Thread Michael J. Kidd
Hello Lukas,
  The 'slow request' logs are expected while the cluster is in such a
state.. the OSD processes simply aren't able to respond quickly to client
IO requests.

I would recommend trying to recover without the most problematic disk (
seems to be OSD.10? ).. Simply shut it down and see if the other OSDs
settle down.  You should also take a look at the kernel logs for any
indications of a problem with the disks themselves, or possibly do an FIO
test against the drive with the OSD shut down (to a file on the OSD
filesystem, not the raw drive.. this would be destructive).

Also, you could upgrade to 0.80.7.  There are some bug fixes, but I'm not
sure if any would specifically help this situation.. not likely to hurt
through.

The desired state is for the cluster to be steady-state before the next
move (unsetting the next flag).  Hopefully this can be achieved without
needing to take down OSDs in multiple hosts.

I'm also unsure about the cache tiering and how it could relate to the load
being seen.

Hope this helps...

Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
 - by Red Hat

On Thu, Oct 30, 2014 at 4:00 AM, Lukáš Kubín lukas.ku...@gmail.com wrote:

 Hi,
 I've noticed the following messages always accumulate in OSD log before it
 exhausts all memory:

 2014-10-30 08:48:42.994190 7f80a2019700  0 log [WRN] : slow request
 38.901192 seconds old, received at 2014-10-30 08:48:04.092889:
 osd_op(osd.29.3076:207644827 rbd_data.2e4ee3ba663be.363b@17
 [copy-get max 8388608] 7.af87e887
 ack+read+ignore_cache+ignore_overlay+map_snap_clone e3359) v4 currently
 reached pg


 Note this is always from the most frequently failing osd.10 (sata tier)
 referring to osd.29 (ssd cache tier). That osd.29 is consuming huge CPU and
 memory resources, but keeps running without failures.

 Can this be eg. a bug? Or some erroneous I/O request which initiated this
 behaviour? Can I eg. attempt to upgrade the Ceph to a more recent release
 in the current unhealthy status of the cluster? Can I eg. try disabling the
 caching tier? Or just somehow evacuate the problematic OSD?

 I'll welcome any ideas. Currently, I'm keeping the osd.10 in an automatic
 restart loop with 60 seconds pause before starting again.

 Thanks and greetings,

 Lukas

 On Wed, Oct 29, 2014 at 8:04 PM, Lukáš Kubín lukas.ku...@gmail.com
 wrote:

 I should have figured that out myself since I did that recently. Thanks.

 Unfortunately, I'm still at the step ceph osd unset noin. After setting
 all the OSDs in, the original issue reapears preventing me to proceed with
 recovery. It now appears mostly at single OSD - osd.10 which consumes ~200%
 CPU and all memory within 45 seconds being killed by Linux then:

 Oct 29 18:24:38 q09 kernel: Out of memory: Kill process 17202 (ceph-osd)
 score 912 or sacrifice child
 Oct 29 18:24:38 q09 kernel: Killed process 17202, UID 0, (ceph-osd)
 total-vm:62713176kB, anon-rss:62009772kB, file-rss:328kB


 I've tried to restart it several times with same result. Similar
 situation with OSDs 0 and 13.

 Also, I've noticed one of SSD cache tier's OSD - osd.29 generating high
 CPU utilization around 180%.

 All the problematic OSD's have been the same ones all the time -  OSD
 0,8,10,13 and 29 - they are those which I've found to be down this morning.

 There is some minor load coming from client - Openstack instances, I
 preferred not to kill them:

 [root@q04 ceph-recovery]# ceph -s
 cluster ec433b4a-9dc0-4d08-bde4-f1657b1fdb99
  health HEALTH_ERR 31 pgs backfill; 241 pgs degraded; 62 pgs down;
 193 pgs incomplete; 13 pgs inconsistent; 62 pgs peering; 12 pgs recovering;
 205 pgs recovery_wait; 93 pgs stuck inactive; 608 pgs stuck unclean; 381138
 requests are blocked  32 sec; recovery 1162468/35207488 objects degraded
 (3.302%); 466/17112963 unfound (0.003%); 13 scrub errors; 1/34 in osds are
 down; nobackfill,norecover,noscrub,nodeep-scrub flag(s) set
  monmap e2: 3 mons at {q03=
 10.255.253.33:6789/0,q04=10.255.253.34:6789/0,q05=10.255.253.35:6789/0},
 election epoch 92, quorum 0,1,2 q03,q04,q05
  osdmap e2782: 34 osds: 33 up, 34 in
 flags nobackfill,norecover,noscrub,nodeep-scrub
   pgmap v7440374: 5632 pgs, 7 pools, 1449 GB data, 16711 kobjects
 3148 GB used, 15010 GB / 18158 GB avail
 1162468/35207488 objects degraded (3.302%); 466/17112963
 unfound (0.003%)
   13 active
   22 active+recovery_wait+remapped
1 active+recovery_wait+inconsistent
 4794 active+clean
  193 incomplete
   62 down+peering
9 active+degraded+remapped+wait_backfill
  182 active+recovery_wait
   74 active+remapped
   12 active+recovering
   12 active+clean+inconsistent
   22 active+remapped+wait_backfill
4 active+clean+replay
  232

Re: [ceph-users] OSD process exhausting server memory

2014-10-30 Thread Michael J. Kidd
Hello Lukas,
  Unfortunately, I'm all out of ideas at the moment.  There are some memory
profiling techniques which can help identify what is causing the memory
utilization, but it's a bit beyond what I typically work on.  Others on the
list may have experience with this (or otherwise have ideas) and may chip
in...

Wish I could be more help..

Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
 - by Red Hat

On Thu, Oct 30, 2014 at 11:00 AM, Lukáš Kubín lukas.ku...@gmail.com wrote:

 Thanks Michael, still no luck.

 Letting the problematic OSD.10 down has no effect. Within minutes more of
 OSDs fail on same issue after consuming ~50GB of memory. Also, I can see
 two of those cache-tier OSDs on separate hosts which remain utilized almost
 200% CPU all the time

 I've performed upgrade of all cluster to 0.80.7. Did not help.

 I have also tried to unset norecovery+nobackfill flags to attempt a
 recovery completion. No luck, several OSDs fail with the same issue
 preventing the recovery to complete. I've performed your fix steps from the
 start again and currently I'm behind the unset noin step.

 I could get some of pools to a state with no degraded objects temporarily.
 Then (within minutes) some OSD fails and it's degraded again.

 I have also tried to let the OSD processes get restarted automatically to
 keep them up as much as possible.

 I consider disabling the tiering pool volumes-cache as that's something
 I can miss:

 pool name   category KB  objects   clones
 degraded
 backups -  000
0
 data-  000
0
 images  -  777989590950270
 8883
 metadata-  000
0
 rbd -  000
0
 volumes -  11560869325965  179
 3307
 volumes-cache   -  649577103 16708730 9894
  1144650


 Can I just switch it into the forward mode and let it empty
 (cache-flush-evict-all) to see if that changes anything?

 Could you or any of your colleagues provide anything else to try?

 Thank you,

 Lukas


 On Thu, Oct 30, 2014 at 3:05 PM, Michael J. Kidd michael.k...@inktank.com
  wrote:

 Hello Lukas,
   The 'slow request' logs are expected while the cluster is in such a
 state.. the OSD processes simply aren't able to respond quickly to client
 IO requests.

 I would recommend trying to recover without the most problematic disk (
 seems to be OSD.10? ).. Simply shut it down and see if the other OSDs
 settle down.  You should also take a look at the kernel logs for any
 indications of a problem with the disks themselves, or possibly do an FIO
 test against the drive with the OSD shut down (to a file on the OSD
 filesystem, not the raw drive.. this would be destructive).

 Also, you could upgrade to 0.80.7.  There are some bug fixes, but I'm not
 sure if any would specifically help this situation.. not likely to hurt
 through.

 The desired state is for the cluster to be steady-state before the next
 move (unsetting the next flag).  Hopefully this can be achieved without
 needing to take down OSDs in multiple hosts.

 I'm also unsure about the cache tiering and how it could relate to the
 load being seen.

 Hope this helps...

 Michael J. Kidd
 Sr. Storage Consultant
 Inktank Professional Services
  - by Red Hat

 On Thu, Oct 30, 2014 at 4:00 AM, Lukáš Kubín lukas.ku...@gmail.com
 wrote:

 Hi,
 I've noticed the following messages always accumulate in OSD log before
 it exhausts all memory:

 2014-10-30 08:48:42.994190 7f80a2019700  0 log [WRN] : slow request
 38.901192 seconds old, received at 2014-10-30 08:48:04.092889:
 osd_op(osd.29.3076:207644827 rbd_data.2e4ee3ba663be.363b@17
 [copy-get max 8388608] 7.af87e887
 ack+read+ignore_cache+ignore_overlay+map_snap_clone e3359) v4 currently
 reached pg


 Note this is always from the most frequently failing osd.10 (sata tier)
 referring to osd.29 (ssd cache tier). That osd.29 is consuming huge CPU and
 memory resources, but keeps running without failures.

 Can this be eg. a bug? Or some erroneous I/O request which initiated
 this behaviour? Can I eg. attempt to upgrade the Ceph to a more recent
 release in the current unhealthy status of the cluster? Can I eg. try
 disabling the caching tier? Or just somehow evacuate the problematic OSD?

 I'll welcome any ideas. Currently, I'm keeping the osd.10 in an
 automatic restart loop with 60 seconds pause before starting again.

 Thanks and greetings,

 Lukas

 On Wed, Oct 29, 2014 at 8:04 PM, Lukáš Kubín lukas.ku...@gmail.com
 wrote:

 I should have figured that out myself since I did that recently. Thanks.

 Unfortunately, I'm still at the step ceph osd unset noin. After
 setting all the OSDs in, the original

Re: [ceph-users] OSD process exhausting server memory

2014-10-29 Thread Michael J. Kidd
Hello Lukas,
  Please try the following process for getting all your OSDs up and
operational...

* Set the following flags: noup, noin, noscrub, nodeep-scrub, norecover,
nobackfill
for i in noup noin noscrub nodeep-scrub norecover nobackfill; do ceph osd
set $i; done

* Stop all OSDs (I know, this seems counter productive)
* Set all OSDs down / out
for i in $(ceph osd tree | grep osd | awk '{print $3}'); do ceph osd down
$i; ceph osd out $i; done
* Set recovery / backfill throttles as well as heartbeat and OSD map
processing tweaks in the /etc/ceph/ceph.conf file under the [osd] section:
[osd]
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_max_single_start = 1
osd_backfill_scan_min = 8
osd_heartbeat_interval = 36
osd_heartbeat_grace = 240
osd_map_message_max = 1000
osd_map_cache_size = 3136

* Start all OSDs
* Monitor 'top' for 0% CPU on all OSD processes.. it may take a while..  I
usually issue 'top' then, the keys M c
 - M = Sort by memory usage
 - c = Show command arguments
 - This allows to easily monitor the OSD process and know which OSDs have
settled, etc..
* Once all OSDs have hit 0% CPU utilization, remove the 'noup' flag
 - ceph osd unset noup
* Again, wait for 0% CPU utilization (may  be immediate, may take a while..
just gotta wait)
* Once all OSDs have hit 0% CPU again, remove the 'noin' flag
 - ceph osd unset noin
 - All OSDs should now appear up/in, and will go through peering..
* Once ceph -s shows no further activity, and OSDs are back at 0% CPU
again, unset 'nobackfill'
 - ceph osd unset nobackfill
* Once ceph -s shows no further activity, and OSDs are back at 0% CPU
again, unset 'norecover'
 - ceph osd unset norecover
* Monitor OSD memory usage... some OSDs may get killed off again, but their
subsequent restart should consume less memory and allow more recovery to
occur between each step above.. and ultimately, hopefully... your entire
cluster will come back online and be usable.

## Clean-up:
* Remove all of the above set options from ceph.conf
* Reset the running OSDs to their defaults:
ceph tell osd.\* injectargs '--osd_max_backfills 10
--osd_recovery_max_active 15 --osd_recovery_max_single_start 5
--osd_backfill_scan_min 64 --osd_heartbeat_interval 6 --osd_heartbeat_grace
36 --osd_map_message_max 100 --osd_map_cache_size 500'
* Unset the noscrub and nodeep-scrub flags:
 - ceph osd unset noscrub
 - ceph osd unset nodeep-scrub


## For help identifying why memory usage was so high, please provide:
* ceph osd dump | grep pool
* ceph osd crush rule dump

Let us know if this helps... I know it looks extreme, but it's worked for
me in the past..


Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
 - by Red Hat

On Wed, Oct 29, 2014 at 8:51 AM, Lukáš Kubín lukas.ku...@gmail.com wrote:

 Hello,
 I've found my ceph v 0.80.3 cluster in a state with 5 of 34 OSDs being
 down through night after months of running without change. From Linux logs
 I found out the OSD processes were killed because they consumed all
 available memory.

 Those 5 failed OSDs were from different hosts of my 4-node cluster (see
 below). Two hosts act as SSD cache tier in some of my pools. The other two
 hosts are the default rotational drives storage.

 After checking the Linux was not out of memory I've attempted to restart
 those failed OSDs. Most of those OSD daemon exhaust all memory in seconds
 and got killed by Linux again:

 Oct 28 22:16:34 q07 kernel: Out of memory: Kill process 24207 (ceph-osd)
 score 867 or sacrifice child
 Oct 28 22:16:34 q07 kernel: Killed process 24207, UID 0, (ceph-osd)
 total-vm:59974412kB, anon-rss:59076880kB, file-rss:512kB


 On the host I've found lots of similar slow request messages preceding
 the crash:

 2014-10-28 22:11:20.885527 7f25f84d1700  0 log [WRN] : slow request
 31.117125 seconds old, received at 2014-10-28 22:10:49.768291:
 osd_sub_op(client.168752.0:2197931 14.2c7
 888596c7/rbd_data.293272f8695e4.006f/head//14 [] v 1551'377417
 snapset=0=[]:[] snapc=0=[]) v10 currently no flag points reached
 2014-10-28 22:11:21.885668 7f25f84d1700  0 log [WRN] : 67 slow requests, 1
 included below; oldest blocked for  9879.304770 secs


 Apparently I can't get the cluster fixed by restarting the OSDs all over
 again. Is there any other option then?

 Thank you.

 Lukas Kubin



 [root@q04 ~]# ceph -s
 cluster ec433b4a-9dc0-4d08-bde4-f1657b1fdb99
  health HEALTH_ERR 9 pgs backfill; 1 pgs backfilling; 521 pgs
 degraded; 425 pgs incomplete; 13 pgs inconsistent; 20 pgs recovering; 50
 pgs recovery_wait; 151 pgs stale; 425 pgs stuck inactive; 151 pgs stuck
 stale; 1164 pgs stuck unclean; 12070270 requests are blocked  32 sec;
 recovery 887322/35206223 objects degraded (2.520%); 119/17131232 unfound
 (0.001%); 13 scrub errors
  monmap e2: 3 mons at {q03=
 10.255.253.33:6789/0,q04=10.255.253.34:6789/0,q05=10.255.253.35:6789/0},
 election epoch 90, quorum 0,1,2 q03,q04,q05
  osdmap e2194: 34 osds: 31 up, 31 in
   pgmap

Re: [ceph-users] OSD process exhausting server memory

2014-10-29 Thread Michael J. Kidd
Ah, sorry... since they were set out manually, they'll need to be set in
manually..

for i in $(ceph osd tree | grep osd | awk '{print $3}'); do ceph osd in $i;
done



Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
 - by Red Hat

On Wed, Oct 29, 2014 at 12:33 PM, Lukáš Kubín lukas.ku...@gmail.com wrote:

 I've ended up at step ceph osd unset noin. My OSDs are up, but not in,
 even after an hour:

 [root@q04 ceph-recovery]# ceph osd stat
  osdmap e2602: 34 osds: 34 up, 0 in
 flags nobackfill,norecover,noscrub,nodeep-scrub


 There seems to be no activity generated by OSD processes, occasionally
 they show 0,3% which I believe is just some basic communication processing.
 No load in network interfaces.

 Is there some other step needed to bring the OSDs in?

 Thank you.

 Lukas

 On Wed, Oct 29, 2014 at 3:58 PM, Michael J. Kidd michael.k...@inktank.com
  wrote:

 Hello Lukas,
   Please try the following process for getting all your OSDs up and
 operational...

 * Set the following flags: noup, noin, noscrub, nodeep-scrub, norecover,
 nobackfill
 for i in noup noin noscrub nodeep-scrub norecover nobackfill; do ceph osd
 set $i; done

 * Stop all OSDs (I know, this seems counter productive)
 * Set all OSDs down / out
 for i in $(ceph osd tree | grep osd | awk '{print $3}'); do ceph osd down
 $i; ceph osd out $i; done
 * Set recovery / backfill throttles as well as heartbeat and OSD map
 processing tweaks in the /etc/ceph/ceph.conf file under the [osd] section:
 [osd]
 osd_max_backfills = 1
 osd_recovery_max_active = 1
 osd_recovery_max_single_start = 1
 osd_backfill_scan_min = 8
 osd_heartbeat_interval = 36
 osd_heartbeat_grace = 240
 osd_map_message_max = 1000
 osd_map_cache_size = 3136

 * Start all OSDs
 * Monitor 'top' for 0% CPU on all OSD processes.. it may take a while..
 I usually issue 'top' then, the keys M c
  - M = Sort by memory usage
  - c = Show command arguments
  - This allows to easily monitor the OSD process and know which OSDs have
 settled, etc..
 * Once all OSDs have hit 0% CPU utilization, remove the 'noup' flag
  - ceph osd unset noup
 * Again, wait for 0% CPU utilization (may  be immediate, may take a
 while.. just gotta wait)
 * Once all OSDs have hit 0% CPU again, remove the 'noin' flag
  - ceph osd unset noin
  - All OSDs should now appear up/in, and will go through peering..
 * Once ceph -s shows no further activity, and OSDs are back at 0% CPU
 again, unset 'nobackfill'
  - ceph osd unset nobackfill
 * Once ceph -s shows no further activity, and OSDs are back at 0% CPU
 again, unset 'norecover'
  - ceph osd unset norecover
 * Monitor OSD memory usage... some OSDs may get killed off again, but
 their subsequent restart should consume less memory and allow more recovery
 to occur between each step above.. and ultimately, hopefully... your entire
 cluster will come back online and be usable.

 ## Clean-up:
 * Remove all of the above set options from ceph.conf
 * Reset the running OSDs to their defaults:
 ceph tell osd.\* injectargs '--osd_max_backfills 10
 --osd_recovery_max_active 15 --osd_recovery_max_single_start 5
 --osd_backfill_scan_min 64 --osd_heartbeat_interval 6 --osd_heartbeat_grace
 36 --osd_map_message_max 100 --osd_map_cache_size 500'
 * Unset the noscrub and nodeep-scrub flags:
  - ceph osd unset noscrub
  - ceph osd unset nodeep-scrub


 ## For help identifying why memory usage was so high, please provide:
 * ceph osd dump | grep pool
 * ceph osd crush rule dump

 Let us know if this helps... I know it looks extreme, but it's worked for
 me in the past..


 Michael J. Kidd
 Sr. Storage Consultant
 Inktank Professional Services
  - by Red Hat

 On Wed, Oct 29, 2014 at 8:51 AM, Lukáš Kubín lukas.ku...@gmail.com
 wrote:

 Hello,
 I've found my ceph v 0.80.3 cluster in a state with 5 of 34 OSDs being
 down through night after months of running without change. From Linux logs
 I found out the OSD processes were killed because they consumed all
 available memory.

 Those 5 failed OSDs were from different hosts of my 4-node cluster (see
 below). Two hosts act as SSD cache tier in some of my pools. The other two
 hosts are the default rotational drives storage.

 After checking the Linux was not out of memory I've attempted to restart
 those failed OSDs. Most of those OSD daemon exhaust all memory in seconds
 and got killed by Linux again:

 Oct 28 22:16:34 q07 kernel: Out of memory: Kill process 24207 (ceph-osd)
 score 867 or sacrifice child
 Oct 28 22:16:34 q07 kernel: Killed process 24207, UID 0, (ceph-osd)
 total-vm:59974412kB, anon-rss:59076880kB, file-rss:512kB


 On the host I've found lots of similar slow request messages preceding
 the crash:

 2014-10-28 22:11:20.885527 7f25f84d1700  0 log [WRN] : slow request
 31.117125 seconds old, received at 2014-10-28 22:10:49.768291:
 osd_sub_op(client.168752.0:2197931 14.2c7
 888596c7/rbd_data.293272f8695e4.006f/head//14 [] v 1551'377417
 snapset=0=[]:[] snapc=0=[]) v10

Re: [ceph-users] RBD for ephemeral

2014-05-19 Thread Michael J. Kidd
Since the status is 'Abandoned', it would appear that the fix has not been
merged into any release of OpenStack.

Thanks,

Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services


On Sun, May 18, 2014 at 5:13 PM, Yuming Ma (yumima) yum...@cisco.comwrote:

  Wondering what is the status of this fix
 https://review.openstack.org/#/c/46879/? Which release has it?
 — Yuming

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD for ephemeral

2014-05-19 Thread Michael J. Kidd
After sending my earlier email, I found another commit that was merged in
March:
https://review.openstack.org/#/c/59149/

Seems to follow a newer image handling technique that was being sought
which prevented the first patch from being merged in...

Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services


On Mon, May 19, 2014 at 11:20 AM, Pierre Grandin 
pierre.gran...@tubemogul.com wrote:

 Actually you can get the patched code from here for Havana :
 https://github.com/jdurgin/nova/tree/havana-ephemeral-rbd

 But i'm still trying to get it to work (in my case the volumes are still
 copies, and not copy on write).


 On Mon, May 19, 2014 at 7:19 AM, Michael J. Kidd michael.k...@inktank.com
  wrote:

 Since the status is 'Abandoned', it would appear that the fix has not
 been merged into any release of OpenStack.

 Thanks,

 Michael J. Kidd
 Sr. Storage Consultant
 Inktank Professional Services


 On Sun, May 18, 2014 at 5:13 PM, Yuming Ma (yumima) yum...@cisco.comwrote:

  Wondering what is the status of this fix
 https://review.openstack.org/#/c/46879/? Which release has it?
 — Yuming

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 --
 *Pierre Grandin *| Senior Site Reliability Engineer
 *M: *510.423.2231  http://559.217.2126/| 
 @p_grandinhttps://twitter.com/p_grandin

 [image: Inline image 
 1]http://www.tubemogul.com/solutions/playtime/brandpoint

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph not replicating

2014-04-19 Thread Michael J. Kidd
You may also want to check your 'min_size'... if it's 2, then you'll be
incomplete even with 1 complete copy.

ceph osd dump | grep pool

You can reduce the min size with the following syntax:

ceph osd pool set poolname min_size 1

Thanks,
Michael J. Kidd

Sent from my mobile device.  Please excuse brevity and typographical errors.
On Apr 19, 2014 12:50 PM, Jean-Charles Lopez jc.lo...@inktank.com wrote:

 Hi again

 Looked at your ceph -s.

 You have only 2 OSDs, one on each node. The default replica count is 2,
 the default crush map says each replica on a different host, or may be you
 set it to 2 different OSDs. Anyway, when one of your OSD goes down, Ceph
 can no longer find another OSDs to host the second replica it must create.

 Looking at your crushmap we would know better.

 Recommendation: for testing efficiently and most options available,
 functionnally speaking, deploy a cluster with 3 nodes, 3 OSDs each is my
 best practice.

 Or make 1 node with 3 OSDs modifying your crushmap to choose type osd in
 your rulesets.

 JC


 On Saturday, April 19, 2014, Gonzalo Aguilar Delgado 
 gagui...@aguilardelgado.com wrote:

 Hi,

 I'm building a cluster where two nodes replicate objects inside. I found
 that shutting down just one of the nodes (the second one), makes everything
 incomplete.

 I cannot find why, since crushmap looks good to me.

 after shutting down one node

 cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
  health HEALTH_WARN 192 pgs incomplete; 96 pgs stuck inactive; 96 pgs
 stuck unclean; 1/2 in osds are down
  monmap e9: 1 mons at {blue-compute=172.16.0.119:6789/0}, election
 epoch 1, quorum 0 blue-compute
  osdmap e73: 2 osds: 1 up, 2 in
   pgmap v172: 192 pgs, 3 pools, 275 bytes data, 1 objects
 7552 kB used, 919 GB / 921 GB avail
  192 incomplete


 Both nodes has WD Caviar Black 500MB disk with btrfs filesystem on it.
 Full disk used.

 I cannot understand why does not replicate to both nodes.

 Someone can help?

 Best regards,



 --
 Sent while moving
 Pardon my French and any spelling | grammar glitches


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph not replicating

2014-04-19 Thread Michael J. Kidd
 Can I remove safely default pools?
Yes, as long as you're not using the default pools to store data, you can
delete them.

 Why total size is about 1GB?, because it should have 500MB since 2
replicas.
I'm assuming that you're talking about the output of 'ceph df' or 'rados
df'. These commands report *raw* storage capacity.. It's up to you to
divide the raw capacity by the number of replicas. It's this way
intentionally since you could have multiple pools each with different
replica counts.


btw.. I'd strongly urge you to re-deploy your OSDs with XFS instead of
BTRFS. The last details I've seen show BTRFS slows drastically after only a
few hours with a high file count in the filesystem. Better to re-deploy now
than when you have data serving in production.

Thanks,

Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services


On Sat, Apr 19, 2014 at 5:51 PM, Gonzalo Aguilar Delgado 
gagui...@aguilardelgado.com wrote:

 Hi Michael,

 It worked. I didn't realized of this because docs it installs two osd
 nodes and says that would become active+clean after installing them.
 (Something that didn't worked for me because the 3 replicas problem).

 http://ceph.com/docs/master/start/quick-ceph-deploy/

 Now I can shutdown second node and I can retrieve the data stored there.

 So last questions are:

 Can I remove safely default pools?
 Why total size is about 1GB?, because it should have 500MB since 2
 replicas.


 Thank you a lot for your help.

 PS: I will try now the openstack integration.


 El sáb, 19 de abr 2014 a las 6:53 , Michael J. Kidd 
 michael.k...@inktank.com escribió:

 You may also want to check your 'min_size'... if it's 2, then you'll be
 incomplete even with 1 complete copy.

 ceph osd dump | grep pool

 You can reduce the min size with the following syntax:

 ceph osd pool set poolname min_size 1

 Thanks,
 Michael J. Kidd

 Sent from my mobile device.  Please excuse brevity and typographical
 errors.
 On Apr 19, 2014 12:50 PM, Jean-Charles Lopez jc.lo...@inktank.com
 wrote:

 Hi again

 Looked at your ceph -s.

 You have only 2 OSDs, one on each node. The default replica count is 2,
 the default crush map says each replica on a different host, or may be you
 set it to 2 different OSDs. Anyway, when one of your OSD goes down, Ceph
 can no longer find another OSDs to host the second replica it must create.

 Looking at your crushmap we would know better.

 Recommendation: for testing efficiently and most options available,
 functionnally speaking, deploy a cluster with 3 nodes, 3 OSDs each is my
 best practice.

 Or make 1 node with 3 OSDs modifying your crushmap to choose type osd
 in your rulesets.

 JC


 On Saturday, April 19, 2014, Gonzalo Aguilar Delgado 
 gagui...@aguilardelgado.com wrote:

 Hi,

 I'm building a cluster where two nodes replicate objects inside. I found
 that shutting down just one of the nodes (the second one), makes everything
 incomplete.

 I cannot find why, since crushmap looks good to me.

 after shutting down one node

 cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
  health HEALTH_WARN 192 pgs incomplete; 96 pgs stuck inactive; 96
 pgs stuck unclean; 1/2 in osds are down
  monmap e9: 1 mons at {blue-compute=172.16.0.119:6789/0}, election
 epoch 1, quorum 0 blue-compute
  osdmap e73: 2 osds: 1 up, 2 in
   pgmap v172: 192 pgs, 3 pools, 275 bytes data, 1 objects
 7552 kB used, 919 GB / 921 GB avail
  192 incomplete


 Both nodes has WD Caviar Black 500MB disk with btrfs filesystem on it.
 Full disk used.

 I cannot understand why does not replicate to both nodes.

 Someone can help?

 Best regards,



 --
 Sent while moving
 Pardon my French and any spelling | grammar glitches


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] No more Journals ?

2014-03-14 Thread Michael J. Kidd
Journals will default to being on-disk with the OSD if there is nothing
specified on the ceph-deploy line.  If you have a separate journal device,
then you should specify it per the original example syntax.

Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services


On Fri, Mar 14, 2014 at 8:22 AM, Markus Goldberg goldb...@uni-hildesheim.de
 wrote:

 Sorry,
 i should have asked a little bit clearer:
 Can ceph (or OSDs) be used without journals now ?
 The Journal-Parameter seems to be optional ( because of '[...]' )

 Markus
 Am 14.03.2014 12:19, schrieb John Spray:

  Journals have not gone anywhere, and ceph-deploy still supports
 specifying them with exactly the same syntax as before.

 The page you're looking at is the simplified quick start, the detail
 on osd creation including journals is here:
 http://eu.ceph.com/docs/v0.77/rados/deployment/ceph-deploy-osd/

 Cheers,
 John

 On Fri, Mar 14, 2014 at 9:47 AM, Markus Goldberg
 goldb...@uni-hildesheim.de wrote:

 Hi,
 i'm a little bit surprised. I read through the new manuals of 0.77
 (http://eu.ceph.com/docs/v0.77/start/quick-ceph-deploy/)
 In the section of creating the osd the manual says:

 Then, from your admin node, use ceph-deploy to prepare the OSDs.

 ceph-deploy osd prepare {ceph-node}:/path/to/directory

 For example:

 ceph-deploy osd prepare node2:/var/local/osd0 node3:/var/local/osd1

 Finally, activate the OSDs.

 ceph-deploy osd activate {ceph-node}:/path/to/directory

 For example:

 ceph-deploy osd activate node2:/var/local/osd0 node3:/var/local/osd1


 In former versions the osd was created like:

 ceph-deploy -v --overwrite-conf osd --fs-type btrfs prepare
 bd-0:/dev/sdb:/dev/sda5

 ^^ Journal
 As i remember defining and creating a journal for each osd was a must.

 So the question is: Are Journals obsolet now ?

 --
 MfG,
Markus Goldberg

 
 --
 Markus Goldberg   Universität Hildesheim
Rechenzentrum
 Tel +49 5121 88392822 Marienburger Platz 22, D-31141 Hildesheim, Germany
 Fax +49 5121 88392823 email goldb...@uni-hildesheim.de
 
 --


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 --
 MfG,
   Markus Goldberg

 --
 Markus Goldberg   Universität Hildesheim
   Rechenzentrum
 Tel +49 5121 88392822 Marienburger Platz 22, D-31141 Hildesheim, Germany
 Fax +49 5121 88392823 email goldb...@uni-hildesheim.de
 --


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Very high latency values

2014-03-07 Thread Michael J. Kidd
Hello Dan,
  A couple of quick things...

* Latency is shown as a sum of all measured latencies over a period of
time, and the count of operations included.  So to calculate the average
per op latency, you must divide the sum by the count.  The result will be
in milliseconds.

* The latency values you're showing there are from 'recoverystate_perf',
meaning they're not relevant to normal operations of the OSDs.  For that,
I'd recommend doing a perf dump against the OSD admin socket and looking at
the latency values under the osd section.

* I've not seen any documentation on each counter, aside from occasional
mailing list posts about specific counters..

Hope this helps!

Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services


On Fri, Mar 7, 2014 at 11:39 AM, Dan Ryder (daryder) dary...@cisco.comwrote:

  Hello,



 I'm working with two different Ceph clusters, and in both clusters, I'm
 seeing very high latency values.



 Here's part of a sample perf dump:



 recoverystate_perf: { initial_latency: { avgcount: 338,

   sum: 0.069851000},

   started_latency: { avgcount: 1647,

   sum: 322317122.940019000},

   reset_latency: { avgcount: 1985,

   sum: 195.935076000},

   start_latency: { avgcount: 1985,

   sum: 0.234355000},

   primary_latency: { avgcount: 266,

   sum: 10819570.688122000},



 You can see both started latency and primary latency have extremely high
 values.



 Some info about the cluster:

 All nodes are on the same subnet - 2 VMs, 1 physical node

 VM1 is just a Monitor, VM2 is Monitor and OSD, Physical node is just an
 OSD.





 One additional question, are these latency values in milliseconds? Is
 there any documentation on the units for perf dump command?

 I've looked around but haven't seen anything.



 Thanks,

 Dan





 [image: http://www.cisco.com/web/europe/images/email/signature/logo05.jpg]

 *Dan Ryder*
 ENGINEER.SOFTWARE ENGINEERING
 CSMTG Performance/Analytics
 dary...@cisco.com
 Phone: *+1 919 392 7438 %2B1%20919%20392%207438*

 *Cisco Systems, Inc.*
 7100-8 Kit Creek Road
 PO Box 14987
 27709-4987
 Research Triangle Park
 United States
 Cisco.com http://www.cisco.com/



 [image: Think before you print.] Think before you print.

 This email may contain confidential and privileged material for the sole
 use of the intended recipient. Any review, use, distribution or disclosure
 by others is strictly prohibited. If you are not the intended recipient (or
 authorized to receive for the recipient), please contact the sender by
 reply email and delete all copies of this message.

 For corporate legal information go to:
 http://www.cisco.com/web/about/doing_business/legal/cri/index.html



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


inline: image006.pnginline: image005.jpg___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pausing recovery when adding new machine

2014-03-07 Thread Michael J. Kidd
Hello Sid,
  You may try setting the 'noup' flag (instead of the 'noout' flag).  This
would prevent new OSDs from being set 'up' and therefore, the data
rebalance shouldn't occur.  Once you add all OSDs, then unset the 'noup'
flag and ensure they're set 'up' automatically... if not, use 'ceph osd up
osdid' to bring them up manually.

Hope this helps!

Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services


On Fri, Mar 7, 2014 at 3:06 PM, Sidharta Mukerjee smukerje...@gmail.comwrote:

 When I use ceph-deploy to add a bunch of new OSDs (from a new machine),
 the ceph cluster starts rebalancing immediately; as a result, the first
 couple OSDs are started properly; but the last few can't start because I
 keep getting a timeout problem, as shown here:

 [root@ia6 ia_scripts]# service ceph start osd.24
 === osd.24 ===
 failed: 'timeout 10 /usr/bin/ceph --name=osd.24 
 --keyring=/var/lib/ceph/osd/ceph-24/keyring
 osd crush create-or-move -- 24 1.82 root=default host=ia6

 Is there a way I can pause the recovery so that the overall system
 behaves way faster and I can then start all the OSDs, make sure they're up
 and they look normal (via ceph osd tree) , and then unpause recovery?

 -Sid

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: files never stored on OSDs

2014-02-28 Thread Michael J. Kidd
Seems that you may also need to tell CephFS to use the new pool instead of
the default..

After CephFS is mounted, run:
# cephfs /mnt/ceph set_layout -p 4


Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services


On Fri, Feb 28, 2014 at 9:12 AM, Sage Weil s...@inktank.com wrote:

 Hi Florent,

 It sounds like the capability for the user you are authenticating as does
 not have access to the new OSD data pool.  Try doing

  ceph auth list

 and see if there is an osd cap that mentions the data pool but not the new
 pool you created; that would explain your symptoms.

 sage

 On Fri, 28 Feb 2014, Florent Bautista wrote:

  Hi all,
 
  Today I'm testing CephFS with client-side kernel drivers.
 
  My installation is composed of 2 nodes, each one with a monitor and an
 OSD.
  One of them is also MDS.
 
  root@test2:~# ceph -s
  cluster 42081905-1a6b-4b9e-8984-145afe0f22f6
   health HEALTH_OK
   monmap e2: 2 mons at {0=192.168.0.202:6789/0,1=192.168.0.200:6789/0
 },
  election epoch 18, quorum 0,1 0,1
   mdsmap e15: 1/1/1 up {0=0=up:active}
   osdmap e82: 2 osds: 2 up, 2 in
pgmap v4405: 384 pgs, 5 pools, 16677 MB data, 4328 objects
  43473 MB used, 2542 GB / 2584 GB avail
   384 active+clean
 
 
  I added data pool to MDS : ceph mds add_data_pool 4
 
  Then I created keyring for my client :
 
  ceph --id admin --keyring /etc/ceph/ceph.client.admin.keyring auth
  get-or-create client.test mds 'allow' osd 'allow * pool=CephFS' mon
 'allow
  *'  /etc/ceph/ceph.client.test.keyring
 
 
  And I mount FS with :
 
  mount -o
 name=test,secret=AQC9YhBT8CE9GhAAdgDiVLGIIgEleen4vkOp5w==,noatime
  -t ceph 192.168.0.200,192.168.0.202:/ /mnt/ceph
 
 
  The client could be Debian 7.4 (kernel 3.2) or Ubuntu 13.11 (kernel
 3.11).
 
  Mount is OK. I can write files to it. I can see files on every clients
  mounted.
 
  BUT...
 
  Where are stored my files ?
 
  My pool stays at 0 disk usage on rados df
 
  Disk usage of OSDs never grows...
 
  What did I miss ?
 
  When client A writes a file, I got Operation not permitted when client
 B
  reads the file, even if I sync FS.
 
  That sounds very strange to me, I think I missed something but I don't
 know
  what. Of course, no error in logs.
 
 

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RedHat ceph boot question

2014-01-25 Thread Michael J. Kidd
While clearly not optimal for long term flexibility, I've found that adding
my OSD's to fstab allows the OSDs to mount during boot, and they start
automatically when they're already mounted during boot.

Hope this helps until a permanent fix is available.

Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services


On Fri, Jan 24, 2014 at 9:08 PM, Derek Yarnell de...@umiacs.umd.edu wrote:

 So we have a test cluster, and two production clusters all running on
 RHEL6.5.  Two are running Emperor and one of them running Dumpling.  On
 all of them our OSDs do not start at boot it seems via the udev rules.
 The OSDs were created with ceph-deploy and are all GPT.  The OSDs are
 visable with `ceph-disk list` and running `/usr/sbin/ceph-disk-activate
 {device}` mounts and adds them.  Running a `partprobe {device}` does not
 seem to trigger the udev rule at all.

 I had found this issue[1] but we are definitely running code that was
 released after this ticket was closed.  Has there been anyone else that
 has problems with udev on RHEL mounting their OSDs?

 [1] - http://tracker.ceph.com/issues/5194

 Thanks,
 derek

 --
 Derek T. Yarnell
 University of Maryland
 Institute for Advanced Computer Studies
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] servers advise (dell r515 or supermicro ....)

2014-01-15 Thread Michael J. Kidd
It's also good to note that the m500 has built in RAIN protection
(basically, diagonal parity at the nand level).  Should be very good for
journal consistency.


Sent from my mobile device.  Please excuse brevity and typographical errors.
On Jan 15, 2014 9:07 AM, Stefan Priebe s.pri...@profihost.ag wrote:

 Am 15.01.2014 15:03, schrieb Robert van Leeuwen:

 Power-Loss Protection:  In the rare event that power fails while the
 drive is operating, power-loss protection helps ensure that data isn’t
 corrupted.


 Seems that not all power protected SSDs are created equal:
 http://lkcl.net/reports/ssd_analysis.html

 The m500 is not tested but the m4 is.

 Up to now it seems that only Intel seems to have done his homework.
 In general they *seem* to be the most reliable SSD provider.


 Testing the m4 is useless as it as no power loss protection. The result
 should have been known before the test has started.

 But yes intel is very reliable but the 520 series and others from intel
 aren't.

 Stefan
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] servers advise (dell r515 or supermicro ....)

2014-01-15 Thread Michael J. Kidd
actually, they're very inexpensive as far as SSD's go.  The 960gb m500 can
be had on Amazon for $499 US on prime (as of yesterday anyway).

Sent from my mobile device.  Please excuse brevity and typographical errors.
On Jan 15, 2014 9:50 AM, Sebastien Han sebastien@enovance.com wrote:

 However you have to get  480GB which ridiculously large for a journal. I
 believe they are pretty expensive too.

 
 Sébastien Han
 Cloud Engineer

 Always give 100%. Unless you're giving blood.”

 Phone: +33 (0)1 49 70 99 72
 Mail: sebastien@enovance.com
 Address : 10, rue de la Victoire - 75009 Paris
 Web : www.enovance.com - Twitter : @enovance

 On 15 Jan 2014, at 15:49, Sebastien Han sebastien@enovance.com
 wrote:

  Sorry I was only looking at the 4K aligned results.
 
  
  Sébastien Han
  Cloud Engineer
 
  Always give 100%. Unless you're giving blood.”
 
  Phone: +33 (0)1 49 70 99 72
  Mail: sebastien@enovance.com
  Address : 10, rue de la Victoire - 75009 Paris
  Web : www.enovance.com - Twitter : @enovance
 
  On 15 Jan 2014, at 15:46, Stefan Priebe s.pri...@profihost.ag wrote:
 
  Am 15.01.2014 15:44, schrieb Mark Nelson:
  On 01/15/2014 08:39 AM, Stefan Priebe wrote:
 
  Am 15.01.2014 15:34, schrieb Sebastien Han:
  Hum the Crucial m500 is pretty slow. The biggest one doesn’t even
  reach 300MB/s.
  Intel DC S3700 100G showed around 200MB/sec for us.
 
  where did you get this values from? I've some 960GB and they all have
 
  450Mb/s write speed. Also in tests like here you see  450MB/s
  http://www.tomshardware.com/reviews/crucial-m500-1tb-ssd,3551-5.html
 
  Looks like at least according to Anand's chart, you'll get full write
  speed once you buy the 480GB model, but not for the 120 or 240GB
 models:
 
 
 http://www.anandtech.com/show/6884/crucial-micron-m500-review-960gb-480gb-240gb-120gb
 
  that's correct but the sentence was  The biggest one doesn’t even
  reach 300MB/s.
 
 
 
  Actually, I don’t know the price difference between the crucial and
  the intel but the intel looks more suitable for me. Especially after
  Mark’s comment.
 
  
  Sébastien Han
  Cloud Engineer
 
  Always give 100%. Unless you're giving blood.”
 
  Phone: +33 (0)1 49 70 99 72
  Mail: sebastien@enovance.com
  Address : 10, rue de la Victoire - 75009 Paris
  Web : www.enovance.com - Twitter : @enovance
 
  On 15 Jan 2014, at 15:28, Mark Nelson mark.nel...@inktank.com
 wrote:
 
  On 01/15/2014 08:03 AM, Robert van Leeuwen wrote:
  Power-Loss Protection:  In the rare event that power fails while
 the
  drive is operating, power-loss protection helps ensure that data
  isn’t
  corrupted.
 
  Seems that not all power protected SSDs are created equal:
  http://lkcl.net/reports/ssd_analysis.html
 
  The m500 is not tested but the m4 is.
 
  Up to now it seems that only Intel seems to have done his homework.
  In general they *seem* to be the most reliable SSD provider.
 
  Even at that, there has been some concern on the list (and lkml)
 that
  certain older Intel drives without super-capacitors are ignoring
  ATA_CMD_FLUSH, making them very fast (which I like!) but potentially
  dangerous (boo!).  The 520 in particular is a drive I've used for a
  lot of Ceph performance testing but I'm afraid that if it's not
  properly handling CMD FLUSH requests, it may not be indicative of
 the
  performance folks would see on other drives that do.
 
  On the third hand, if drives with supercaps like the Intel DC S3700
  can safely ignore CMD_FLUSH and maintain high performance (even when
  there are a lot of O_DSYNC calls, ala the journal), that potentially
  makes them even more attractive (and that drive already has
  relatively high sequential write performance and high write
 endurance).
 
 
  Cheers,
  Robert van Leeuwen
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com