Re: [ceph-users] Is there a limit for object size in CephFS?

2015-08-13 Thread Yan, Zheng
just tired 4.0 kernel, still do not encounter any problem. please run the
test again, when the test hang, check  /sys/kernel/debug/ceph/*/mdsc
and /sys/kernel/debug/ceph/*/osdc
to find which request is hung.

By the way, do you have cephfs mount on host which run ceph-osd/ceph-mds?




On Wed, Aug 12, 2015 at 11:12 PM, Hadi Montakhabi h...@cs.uh.edu wrote:

 4.0.6-300.fc22.x86_64

 On Tue, Aug 11, 2015 at 10:24 PM, Yan, Zheng uker...@gmail.com wrote:

 On Wed, Aug 12, 2015 at 5:33 AM, Hadi Montakhabi h...@cs.uh.edu wrote:

 ​​
 [sequential read]
 readwrite=read
 size=2g
 directory=/mnt/mycephfs
 ioengine=libaio
 direct=1
 blocksize=${BLOCKSIZE}
 numjobs=1
 iodepth=1
 invalidate=1 # causes the kernel buffer and page cache to be invalidated
 #nrfiles=1
 [sequential write]
 readwrite=write # randread randwrite
 size=2g
 directory=/mnt/mycephfs
 ioengine=libaio
 direct=1
 blocksize=${BLOCKSIZE}
 numjobs=1
 iodepth=1
 invalidate=1
 [random read]
 readwrite=randread
 size=2g
 directory=/mnt/mycephfs
 ioengine=libaio
 direct=1
 blocksize=${BLOCKSIZE}
 numjobs=1
 iodepth=1
 invalidate=1
 [random write]
 readwrite=randwrite
 size=2g
 directory=/mnt/mycephfs
 ioengine=libaio
 direct=1
 blocksize=${BLOCKSIZE}
 numjobs=1
 iodepth=1
 invalidate=1


 I just tried 4.2-rc kernel, everything went well. which version of kernel
 were you using?






 On Sun, Aug 9, 2015 at 9:27 PM, Yan, Zheng uker...@gmail.com wrote:


 On Sun, Aug 9, 2015 at 8:57 AM, Hadi Montakhabi h...@cs.uh.edu wrote:

 I am using fio.
 I use the kernel module to Mount CephFS.


 please send fio job file to us



 On Aug 8, 2015 10:52 AM, Ketor D d.ke...@gmail.com wrote:

 Hi Haidi,
   Which bench tool do you use? And how you mount CephFS,
 ceph-fuse or kernel-cephfs?

 On Fri, Aug 7, 2015 at 11:50 PM, Hadi Montakhabi h...@cs.uh.edu
 wrote:

 Hello Cephers,

 I am benchmarking CephFS. In one of my experiments, I change the
 object size.
 I start from 64kb. Everytime I do different block size reads and
 writes.
 By increasing the object size to 64MB and increasing the block size
 to 64MB, CephFS crashes (shown in the chart below). What I mean by 
 crash is
 when I do ceph -s or ceph -w it gets into constantly reporting me
 reads, but it never finishes the operation (even after a few days!).
 I have repeated this experiment for different underlying file
 systems (xfs and btrfs), and the same thing happens in both cases.
 What could be the reason for crashing CephFS? Is there a limit for
 object size in CephFS?

 Thank you,
 Hadi

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tier best practices

2015-08-13 Thread Vickey Singh
Thanks Nick for your suggestion.

Can you also tell how i can reduce RBD block size to 512K or 1M , do i need
to put something in clients ceph.conf  ( what parameter i need to set )

Thanks once again

- Vickey

On Wed, Aug 12, 2015 at 4:49 PM, Nick Fisk n...@fisk.me.uk wrote:

  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
  Dominik Zalewski
  Sent: 12 August 2015 14:40
  To: ceph-us...@ceph.com
  Subject: [ceph-users] Cache tier best practices
 
  Hi,
 
  I would like to hear from people who use cache tier in Ceph about best
  practices and things I should avoid.
 
  I remember hearing that it wasn't that stable back then. Has it changed
 in
  Hammer release?

 It's not so much the stability, but the performance. If your working set
 will sit mostly in the cache tier and won't tend to change then you might
 be alright. Otherwise you will find that performance is very poor.

 Only tip I can really give is that I have found dropping the RBD block
 size down to 512kb-1MB helps quite a bit as it makes the cache more
 effective and also minimises the amount of data transferred on each
 promotion/flush.

 
  Any tips and tricks are much appreciated!
 
  Thanks
 
  Dominik




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds server(s) crashed

2015-08-13 Thread John Spray
On Thu, Aug 13, 2015 at 5:12 AM, Bob Ababurko b...@ababurko.net wrote:
  I am actually looking for the most stable way to implement cephfs at
  this
  point.   My cephfs cluster contains millions of small files, so many
  inodes
  if that needs to be taken into account.  Perhaps I should only be using
  one
  MDS node for stability at this point?  Is this the best way forward to
  get a
  handle on stability?  I'm also curious if I should I set my mds cache
  size
  to a number greater than files I have in the cephfs cluster?  If you can
  give some key points to configure cephfs to get the best stability and
  if
  possible, availability.this would be helpful to me.

 One active MDS is the most stable setup. Adding a few standby MDS
 should not hurt stability.

 You can't set  mds cache size to a number greater than files in the
 fs, it requires lots of memory.



 I'm not sure what amount of RAM you consider to be 'lots' but I would really
 like to understand a bit more about this.  Perhaps a rule of thumb?  It
 there an advantage to more RAM  large mds cache size?  We plan on putting
 close to a billion small files in this pool via cephfs so what should we be
 considering when sizing our MDS hosts OR change to the MDS config?
 Basically, what should we OR should not be doing when we have a cluster with
 this many files?  Thanks!

The advantage to setting up a larger cache is:
 * We can allow clients to hold more in cache (anything in client
cache must also be in MDS cache)
 * We are less likely to need to read from disk on a random metadata read
 * We are less likely to need to write from to disk again if a file
was modified (can just journal + update in cache)

None of these outcomes particularly relevant if your workload is a
stream of a billion creates.  The reason we're hitting the cache size
limit in this case is because of the size of the directories: some
operations during restart of the MDS are happening at a per-directory
level of granularity.

If you're running up to deploying a billion-file workload, it might be
worth doing some experiments on a smaller system with the same file
hierarchy structure.  You could experiment with enabling inline data,
tuning mds_bal_split_size (how large dirs grow before getting
fragmented), mds_cache_size and see what effect these options have on
the rate of file creates that we sustain.  For best results, also
periodically kill an MDS during a run, to check that the system
recovers correctly (i.e. check for bugs like the one you've just hit).

As for the most stable configuration, the CephFS for early adopters
page[1] is still current.  Enabling inline data and/or directory
fragmentation will put you in slightly riskier territory (aka less
comprehensively tested by us), but if you can check that the
filesystem is working correctly for your workload in a POC then that's
the most important measure of whether it's suitable for you to deploy.

John

1. http://ceph.com/docs/master/cephfs/early-adopters/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds server(s) crashed

2015-08-13 Thread John Spray
On Thu, Aug 13, 2015 at 3:29 AM, yangyongp...@bwstor.com.cn
yangyongp...@bwstor.com.cn wrote:
 I also encounter a problem,standby mds can not be altered to active when
 active mds service stopped,which bother me for serval days.Maybe MDS cluster
 can solve those problem,but ceph team haven't released this feature.

That sounds like an unrelated issue -- can you give us more details,
like the output of ceph status? (possibly in a tracker.ceph.com
ticket)

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Geographical Replication and Disaster Recovery Support

2015-08-13 Thread Irek Fasikhov
Hi.
This document applies only to RadosGW.

You need to read the data document:
https://wiki.ceph.com/Planning/Blueprints/Hammer/RBD%3A_Mirroring


С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757

2015-08-13 11:40 GMT+03:00 Özhan Rüzgar Karaman oruzgarkara...@gmail.com:

 Hi;
 I like to learn about Ceph's Geographical Replication and Disaster
 Recovery Options. I know that currently we do not have a built-in official
 Geo Replication or disaster recovery, there are some third party tools like
 drbd but they are not like a solution that business needs.

 I also read the RGW document at Ceph Wiki Site.


 https://wiki.ceph.com/Planning/Blueprints/Dumpling/RGW_Geo-Replication_and_Disaster_Recovery


 The document is from Dumpling Release nearly year 2013. Do we have any
 active works or efforts to achieve disaster recovery or geographical
 replication features to Ceph, is it on our current road map?

 Thanks
 Özhan KARAMAN

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Geographical Replication and Disaster Recovery Support

2015-08-13 Thread Özhan Rüzgar Karaman
Hi;
I like to learn about Ceph's Geographical Replication and Disaster Recovery
Options. I know that currently we do not have a built-in official Geo
Replication or disaster recovery, there are some third party tools like
drbd but they are not like a solution that business needs.

I also read the RGW document at Ceph Wiki Site.

https://wiki.ceph.com/Planning/Blueprints/Dumpling/RGW_Geo-Replication_and_Disaster_Recovery


The document is from Dumpling Release nearly year 2013. Do we have any
active works or efforts to achieve disaster recovery or geographical
replication features to Ceph, is it on our current road map?

Thanks
Özhan KARAMAN
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH cache layer. Very slow

2015-08-13 Thread Voloshanenko Igor
So, after testing SSD (i wipe 1 SSD, and used it for tests)

root@ix-s2:~# sudo fio --filename=/dev/sda --direct=1 --sync=1 --rw=write
--bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --gr[53/1800]
ting --name=journal-test
journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync,
iodepth=1
fio-2.1.3
Starting 1 process
Jobs: 1 (f=1): [W] [100.0% done] [0KB/1152KB/0KB /s] [0/288/0 iops] [eta
00m:00s]
journal-test: (groupid=0, jobs=1): err= 0: pid=2849460: Thu Aug 13 10:46:42
2015
  write: io=68972KB, bw=1149.6KB/s, iops=287, runt= 60001msec
clat (msec): min=2, max=15, avg= 3.48, stdev= 1.08
 lat (msec): min=2, max=15, avg= 3.48, stdev= 1.08
clat percentiles (usec):
 |  1.00th=[ 2704],  5.00th=[ 2800], 10.00th=[ 2864], 20.00th=[ 2928],
 | 30.00th=[ 3024], 40.00th=[ 3088], 50.00th=[ 3280], 60.00th=[ 3408],
 | 70.00th=[ 3504], 80.00th=[ 3728], 90.00th=[ 3856], 95.00th=[ 4016],
 | 99.00th=[ 9024], 99.50th=[ 9280], 99.90th=[ 9792], 99.95th=[10048],
 | 99.99th=[14912]
bw (KB  /s): min= 1064, max= 1213, per=100.00%, avg=1150.07, stdev=34.31
lat (msec) : 4=94.99%, 10=4.96%, 20=0.05%
  cpu  : usr=0.13%, sys=0.57%, ctx=17248, majf=0, minf=7
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.0%
 issued: total=r=0/w=17243/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=68972KB, aggrb=1149KB/s, minb=1149KB/s, maxb=1149KB/s,
mint=60001msec, maxt=60001msec

Disk stats (read/write):
  sda: ios=0/17224, merge=0/0, ticks=0/59584, in_queue=59576, util=99.30%

So, it's pain... SSD do only 287 iops on 4K... 1,1 MB/s

I try to change cache mode :
echo temporary write through  /sys/class/scsi_disk/2:0:0:0/cache_type
echo temporary write through  /sys/class/scsi_disk/3:0:0:0/cache_type

no luck, still same shit results, also i found this article:
https://lkml.org/lkml/2013/11/20/264 pointed to old very simple patch,
which disable CMD_FLUSH
https://gist.github.com/TheCodeArtist/93dddcd6a21dc81414ba

Has everybody better ideas, how to improve this? (or disable CMD_FLUSH
without recompile kernel, i used ubuntu and 4.0.4 for now (4.x branch
because SSD 850 Pro have issue with NCQ TRIM and before 4.0.4 this
exception was not included into libsata.c)

2015-08-12 19:17 GMT+03:00 Pieter Koorts pieter.koo...@me.com:

 Hi Igor

 I suspect you have very much the same problem as me.

 https://www.mail-archive.com/ceph-users@lists.ceph.com/msg22260.html

 Basically Samsung drives (like many SATA SSD's) are very much hit and miss
 so you will need to test them like described here to see if they are any
 good.
 http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

 To give you an idea my average performance went from 11MB/s (with Samsung
 SSD) to 30MB/s (without any SSD) on write performance. This is a very small
 cluster.

 Pieter

 On Aug 12, 2015, at 04:33 PM, Voloshanenko Igor 
 igor.voloshane...@gmail.com wrote:

 Hi all, we have setup CEPH cluster with 60 OSD (2 diff types) (5 nodes, 12
 disks on each, 10 HDD, 2 SSD)

 Also we cover this with custom crushmap with 2 root leaf

 ID   WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -100 5.0 root ssd
 -102 1.0 host ix-s2-ssd
2 1.0 osd.2   up  1.0  1.0
9 1.0 osd.9   up  1.0  1.0
 -103 1.0 host ix-s3-ssd
3 1.0 osd.3   up  1.0  1.0
7 1.0 osd.7   up  1.0  1.0
 -104 1.0 host ix-s5-ssd
1 1.0 osd.1   up  1.0  1.0
6 1.0 osd.6   up  1.0  1.0
 -105 1.0 host ix-s6-ssd
4 1.0 osd.4   up  1.0  1.0
8 1.0 osd.8   up  1.0  1.0
 -106 1.0 host ix-s7-ssd
0 1.0 osd.0   up  1.0  1.0
5 1.0 osd.5   up  1.0  1.0
   -1 5.0 root platter
   -2 1.0 host ix-s2-platter
   13 1.0 osd.13  up  1.0  1.0
   17 1.0 osd.17  up  1.0  1.0
   21 1.0 osd.21  up  1.0  1.0
   27 1.0 osd.27  up  1.0  1.0
   32 1.0 osd.32  up  1.0  1.0
   37 1.0 osd.37  up  1.0  1.0
   44 1.0 osd.44  up  1.0  1.0
   48 1.0 osd.48  up  1.0  1.0
   55 1.0 osd.55  up  1.0  1.0
  

Re: [ceph-users] CEPH cache layer. Very slow

2015-08-13 Thread Irek Fasikhov
Hi, Igor.
Try to roll the patch here:
http://www.theirek.com/blog/2014/02/16/patch-dlia-raboty-s-enierghoniezavisimym-keshiem-ssd-diskov

P.S. I am no longer tracks changes in this direction(kernel), because we
use already recommended SSD

С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757

2015-08-13 11:56 GMT+03:00 Voloshanenko Igor igor.voloshane...@gmail.com:

 So, after testing SSD (i wipe 1 SSD, and used it for tests)

 root@ix-s2:~# sudo fio --filename=/dev/sda --direct=1 --sync=1 --rw=write
 --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --gr[53/1800]
 ting --name=journal-test
 journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync,
 iodepth=1
 fio-2.1.3
 Starting 1 process
 Jobs: 1 (f=1): [W] [100.0% done] [0KB/1152KB/0KB /s] [0/288/0 iops] [eta
 00m:00s]
 journal-test: (groupid=0, jobs=1): err= 0: pid=2849460: Thu Aug 13
 10:46:42 2015
   write: io=68972KB, bw=1149.6KB/s, iops=287, runt= 60001msec
 clat (msec): min=2, max=15, avg= 3.48, stdev= 1.08
  lat (msec): min=2, max=15, avg= 3.48, stdev= 1.08
 clat percentiles (usec):
  |  1.00th=[ 2704],  5.00th=[ 2800], 10.00th=[ 2864], 20.00th=[ 2928],
  | 30.00th=[ 3024], 40.00th=[ 3088], 50.00th=[ 3280], 60.00th=[ 3408],
  | 70.00th=[ 3504], 80.00th=[ 3728], 90.00th=[ 3856], 95.00th=[ 4016],
  | 99.00th=[ 9024], 99.50th=[ 9280], 99.90th=[ 9792], 99.95th=[10048],
  | 99.99th=[14912]
 bw (KB  /s): min= 1064, max= 1213, per=100.00%, avg=1150.07,
 stdev=34.31
 lat (msec) : 4=94.99%, 10=4.96%, 20=0.05%
   cpu  : usr=0.13%, sys=0.57%, ctx=17248, majf=0, minf=7
   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
 =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 =64=0.0%
  issued: total=r=0/w=17243/d=0, short=r=0/w=0/d=0

 Run status group 0 (all jobs):
   WRITE: io=68972KB, aggrb=1149KB/s, minb=1149KB/s, maxb=1149KB/s,
 mint=60001msec, maxt=60001msec

 Disk stats (read/write):
   sda: ios=0/17224, merge=0/0, ticks=0/59584, in_queue=59576, util=99.30%

 So, it's pain... SSD do only 287 iops on 4K... 1,1 MB/s

 I try to change cache mode :
 echo temporary write through  /sys/class/scsi_disk/2:0:0:0/cache_type
 echo temporary write through  /sys/class/scsi_disk/3:0:0:0/cache_type

 no luck, still same shit results, also i found this article:
 https://lkml.org/lkml/2013/11/20/264 pointed to old very simple patch,
 which disable CMD_FLUSH
 https://gist.github.com/TheCodeArtist/93dddcd6a21dc81414ba

 Has everybody better ideas, how to improve this? (or disable CMD_FLUSH
 without recompile kernel, i used ubuntu and 4.0.4 for now (4.x branch
 because SSD 850 Pro have issue with NCQ TRIM and before 4.0.4 this
 exception was not included into libsata.c)

 2015-08-12 19:17 GMT+03:00 Pieter Koorts pieter.koo...@me.com:

 Hi Igor

 I suspect you have very much the same problem as me.

 https://www.mail-archive.com/ceph-users@lists.ceph.com/msg22260.html

 Basically Samsung drives (like many SATA SSD's) are very much hit and
 miss so you will need to test them like described here to see if they are
 any good.
 http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

 To give you an idea my average performance went from 11MB/s (with Samsung
 SSD) to 30MB/s (without any SSD) on write performance. This is a very small
 cluster.

 Pieter

 On Aug 12, 2015, at 04:33 PM, Voloshanenko Igor 
 igor.voloshane...@gmail.com wrote:

 Hi all, we have setup CEPH cluster with 60 OSD (2 diff types) (5 nodes,
 12 disks on each, 10 HDD, 2 SSD)

 Also we cover this with custom crushmap with 2 root leaf

 ID   WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -100 5.0 root ssd
 -102 1.0 host ix-s2-ssd
2 1.0 osd.2   up  1.0  1.0
9 1.0 osd.9   up  1.0  1.0
 -103 1.0 host ix-s3-ssd
3 1.0 osd.3   up  1.0  1.0
7 1.0 osd.7   up  1.0  1.0
 -104 1.0 host ix-s5-ssd
1 1.0 osd.1   up  1.0  1.0
6 1.0 osd.6   up  1.0  1.0
 -105 1.0 host ix-s6-ssd
4 1.0 osd.4   up  1.0  1.0
8 1.0 osd.8   up  1.0  1.0
 -106 1.0 host ix-s7-ssd
0 1.0 osd.0   up  1.0  1.0
5 1.0 osd.5   up  1.0  1.0
   -1 5.0 root platter
   -2 1.0 host ix-s2-platter
   13 1.0 osd.13  up  1.0  1.0
   17 1.0 osd.17  up  1.0  1.0
   21 1.0 osd.21  up  1.0  1.0
   

[ceph-users] Change protection/profile from a erasure coded pool

2015-08-13 Thread Italo Santos
Hello everyone,  

Today I have a cluster with 4 hosts and I created a pool that uses a erasure 
code profile bellow:

##
directory=/usr/lib/ceph/erasure-code
k=3
m=1
plugin=jerasure
ruleset-failure-domain=host
technique=reed_sol_van

##

This cluster is used only for RGW and I’m planning to add a new host on that 
cluster. So, with a new node I’d like to change this pool to use another 
profile where with k=3, m=2 to increase the protection of the data.

Keep in mind the need of create new pool with a different profile and move all 
objects from the previous pool to the new, because a previous created pool 
cannot be modified. I’d like to know if someone already done that and if 
someone has some recommendation for better way to make this done?  

Regards.

Italo Santos
http://italosantos.com.br/

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd map failed

2015-08-13 Thread Adir Lev
Hi,

I have a CEPH cluster running on 4 physical servers, the cluster is up and 
healthy
So far I was unable to connect any client to the cluster using krbd or fio rbd 
plugin.
My clients can see and create images in rbd pool but cannot map
root@r-dcs68 ~ # rbd ls
fio_test
foo
foo1
foo_test

root@r-dcs68 ~ # rbd map foo
rbd: sysfs write failed
rbd: map failed: (95) Operation not supported

using strace I see that there is no write permissions to /sys/bus/rbd/add
root@r-dcs68 ~ # echo 192.168.57.102:16789 name=admin,key=client.admin rbd foo 
-  /sys/bus/rbd/add
-bash: echo: write error: Operation not permitted

Any idea why I can't map?

Thanks
Adir

--
here's some info about my client:
root@r-dcs68 ~ # uname -r
4.1.0

root@r-dcs68 ~ # cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.0 (Maipo)

root@r-dcs68 ~ # lsmod |grep rbd
rbd62776  0
libceph   236956  1 rbd


and tail of log:
[   49.303330] Key type id_resolver registered
[   49.306638] Key type id_legacy registered
[   70.471743] Key type ceph registered
[   70.474482] libceph: loaded (mon/osd proto 15/24)
[   70.479873] rbd: loaded
[  114.968597] Loading iSCSI transport class v2.0-870.
[  114.975685] iscsi: registered transport (iser)
[  146.061478] ib0: sendonly multicast join failed for 
ff12:401b::::::0016, status -22
[  148.076070] ib0: sendonly multicast join failed for 
ff12:401b::::::0016, status -22
[  190.207159] libceph: client4407 fsid 0169d615-9b5b-432f-9241-4aadf71be9cc
[  190.213774] libceph: mon0 192.168.57.102:16789 session established



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH cache layer. Very slow

2015-08-13 Thread Jan Schermer
I tested and can recommend the Samsung 845 DC PRO (make sure it is DC PRO and 
not just PRO or DC EVO!).
Those were very cheap but are out of stock at the moment (here).
Faster than Intels, cheaper, and slightly different technology (3D V-NAND) 
which IMO makes them superior without needing many tricks to do its job.

Jan

 On 13 Aug 2015, at 14:40, Voloshanenko Igor igor.voloshane...@gmail.com 
 wrote:
 
 Tnx, Irek! Will try!
 
 but another question to all, which SSD good enough for CEPH now?
 
 I'm looking into S3500 240G (I have some S3500 120G which show great results. 
 Around 8x times better than Samsung)
 
 Possible you can give advice about other vendors/models with same or below 
 price level as S3500 240G?
 
 2015-08-13 12:11 GMT+03:00 Irek Fasikhov malm...@gmail.com 
 mailto:malm...@gmail.com:
 Hi, Igor.
 Try to roll the patch here:
 http://www.theirek.com/blog/2014/02/16/patch-dlia-raboty-s-enierghoniezavisimym-keshiem-ssd-diskov
  
 http://www.theirek.com/blog/2014/02/16/patch-dlia-raboty-s-enierghoniezavisimym-keshiem-ssd-diskov
 
 P.S. I am no longer tracks changes in this direction(kernel), because we use 
 already recommended SSD
 
 С уважением, Фасихов Ирек Нургаязович
 Моб.: +79229045757 tel:%2B79229045757
 
 2015-08-13 11:56 GMT+03:00 Voloshanenko Igor igor.voloshane...@gmail.com 
 mailto:igor.voloshane...@gmail.com:
 So, after testing SSD (i wipe 1 SSD, and used it for tests)
 
 root@ix-s2:~# sudo fio --filename=/dev/sda --direct=1 --sync=1 --rw=write 
 --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --gr[53/1800]
 ting --name=journal-test
 journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
 fio-2.1.3
 Starting 1 process
 Jobs: 1 (f=1): [W] [100.0% done] [0KB/1152KB/0KB /s] [0/288/0 iops] [eta 
 00m:00s]
 journal-test: (groupid=0, jobs=1): err= 0: pid=2849460: Thu Aug 13 10:46:42 
 2015
   write: io=68972KB, bw=1149.6KB/s, iops=287, runt= 60001msec
 clat (msec): min=2, max=15, avg= 3.48, stdev= 1.08
  lat (msec): min=2, max=15, avg= 3.48, stdev= 1.08
 clat percentiles (usec):
  |  1.00th=[ 2704],  5.00th=[ 2800], 10.00th=[ 2864], 20.00th=[ 2928],
  | 30.00th=[ 3024], 40.00th=[ 3088], 50.00th=[ 3280], 60.00th=[ 3408],
  | 70.00th=[ 3504], 80.00th=[ 3728], 90.00th=[ 3856], 95.00th=[ 4016],
  | 99.00th=[ 9024], 99.50th=[ 9280], 99.90th=[ 9792], 99.95th=[10048],
  | 99.99th=[14912]
 bw (KB  /s): min= 1064, max= 1213, per=100.00%, avg=1150.07, stdev=34.31
 lat (msec) : 4=94.99%, 10=4.96%, 20=0.05%
   cpu  : usr=0.13%, sys=0.57%, ctx=17248, majf=0, minf=7
   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  issued: total=r=0/w=17243/d=0, short=r=0/w=0/d=0
 
 Run status group 0 (all jobs):
   WRITE: io=68972KB, aggrb=1149KB/s, minb=1149KB/s, maxb=1149KB/s, 
 mint=60001msec, maxt=60001msec
 
 Disk stats (read/write):
   sda: ios=0/17224, merge=0/0, ticks=0/59584, in_queue=59576, util=99.30%
 
 So, it's pain... SSD do only 287 iops on 4K... 1,1 MB/s
 
 I try to change cache mode :
 echo temporary write through  /sys/class/scsi_disk/2:0:0:0/cache_type
 echo temporary write through  /sys/class/scsi_disk/3:0:0:0/cache_type
 
 no luck, still same shit results, also i found this article:
 https://lkml.org/lkml/2013/11/20/264 https://lkml.org/lkml/2013/11/20/264 
 pointed to old very simple patch, which disable CMD_FLUSH
 https://gist.github.com/TheCodeArtist/93dddcd6a21dc81414ba 
 https://gist.github.com/TheCodeArtist/93dddcd6a21dc81414ba
 
 Has everybody better ideas, how to improve this? (or disable CMD_FLUSH 
 without recompile kernel, i used ubuntu and 4.0.4 for now (4.x branch because 
 SSD 850 Pro have issue with NCQ TRIM and before 4.0.4 this exception was not 
 included into libsata.c)
 
 2015-08-12 19:17 GMT+03:00 Pieter Koorts pieter.koo...@me.com 
 mailto:pieter.koo...@me.com:
 Hi Igor
 
 I suspect you have very much the same problem as me.
 
 https://www.mail-archive.com/ceph-users@lists.ceph.com/msg22260.html 
 https://www.mail-archive.com/ceph-users@lists.ceph.com/msg22260.html
 
 Basically Samsung drives (like many SATA SSD's) are very much hit and miss so 
 you will need to test them like described here to see if they are any good. 
 http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
  
 http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
 
 To give you an idea my average performance went from 11MB/s (with Samsung 
 SSD) to 30MB/s (without any SSD) on write performance. This is a very small 
 cluster.
 
 Pieter
 
 On Aug 12, 2015, at 04:33 PM, Voloshanenko Igor igor.voloshane...@gmail.com 
 mailto:igor.voloshane...@gmail.com wrote:
 
 Hi all, we have setup CEPH cluster with 60 OSD (2 diff types) 

Re: [ceph-users] CEPH cache layer. Very slow

2015-08-13 Thread Voloshanenko Igor
So, good, but price for 845 DC PRO 400 GB higher in about 2x times than
intel S3500 240G (((

Any other models? (((

2015-08-13 15:45 GMT+03:00 Jan Schermer j...@schermer.cz:

 I tested and can recommend the Samsung 845 DC PRO (make sure it is DC PRO
 and not just PRO or DC EVO!).
 Those were very cheap but are out of stock at the moment (here).
 Faster than Intels, cheaper, and slightly different technology (3D V-NAND)
 which IMO makes them superior without needing many tricks to do its job.

 Jan

 On 13 Aug 2015, at 14:40, Voloshanenko Igor igor.voloshane...@gmail.com
 wrote:

 Tnx, Irek! Will try!

 but another question to all, which SSD good enough for CEPH now?

 I'm looking into S3500 240G (I have some S3500 120G which show great
 results. Around 8x times better than Samsung)

 Possible you can give advice about other vendors/models with same or below
 price level as S3500 240G?

 2015-08-13 12:11 GMT+03:00 Irek Fasikhov malm...@gmail.com:

 Hi, Igor.
 Try to roll the patch here:

 http://www.theirek.com/blog/2014/02/16/patch-dlia-raboty-s-enierghoniezavisimym-keshiem-ssd-diskov

 P.S. I am no longer tracks changes in this direction(kernel), because we
 use already recommended SSD

 С уважением, Фасихов Ирек Нургаязович
 Моб.: +79229045757

 2015-08-13 11:56 GMT+03:00 Voloshanenko Igor igor.voloshane...@gmail.com
 :

 So, after testing SSD (i wipe 1 SSD, and used it for tests)

 root@ix-s2:~# sudo fio --filename=/dev/sda --direct=1 --sync=1
 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based
 --gr[53/1800]
 ting --name=journal-test
 journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync,
 iodepth=1
 fio-2.1.3
 Starting 1 process
 Jobs: 1 (f=1): [W] [100.0% done] [0KB/1152KB/0KB /s] [0/288/0 iops] [eta
 00m:00s]
 journal-test: (groupid=0, jobs=1): err= 0: pid=2849460: Thu Aug 13
 10:46:42 2015
   write: io=68972KB, bw=1149.6KB/s, iops=287, runt= 60001msec
 clat (msec): min=2, max=15, avg= 3.48, stdev= 1.08
  lat (msec): min=2, max=15, avg= 3.48, stdev= 1.08
 clat percentiles (usec):
  |  1.00th=[ 2704],  5.00th=[ 2800], 10.00th=[ 2864], 20.00th=[
 2928],
  | 30.00th=[ 3024], 40.00th=[ 3088], 50.00th=[ 3280], 60.00th=[
 3408],
  | 70.00th=[ 3504], 80.00th=[ 3728], 90.00th=[ 3856], 95.00th=[
 4016],
  | 99.00th=[ 9024], 99.50th=[ 9280], 99.90th=[ 9792],
 99.95th=[10048],
  | 99.99th=[14912]
 bw (KB  /s): min= 1064, max= 1213, per=100.00%, avg=1150.07,
 stdev=34.31
 lat (msec) : 4=94.99%, 10=4.96%, 20=0.05%
   cpu  : usr=0.13%, sys=0.57%, ctx=17248, majf=0, minf=7
   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
 =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 =64=0.0%
  issued: total=r=0/w=17243/d=0, short=r=0/w=0/d=0

 Run status group 0 (all jobs):
   WRITE: io=68972KB, aggrb=1149KB/s, minb=1149KB/s, maxb=1149KB/s,
 mint=60001msec, maxt=60001msec

 Disk stats (read/write):
   sda: ios=0/17224, merge=0/0, ticks=0/59584, in_queue=59576, util=99.30%

 So, it's pain... SSD do only 287 iops on 4K... 1,1 MB/s

 I try to change cache mode :
 echo temporary write through  /sys/class/scsi_disk/2:0:0:0/cache_type
 echo temporary write through  /sys/class/scsi_disk/3:0:0:0/cache_type

 no luck, still same shit results, also i found this article:
 https://lkml.org/lkml/2013/11/20/264 pointed to old very simple patch,
 which disable CMD_FLUSH
 https://gist.github.com/TheCodeArtist/93dddcd6a21dc81414ba

 Has everybody better ideas, how to improve this? (or disable CMD_FLUSH
 without recompile kernel, i used ubuntu and 4.0.4 for now (4.x branch
 because SSD 850 Pro have issue with NCQ TRIM and before 4.0.4 this
 exception was not included into libsata.c)

 2015-08-12 19:17 GMT+03:00 Pieter Koorts pieter.koo...@me.com:

 Hi Igor

 I suspect you have very much the same problem as me.

 https://www.mail-archive.com/ceph-users@lists.ceph.com/msg22260.html

 Basically Samsung drives (like many SATA SSD's) are very much hit and
 miss so you will need to test them like described here to see if they are
 any good.
 http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

 To give you an idea my average performance went from 11MB/s (with
 Samsung SSD) to 30MB/s (without any SSD) on write performance. This is a
 very small cluster.

 Pieter

 On Aug 12, 2015, at 04:33 PM, Voloshanenko Igor 
 igor.voloshane...@gmail.com wrote:

 Hi all, we have setup CEPH cluster with 60 OSD (2 diff types) (5 nodes,
 12 disks on each, 10 HDD, 2 SSD)

 Also we cover this with custom crushmap with 2 root leaf

 ID   WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -100 5.0 root ssd
 -102 1.0 host ix-s2-ssd
2 1.0 osd.2   up  1.0  1.0
9 1.0 osd.9   up  1.0  1.0
 

[ceph-users] OSD space imbalance

2015-08-13 Thread Vedran Furač
Hello,

I'm having an issue where disk usages between OSDs aren't well balanced
thus causing disk space to be wasted. Ceph is latest 0.94.2, used
exclusively through cephfs. Re-weighting helps, but just slightly, and
it has to be done on a daily basis causing constant refills. In the end
I get OSD with 65% usage with some other going over 90%. I also set the
ceph osd crush tunables optimal, but I didn't notice any changes when
it comes to disk usage. Is there anything I can do to get them within
10% range at least?

 health HEALTH_OK
 mdsmap e2577: 1/1/1 up, 2 up:standby
 osdmap e25239: 48 osds: 48 up, 48 in
  pgmap v3188836: 5184 pgs, 3 pools, 18028 GB data, 6385 kobjects
36156 GB used, 9472 GB / 45629 GB avail
5184 active+clean


ID WEIGHT  REWEIGHT SIZE   USEAVAIL   %USE  VAR
37 0.92999  1.0   950G   625G324G 65.85 0.83
21 0.92999  1.0   950G   649G300G 68.35 0.86
32 0.92999  1.0   950G   670G279G 70.58 0.89
 7 0.92999  1.0   950G   676G274G 71.11 0.90
17 0.92999  1.0   950G   681G268G 71.73 0.91
40 0.92999  1.0   950G   689G260G 72.55 0.92
20 0.92999  1.0   950G   690G260G 72.62 0.92
25 0.92999  1.0   950G   691G258G 72.76 0.92
 2 0.92999  1.0   950G   694G256G 73.03 0.92
39 0.92999  1.0   950G   697G253G 73.35 0.93
18 0.92999  1.0   950G   703G247G 74.00 0.93
47 0.92999  1.0   950G   703G246G 74.05 0.93
23 0.92999  0.86693   950G   704G245G 74.14 0.94
 6 0.92999  1.0   950G   726G224G 76.39 0.96
 8 0.92999  1.0   950G   727G223G 76.54 0.97
 5 0.92999  1.0   950G   728G222G 76.62 0.97
35 0.92999  1.0   950G   728G221G 76.66 0.97
11 0.92999  1.0   950G   730G220G 76.82 0.97
43 0.92999  1.0   950G   730G219G 76.87 0.97
33 0.92999  1.0   950G   734G215G 77.31 0.98
38 0.92999  1.0   950G   736G214G 77.49 0.98
12 0.92999  1.0   950G   737G212G 77.61 0.98
31 0.92999  0.85184   950G   742G208G 78.09 0.99
28 0.92999  1.0   950G   745G205G 78.41 0.99
27 0.92999  1.0   950G   751G199G 79.04 1.00
10 0.92999  1.0   950G   754G195G 79.40 1.00
13 0.92999  1.0   950G   762G188G 80.21 1.01
 9 0.92999  1.0   950G   763G187G 80.29 1.01
16 0.92999  1.0   950G   764G186G 80.37 1.01
 0 0.92999  1.0   950G   778G171G 81.94 1.03
 3 0.92999  1.0   950G   780G170G 82.11 1.04
41 0.92999  1.0   950G   780G169G 82.13 1.04
34 0.92999  0.87303   950G   783G167G 82.43 1.04
14 0.92999  1.0   950G   784G165G 82.56 1.04
42 0.92999  1.0   950G   786G164G 82.70 1.04
46 0.92999  1.0   950G   788G162G 82.93 1.05
30 0.92999  1.0   950G   790G160G 83.12 1.05
45 0.92999  1.0   950G   804G146G 84.59 1.07
44 0.92999  1.0   950G   807G143G 84.92 1.07
 1 0.92999  1.0   950G   817G132G 86.05 1.09
22 0.92999  1.0   950G   825G125G 86.81 1.10
15 0.92999  1.0   950G   826G123G 86.97 1.10
19 0.92999  1.0   950G   829G120G 87.30 1.10
36 0.92999  1.0   950G   831G119G 87.48 1.10
24 0.92999  1.0   950G   831G118G 87.50 1.10
26 0.92999  1.0   950G   851G 101692M 89.55 1.13
29 0.92999  1.0   950G   851G 101341M 89.59 1.13
 4 0.92999  1.0   950G   860G  92164M 90.53 1.14
MIN/MAX VAR: 0.83/1.14  STDDEV: 5.94
  TOTAL 45629G 36156G   9473G 79.24

Thanks,
Vedran


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Can not active osds (old/different cluster instance?)

2015-08-13 Thread Vickie ch
Dear all,
  I try to create osds and get an error message (old/different cluster
instance?).
And osd can create but not active. This server ever build osds before.
Pls give me some advises.

OS:rhel7
ceph:0.80 firefly


Best wishes,
Mika
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph distributed osd

2015-08-13 Thread gjprabu
Dear Team,



 We are using two ceph OSD with replica 2 and it is working properly. 
Here my doubt is (Pool A -image size will be 10GB) and its replicated with two 
OSD, what will happen suppose if the size reached the limit, Is there any 
chance to make the data to continue writing in another two OSD's.



Regards

Prabu













___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 active+remapped

2015-08-13 Thread Steve Dainard
I decided to set OSD 76 out and let the cluster shuffle the data off
that disk and then brought the OSD back in. For the most part this
seemed to be working, but then I had 1 object degraded and 88xxx
objects misplaced:

# ceph health detail
HEALTH_WARN 11 pgs stuck unclean; recovery 1/66089446 objects degraded
(0.000%); recovery 88844/66089446 objects misplaced (0.134%)
pg 2.e7f is stuck unclean for 88398.251351, current state
active+remapped, last acting [58,5]
pg 2.143 is stuck unclean for 13892.364101, current state
active+remapped, last acting [16,76]
pg 2.968 is stuck unclean for 13892.363521, current state
active+remapped, last acting [44,76]
pg 2.5f8 is stuck unclean for 13892.377245, current state
active+remapped, last acting [17,76]
pg 2.81c is stuck unclean for 13892.363443, current state
active+remapped, last acting [25,76]
pg 2.1a3 is stuck unclean for 13892.364400, current state
active+remapped, last acting [16,76]
pg 2.2cb is stuck unclean for 13892.374390, current state
active+remapped, last acting [14,76]
pg 2.d41 is stuck unclean for 13892.373636, current state
active+remapped, last acting [27,76]
pg 2.3f9 is stuck unclean for 13892.373147, current state
active+remapped, last acting [35,76]
pg 2.a62 is stuck unclean for 86283.741920, current state
active+remapped, last acting [2,38]
pg 2.1b0 is stuck unclean for 13892.363268, current state
active+remapped, last acting [3,76]
recovery 1/66089446 objects degraded (0.000%)
recovery 88844/66089446 objects misplaced (0.134%)

I say apparently because with one object degraded, none of the pg's
are showing degraded:
# ceph pg dump_stuck degraded
ok

# ceph pg dump_stuck unclean
ok
pg_stat state up up_primary acting acting_primary
2.e7f active+remapped [58] 58 [58,5] 58
2.143 active+remapped [16] 16 [16,76] 16
2.968 active+remapped [44] 44 [44,76] 44
2.5f8 active+remapped [17] 17 [17,76] 17
2.81c active+remapped [25] 25 [25,76] 25
2.1a3 active+remapped [16] 16 [16,76] 16
2.2cb active+remapped [14] 14 [14,76] 14
2.d41 active+remapped [27] 27 [27,76] 27
2.3f9 active+remapped [35] 35 [35,76] 35
2.a62 active+remapped [2] 2 [2,38] 2
2.1b0 active+remapped [3] 3 [3,76] 3

All of the OSD filesystems are below 85% full.

I then compared a 0.94.2 cluster that was new and had not been updated
(current cluster is 0.94.2 which had been updated a couple times) and
noticed the crush map had 'tunable straw_calc_version 1' so I added it
to the current cluster.

After the data moved around for about 8 hours or so I'm left with this state:

# ceph health detail
HEALTH_WARN 2 pgs stuck unclean; recovery 16357/66089446 objects
misplaced (0.025%)
pg 2.e7f is stuck unclean for 149422.331848, current state
active+remapped, last acting [58,5]
pg 2.782 is stuck unclean for 64878.002464, current state
active+remapped, last acting [76,31]
recovery 16357/66089446 objects misplaced (0.025%)

I attempted a pg repair on both of the pg's listed above, but it
doesn't look like anything is happening. The doc's reference an
inconsistent state as a use case for the repair command so that's
likely why.

These 2 pg's have been the issue throughout this process so how can I
dig deeper to figure out what the problem is?

# ceph pg 2.e7f query: http://pastebin.com/jMMsbsjS
# ceph pg 2.e7f query: http://pastebin.com/0ntBfFK5


On Wed, Aug 12, 2015 at 6:52 PM, yangyongp...@bwstor.com.cn
yangyongp...@bwstor.com.cn wrote:
 You can try ceph pg repair pg_idto repair the unhealth pg.ceph health
 detail command is very useful to detect unhealth pgs.

 
 yangyongp...@bwstor.com.cn


 From: Steve Dainard
 Date: 2015-08-12 23:48
 To: ceph-users
 Subject: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1
 active+remapped
 I ran a ceph osd reweight-by-utilization yesterday and partway through
 had a network interruption. After the network was restored the cluster
 continued to rebalance but this morning the cluster has stopped
 rebalance and status will not change from:

 # ceph status
 cluster af859ff1-c394-4c9a-95e2-0e0e4c87445c
  health HEALTH_WARN
 1 pgs degraded
 1 pgs stuck degraded
 2 pgs stuck unclean
 1 pgs stuck undersized
 1 pgs undersized
 recovery 8163/66089054 objects degraded (0.012%)
 recovery 8194/66089054 objects misplaced (0.012%)
  monmap e24: 3 mons at
 {mon1=10.0.231.53:6789/0,mon2=10.0.231.54:6789/0,mon3=10.0.231.55:6789/0}
 election epoch 250, quorum 0,1,2 mon1,mon2,mon3
  osdmap e184486: 100 osds: 100 up, 100 in; 1 remapped pgs
   pgmap v3010985: 4144 pgs, 7 pools, 125 TB data, 32270 kobjects
 251 TB used, 111 TB / 363 TB avail
 8163/66089054 objects degraded (0.012%)
 8194/66089054 objects misplaced (0.012%)
 4142 active+clean
1 active+undersized+degraded
1 active+remapped


 # ceph health detail
 HEALTH_WARN 1 pgs degraded; 

Re: [ceph-users] Cache tier best practices

2015-08-13 Thread Nick Fisk
Use the order parameter when creating an RBD 22=4MB, 20=1MB

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Vickey 
Singh
Sent: 13 August 2015 09:31
To: Nick Fisk n...@fisk.me.uk
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] Cache tier best practices

 

Thanks Nick for your suggestion.

 

Can you also tell how i can reduce RBD block size to 512K or 1M , do i need to 
put something in clients ceph.conf  ( what parameter i need to set )

 

Thanks once again

 

- Vickey

 

On Wed, Aug 12, 2015 at 4:49 PM, Nick Fisk n...@fisk.me.uk 
mailto:n...@fisk.me.uk  wrote:

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
 mailto:ceph-users-boun...@lists.ceph.com ] On Behalf Of
 Dominik Zalewski
 Sent: 12 August 2015 14:40
 To: ceph-us...@ceph.com mailto:ceph-us...@ceph.com 
 Subject: [ceph-users] Cache tier best practices

 Hi,

 I would like to hear from people who use cache tier in Ceph about best
 practices and things I should avoid.

 I remember hearing that it wasn't that stable back then. Has it changed in
 Hammer release?

It's not so much the stability, but the performance. If your working set will 
sit mostly in the cache tier and won't tend to change then you might be 
alright. Otherwise you will find that performance is very poor.

Only tip I can really give is that I have found dropping the RBD block size 
down to 512kb-1MB helps quite a bit as it makes the cache more effective and 
also minimises the amount of data transferred on each promotion/flush.


 Any tips and tricks are much appreciated!

 Thanks

 Dominik




___
ceph-users mailing list
ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tier best practices

2015-08-13 Thread Bill Sanders
I think you're looking for this.

http://ceph.com/docs/master/man/8/rbd/#cmdoption-rbd--order

It's used when you create the RBD images.  1MB is order=20, 512 is order=19.

Thanks,
Bill Sanders


On Thu, Aug 13, 2015 at 1:31 AM, Vickey Singh vickey.singh22...@gmail.com
wrote:

 Thanks Nick for your suggestion.

 Can you also tell how i can reduce RBD block size to 512K or 1M , do i
 need to put something in clients ceph.conf  ( what parameter i need to set )

 Thanks once again

 - Vickey

 On Wed, Aug 12, 2015 at 4:49 PM, Nick Fisk n...@fisk.me.uk wrote:

  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
 Of
  Dominik Zalewski
  Sent: 12 August 2015 14:40
  To: ceph-us...@ceph.com
  Subject: [ceph-users] Cache tier best practices
 
  Hi,
 
  I would like to hear from people who use cache tier in Ceph about best
  practices and things I should avoid.
 
  I remember hearing that it wasn't that stable back then. Has it changed
 in
  Hammer release?

 It's not so much the stability, but the performance. If your working set
 will sit mostly in the cache tier and won't tend to change then you might
 be alright. Otherwise you will find that performance is very poor.

 Only tip I can really give is that I have found dropping the RBD block
 size down to 512kb-1MB helps quite a bit as it makes the cache more
 effective and also minimises the amount of data transferred on each
 promotion/flush.

 
  Any tips and tricks are much appreciated!
 
  Thanks
 
  Dominik




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to improve single thread sequential reads?

2015-08-13 Thread Nick Fisk
Hi,

 

I'm trying to use a RBD to act as a staging area for some data before
pushing it down to some LTO6 tapes. As I cannot use striping with the kernel
client I tend to be maxing out at around 80MB/s reads testing with DD. Has
anyone got any clever suggestions of giving this a bit of a boost, I think I
need to get it up to around 200MB/s to make sure there is always a steady
flow of data to the tape drive.

 

Rbd-fuse seems to top out at 12MB/s, so there goes that option.

 

I'm thinking mapping multiple RBD's and then combining them into a mdadm
RAID0 stripe might work, but seems a bit messy.

 

Any suggestions?

 

Thanks,

Nick




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD space imbalance

2015-08-13 Thread Vedran Furač
On 13.08.2015 18:01, GuangYang wrote:
 Try 'ceph osd reweight-by-pg int' right after creating the pools?

Would it do any good now when pool is in use and nearly full as I can't
re-create it now. Also, what's the integer argument in the command
above? I failed to find proper explanation in the docs.

 What is the typical object size in the cluster?

Around 50 MB.


Thanks,
Vedran

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 active+remapped

2015-08-13 Thread GuangYang
I don't see anything obvious, sorry..

Looks like something with osd.{5, 76, 38}, which are absent from the *up* set 
though they are up. How about increasing log level 'debug_osd = 20' on osd.76 
and restart the OSD?

Thanks,
Guang



 Date: Thu, 13 Aug 2015 09:10:31 -0700
 Subject: Re: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 
 active+remapped
 From: sdain...@spd1.com
 To: yguan...@outlook.com
 CC: yangyongp...@bwstor.com.cn; ceph-users@lists.ceph.com

 OSD tree: http://pastebin.com/3z333DP4
 Crushmap: http://pastebin.com/DBd9k56m

 I realize these nodes are quite large, I have plans to break them out
 into 12 OSD's/node.

 On Thu, Aug 13, 2015 at 9:02 AM, GuangYang yguan...@outlook.com wrote:
 Could you share the 'ceph osd tree dump' and CRUSH map dump ?

 Thanks,
 Guang


 
 Date: Thu, 13 Aug 2015 08:16:09 -0700
 From: sdain...@spd1.com
 To: yangyongp...@bwstor.com.cn; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Cluster health_warn 1 
 active+undersized+degraded/1 active+remapped

 I decided to set OSD 76 out and let the cluster shuffle the data off
 that disk and then brought the OSD back in. For the most part this
 seemed to be working, but then I had 1 object degraded and 88xxx
 objects misplaced:

 # ceph health detail
 HEALTH_WARN 11 pgs stuck unclean; recovery 1/66089446 objects degraded
 (0.000%); recovery 88844/66089446 objects misplaced (0.134%)
 pg 2.e7f is stuck unclean for 88398.251351, current state
 active+remapped, last acting [58,5]
 pg 2.143 is stuck unclean for 13892.364101, current state
 active+remapped, last acting [16,76]
 pg 2.968 is stuck unclean for 13892.363521, current state
 active+remapped, last acting [44,76]
 pg 2.5f8 is stuck unclean for 13892.377245, current state
 active+remapped, last acting [17,76]
 pg 2.81c is stuck unclean for 13892.363443, current state
 active+remapped, last acting [25,76]
 pg 2.1a3 is stuck unclean for 13892.364400, current state
 active+remapped, last acting [16,76]
 pg 2.2cb is stuck unclean for 13892.374390, current state
 active+remapped, last acting [14,76]
 pg 2.d41 is stuck unclean for 13892.373636, current state
 active+remapped, last acting [27,76]
 pg 2.3f9 is stuck unclean for 13892.373147, current state
 active+remapped, last acting [35,76]
 pg 2.a62 is stuck unclean for 86283.741920, current state
 active+remapped, last acting [2,38]
 pg 2.1b0 is stuck unclean for 13892.363268, current state
 active+remapped, last acting [3,76]
 recovery 1/66089446 objects degraded (0.000%)
 recovery 88844/66089446 objects misplaced (0.134%)

 I say apparently because with one object degraded, none of the pg's
 are showing degraded:
 # ceph pg dump_stuck degraded
 ok

 # ceph pg dump_stuck unclean
 ok
 pg_stat state up up_primary acting acting_primary
 2.e7f active+remapped [58] 58 [58,5] 58
 2.143 active+remapped [16] 16 [16,76] 16
 2.968 active+remapped [44] 44 [44,76] 44
 2.5f8 active+remapped [17] 17 [17,76] 17
 2.81c active+remapped [25] 25 [25,76] 25
 2.1a3 active+remapped [16] 16 [16,76] 16
 2.2cb active+remapped [14] 14 [14,76] 14
 2.d41 active+remapped [27] 27 [27,76] 27
 2.3f9 active+remapped [35] 35 [35,76] 35
 2.a62 active+remapped [2] 2 [2,38] 2
 2.1b0 active+remapped [3] 3 [3,76] 3

 All of the OSD filesystems are below 85% full.

 I then compared a 0.94.2 cluster that was new and had not been updated
 (current cluster is 0.94.2 which had been updated a couple times) and
 noticed the crush map had 'tunable straw_calc_version 1' so I added it
 to the current cluster.

 After the data moved around for about 8 hours or so I'm left with this 
 state:

 # ceph health detail
 HEALTH_WARN 2 pgs stuck unclean; recovery 16357/66089446 objects
 misplaced (0.025%)
 pg 2.e7f is stuck unclean for 149422.331848, current state
 active+remapped, last acting [58,5]
 pg 2.782 is stuck unclean for 64878.002464, current state
 active+remapped, last acting [76,31]
 recovery 16357/66089446 objects misplaced (0.025%)

 I attempted a pg repair on both of the pg's listed above, but it
 doesn't look like anything is happening. The doc's reference an
 inconsistent state as a use case for the repair command so that's
 likely why.

 These 2 pg's have been the issue throughout this process so how can I
 dig deeper to figure out what the problem is?

 # ceph pg 2.e7f query: http://pastebin.com/jMMsbsjS
 # ceph pg 2.e7f query: http://pastebin.com/0ntBfFK5


 On Wed, Aug 12, 2015 at 6:52 PM, yangyongp...@bwstor.com.cn
 yangyongp...@bwstor.com.cn wrote:
 You can try ceph pg repair pg_idto repair the unhealth pg.ceph health
 detail command is very useful to detect unhealth pgs.

 
 yangyongp...@bwstor.com.cn


 From: Steve Dainard
 Date: 2015-08-12 23:48
 To: ceph-users
 Subject: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1
 active+remapped
 I ran a ceph osd reweight-by-utilization yesterday and partway 

Re: [ceph-users] OSD space imbalance

2015-08-13 Thread GuangYang
There are three factors that impact disk utilization of an OSD:
 1. number of PGs on the OSD (determined by CRUSH)
 2. number of objects with each PG (better to pick a 2 power PG number to make 
this one more even)
 3. object size deviation

with 'ceph osd reweight-by-pg', you can tune (1). And if you would like to get 
a better understanding of what is the root cause in your cluster, you can find 
more information from 'pg dump', from where you can get the raw data for 1 and 
2.

Once the cluster is filled, you properly go with 'ceph osd 
reweight-by-utilization', be careful of that since it could incur lots of data 
movement...


 To: ceph-users@lists.ceph.com
 From: vedran.fu...@gmail.com
 Date: Fri, 14 Aug 2015 00:15:17 +0200
 Subject: Re: [ceph-users] OSD space imbalance

 On 13.08.2015 18:01, GuangYang wrote:
 Try 'ceph osd  int' right after creating the pools?

 Would it do any good now when pool is in use and nearly full as I can't
 re-create it now. Also, what's the integer argument in the command
 above? I failed to find proper explanation in the docs.
Please check it out here - 
https://github.com/ceph/ceph/blob/master/src/mon/OSDMonitor.cc#L469

 What is the typical object size in the cluster?

 Around 50 MB.


 Thanks,
 Vedran

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph distributed osd

2015-08-13 Thread yangyongp...@bwstor.com.cn
You can add one or more osd,and ceph will balance the distribution of pgs.Data 
will not out of rage as if you have big enough space.



yangyongp...@bwstor.com.cn
 
From: gjprabu
Date: 2015-08-13 22:42
To: ceph-users
CC: Kamala Subramani; Siva Sokkumuthu
Subject: [ceph-users] ceph distributed osd
Dear Team,

 We are using two ceph OSD with replica 2 and it is working properly. 
Here my doubt is (Pool A -image size will be 10GB) and its replicated with two 
OSD, what will happen suppose if the size reached the limit, Is there any 
chance to make the data to continue writing in another two OSD's.

Regards
Prabu





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd map pool object question / bug?

2015-08-13 Thread Steven McDonald
Hi Goncalo,

On Fri, 14 Aug 2015 13:30:35 +1000
Goncalo Borges gonc...@physics.usyd.edu.au wrote:

 Is this expected? Are those PGs actually assigned to something it
 does not exists?

Objects are mapped to PGs algorithmically, based on their names. You can
think of that result as telling you where the object *would* be placed
if it were created.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph osd map pool object question / bug?

2015-08-13 Thread Goncalo Borges

Hi Cehp gurus...

I using 0.94.2 in all my Ceph / CephFS installation.

While trying to understand how files are translated into object, it 
seems that 'ceph osd map' returns a valid answer even for objects that 
do not exist.



   # ceph osd map cephfs_dt thisobjectdoesnotexist
   osdmap e341 pool 'cephfs_dt' (5) object 'thisobjectdoesnotexist' -
   pg 5.28aa7f5a (5.35a) - up ([24,21,15], p24) acting ([24,21,15], p24)


Is this expected? Are those PGs actually assigned to something it does 
not exists?


Cheers
Goncalo

--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] teuthology: running create_nodes.py will be hanged

2015-08-13 Thread Songbo Wang
Hi,

When setting up teuthology in my own environment ,  I found a problem as
follows:
In the file teuthology/__init__.py,  when importing class
gevent.monkey, It will conflict  with paramiko.  and if
 create_nodes.py is used to connect to paddles/pulpito node, it will be
hanged.

 root@ubunut4:~/src/teuthology_master# git diff
teuthology/__init__.py
 diff --git a/teuthology/__init__.py b/teuthology/__init__.py
 index d0bcfc0..b34cf4e 100644
 --- a/teuthology/__init__.py
 +++ b/teuthology/__init__.py
 @@ -1,5 +1,5 @@
 -from gevent import monkey
 -monkey.patch_all(dns=False)
 +#from gevent import monkey
 +#monkey.patch_all(dns=False)
 from .orchestra import monkey
 monkey.patch_all()

After modification, everything looks fine. So l  am wondering if this is a bug?
Any reply will be highly appreciated.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 active+remapped

2015-08-13 Thread Steve Dainard
OSD tree: http://pastebin.com/3z333DP4
Crushmap: http://pastebin.com/DBd9k56m

I realize these nodes are quite large, I have plans to break them out
into 12 OSD's/node.

On Thu, Aug 13, 2015 at 9:02 AM, GuangYang yguan...@outlook.com wrote:
 Could you share the 'ceph osd tree dump' and CRUSH map dump ?

 Thanks,
 Guang


 
 Date: Thu, 13 Aug 2015 08:16:09 -0700
 From: sdain...@spd1.com
 To: yangyongp...@bwstor.com.cn; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 
 active+remapped

 I decided to set OSD 76 out and let the cluster shuffle the data off
 that disk and then brought the OSD back in. For the most part this
 seemed to be working, but then I had 1 object degraded and 88xxx
 objects misplaced:

 # ceph health detail
 HEALTH_WARN 11 pgs stuck unclean; recovery 1/66089446 objects degraded
 (0.000%); recovery 88844/66089446 objects misplaced (0.134%)
 pg 2.e7f is stuck unclean for 88398.251351, current state
 active+remapped, last acting [58,5]
 pg 2.143 is stuck unclean for 13892.364101, current state
 active+remapped, last acting [16,76]
 pg 2.968 is stuck unclean for 13892.363521, current state
 active+remapped, last acting [44,76]
 pg 2.5f8 is stuck unclean for 13892.377245, current state
 active+remapped, last acting [17,76]
 pg 2.81c is stuck unclean for 13892.363443, current state
 active+remapped, last acting [25,76]
 pg 2.1a3 is stuck unclean for 13892.364400, current state
 active+remapped, last acting [16,76]
 pg 2.2cb is stuck unclean for 13892.374390, current state
 active+remapped, last acting [14,76]
 pg 2.d41 is stuck unclean for 13892.373636, current state
 active+remapped, last acting [27,76]
 pg 2.3f9 is stuck unclean for 13892.373147, current state
 active+remapped, last acting [35,76]
 pg 2.a62 is stuck unclean for 86283.741920, current state
 active+remapped, last acting [2,38]
 pg 2.1b0 is stuck unclean for 13892.363268, current state
 active+remapped, last acting [3,76]
 recovery 1/66089446 objects degraded (0.000%)
 recovery 88844/66089446 objects misplaced (0.134%)

 I say apparently because with one object degraded, none of the pg's
 are showing degraded:
 # ceph pg dump_stuck degraded
 ok

 # ceph pg dump_stuck unclean
 ok
 pg_stat state up up_primary acting acting_primary
 2.e7f active+remapped [58] 58 [58,5] 58
 2.143 active+remapped [16] 16 [16,76] 16
 2.968 active+remapped [44] 44 [44,76] 44
 2.5f8 active+remapped [17] 17 [17,76] 17
 2.81c active+remapped [25] 25 [25,76] 25
 2.1a3 active+remapped [16] 16 [16,76] 16
 2.2cb active+remapped [14] 14 [14,76] 14
 2.d41 active+remapped [27] 27 [27,76] 27
 2.3f9 active+remapped [35] 35 [35,76] 35
 2.a62 active+remapped [2] 2 [2,38] 2
 2.1b0 active+remapped [3] 3 [3,76] 3

 All of the OSD filesystems are below 85% full.

 I then compared a 0.94.2 cluster that was new and had not been updated
 (current cluster is 0.94.2 which had been updated a couple times) and
 noticed the crush map had 'tunable straw_calc_version 1' so I added it
 to the current cluster.

 After the data moved around for about 8 hours or so I'm left with this state:

 # ceph health detail
 HEALTH_WARN 2 pgs stuck unclean; recovery 16357/66089446 objects
 misplaced (0.025%)
 pg 2.e7f is stuck unclean for 149422.331848, current state
 active+remapped, last acting [58,5]
 pg 2.782 is stuck unclean for 64878.002464, current state
 active+remapped, last acting [76,31]
 recovery 16357/66089446 objects misplaced (0.025%)

 I attempted a pg repair on both of the pg's listed above, but it
 doesn't look like anything is happening. The doc's reference an
 inconsistent state as a use case for the repair command so that's
 likely why.

 These 2 pg's have been the issue throughout this process so how can I
 dig deeper to figure out what the problem is?

 # ceph pg 2.e7f query: http://pastebin.com/jMMsbsjS
 # ceph pg 2.e7f query: http://pastebin.com/0ntBfFK5


 On Wed, Aug 12, 2015 at 6:52 PM, yangyongp...@bwstor.com.cn
 yangyongp...@bwstor.com.cn wrote:
 You can try ceph pg repair pg_idto repair the unhealth pg.ceph health
 detail command is very useful to detect unhealth pgs.

 
 yangyongp...@bwstor.com.cn


 From: Steve Dainard
 Date: 2015-08-12 23:48
 To: ceph-users
 Subject: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1
 active+remapped
 I ran a ceph osd reweight-by-utilization yesterday and partway through
 had a network interruption. After the network was restored the cluster
 continued to rebalance but this morning the cluster has stopped
 rebalance and status will not change from:

 # ceph status
 cluster af859ff1-c394-4c9a-95e2-0e0e4c87445c
 health HEALTH_WARN
 1 pgs degraded
 1 pgs stuck degraded
 2 pgs stuck unclean
 1 pgs stuck undersized
 1 pgs undersized
 recovery 8163/66089054 objects degraded (0.012%)
 recovery 8194/66089054 objects misplaced (0.012%)
 monmap e24: 3 mons at
 

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-13 Thread Nick Fisk
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Nick Fisk
 Sent: 13 August 2015 18:04
 To: ceph-users@lists.ceph.com
 Subject: [ceph-users] How to improve single thread sequential reads?
 
 Hi,
 
 I'm trying to use a RBD to act as a staging area for some data before
pushing
 it down to some LTO6 tapes. As I cannot use striping with the kernel
client I
 tend to be maxing out at around 80MB/s reads testing with DD. Has anyone
 got any clever suggestions of giving this a bit of a boost, I think I need
to get it
 up to around 200MB/s to make sure there is always a steady flow of data to
 the tape drive.

I've just tried the testing kernel with the blk-mq fixes in it for full size
IO's, this combined with bumping readahead up to 4MB, is now getting me on
average 150MB/s to 200MB/s so this might suffice.

On a personal interest, I would still like to know if anyone has ideas on
how to really push much higher bandwidth through a RBD.

 
 Rbd-fuse seems to top out at 12MB/s, so there goes that option.
 
 I'm thinking mapping multiple RBD's and then combining them into a mdadm
 RAID0 stripe might work, but seems a bit messy.
 
 Any suggestions?
 
 Thanks,
 Nick
 


 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD space imbalance

2015-08-13 Thread GuangYang
Try 'ceph osd reweight-by-pg int' right after creating the pools? What is the 
typical object size in the cluster?


Thanks,
Guang



 To: ceph-users@lists.ceph.com
 From: vedran.fu...@gmail.com
 Date: Thu, 13 Aug 2015 14:58:11 +0200
 Subject: [ceph-users] OSD space imbalance

 Hello,

 I'm having an issue where disk usages between OSDs aren't well balanced
 thus causing disk space to be wasted. Ceph is latest 0.94.2, used
 exclusively through cephfs. Re-weighting helps, but just slightly, and
 it has to be done on a daily basis causing constant refills. In the end
 I get OSD with 65% usage with some other going over 90%. I also set the
 ceph osd crush tunables optimal, but I didn't notice any changes when
 it comes to disk usage. Is there anything I can do to get them within
 10% range at least?

 health HEALTH_OK
 mdsmap e2577: 1/1/1 up, 2 up:standby
 osdmap e25239: 48 osds: 48 up, 48 in
 pgmap v3188836: 5184 pgs, 3 pools, 18028 GB data, 6385 kobjects
 36156 GB used, 9472 GB / 45629 GB avail
 5184 active+clean


 ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR
 37 0.92999 1.0 950G 625G 324G 65.85 0.83
 21 0.92999 1.0 950G 649G 300G 68.35 0.86
 32 0.92999 1.0 950G 670G 279G 70.58 0.89
 7 0.92999 1.0 950G 676G 274G 71.11 0.90
 17 0.92999 1.0 950G 681G 268G 71.73 0.91
 40 0.92999 1.0 950G 689G 260G 72.55 0.92
 20 0.92999 1.0 950G 690G 260G 72.62 0.92
 25 0.92999 1.0 950G 691G 258G 72.76 0.92
 2 0.92999 1.0 950G 694G 256G 73.03 0.92
 39 0.92999 1.0 950G 697G 253G 73.35 0.93
 18 0.92999 1.0 950G 703G 247G 74.00 0.93
 47 0.92999 1.0 950G 703G 246G 74.05 0.93
 23 0.92999 0.86693 950G 704G 245G 74.14 0.94
 6 0.92999 1.0 950G 726G 224G 76.39 0.96
 8 0.92999 1.0 950G 727G 223G 76.54 0.97
 5 0.92999 1.0 950G 728G 222G 76.62 0.97
 35 0.92999 1.0 950G 728G 221G 76.66 0.97
 11 0.92999 1.0 950G 730G 220G 76.82 0.97
 43 0.92999 1.0 950G 730G 219G 76.87 0.97
 33 0.92999 1.0 950G 734G 215G 77.31 0.98
 38 0.92999 1.0 950G 736G 214G 77.49 0.98
 12 0.92999 1.0 950G 737G 212G 77.61 0.98
 31 0.92999 0.85184 950G 742G 208G 78.09 0.99
 28 0.92999 1.0 950G 745G 205G 78.41 0.99
 27 0.92999 1.0 950G 751G 199G 79.04 1.00
 10 0.92999 1.0 950G 754G 195G 79.40 1.00
 13 0.92999 1.0 950G 762G 188G 80.21 1.01
 9 0.92999 1.0 950G 763G 187G 80.29 1.01
 16 0.92999 1.0 950G 764G 186G 80.37 1.01
 0 0.92999 1.0 950G 778G 171G 81.94 1.03
 3 0.92999 1.0 950G 780G 170G 82.11 1.04
 41 0.92999 1.0 950G 780G 169G 82.13 1.04
 34 0.92999 0.87303 950G 783G 167G 82.43 1.04
 14 0.92999 1.0 950G 784G 165G 82.56 1.04
 42 0.92999 1.0 950G 786G 164G 82.70 1.04
 46 0.92999 1.0 950G 788G 162G 82.93 1.05
 30 0.92999 1.0 950G 790G 160G 83.12 1.05
 45 0.92999 1.0 950G 804G 146G 84.59 1.07
 44 0.92999 1.0 950G 807G 143G 84.92 1.07
 1 0.92999 1.0 950G 817G 132G 86.05 1.09
 22 0.92999 1.0 950G 825G 125G 86.81 1.10
 15 0.92999 1.0 950G 826G 123G 86.97 1.10
 19 0.92999 1.0 950G 829G 120G 87.30 1.10
 36 0.92999 1.0 950G 831G 119G 87.48 1.10
 24 0.92999 1.0 950G 831G 118G 87.50 1.10
 26 0.92999 1.0 950G 851G 101692M 89.55 1.13
 29 0.92999 1.0 950G 851G 101341M 89.59 1.13
 4 0.92999 1.0 950G 860G 92164M 90.53 1.14
 MIN/MAX VAR: 0.83/1.14 STDDEV: 5.94
 TOTAL 45629G 36156G 9473G 79.24

 Thanks,
 Vedran


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 active+remapped

2015-08-13 Thread GuangYang
Could you share the 'ceph osd tree dump' and CRUSH map dump ?

Thanks,
Guang



 Date: Thu, 13 Aug 2015 08:16:09 -0700
 From: sdain...@spd1.com
 To: yangyongp...@bwstor.com.cn; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 
 active+remapped

 I decided to set OSD 76 out and let the cluster shuffle the data off
 that disk and then brought the OSD back in. For the most part this
 seemed to be working, but then I had 1 object degraded and 88xxx
 objects misplaced:

 # ceph health detail
 HEALTH_WARN 11 pgs stuck unclean; recovery 1/66089446 objects degraded
 (0.000%); recovery 88844/66089446 objects misplaced (0.134%)
 pg 2.e7f is stuck unclean for 88398.251351, current state
 active+remapped, last acting [58,5]
 pg 2.143 is stuck unclean for 13892.364101, current state
 active+remapped, last acting [16,76]
 pg 2.968 is stuck unclean for 13892.363521, current state
 active+remapped, last acting [44,76]
 pg 2.5f8 is stuck unclean for 13892.377245, current state
 active+remapped, last acting [17,76]
 pg 2.81c is stuck unclean for 13892.363443, current state
 active+remapped, last acting [25,76]
 pg 2.1a3 is stuck unclean for 13892.364400, current state
 active+remapped, last acting [16,76]
 pg 2.2cb is stuck unclean for 13892.374390, current state
 active+remapped, last acting [14,76]
 pg 2.d41 is stuck unclean for 13892.373636, current state
 active+remapped, last acting [27,76]
 pg 2.3f9 is stuck unclean for 13892.373147, current state
 active+remapped, last acting [35,76]
 pg 2.a62 is stuck unclean for 86283.741920, current state
 active+remapped, last acting [2,38]
 pg 2.1b0 is stuck unclean for 13892.363268, current state
 active+remapped, last acting [3,76]
 recovery 1/66089446 objects degraded (0.000%)
 recovery 88844/66089446 objects misplaced (0.134%)

 I say apparently because with one object degraded, none of the pg's
 are showing degraded:
 # ceph pg dump_stuck degraded
 ok

 # ceph pg dump_stuck unclean
 ok
 pg_stat state up up_primary acting acting_primary
 2.e7f active+remapped [58] 58 [58,5] 58
 2.143 active+remapped [16] 16 [16,76] 16
 2.968 active+remapped [44] 44 [44,76] 44
 2.5f8 active+remapped [17] 17 [17,76] 17
 2.81c active+remapped [25] 25 [25,76] 25
 2.1a3 active+remapped [16] 16 [16,76] 16
 2.2cb active+remapped [14] 14 [14,76] 14
 2.d41 active+remapped [27] 27 [27,76] 27
 2.3f9 active+remapped [35] 35 [35,76] 35
 2.a62 active+remapped [2] 2 [2,38] 2
 2.1b0 active+remapped [3] 3 [3,76] 3

 All of the OSD filesystems are below 85% full.

 I then compared a 0.94.2 cluster that was new and had not been updated
 (current cluster is 0.94.2 which had been updated a couple times) and
 noticed the crush map had 'tunable straw_calc_version 1' so I added it
 to the current cluster.

 After the data moved around for about 8 hours or so I'm left with this state:

 # ceph health detail
 HEALTH_WARN 2 pgs stuck unclean; recovery 16357/66089446 objects
 misplaced (0.025%)
 pg 2.e7f is stuck unclean for 149422.331848, current state
 active+remapped, last acting [58,5]
 pg 2.782 is stuck unclean for 64878.002464, current state
 active+remapped, last acting [76,31]
 recovery 16357/66089446 objects misplaced (0.025%)

 I attempted a pg repair on both of the pg's listed above, but it
 doesn't look like anything is happening. The doc's reference an
 inconsistent state as a use case for the repair command so that's
 likely why.

 These 2 pg's have been the issue throughout this process so how can I
 dig deeper to figure out what the problem is?

 # ceph pg 2.e7f query: http://pastebin.com/jMMsbsjS
 # ceph pg 2.e7f query: http://pastebin.com/0ntBfFK5


 On Wed, Aug 12, 2015 at 6:52 PM, yangyongp...@bwstor.com.cn
 yangyongp...@bwstor.com.cn wrote:
 You can try ceph pg repair pg_idto repair the unhealth pg.ceph health
 detail command is very useful to detect unhealth pgs.

 
 yangyongp...@bwstor.com.cn


 From: Steve Dainard
 Date: 2015-08-12 23:48
 To: ceph-users
 Subject: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1
 active+remapped
 I ran a ceph osd reweight-by-utilization yesterday and partway through
 had a network interruption. After the network was restored the cluster
 continued to rebalance but this morning the cluster has stopped
 rebalance and status will not change from:

 # ceph status
 cluster af859ff1-c394-4c9a-95e2-0e0e4c87445c
 health HEALTH_WARN
 1 pgs degraded
 1 pgs stuck degraded
 2 pgs stuck unclean
 1 pgs stuck undersized
 1 pgs undersized
 recovery 8163/66089054 objects degraded (0.012%)
 recovery 8194/66089054 objects misplaced (0.012%)
 monmap e24: 3 mons at
 {mon1=10.0.231.53:6789/0,mon2=10.0.231.54:6789/0,mon3=10.0.231.55:6789/0}
 election epoch 250, quorum 0,1,2 mon1,mon2,mon3
 osdmap e184486: 100 osds: 100 up, 100 in; 1 remapped pgs
 pgmap v3010985: 4144 pgs, 7 pools, 125 TB data, 32270 kobjects
 251 TB used, 111 TB / 363 TB avail