Re: [ceph-users] 回复: half performace with keyvalue backend in 0.87

2014-11-10 Thread 廖建锋
Haomai wang,
   Do you have proresss on this performance issue?



发件人: Haomai Wangmailto:haomaiw...@gmail.com
发送时间: 2014-10-31 10:05
收件人: 廖建锋mailto:de...@f-club.cn
抄送: ceph-usersmailto:ceph-users-boun...@lists.ceph.com; 
ceph-usersmailto:ceph-users@lists.ceph.com
主题: Re: Re: [ceph-users] 回复: half performace with keyvalue backend in 0.87

ok. I will explore it.

On Fri, Oct 31, 2014 at 10:03 AM, 廖建锋 de...@f-club.cn wrote:
 I am not sure if it seq or ramdon,  i just use rsync to copy millions small
 pic file form our pc server to ceph cluster

 发件人: Haomai Wang
 发送时间: 2014-10-31 09:59
 收件人: 廖建锋
 抄送: ceph-users; ceph-users
 主题: Re: Re: [ceph-users] 回复: half performace with keyvalue backend in 0.87

 Thanks, recently I mainly focus on rbd performance for it(random small
 write).

 I want to know your test situation. Is it seq write?

 On Fri, Oct 31, 2014 at 9:48 AM, 廖建锋 de...@f-club.cn wrote:
 which i can telll is :
   in 0.87 , osd's writting under 10MB/s ,but io utilization is about
 95%
  in 0.80.6, osd's writting about 20MB/s, but io utilization is about
 30%

 iostat  -mx 2 with 0.87

 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm
 %util
 sdb 0.00 43.00 9.00 85.50 0.95 1.18 46.14 1.36 14.49 10.01 94.55
 sdc 0.00 37.50 6.00 99.00 0.62 10.01 207.31 2.24 21.31 9.33 97.95
 sda 0.00 3.50 0.00 1.00 0.00 0.02 36.00 0.02 17.50 17.50 1.75

 avg-cpu: %user %nice %system %iowait %steal %idle
 3.16 0.00 1.01 17.45 0.00 78.38

 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm
 %util
 sdb 0.00 36.50 0.00 47.50 0.00 1.09 47.07 0.82 17.17 16.71 79.35
 sdc 0.00 25.00 15.00 77.50 1.26 0.65 42.34 1.73 18.72 10.70 99.00
 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

 发件人: Haomai Wang
 发送时间: 2014-10-31 09:40
 收件人: 廖建锋
 抄送: ceph-users; ceph-users
 主题: Re: [ceph-users] 回复: half performace with keyvalue backend in 0.87

 Yes, it exists persistence problem at 0.80.6 and we fixed it at Giant.
 But at Giant, other performance optimization has been applied. Could
 you tell more about your tests?

 On Fri, Oct 31, 2014 at 8:27 AM, 廖建锋 de...@f-club.cn wrote:
 Also found the other problem is:  the ceph osd directory has millions
 small
 files which will cause performance issue

 1008 = # pwd
 /var/lib/ceph/osd/ceph-8/current

 1007 = # ls |wc -l
 21451

 发件人: ceph-users
 发送时间: 2014-10-31 08:23
 收件人: ceph-users
 主题: [ceph-users] half performace with keyvalue backend in 0.87
 Dear Ceph,
   I used keyvalue backend in 0.80.6 and 0.80.7, the average speed
 with
 rsync millions small files is 10M byte /second
 when i upgrade to 0.87(giant), the speed slow down to 5M byte /second,  I
 don't why , is there any tunning option for this?
 will superblock cause those performance slow down?




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 --
 Best Regards,

 Wheat



 --
 Best Regards,

 Wheat



--
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds isn't working anymore after osd's running full

2014-11-10 Thread Jasper Siero
Hello Greg and John,

Thanks for solving the bug. I will compile the patch and make new rpm packages 
and test it on the Ceph cluster. I will let you know what the results are.

Kind regards,

Jasper

Van: Gregory Farnum [g...@gregs42.com]
Verzonden: vrijdag 7 november 2014 22:42
Aan: Jasper Siero
CC: ceph-users; John Spray
Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full

On Thu, Nov 6, 2014 at 11:49 AM, John Spray john.sp...@redhat.com wrote:
 This is still an issue on master, so a fix will be coming soon.
 Follow the ticket for updates:
 http://tracker.ceph.com/issues/10025

 Thanks for finding the bug!

John is off for a vacation, but he pushed a branch wip-10025-firefly
that if you install that (similar address to the other one) should
work for you. You'll need to reset and undump again (I presume you
still have the journal-as-a-file). I'll be merging them in to the
stable branches pretty shortly as well.
-Greg


 John

 On Thu, Nov 6, 2014 at 6:21 PM, John Spray john.sp...@redhat.com wrote:
 Jasper,

 Thanks for this -- I've reproduced this issue in a development
 environment.  We'll see if this is also an issue on giant, and
 backport a fix if appropriate.  I'll update this thread soon.

 Cheers,
 John

 On Mon, Nov 3, 2014 at 8:49 AM, Jasper Siero
 jasper.si...@target-holding.nl wrote:
 Hello Greg,

 I saw that the site of the previous link of the logs uses a very short 
 expiring time so I uploaded it to another one:

 http://www.mediafire.com/download/gikiy7cqs42cllt/ceph-mds.th1-mon001.log.tar.gz

 Thanks,

 Jasper

 
 Van: gregory.far...@inktank.com [gregory.far...@inktank.com] namens Gregory 
 Farnum [gfar...@redhat.com]
 Verzonden: donderdag 30 oktober 2014 1:03
 Aan: Jasper Siero
 CC: John Spray; ceph-users
 Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running 
 full

 On Wed, Oct 29, 2014 at 7:51 AM, Jasper Siero
 jasper.si...@target-holding.nl wrote:
 Hello Greg,

 I added the debug options which you mentioned and started the process 
 again:

 [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 --pid-file 
 /var/run/ceph/mds.th1-mon001.pid -c /etc/ceph/ceph.conf --cluster ceph 
 --reset-journal 0
 old journal was 9483323613~134233517
 new journal start will be 9621733376 (4176246 bytes past old end)
 writing journal head
 writing EResetJournal entry
 done
 [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 -c 
 /etc/ceph/ceph.conf --cluster ceph --undump-journal 0 
 journaldumptgho-mon001
 undump journaldumptgho-mon001
 start 9483323613 len 134213311
 writing header 200.
  writing 9483323613~1048576
  writing 9484372189~1048576
  writing 9485420765~1048576
  writing 9486469341~1048576
  writing 9487517917~1048576
  writing 9488566493~1048576
  writing 9489615069~1048576
  writing 9490663645~1048576
  writing 9491712221~1048576
  writing 9492760797~1048576
  writing 9493809373~1048576
  writing 9494857949~1048576
  writing 9495906525~1048576
  writing 9496955101~1048576
  writing 9498003677~1048576
  writing 9499052253~1048576
  writing 9500100829~1048576
  writing 9501149405~1048576
  writing 9502197981~1048576
  writing 9503246557~1048576
  writing 9504295133~1048576
  writing 9505343709~1048576
  writing 9506392285~1048576
  writing 9507440861~1048576
  writing 9508489437~1048576
  writing 9509538013~1048576
  writing 9510586589~1048576
  writing 9511635165~1048576
  writing 9512683741~1048576
  writing 9513732317~1048576
  writing 9514780893~1048576
  writing 9515829469~1048576
  writing 9516878045~1048576
  writing 9517926621~1048576
  writing 9518975197~1048576
  writing 9520023773~1048576
  writing 9521072349~1048576
  writing 9522120925~1048576
  writing 9523169501~1048576
  writing 9524218077~1048576
  writing 9525266653~1048576
  writing 9526315229~1048576
  writing 9527363805~1048576
  writing 9528412381~1048576
  writing 9529460957~1048576
  writing 9530509533~1048576
  writing 9531558109~1048576
  writing 9532606685~1048576
  writing 9533655261~1048576
  writing 9534703837~1048576
  writing 9535752413~1048576
  writing 9536800989~1048576
  writing 9537849565~1048576
  writing 9538898141~1048576
  writing 9539946717~1048576
  writing 9540995293~1048576
  writing 9542043869~1048576
  writing 9543092445~1048576
  writing 9544141021~1048576
  writing 9545189597~1048576
  writing 9546238173~1048576
  writing 9547286749~1048576
  writing 9548335325~1048576
  writing 9549383901~1048576
  writing 9550432477~1048576
  writing 9551481053~1048576
  writing 9552529629~1048576
  writing 9553578205~1048576
  writing 9554626781~1048576
  writing 9555675357~1048576
  writing 9556723933~1048576
  writing 9557772509~1048576
  writing 9558821085~1048576
  writing 9559869661~1048576
  writing 9560918237~1048576
  writing 9561966813~1048576
  writing 9563015389~1048576
  writing 9564063965~1048576
  writing 9565112541~1048576
  writing 9566161117~1048576
  

Re: [ceph-users] Cache Tier Statistics

2014-11-10 Thread Nick Fisk
Hi Jean-Charles,

Thanks for your response, I have found the following using ceph daemon 
osd.{id} perf dump.

  tier_promote: 1425,
  tier_flush: 0,
  tier_flush_fail: 0,
  tier_try_flush: 216,
  tier_try_flush_fail: 21,
  tier_evict: 1413,
  tier_whiteout: 201,
  tier_dirty: 671,
  tier_clean: 216,
  tier_delay: 16,

I'm guessing the tier_promote should increase every time there is a cache miss? 
If this is the case then I simply need to add this value from every OSD and 
divide by total reads to work out the percentage hit rate.

Nick

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Jean-Charles Lopez
Sent: 09 November 2014 01:43
To: Nick Fisk
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Cache Tier Statistics

Hi Nick

If my brain doesn't fail me you can try
ceph daemon osd.{id} perf dump
ceph report (not 100% sure if cache stats are in

Rgds
JC


On Saturday, November 8, 2014, Nick Fisk n...@fisk.me.uk wrote:
Hi,
 
Does anyone know if there any statistics available specific to the cache tier 
functionality, I’m thinking along the lines of cache hit ratios? Or should I be 
pulling out the Read statistics for backing+cache pools and assuming that if a 
read happens from the backing pool it was a miss and then calculating it from 
that?
 
Thanks,
Nick




-- 
Sent while moving




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 回复: half performace with keyvalue backend in 0.87

2014-11-10 Thread Haomai Wang
Yep, be patient. Need more time

On Mon, Nov 10, 2014 at 9:33 AM, 廖建锋 de...@f-club.cn wrote:
 Haomai wang,
Do you have proresss on this performance issue?



 发件人: Haomai Wang
 发送时间: 2014-10-31 10:05
 收件人: 廖建锋
 抄送: ceph-users; ceph-users
 主题: Re: Re: [ceph-users] 回复: half performace with keyvalue backend in 0.87

 ok. I will explore it.

 On Fri, Oct 31, 2014 at 10:03 AM, 廖建锋 de...@f-club.cn wrote:
 I am not sure if it seq or ramdon,  i just use rsync to copy millions
 small
 pic file form our pc server to ceph cluster

 发件人: Haomai Wang
 发送时间: 2014-10-31 09:59
 收件人: 廖建锋
 抄送: ceph-users; ceph-users
 主题: Re: Re: [ceph-users] 回复: half performace with keyvalue backend in 0.87

 Thanks, recently I mainly focus on rbd performance for it(random small
 write).

 I want to know your test situation. Is it seq write?

 On Fri, Oct 31, 2014 at 9:48 AM, 廖建锋 de...@f-club.cn wrote:
 which i can telll is :
   in 0.87 , osd's writting under 10MB/s ,but io utilization is about
 95%
  in 0.80.6, osd's writting about 20MB/s, but io utilization is about
 30%

 iostat  -mx 2 with 0.87

 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm
 %util
 sdb 0.00 43.00 9.00 85.50 0.95 1.18 46.14 1.36 14.49 10.01 94.55
 sdc 0.00 37.50 6.00 99.00 0.62 10.01 207.31 2.24 21.31 9.33 97.95
 sda 0.00 3.50 0.00 1.00 0.00 0.02 36.00 0.02 17.50 17.50 1.75

 avg-cpu: %user %nice %system %iowait %steal %idle
 3.16 0.00 1.01 17.45 0.00 78.38

 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm
 %util
 sdb 0.00 36.50 0.00 47.50 0.00 1.09 47.07 0.82 17.17 16.71 79.35
 sdc 0.00 25.00 15.00 77.50 1.26 0.65 42.34 1.73 18.72 10.70 99.00
 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

 发件人: Haomai Wang
 发送时间: 2014-10-31 09:40
 收件人: 廖建锋
 抄送: ceph-users; ceph-users
 主题: Re: [ceph-users] 回复: half performace with keyvalue backend in 0.87

 Yes, it exists persistence problem at 0.80.6 and we fixed it at Giant.
 But at Giant, other performance optimization has been applied. Could
 you tell more about your tests?

 On Fri, Oct 31, 2014 at 8:27 AM, 廖建锋 de...@f-club.cn wrote:
 Also found the other problem is:  the ceph osd directory has millions
 small
 files which will cause performance issue

 1008 = # pwd
 /var/lib/ceph/osd/ceph-8/current

 1007 = # ls |wc -l
 21451

 发件人: ceph-users
 发送时间: 2014-10-31 08:23
 收件人: ceph-users
 主题: [ceph-users] half performace with keyvalue backend in 0.87
 Dear Ceph,
   I used keyvalue backend in 0.80.6 and 0.80.7, the average speed
 with
 rsync millions small files is 10M byte /second
 when i upgrade to 0.87(giant), the speed slow down to 5M byte /second,
 I
 don't why , is there any tunning option for this?
 will superblock cause those performance slow down?




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 --
 Best Regards,

 Wheat



 --
 Best Regards,

 Wheat



 --
 Best Regards,

 Wheat



-- 
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph on RHEL 7 using teuthology

2014-11-10 Thread Sarang G
Yes. I see similar package dependency when installing manually.

~Pras

On Mon, Nov 10, 2014 at 3:00 PM, Loic Dachary l...@dachary.org wrote:

 Hi,

 It looks like there are broken packages on the target machine even before
 teuthology tries to install new packages. Do you see similar errors when
 trying to install a package manually ?

 Cheers

 On 10/11/2014 09:59, Sarang G wrote:
  Hi,
 
  1. Created an instance on AWS using AMI: ami-99bef1a9
  2. All the settings related to root access to sudo, passwordless ssh was
 configured.
  3. started teuthology with basic yaml configuration:
 
  check-locks: false
  os_type: rhel
  os_version: '7.0'
  roles:
  - - mon.a
- mon.b
- mon.c
- osd.0
- osd.1
- osd.2
- client.0
  suite_path: /home/pras/ceph-qa-suite
  targets:
user@hostname: ssh-rsa ssh-key
 
  tasks:
  - install: null
  - ceph: null
  - interactive: null
 
  Teuthology log attached.
 
  ~Pras
 
 
 
 
  On Mon, Nov 10, 2014 at 1:03 PM, Loic Dachary l...@dachary.org mailto:
 l...@dachary.org wrote:
 
  [moving the thread to ceph-devel]
 
  Hi,
 
  It would be useful if you could upload the full log somewhere and
 provide details about what you did on the machine prior to seeing this
 error. That would help figure out what is wrong.
 
  Cheers
 
  On 10/11/2014 07:20, Sarang G wrote:
   Hi,
  
   I am trying to install ceph on RHEL 7 AWS instance using
 teuthology.
  
   I am facing some dependency issues:
  
   2014-11-09 22:10:57,269.269 DEBUG:teuthology.orchestra.run:Running
 [10.15.17.91]: 'sudo yum install ceph-radosgw-0.87 -y'
   2014-11-09 22:10:58,223.223
 INFO:teuthology.orchestra.run.out:[10.15.17.91]: Loaded plugins: amazon-id,
 rhui-lb
   2014-11-09 22:10:59,008.008
 INFO:teuthology.orchestra.run.out:[10.15.17.91]: Resolving Dependencies
   2014-11-09 22:10:59,010.010
 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Running transaction
 check
   2014-11-09 22:10:59,010.010
 INFO:teuthology.orchestra.run.out:[10.15.17.91]: --- Package
 ceph-radosgw.x86_64 1:0.87-665.gb8ec7d7.el7 will be installed
   2014-11-09 22:10:59,028.028
 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Processing Dependency:
 librados2 = 1:0.87-665.gb8ec7d7.el7 for package:
 1:ceph-radosgw-0.87-665.gb8ec7d7.el7.x86_64
   2014-11-09 22:10:59,086.086
 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Processing Dependency:
 ceph-common = 1:0.87-665.gb8ec7d7.el7 for package:
 1:ceph-radosgw-0.87-665.gb8ec7d7.el7.x86_64
   2014-11-09 22:10:59,088.088
 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Processing Dependency:
 librados.so.2()(64bit) for package:
 1:ceph-radosgw-0.87-665.gb8ec7d7.el7.x86_64
   2014-11-09 22:10:59,089.089
 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Processing Dependency:
 libfcgi.so.0()(64bit) for package:
 1:ceph-radosgw-0.87-665.gb8ec7d7.el7.x86_64
   2014-11-09 22:10:59,089.089
 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Running transaction
 check
   2014-11-09 22:10:59,090.090
 INFO:teuthology.orchestra.run.out:[10.15.17.91]: --- Package
 ceph-common.x86_64 1:0.87-665.gb8ec7d7.el7 will be installed
   2014-11-09 22:10:59,103.103
 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Processing Dependency:
 python-ceph = 1:0.87-665.gb8ec7d7.el7 for package:
 1:ceph-common-0.87-665.gb8ec7d7.el7.x86_64
   2014-11-09 22:10:59,255.255
 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Processing Dependency:
 librbd1 = 1:0.87-665.gb8ec7d7.el7 for package:
 1:ceph-common-0.87-665.gb8ec7d7.el7.x86_64
   2014-11-09 22:10:59,255.255
 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Processing Dependency:
 libtcmalloc.so.4()(64bit) for package:
 1:ceph-common-0.87-665.gb8ec7d7.el7.x86_64
   2014-11-09 22:10:59,255.255
 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Processing Dependency:
 librbd.so.1()(64bit) for package: 1:ceph-common-0.87-665.gb8ec7d7.el7.x86_64
   2014-11-09 22:10:59,255.255
 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Processing Dependency:
 libboost_thread-mt.so.1.53.0()(64bit) for package:
 1:ceph-common-0.87-665.gb8ec7d7.el7.x86_64
   2014-11-09 22:10:59,256.256
 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Processing Dependency:
 libboost_system-mt.so.1.53.0()(64bit) for package:
 1:ceph-common-0.87-665.gb8ec7d7.el7.x86_64
   2014-11-09 22:10:59,256.256
 INFO:teuthology.orchestra.run.out:[10.15.17.91]: --- Package
 ceph-radosgw.x86_64 1:0.87-665.gb8ec7d7.el7 will be installed
   2014-11-09 22:10:59,256.256
 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Processing Dependency:
 libfcgi.so.0()(64bit) for package:
 1:ceph-radosgw-0.87-665.gb8ec7d7.el7.x86_64
   2014-11-09 22:10:59,256.256
 INFO:teuthology.orchestra.run.out:[10.15.17.91]: --- Package
 librados2.x86_64 1:0.87-665.gb8ec7d7.el7 will be installed
   2014-11-09 22:10:59,256.256
 

Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem

2014-11-10 Thread Chad Seys
Hi Craig and list,

   If you create a real osd.20, you might want to leave it OUT until you
   get things healthy again.

I created a real osd.20 (and it turns out I needed an osd.21 also).  

ceph pg x.xx query no longer lists down osds for probing:
down_osds_we_would_probe: [],

But I cannot find the magic command line which will remove these incomplete 
PGs.

Anyone know how to remove incomplete PGs ?

Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Typical 10GbE latency

2014-11-10 Thread Wido den Hollander
On 08-11-14 02:42, Gary M wrote:
 Wido,
 
 Take the switch out of the path between nodes and remeasure.. ICMP-echo
 requests are very low priority traffic for switches and network stacks. 
 

I tried with a direct TwinAx and fiber cable. No difference.

 If you really want to know, place a network analyzer between the nodes
 to measure the request packet to response packet latency.. The ICMP
 traffic to the ping application is not accurate in the sub-millisecond
 range. And should only be used as a rough estimate.
 

True, I fully agree with you. But, why is everybody showing a lower
latency here? My latencies are about 40% higher then what I see in this
setup and other setups.

 You also may want to install the high resolution timer patch, sometimes
 called HRT, to the kernel which may give you different results. 
 
 ICMP traffic takes a different path than the TCP traffic and should not
 be considered an indicator of defect. 
 

Yes, I'm aware. But it still doesn't explain me why the latency on other
systems, which are in production, is lower then on this idle system.

 I believe the ping app calls the sendto system call.(sorry its been a
 while since I last looked)  Systems calls can take between .1us and .2us
 each. However, the ping application makes several of these calls and
 waits for a signal from the kernel. The wait for a signal means the ping
 application must wait to be rescheduled to report the time.Rescheduling
 will depend on a lot of other factors in the os. eg, timers, card
 interrupts other tasks with higher priorities.  Reporting the time must
 add a few more systems calls for this to happen. As the ping application
 loops to post the next ping request which again requires a few systems
 calls which may cause a task switch while in each system call.
 
 For the above factors, the ping application is not a good representation
 of network performance due to factors in the application and network
 traffic shaping performed at the switch and the tcp stacks. 
 

I think that netperf is probably a better tool, but that also does TCP
latencies.

I want the real IP latency, so I assumed that ICMP would be the most
simple one.

The other setups I have access to are in production and do not have any
special tuning, yet their latency is still lower then on this new
deployment.

That's what gets me confused.

Wido

 cheers,
 gary
 
 
 On Fri, Nov 7, 2014 at 4:32 PM, Łukasz Jagiełło
 jagiello.luk...@gmail.com mailto:jagiello.luk...@gmail.com wrote:
 
 Hi,
 
 rtt min/avg/max/mdev = 0.070/0.177/0.272/0.049 ms
 
 04:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit
 SFI/SFP+ Network Connection (rev 01)
 
 at both hosts and Arista 7050S-64 between.
 
 Both hosts were part of active ceph cluster.
 
 
 On Thu, Nov 6, 2014 at 5:18 AM, Wido den Hollander w...@42on.com
 mailto:w...@42on.com wrote:
 
 Hello,
 
 While working at a customer I've ran into a 10GbE latency which
 seems
 high to me.
 
 I have access to a couple of Ceph cluster and I ran a simple
 ping test:
 
 $ ping -s 8192 -c 100 -n ip
 
 Two results I got:
 
 rtt min/avg/max/mdev = 0.080/0.131/0.235/0.039 ms
 rtt min/avg/max/mdev = 0.128/0.168/0.226/0.023 ms
 
 Both these environment are running with Intel 82599ES 10Gbit
 cards in
 LACP. One with Extreme Networks switches, the other with Arista.
 
 Now, on a environment with Cisco Nexus 3000 and Nexus 7000
 switches I'm
 seeing:
 
 rtt min/avg/max/mdev = 0.160/0.244/0.298/0.029 ms
 
 As you can see, the Cisco Nexus network has high latency
 compared to the
 other setup.
 
 You would say the switches are to blame, but we also tried with
 a direct
 TwinAx connection, but that didn't help.
 
 This setup also uses the Intel 82599ES cards, so the cards don't
 seem to
 be the problem.
 
 The MTU is set to 9000 on all these networks and cards.
 
 I was wondering, others with a Ceph cluster running on 10GbE,
 could you
 perform a simple network latency test like this? I'd like to
 compare the
 results.
 
 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant
 
 Phone: +31 (0)20 700 9902 tel:%2B31%20%280%2920%20700%209902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 
 -- 
 Łukasz Jagiełło
 lukaszatjagiellodotorg
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 
 

Re: [ceph-users] mds isn't working anymore after osd's running full

2014-11-10 Thread Jasper Siero
Hello John and Greg,

I used the new patch and now the undump succeeded and the mds is working fine 
and I can mount cephfs again!

I still have one placement group which keeps deep scrubbing even after 
restarting the ceph cluster:
dumped all in format plain
3.300   0   0   0   0   0   0   
active+clean+scrubbing+deep 2014-11-10 17:21:15.866965  0'0 
2414:418[1,9]   1   [1,9]   1   631'34632014-08-21 
15:14:45.430926  602'31312014-08-18 15:14:37.494913

I there a way to solve this?

Kind regards,

Jasper

Van: Gregory Farnum [g...@gregs42.com]
Verzonden: vrijdag 7 november 2014 22:42
Aan: Jasper Siero
CC: ceph-users; John Spray
Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full

On Thu, Nov 6, 2014 at 11:49 AM, John Spray john.sp...@redhat.com wrote:
 This is still an issue on master, so a fix will be coming soon.
 Follow the ticket for updates:
 http://tracker.ceph.com/issues/10025

 Thanks for finding the bug!

John is off for a vacation, but he pushed a branch wip-10025-firefly
that if you install that (similar address to the other one) should
work for you. You'll need to reset and undump again (I presume you
still have the journal-as-a-file). I'll be merging them in to the
stable branches pretty shortly as well.
-Greg


 John

 On Thu, Nov 6, 2014 at 6:21 PM, John Spray john.sp...@redhat.com wrote:
 Jasper,

 Thanks for this -- I've reproduced this issue in a development
 environment.  We'll see if this is also an issue on giant, and
 backport a fix if appropriate.  I'll update this thread soon.

 Cheers,
 John

 On Mon, Nov 3, 2014 at 8:49 AM, Jasper Siero
 jasper.si...@target-holding.nl wrote:
 Hello Greg,

 I saw that the site of the previous link of the logs uses a very short 
 expiring time so I uploaded it to another one:

 http://www.mediafire.com/download/gikiy7cqs42cllt/ceph-mds.th1-mon001.log.tar.gz

 Thanks,

 Jasper

 
 Van: gregory.far...@inktank.com [gregory.far...@inktank.com] namens Gregory 
 Farnum [gfar...@redhat.com]
 Verzonden: donderdag 30 oktober 2014 1:03
 Aan: Jasper Siero
 CC: John Spray; ceph-users
 Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running 
 full

 On Wed, Oct 29, 2014 at 7:51 AM, Jasper Siero
 jasper.si...@target-holding.nl wrote:
 Hello Greg,

 I added the debug options which you mentioned and started the process 
 again:

 [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 --pid-file 
 /var/run/ceph/mds.th1-mon001.pid -c /etc/ceph/ceph.conf --cluster ceph 
 --reset-journal 0
 old journal was 9483323613~134233517
 new journal start will be 9621733376 (4176246 bytes past old end)
 writing journal head
 writing EResetJournal entry
 done
 [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 -c 
 /etc/ceph/ceph.conf --cluster ceph --undump-journal 0 
 journaldumptgho-mon001
 undump journaldumptgho-mon001
 start 9483323613 len 134213311
 writing header 200.
  writing 9483323613~1048576
  writing 9484372189~1048576
  writing 9485420765~1048576
  writing 9486469341~1048576
  writing 9487517917~1048576
  writing 9488566493~1048576
  writing 9489615069~1048576
  writing 9490663645~1048576
  writing 9491712221~1048576
  writing 9492760797~1048576
  writing 9493809373~1048576
  writing 9494857949~1048576
  writing 9495906525~1048576
  writing 9496955101~1048576
  writing 9498003677~1048576
  writing 9499052253~1048576
  writing 9500100829~1048576
  writing 9501149405~1048576
  writing 9502197981~1048576
  writing 9503246557~1048576
  writing 9504295133~1048576
  writing 9505343709~1048576
  writing 9506392285~1048576
  writing 9507440861~1048576
  writing 9508489437~1048576
  writing 9509538013~1048576
  writing 9510586589~1048576
  writing 9511635165~1048576
  writing 9512683741~1048576
  writing 9513732317~1048576
  writing 9514780893~1048576
  writing 9515829469~1048576
  writing 9516878045~1048576
  writing 9517926621~1048576
  writing 9518975197~1048576
  writing 9520023773~1048576
  writing 9521072349~1048576
  writing 9522120925~1048576
  writing 9523169501~1048576
  writing 9524218077~1048576
  writing 9525266653~1048576
  writing 9526315229~1048576
  writing 9527363805~1048576
  writing 9528412381~1048576
  writing 9529460957~1048576
  writing 9530509533~1048576
  writing 9531558109~1048576
  writing 9532606685~1048576
  writing 9533655261~1048576
  writing 9534703837~1048576
  writing 9535752413~1048576
  writing 9536800989~1048576
  writing 9537849565~1048576
  writing 9538898141~1048576
  writing 9539946717~1048576
  writing 9540995293~1048576
  writing 9542043869~1048576
  writing 9543092445~1048576
  writing 9544141021~1048576
  writing 9545189597~1048576
  writing 9546238173~1048576
  writing 9547286749~1048576
  writing 9548335325~1048576
  writing 9549383901~1048576
  writing 9550432477~1048576
  writing 9551481053~1048576
  writing 

Re: [ceph-users] Installing CephFs via puppet

2014-11-10 Thread Francois Charlier
- Original Message -
 From: JIten Shah jshah2...@me.com
 To: Jean-Charles LOPEZ jc.lo...@inktank.com
 Cc: ceph-users ceph-us...@ceph.com
 Sent: Friday, November 7, 2014 7:18:10 PM
 Subject: Re: [ceph-users] Installing CephFs via puppet
 
 Thanks JC and Loic but we HAVE to use puppet.  That’s how all of our
 configuration and deployment stuff works and I can’t sway away from it.
 
 Is https://github.com/enovance/puppet-ceph a good resource for cephFS? Has
 anyone used it successfully?
 

Hi,

This module doesn't currently doesn't provide any mean to deploy CephFS.
-- 
François Charlier   Software Engineer
// eNovance SAS  http://www.enovance.com/
// ✉ francois.charl...@enovance.com  ☎ +33 1 49 70 99 81
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds isn't working anymore after osd's running full

2014-11-10 Thread Gregory Farnum
It's supposed to do that; deep scrubbing is an ongoing
consistency-check mechanism. If you really want to disable it you can
set an osdmap flag to prevent it, but you'll have to check the docs
for exactly what that is as I can't recall.
Glad things are working for you; sorry it took so long!
-Greg

On Mon, Nov 10, 2014 at 8:49 AM, Jasper Siero
jasper.si...@target-holding.nl wrote:
 Hello John and Greg,

 I used the new patch and now the undump succeeded and the mds is working fine 
 and I can mount cephfs again!

 I still have one placement group which keeps deep scrubbing even after 
 restarting the ceph cluster:
 dumped all in format plain
 3.300   0   0   0   0   0   0   
 active+clean+scrubbing+deep 2014-11-10 17:21:15.866965  0'0 
 2414:418[1,9]   1   [1,9]   1   631'34632014-08-21 
 15:14:45.430926  602'31312014-08-18 15:14:37.494913

 I there a way to solve this?

 Kind regards,

 Jasper
 
 Van: Gregory Farnum [g...@gregs42.com]
 Verzonden: vrijdag 7 november 2014 22:42
 Aan: Jasper Siero
 CC: ceph-users; John Spray
 Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full

 On Thu, Nov 6, 2014 at 11:49 AM, John Spray john.sp...@redhat.com wrote:
 This is still an issue on master, so a fix will be coming soon.
 Follow the ticket for updates:
 http://tracker.ceph.com/issues/10025

 Thanks for finding the bug!

 John is off for a vacation, but he pushed a branch wip-10025-firefly
 that if you install that (similar address to the other one) should
 work for you. You'll need to reset and undump again (I presume you
 still have the journal-as-a-file). I'll be merging them in to the
 stable branches pretty shortly as well.
 -Greg


 John

 On Thu, Nov 6, 2014 at 6:21 PM, John Spray john.sp...@redhat.com wrote:
 Jasper,

 Thanks for this -- I've reproduced this issue in a development
 environment.  We'll see if this is also an issue on giant, and
 backport a fix if appropriate.  I'll update this thread soon.

 Cheers,
 John

 On Mon, Nov 3, 2014 at 8:49 AM, Jasper Siero
 jasper.si...@target-holding.nl wrote:
 Hello Greg,

 I saw that the site of the previous link of the logs uses a very short 
 expiring time so I uploaded it to another one:

 http://www.mediafire.com/download/gikiy7cqs42cllt/ceph-mds.th1-mon001.log.tar.gz

 Thanks,

 Jasper

 
 Van: gregory.far...@inktank.com [gregory.far...@inktank.com] namens 
 Gregory Farnum [gfar...@redhat.com]
 Verzonden: donderdag 30 oktober 2014 1:03
 Aan: Jasper Siero
 CC: John Spray; ceph-users
 Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running 
 full

 On Wed, Oct 29, 2014 at 7:51 AM, Jasper Siero
 jasper.si...@target-holding.nl wrote:
 Hello Greg,

 I added the debug options which you mentioned and started the process 
 again:

 [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 --pid-file 
 /var/run/ceph/mds.th1-mon001.pid -c /etc/ceph/ceph.conf --cluster ceph 
 --reset-journal 0
 old journal was 9483323613~134233517
 new journal start will be 9621733376 (4176246 bytes past old end)
 writing journal head
 writing EResetJournal entry
 done
 [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 -c 
 /etc/ceph/ceph.conf --cluster ceph --undump-journal 0 
 journaldumptgho-mon001
 undump journaldumptgho-mon001
 start 9483323613 len 134213311
 writing header 200.
  writing 9483323613~1048576
  writing 9484372189~1048576
  writing 9485420765~1048576
  writing 9486469341~1048576
  writing 9487517917~1048576
  writing 9488566493~1048576
  writing 9489615069~1048576
  writing 9490663645~1048576
  writing 9491712221~1048576
  writing 9492760797~1048576
  writing 9493809373~1048576
  writing 9494857949~1048576
  writing 9495906525~1048576
  writing 9496955101~1048576
  writing 9498003677~1048576
  writing 9499052253~1048576
  writing 9500100829~1048576
  writing 9501149405~1048576
  writing 9502197981~1048576
  writing 9503246557~1048576
  writing 9504295133~1048576
  writing 9505343709~1048576
  writing 9506392285~1048576
  writing 9507440861~1048576
  writing 9508489437~1048576
  writing 9509538013~1048576
  writing 9510586589~1048576
  writing 9511635165~1048576
  writing 9512683741~1048576
  writing 9513732317~1048576
  writing 9514780893~1048576
  writing 9515829469~1048576
  writing 9516878045~1048576
  writing 9517926621~1048576
  writing 9518975197~1048576
  writing 9520023773~1048576
  writing 9521072349~1048576
  writing 9522120925~1048576
  writing 9523169501~1048576
  writing 9524218077~1048576
  writing 9525266653~1048576
  writing 9526315229~1048576
  writing 9527363805~1048576
  writing 9528412381~1048576
  writing 9529460957~1048576
  writing 9530509533~1048576
  writing 9531558109~1048576
  writing 9532606685~1048576
  writing 9533655261~1048576
  writing 9534703837~1048576
  writing 9535752413~1048576
  writing 9536800989~1048576
  writing 

Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem

2014-11-10 Thread Craig Lewis
If all of your PGs now have an empty down_osds_we_would_probe, I'd run
through this discussion again.  The commands to tell Ceph to give up on
lost data should have an effect now.

That's my experience anyway.  Nothing progressed until I took care of
down_osds_we_would_probe.
After that was empty, I was able to repair.  It wasn't immediate though.
It still took ~24 hours, and a few OSD restarts, for the cluster to get
itself healthy.  You might try sequentially restarting OSDs.  It shouldn't
be necessary, but it shouldn't make anything worse.



On Mon, Nov 10, 2014 at 7:17 AM, Chad Seys cws...@physics.wisc.edu wrote:

 Hi Craig and list,

If you create a real osd.20, you might want to leave it OUT until you
get things healthy again.

 I created a real osd.20 (and it turns out I needed an osd.21 also).

 ceph pg x.xx query no longer lists down osds for probing:
 down_osds_we_would_probe: [],

 But I cannot find the magic command line which will remove these incomplete
 PGs.

 Anyone know how to remove incomplete PGs ?

 Thanks!
 Chad.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pg's stuck in inactive/unclean state + Association from PG-OSD does not seem to be happenning.

2014-11-10 Thread Prashanth Nednoor
Folks,

Now, we are running into an issue where the PG's(192) are stuck in creating 
state forever.
I have experimented with various PG settings(osd_pool_default_pg_num from 50 to 
400) for replicas and default and doesn't seem to help so far.
Just to give you a brief overview, I have 8 osd's.
I see the create_pg is pending  messages in ceph monitor logs.
I have attached the following logs in the zip file.
1) crush map(crush.map)
2) ceph osd tree, (OSD_TREE.txt OSD's  1,2,3,4 belong to host octeon and  OSD's 
0,5,6,7 belong to host octeon1).
3) ceph pg dump, health details etcetc(dump_pgs, health_detail)
4) Attached the ceph.conf
5) ceph osd lspools.
0 data,1 metadata,2 rbd,

Here is the dump for ceph -w before any osd's were created:
ceph -w
cluster 3eda0199-93a9-428b-8209-caeff84d3d3f
 health HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds
 monmap e1: 1 mons at {essperf13=209.243.160.45:6789/0}, election epoch 1, 
quorum 0 essperf13
 osdmap e205: 0 osds: 0 up, 0 in
  pgmap v928: 192 pgs, 3 pools, 0 bytes data, 0 objects
0 kB used, 0 kB / 0 kB avail
 192 creating

2014-11-05 23:26:46.555348 mon.0 [INF] pgmap v928: 192 pgs: 192 creating; 0 
bytes data, 0 kB used, 0 kB / 0 kB avail

Here is the dump for ceph -w after  8 osd's were created:
ceph -w
cluster 3eda0199-93a9-428b-8209-caeff84d3d3f
 health HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean
 monmap e1: 1 mons at {essperf13=209.243.160.45:6789/0}, election epoch 1, 
quorum 0 essperf13
 osdmap e213: 8 osds: 8 up, 8 in
  pgmap v958: 192 pgs, 3 pools, 0 bytes data, 0 objects
328 MB used, 14856 GB / 14856 GB avail
 192 creating

2014-11-05 23:46:25.461143 mon.0 [INF] pgmap v958: 192 pgs: 192 creating; 0 
bytes data, 328 MB used, 14856 GB / 14856 GB avail

Any pointers to resolve this issue will be helpful.

Thanks
Prashanth



-Original Message-
From: Prashanth Nednoor
Sent: Tuesday, October 28, 2014 9:26 PM
To: 'Sage Weil'
Cc: Philip Kufeldt; ceph-de...@vger.kernel.org
Subject: RE: cephx auth issues:Having issues trying to get the OSD up on a 
MIPS64, when the OSD tries to communicate with the monitor!!!

Sage,

As requested I set the debug setting in ceph.conf on both the sides.
Here are the logs for  the  OSD and  MONITOR attached.
1) OSD : IPADDRESS: 209.243.157.187. Logfile attached is: Ceph-0.log
2) MONITOR: IP ADDRESS: 209.243.160.45, Logfile attached is: 
Ceph-mon.essperf13.log  
 
Please Note that AUTHENTICATION IS DISABLED IN THE /etc/ceph/ceph.conf files on 
both OSD and monitor.
In addition to this on the OSD side I by-passed part of the authentication code 
that was causing trouble(monc-authenticate) in osd_init function call. I hope 
this is ok.
Good news is my osd daemon is up now on the MIPS side, finally, but for some 
reason MONITOR is still not detecting the OSD.

It seems from the ceph mon log, it knows the  OSD is at 187 and it does 
exchange some information.

Thanks for your prompt response and help.

Thanks
Prashanth

-Original Message-
From: Sage Weil [mailto:s...@newdream.net]
Sent: Tuesday, October 28, 2014 4:59 PM
To: Prashanth Nednoor
Cc: Philip Kufeldt; ceph-de...@vger.kernel.org
Subject: Re: cephx auth issues:Having issues trying to get the OSD up on a 
MIPS64, when the OSD tries to communicate with the monitor!!!

Hi,

On Tue, 28 Oct 2014, Prashanth Nednoor wrote:
 Folks,
 
 I am trying to get the osd up and having an issue. OSD does exchange some 
 messages with the MONITOR before this error.
 Seems like an issue with authentication in my set up with MIPS based OSD and 
 Intel XEON MONITORS. I have attached the logs.
 The OSD(209.243.157.187) sends some request to MONITOR (209.243.160.45).
 I see this message No session security set, followed by the below message.
 The reply is coming back as auth_reply(proto 2 -1 (1) Operation not permitted.
 
 Is there an ENDIAN issue here between MIPS based OSD(BIGEENDIAN) and INTEL 
 XEONS(LITTLE ENDIAN), my CEPH-MOINTORS are INTEL XEONS???
 
 I made sure the keyrings are all consistent. Here are the keys on OSD and 
 MONITOR.
 
 I tried disabling authentication by setting the following 
 auth_service_required = none, auth_client_required = none and 
 auth_cluster_required = none.
 Looks there was some issue with this in osd_init code, where it seems like 
 AUTHENTICATION IS MANDATORY.
 
 HERE IS THE INFORMATION ON MY KEYS ON OSD AND MONITOR.
 ON THE OSD:
 more /etc/ceph/ceph.client.admin.keyring
 [osd.0]
 key = AQCddYJv4JkxIhAApeqP7Ahp+uUXYrgmgQt+LA==
 [client.admin]
 key = AQA1jixUQAaWABAA1tAjhIbrmOCIqNAkeNVulQ==
 
 more /var/lib/ceph/bootstrap-osd/ceph.keyring
 [client.bootstrap-osd]
 key = AQA1jixUwGjoGxAASUUlYC2rGfH7Zl4rCfCylA==
 
 ON THE MONITOR:
 more /etc/ceph/ceph.client.admin.keyring
 [client.admin]
 key = AQA1jixUQAaWABAA1tAjhIbrmOCIqNAkeNVulQ==
 
 more /var/lib/ceph/bootstrap-osd/ceph.keyring
 

Re: [ceph-users] PG inconsistency

2014-11-10 Thread Craig Lewis
For #1, it depends what you mean by fast.  I wouldn't worry about it taking
15 minutes.

If you mark the old OSD out, ceph will start remapping data immediately,
including a bunch of PGs on unrelated OSDs.  Once you replace the disk, and
put the same OSDID back in the same host, the CRUSH map will be back to
what it was before you started.  All of those remaps on unrelated OSDs will
reverse.  They'll complete fairly quickly, because they only have to
backfill the data that was written during the remap.


I prefer #1.  ceph pg repair will just overwrite the replicas with whatever
the primary OSD has, which may copy bad data from your bad OSD over good
replicas.  So #2 has the potential to corrupt the data.  #1 will delete the
data you know is bad, leaving only good data behind to replicate.  Once
ceph pg repair gets more intelligent, I'll revisit this.

I also prefer the simplicity.  If it's dead or corrupt, they're treated the
same.




On Sun, Nov 9, 2014 at 7:25 PM, GuangYang yguan...@outlook.com wrote:


 In terms of disk replacement, to avoid migrating data back and forth, are
 the below two approaches reasonable?
  1. Keep the OSD in and do an ad-hoc disk replacement and provision a new
 OSD (so that keep the OSD id as the same), and then trigger data migration.
 In this way the data migration only happens once, however, it does require
 operators to replace the disk very fast.
  2. Move the data on the broken disk to a new disk completely and use Ceph
 to repair bad objects.

 Thanks,
 Guang

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD commits suicide

2014-11-10 Thread Craig Lewis
Have you tuned any of the recovery or backfill parameters?  My ceph.conf
has:
[osd]
  osd max backfills = 1
  osd recovery max active = 1
  osd recovery op priority = 1

Still, if it's running for a few hours, then failing, it sounds like there
might be something else at play.  OSDs use a lot of RAM during recovery.
How much RAM and how many OSDs do you have in these nodes?  What does
memory usage look like after a fresh restart, and what does it look like
when the problems start?  Even better if you know what it looks like 5
minutes before the problems start.

Is there anything interesting in the kernel logs?  OOM killers, or memory
deadlocks?



On Sat, Nov 8, 2014 at 11:19 AM, Erik Logtenberg e...@logtenberg.eu wrote:

 Hi,

 I have some OSD's that keep committing suicide. My cluster has ~1.3M
 misplaced objects, and it can't really recover, because OSD's keep
 failing before recovering finishes. The load on the hosts is quite high,
 but the cluster currently has no other tasks than just the
 backfilling/recovering.

 I attached the logfile from a failed OSD. It shows the suicide, the
 recent events and also me starting the OSD again after some time.

 It'll keep running for a couple of hours and then fail again, for the
 same reason.

 I noticed a lot of timeouts. Apparently ceph stresses the hosts to the
 limit with the recovery tasks, so much that they timeout and can't
 finish that task. I don't understand why. Can I somehow throttle ceph a
 bit so that it doesn't keep overrunning itself? I kinda feel like it
 should chill out a bit and simply recover one step at a time instead of
 full force and then fail.

 Thanks,

 Erik.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] An OSD always crash few minutes after start

2014-11-10 Thread Craig Lewis
You're running 0.87-6.  There were various fixes for this problem in
Firefly.  Were any of these snapshots created on early version of Firefly?

So far, every fix for this issue has gotten developers involved.  I'd see
if you can talk to some devs on IRC, or post to the ceph-devel mailing list.


My own experience is that I had to delete the affected PGs, and force
create them.  Hopefully there's a better answer now.



On Fri, Nov 7, 2014 at 8:10 PM, Chu Duc Minh chu.ducm...@gmail.com wrote:

 One of my OSDs have problems and can NOT be start. I tried to start many
 times but it always crash few minutes after start.
 I think about two reasons to make it crash:
 1. A read/write request to this OSD, but due to the corrupted
 volume/snapshot/parent-image/..., it crash.
 2. The recovering process can NOT work properly due to the corrupted
 volumes/snapshot/parent-image/...

 After many retry and check log, i guess the reason (2) is the main cause.
 Because  if (1) is the main cause, other OSDs (contain buggy
 volume/snapshot) will crash too.

 State of my ceph cluster (just few seconds before crash time):

   111/57706299 objects degraded (0.001%)
 14918 active+clean
1 active+clean+scrubbing+deep
   52 active+recovery_wait+degraded
2 active+recovering+degraded


 PS: i attach crash-dump log of that OSD in this email for your information.

 Thank you!

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stuck in stale state

2014-11-10 Thread Craig Lewis
nothing to send, going to standby isn't necessarily bad, I see it from
time to time.  It shouldn't stay like that for long though.  If it's been 5
minutes, and the cluster still isn't doing anything, I'd restart that osd.

On Fri, Nov 7, 2014 at 1:55 PM, Jan Pekař jan.pe...@imatic.cz wrote:

 Hi,

 I was testing ceph cluster map changes and I got to stuck state which
 seems to be indefinite.
 First my description what I have done.

 I'm testing special case with only one copy of pg's (pool size = 1).

 All pg's was on one osd.0. I created second osd.1 and modified cluster map
 to transfer one pool (metadata) to the newly created osd.1
 PG's started to remap and objects degraded number was dropping - so
 everything looked normal.

 During that recovery process I restarted both osd daemons.
 After that I noticed, that pg's, that should be remapped had stale state
 - stale+active+remapped+backfilling and other object with stale state .
 I tried to run ceph pg force_create_pg on one pg, that should be remapped,
 but nothing changed (that is 1 stuck / creating PG below in ceph health)

 Command rados -p metadata ls hangs so data are unavailable, but it should
 be there.

 What should I do in this state to get it working?

 ceph -s below:

 cluster 93418692-8e2e-4689-a237-ed5b47f39f72
  health HEALTH_WARN 52 pgs backfill; 1 pgs backfilling; 63 pgs stale;
 1 pgs stuck inactive; 63 pgs stuck stale; 54 pgs stuck unclean; recovery
 107232/1881806 objects degraded (5.698%); mon.imatic-mce low disk space
  monmap e1: 1 mons at {imatic-mce=192.168.11.165:6789/0}, election
 epoch 1, quorum 0 imatic-mce
  mdsmap e450: 1/1/1 up {0=imatic-mce=up:active}
  osdmap e275: 2 osds: 2 up, 2 in
   pgmap v51624: 448 pgs, 4 pools, 790 GB data, 1732 kobjects
 804 GB used, 2915 GB / 3720 GB avail
 107232/1881806 objects degraded (5.698%)
   52 stale+active+remapped+wait_backfill
1 creating
1 stale+active+remapped+backfilling
   10 stale+active+clean
  384 active+clean

 Last message in OSD log's:

 2014-11-07 22:17:45.402791 deb4db70  0 -- 192.168.11.165:6804/29564 
 192.168.11.165:6807/29939 pipe(0x9d52f00 sd=213 :53216 s=2 pgs=1 cs=1 l=0
 c=0x2c7f58c0).fault with nothing to send, going to standby

 Thank you for help
 With regards
 Jan Pekar, ceph fan
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem

2014-11-10 Thread Chad Seys
Hi Craig,

 If all of your PGs now have an empty down_osds_we_would_probe, I'd run
 through this discussion again.

Yep, looks to be true.

So I ran:

# ceph pg force_create_pg 2.5

and it has been creating for about 3 hours now. :/


# ceph health detail | grep creating
pg 2.5 is stuck inactive since forever, current state creating, last acting []
pg 2.5 is stuck unclean since forever, current state creating, last acting []

Then I restart all OSDs.  The creating label disapears and I'm back with 
same number of incomplete PGs.  :(

is the 'force_create_pg' the right command?  The 'mark_unfound_lost' complains 
that 'pg has no unfound objects' .

I shall start the 'force_create_pg' again and wait longer.  Unless there is a 
different command to use. ?

Thanks!
Chad.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd down

2014-11-10 Thread Shain Miley
Craig,

Thanks for the info.

I ended up doing a zap and then a create via ceph-deploy.

One question that I still have is surrounding adding the failed osd back into 
the pool.

In this example...osd.70 was badwhen I added it back in via 
ceph-deploy...the disk was brought up as osd.108.

Only after osd.108 was up and running did I think to remove osd.70 from the 
crush map etc.

My question is this...had I removed it from the crush map prior to my 
ceph-deploy create...should/would Ceph have reused the osd number 70?

I would prefer to replace a failed disk with a new one and keep the old osd 
assignment...if possible that is why I am asking.

Anyway...thanks again for all the help.

Shain

Sent from my iPhone

On Nov 7, 2014, at 2:09 PM, Craig Lewis 
cle...@centraldesktop.commailto:cle...@centraldesktop.com wrote:

I'd stop that osd daemon, and run xfs_check / xfs_repair on that partition.

If you repair anything, you should probably force a deep-scrub on all the PGs 
on that disk.  I think ceph osd deep-scrub osdid will do that, but you might 
have to manually grep ceph pg dump .


Or you could just treat it like a failed disk, but re-use the disk. 
ceph-disk-prepare --zap-disk should take care of you.


On Thu, Nov 6, 2014 at 5:06 PM, Shain Miley 
smi...@npr.orgmailto:smi...@npr.org wrote:
I tried restarting all the osd's on that node, osd.70 was the only ceph process 
that did not come back online.

There is nothing in the ceph-osd log for osd.70.

However I do see over 13,000 of these messages in the kern.log:

Nov  6 19:54:27 hqosd6 kernel: [34042786.392178] XFS (sdl1): xfs_log_force: 
error 5 returned.

Does anyone have any suggestions on how I might be able to get this HD back in 
the cluster (or whether or not it is worth even trying).

Thanks,

Shain

Shain Miley | Manager of Systems and Infrastructure, Digital Media | 
smi...@npr.orgmailto:smi...@npr.org | 202.513.3649


From: Shain Miley [smi...@npr.orgmailto:smi...@npr.org]
Sent: Tuesday, November 04, 2014 3:55 PM
To: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
Subject: osd down

Hello,

We are running ceph version 0.80.5 with 108 osd's.

Today I noticed that one of the osd's is down:

root@hqceph1:/var/log/ceph# ceph -s
 cluster 504b5794-34bd-44e7-a8c3-0494cf800c23
  health HEALTH_WARN crush map has legacy tunables
  monmap e1: 3 mons at
{hqceph1=10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0http://10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0},
election epoch 146, quorum 0,1,2 hqceph1,hqceph2,hqceph3
  osdmap e7119: 108 osds: 107 up, 107 in
   pgmap v6729985: 3208 pgs, 17 pools, 81193 GB data, 21631 kobjects
 216 TB used, 171 TB / 388 TB avail
 3204 active+clean
4 active+clean+scrubbing
   client io 4079 kB/s wr, 8 op/s


Using osd dump I determined that it is osd number 70:

osd.70 down out weight 0 up_from 2668 up_thru 6886 down_at 6913
last_clean_interval [488,2665) 
10.35.1.217:6814/22440http://10.35.1.217:6814/22440
10.35.1.217:6820/22440http://10.35.1.217:6820/22440 
10.35.1.217:6824/22440http://10.35.1.217:6824/22440 10.35.1.217:6830/22440
autoout,existshttp://10.35.1.217:6830/22440
autoout,exists 5dbd4a14-5045-490e-859b-15533cd67568


Looking at that node, the drive is still mounted and I did not see any
errors in any of the system logs, and the raid level status shows the
drive as up and healthy, etc.


root@hqosd6:~# df -h |grep 70
/dev/sdl1   3.7T  1.9T  1.9T  51% /var/lib/ceph/osd/ceph-70


I was hoping that someone might be able to advise me on the next course
of action (can I add the osd back in?, should I replace the drive
altogether, etc)

I have attached the osd log to this email.

Any suggestions would be great.

Thanks,

Shain















--
Shain Miley | Manager of Systems and Infrastructure, Digital Media |
smi...@npr.orgmailto:smi...@npr.org | 202.513.3649
___
ceph-users mailing list
ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pg's stuck in inactive/unclean state + Association from PG-OSD does not seem to be happenning.

2014-11-10 Thread Jan Pekař

It is simple.
When you have this kind of problem (stuck), first look into crush map.

And here you are:

You have only one default ruleset 0 with step take default (so 
selecting osd's from default root subtree), but your root doesn't 
contain any osds. See below:


rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

root default {
id -1   # do not change unnecessarily
# weight 0.000
alg straw
hash 0  # rjenkins1
}

I recommend to add octeon1 and octeon as items into default root and it 
should work (or create another root and replace step take default with 
your new root name).


JP

On 2014-11-10 20:21, Prashanth Nednoor wrote:

Folks,

Now, we are running into an issue where the PG's(192) are stuck in creating 
state forever.
I have experimented with various PG settings(osd_pool_default_pg_num from 50 to 
400) for replicas and default and doesn't seem to help so far.
Just to give you a brief overview, I have 8 osd's.
I see the create_pg is pending  messages in ceph monitor logs.
I have attached the following logs in the zip file.
1) crush map(crush.map)
2) ceph osd tree, (OSD_TREE.txt OSD's  1,2,3,4 belong to host octeon and  OSD's 
0,5,6,7 belong to host octeon1).
3) ceph pg dump, health details etcetc(dump_pgs, health_detail)
4) Attached the ceph.conf
5) ceph osd lspools.
0 data,1 metadata,2 rbd,

Here is the dump for ceph -w before any osd's were created:
ceph -w
 cluster 3eda0199-93a9-428b-8209-caeff84d3d3f
  health HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds
  monmap e1: 1 mons at {essperf13=209.243.160.45:6789/0}, election epoch 1, 
quorum 0 essperf13
  osdmap e205: 0 osds: 0 up, 0 in
   pgmap v928: 192 pgs, 3 pools, 0 bytes data, 0 objects
 0 kB used, 0 kB / 0 kB avail
  192 creating

2014-11-05 23:26:46.555348 mon.0 [INF] pgmap v928: 192 pgs: 192 creating; 0 
bytes data, 0 kB used, 0 kB / 0 kB avail

Here is the dump for ceph -w after  8 osd's were created:
ceph -w
 cluster 3eda0199-93a9-428b-8209-caeff84d3d3f
  health HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean
  monmap e1: 1 mons at {essperf13=209.243.160.45:6789/0}, election epoch 1, 
quorum 0 essperf13
  osdmap e213: 8 osds: 8 up, 8 in
   pgmap v958: 192 pgs, 3 pools, 0 bytes data, 0 objects
 328 MB used, 14856 GB / 14856 GB avail
  192 creating

2014-11-05 23:46:25.461143 mon.0 [INF] pgmap v958: 192 pgs: 192 creating; 0 
bytes data, 328 MB used, 14856 GB / 14856 GB avail

Any pointers to resolve this issue will be helpful.

Thanks
Prashanth



-Original Message-
From: Prashanth Nednoor
Sent: Tuesday, October 28, 2014 9:26 PM
To: 'Sage Weil'
Cc: Philip Kufeldt; ceph-de...@vger.kernel.org
Subject: RE: cephx auth issues:Having issues trying to get the OSD up on a 
MIPS64, when the OSD tries to communicate with the monitor!!!

Sage,

As requested I set the debug setting in ceph.conf on both the sides.
Here are the logs for  the  OSD and  MONITOR attached.
1) OSD : IPADDRESS: 209.243.157.187. Logfile attached is: Ceph-0.log
2) MONITOR: IP ADDRESS: 209.243.160.45, Logfile attached is: 
Ceph-mon.essperf13.log

Please Note that AUTHENTICATION IS DISABLED IN THE /etc/ceph/ceph.conf files on 
both OSD and monitor.
In addition to this on the OSD side I by-passed part of the authentication code 
that was causing trouble(monc-authenticate) in osd_init function call. I hope 
this is ok.
Good news is my osd daemon is up now on the MIPS side, finally, but for some 
reason MONITOR is still not detecting the OSD.

It seems from the ceph mon log, it knows the  OSD is at 187 and it does 
exchange some information.

Thanks for your prompt response and help.

Thanks
Prashanth

-Original Message-
From: Sage Weil [mailto:s...@newdream.net]
Sent: Tuesday, October 28, 2014 4:59 PM
To: Prashanth Nednoor
Cc: Philip Kufeldt; ceph-de...@vger.kernel.org
Subject: Re: cephx auth issues:Having issues trying to get the OSD up on a 
MIPS64, when the OSD tries to communicate with the monitor!!!

Hi,

On Tue, 28 Oct 2014, Prashanth Nednoor wrote:

Folks,

I am trying to get the osd up and having an issue. OSD does exchange some 
messages with the MONITOR before this error.
Seems like an issue with authentication in my set up with MIPS based OSD and 
Intel XEON MONITORS. I have attached the logs.
The OSD(209.243.157.187) sends some request to MONITOR (209.243.160.45).
I see this message No session security set, followed by the below message.
The reply is coming back as auth_reply(proto 2 -1 (1) Operation not permitted.

Is there an ENDIAN issue here between MIPS based OSD(BIGEENDIAN) and INTEL 
XEONS(LITTLE ENDIAN), my CEPH-MOINTORS are INTEL XEONS???

I made sure the keyrings are all consistent. Here are the keys on OSD and 
MONITOR.

I tried 

Re: [ceph-users] Stuck in stale state

2014-11-10 Thread Jan Pekař
Thank you, sorry for bothering, I was new to ceph-users list and I 
couldn't cancel my message. I found out what happened few hours later.


Main problem was, that I moved one OSD from host hostname {} crush map 
entry (I wanted to do so). Everything was OK, but restart of OSD caused 
automatic osd placement back into host hostname {} crush map section 
again.

I solved it with
 osd crush update on start = false
see ceph-crush-location hook
http://ceph.com/docs/master/rados/operations/crush-map/

You can consider solved, no problem with CEPH, only my poor knowledge 
caused that.


JP


On 2014-11-10 20:53, Craig Lewis wrote:

nothing to send, going to standby isn't necessarily bad, I see it from
time to time.  It shouldn't stay like that for long though.  If it's
been 5 minutes, and the cluster still isn't doing anything, I'd restart
that osd.

On Fri, Nov 7, 2014 at 1:55 PM, Jan Pekař jan.pe...@imatic.cz
mailto:jan.pe...@imatic.cz wrote:

Hi,

I was testing ceph cluster map changes and I got to stuck state
which seems to be indefinite.
First my description what I have done.

I'm testing special case with only one copy of pg's (pool size = 1).

All pg's was on one osd.0. I created second osd.1 and modified
cluster map to transfer one pool (metadata) to the newly created osd.1
PG's started to remap and objects degraded number was dropping -
so everything looked normal.

During that recovery process I restarted both osd daemons.
After that I noticed, that pg's, that should be remapped had stale
state - stale+active+remapped+__backfilling and other object with
stale state .
I tried to run ceph pg force_create_pg on one pg, that should be
remapped, but nothing changed (that is 1 stuck / creating PG below
in ceph health)

Command rados -p metadata ls hangs so data are unavailable, but it
should be there.

What should I do in this state to get it working?

ceph -s below:

 cluster 93418692-8e2e-4689-a237-__ed5b47f39f72
  health HEALTH_WARN 52 pgs backfill; 1 pgs backfilling; 63 pgs
stale; 1 pgs stuck inactive; 63 pgs stuck stale; 54 pgs stuck
unclean; recovery 107232/1881806 objects degraded (5.698%);
mon.imatic-mce low disk space
  monmap e1: 1 mons at {imatic-mce=192.168.11.165:__6789/0
http://192.168.11.165:6789/0}, election epoch 1, quorum 0 imatic-mce
  mdsmap e450: 1/1/1 up {0=imatic-mce=up:active}
  osdmap e275: 2 osds: 2 up, 2 in
   pgmap v51624: 448 pgs, 4 pools, 790 GB data, 1732 kobjects
 804 GB used, 2915 GB / 3720 GB avail
 107232/1881806 objects degraded (5.698%)
   52 stale+active+remapped+wait___backfill
1 creating
1 stale+active+remapped+__backfilling
   10 stale+active+clean
  384 active+clean

Last message in OSD log's:

2014-11-07 22:17:45.402791 deb4db70  0 -- 192.168.11.165:6804/29564
http://192.168.11.165:6804/29564  192.168.11.165:6807/29939
http://192.168.11.165:6807/29939 pipe(0x9d52f00 sd=213 :53216 s=2
pgs=1 cs=1 l=0 c=0x2c7f58c0).fault with nothing to send, going to
standby

Thank you for help
With regards
Jan Pekar, ceph fan
_
ceph-users mailing list
ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--

Ing. Jan Pekař
jan.pe...@imatic.cz | +420603811737

Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz

--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Node down question

2014-11-10 Thread Jason
I have searched the list archives, and have seen a couple of references
to this question, but no real solution, unfortunately...

We are running multiple ceph clusters, pretty much as media appliances. 
As such, the number of nodes is variable, and all of the nodes are
symmetric (i.e. same CPU power, memory, disk space).  As a result, we
are running a monitor and OSD (connected to an SSD RAID) on each of the
systems.  The number of nodes is typically small, on the order of five
to a dozen.  As the node count gets higher, we are planning not to run
monitors on all nodes.

Our pools are typically set up with a replication size of 2 or 3, with a
minsize of 1.

The problem occurs when a single node goes down, such that its monitor
and OSD stop at once.  For a client (especially a writer) on another
node, there is a pretty consistent 20 second delay until further
operations go through.  This is a delay that we cannot easily survive.

If I first bring down the OSD, then wait a few seconds, and then bring
down the monitor, the system behaves with only a few seconds of delay. 
However, we can't always guarantee the graceful shutdown (such as when a
node is rebooted, loses network connectivity, or power is lost).

Note that I get exactly the same behavior if I stop an OSD on one
system, while stopping a monitor on another...

Previous discussions similar to this have touched upon the osd
heartbeat grace setting, which is conspiciously set to 20 seconds.  I
have tried changing this, along with any other related settings, to no
avail -- for whatever I do, the delay remains at 20 seconds.

Anything else to try?

Jason

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd down

2014-11-10 Thread Craig Lewis
Yes, removing an OSD before re-creating it will give you the same OSD ID.
That's my preferred method, because it keeps the crushmap the same.  Only
PGs that existed on the replaced disk need to be backfilled.

I don't know if adding the replacement to the same host then removing the
old OSD gives you the same CRUSH map as the reverse.  I suspect not,
because the OSDs are re-ordered on that host.


On Mon, Nov 10, 2014 at 1:29 PM, Shain Miley smi...@npr.org wrote:

   Craig,

  Thanks for the info.

  I ended up doing a zap and then a create via ceph-deploy.

  One question that I still have is surrounding adding the failed osd back
 into the pool.

  In this example...osd.70 was badwhen I added it back in via
 ceph-deploy...the disk was brought up as osd.108.

  Only after osd.108 was up and running did I think to remove osd.70 from
 the crush map etc.

  My question is this...had I removed it from the crush map prior to my
 ceph-deploy create...should/would Ceph have reused the osd number 70?

  I would prefer to replace a failed disk with a new one and keep the old
 osd assignment...if possible that is why I am asking.

  Anyway...thanks again for all the help.

  Shain

 Sent from my iPhone

 On Nov 7, 2014, at 2:09 PM, Craig Lewis cle...@centraldesktop.com wrote:

   I'd stop that osd daemon, and run xfs_check / xfs_repair on that
 partition.

  If you repair anything, you should probably force a deep-scrub on all
 the PGs on that disk.  I think ceph osd deep-scrub osdid will do that,
 but you might have to manually grep ceph pg dump .


  Or you could just treat it like a failed disk, but re-use the disk. 
 ceph-disk-prepare
 --zap-disk should take care of you.


 On Thu, Nov 6, 2014 at 5:06 PM, Shain Miley smi...@npr.org wrote:

 I tried restarting all the osd's on that node, osd.70 was the only ceph
 process that did not come back online.

 There is nothing in the ceph-osd log for osd.70.

 However I do see over 13,000 of these messages in the kern.log:

 Nov  6 19:54:27 hqosd6 kernel: [34042786.392178] XFS (sdl1):
 xfs_log_force: error 5 returned.

 Does anyone have any suggestions on how I might be able to get this HD
 back in the cluster (or whether or not it is worth even trying).

 Thanks,

 Shain

 Shain Miley | Manager of Systems and Infrastructure, Digital Media |
 smi...@npr.org | 202.513.3649

 
 From: Shain Miley [smi...@npr.org]
 Sent: Tuesday, November 04, 2014 3:55 PM
 To: ceph-users@lists.ceph.com
 Subject: osd down

 Hello,

 We are running ceph version 0.80.5 with 108 osd's.

 Today I noticed that one of the osd's is down:

 root@hqceph1:/var/log/ceph# ceph -s
  cluster 504b5794-34bd-44e7-a8c3-0494cf800c23
   health HEALTH_WARN crush map has legacy tunables
   monmap e1: 3 mons at
 {hqceph1=
 10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0
 },
 election epoch 146, quorum 0,1,2 hqceph1,hqceph2,hqceph3
   osdmap e7119: 108 osds: 107 up, 107 in
pgmap v6729985: 3208 pgs, 17 pools, 81193 GB data, 21631 kobjects
  216 TB used, 171 TB / 388 TB avail
  3204 active+clean
 4 active+clean+scrubbing
client io 4079 kB/s wr, 8 op/s


 Using osd dump I determined that it is osd number 70:

 osd.70 down out weight 0 up_from 2668 up_thru 6886 down_at 6913
 last_clean_interval [488,2665) 10.35.1.217:6814/22440
 10.35.1.217:6820/22440 10.35.1.217:6824/22440 10.35.1.217:6830/22440
 autoout,exists http://10.35.1.217:6830/22440autoout,exists
 5dbd4a14-5045-490e-859b-15533cd67568


 Looking at that node, the drive is still mounted and I did not see any
 errors in any of the system logs, and the raid level status shows the
 drive as up and healthy, etc.


 root@hqosd6:~# df -h |grep 70
 /dev/sdl1   3.7T  1.9T  1.9T  51% /var/lib/ceph/osd/ceph-70


 I was hoping that someone might be able to advise me on the next course
 of action (can I add the osd back in?, should I replace the drive
 altogether, etc)

 I have attached the osd log to this email.

 Any suggestions would be great.

 Thanks,

 Shain















 --
 Shain Miley | Manager of Systems and Infrastructure, Digital Media |
 smi...@npr.org | 202.513.3649
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Trying to figure out usable space on erasure coded pools

2014-11-10 Thread David Moreau Simard
Hi,

It's easy to calculate the amount of raw storage vs actual storage on 
replicated pools.
Example with 4x 2TB disks:
- 8TB raw
- 4TB usable (when using 2 replicas)

I understand how erasure coded pools reduces the overhead of storage required 
for data redundancy and resiliency and how it depends on the erasure coding 
profile you use.

Do you guys have an easy way to figure out the amount of usable storage ?

Thanks !
--
David Moreau Simard



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem

2014-11-10 Thread Craig Lewis
I had the same experience with force_create_pg too.

I ran it, and the PGs sat there in creating state.  I left the cluster
overnight, and sometime in the middle of the night, they created.  The
actual transition from creating to active+clean happened during the
recovery after a single OSD was kicked out.  I don't recall if that single
OSD was responsible for the creating PGs.  I really can't say what
un-jammed my creating.


On Mon, Nov 10, 2014 at 12:33 PM, Chad Seys cws...@physics.wisc.edu wrote:

 Hi Craig,

  If all of your PGs now have an empty down_osds_we_would_probe, I'd run
  through this discussion again.

 Yep, looks to be true.

 So I ran:

 # ceph pg force_create_pg 2.5

 and it has been creating for about 3 hours now. :/


 # ceph health detail | grep creating
 pg 2.5 is stuck inactive since forever, current state creating, last
 acting []
 pg 2.5 is stuck unclean since forever, current state creating, last acting
 []

 Then I restart all OSDs.  The creating label disapears and I'm back with
 same number of incomplete PGs.  :(

 is the 'force_create_pg' the right command?  The 'mark_unfound_lost'
 complains
 that 'pg has no unfound objects' .

 I shall start the 'force_create_pg' again and wait longer.  Unless there
 is a
 different command to use. ?

 Thanks!
 Chad.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Trying to figure out usable space on erasure coded pools

2014-11-10 Thread Sage Weil
On Mon, 10 Nov 2014, David Moreau Simard wrote:
 Hi,
 
 It's easy to calculate the amount of raw storage vs actual storage on 
 replicated pools.
 Example with 4x 2TB disks:
 - 8TB raw
 - 4TB usable (when using 2 replicas)
 
 I understand how erasure coded pools reduces the overhead of storage required 
 for data redundancy and resiliency and how it depends on the erasure coding 
 profile you use.
 
 Do you guys have an easy way to figure out the amount of usable storage ?

The 'ceph df' command now has a 'MAX AVAIL' column that factors in either 
the replication factor or erasure k/(k+m) ratio.  It also takes into 
account the projected distribution of data across disks from the CRUSH 
rule and uses the 'first OSD to fill up' as the target.

What it doesn't take into account is the expected variation in utilization 
or the 'full_ratio' and 'near_full_ratio' which will stop writes sometime 
before that point.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Trying to figure out usable space on erasure coded pools

2014-11-10 Thread David Moreau Simard
Oh, that's interesting - I didn't know that.

Thanks.
--
David Moreau Simard


 On Nov 10, 2014, at 6:06 PM, Sage Weil s...@newdream.net wrote:
 
 On Mon, 10 Nov 2014, David Moreau Simard wrote:
 Hi,
 
 It's easy to calculate the amount of raw storage vs actual storage on 
 replicated pools.
 Example with 4x 2TB disks:
 - 8TB raw
 - 4TB usable (when using 2 replicas)
 
 I understand how erasure coded pools reduces the overhead of storage 
 required for data redundancy and resiliency and how it depends on the 
 erasure coding profile you use.
 
 Do you guys have an easy way to figure out the amount of usable storage ?
 
 The 'ceph df' command now has a 'MAX AVAIL' column that factors in either 
 the replication factor or erasure k/(k+m) ratio.  It also takes into 
 account the projected distribution of data across disks from the CRUSH 
 rule and uses the 'first OSD to fill up' as the target.
 
 What it doesn't take into account is the expected variation in utilization 
 or the 'full_ratio' and 'near_full_ratio' which will stop writes sometime 
 before that point.
 
 sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG's incomplete after OSD failure

2014-11-10 Thread Matthew Anderson
Hi All,

We've had a string of very unfortunate failures and need a hand fixing
the incomplete PG's that we're now left with. We're configured with 3
replicas over different hosts with 5 in total.

The timeline goes -
-1 week  :: A full server goes offline with a failed backplane. Still
not working
-1 day  ::  OSD 190 fails
-1 day + 3 minutes :: OSD 121 fails in a different server fails taking
out several PG's and blocking IO
Today  :: The first failed osd (osd.190) was cloned to a good drive
with xfs_dump | xfs_restore and now boots fine. The last failed osd
(osd.121) is completely unrecoverable and was marked as lost.

What we're left with now is 2 incomplete PG's that are preventing RBD
images from booting.

# ceph pg dump_stuck inactive
ok
pg_statobjectsmipdegrmispunfbyteslog
disklogstatestate_stampvreportedupup_primary
 actingacting_primarylast_scrubscrub_stamp
last_deep_scrubdeep_scrub_stamp
8.ca244000001021974886492059205
incomplete2014-11-11 10:29:04.910512160435'959618
161358:6071679[190,111]190[190,111]19086417'207324
   2013-09-09 12:58:10.74900186229'1968872013-09-02
12:57:58.162789
8.6ae00000031763176incomplete
2014-11-11 10:24:07.000373160931'1935986161358:267
[117,190]117[117,190]11786424'3897482013-09-09
16:52:58.79665086424'3897482013-09-09 16:52:58.796650

We've tried doing a pg revert but it's saying 'no missing objects'
followed by not doing anything. I've also done the usual scrub,
deep-scrub, pg and osd repairs... so far nothing has helped.

I think it could be a similar situation to this post [
http://www.spinics.net/lists/ceph-users/msg11461.html ] where one of
the osd's it holding a slightly newer but incomplete version of the PG
which needs to be removed. Is anyone able to shed some light on how I
might be able to use the objectstore tool to check if this is the
case?

If anyone has any suggestions it would be greatly appreciated.
Likewise if you need any more information about my problem just let me
know

Thanks all
-Matt
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osds fails to start with mismatch in id

2014-11-10 Thread Ramakrishna Nishtala (rnishtal)
Hi Greg,

Thanks for the pointer. I think you are right. The full story is like this.



After installation, everything works fine until I reboot. I do observe udevadm 
getting triggered in logs, but the devices do not come up after reboot. Exact 
issue as http://tracker.ceph.com/issues/5194. But this has been fixed a while 
back per the case details.

As a workaround, I copied the contents from /proc/mounts to fstab and that’s 
where I landed into the issue.



After your suggestion, defined as UUID in fstab, but similar problem.

blkid.tab now moved to tmpfs and also isn’t consistent ever after issuing blkid 
explicitly to get the UUID’s. Goes in line with ceph-disk comments.



Decided to reinstall, dd the partitions, zapdisks etc. Did not help. Very weird 
that links below change in /dev/disk/by-uuid and /dev/disk/by-partuuid etc.



Before reboot

lrwxrwxrwx 1 root root 10 Nov 10 06:31 11aca3e2-a9d5-4bcc-a5b0-441c53d473b6 - 
../../sdd2

lrwxrwxrwx 1 root root 10 Nov 10 06:31 89594989-90cb-4144-ac99-0ffd6a04146e - 
../../sde2

lrwxrwxrwx 1 root root 10 Nov 10 06:31 c17fe791-5525-4b09-92c4-f90eaaf80dc6 - 
../../sda2

lrwxrwxrwx 1 root root 10 Nov 10 06:31 c57541a1-6820-44a8-943f-94d68b4b03d4 - 
../../sdc2

lrwxrwxrwx 1 root root 10 Nov 10 06:31 da7030dd-712e-45e4-8d89-6e795d9f8011 - 
../../sdb2



After reboot

lrwxrwxrwx 1 root root 10 Nov 10 09:50 11aca3e2-a9d5-4bcc-a5b0-441c53d473b6 - 
../../sdd2

lrwxrwxrwx 1 root root 10 Nov 10 09:50 89594989-90cb-4144-ac99-0ffd6a04146e - 
../../sde2

lrwxrwxrwx 1 root root 10 Nov 10 09:50 c17fe791-5525-4b09-92c4-f90eaaf80dc6 - 
../../sda2

lrwxrwxrwx 1 root root 10 Nov 10 09:50 c57541a1-6820-44a8-943f-94d68b4b03d4 - 
../../sdb2

lrwxrwxrwx 1 root root 10 Nov 10 09:50 da7030dd-712e-45e4-8d89-6e795d9f8011 - 
../../sdh2



Essentially, the transformation here is sdb2-sdh2 and sdc2- sdb2. In fact I 
haven’t partitioned my sdh at all before the test. The only difference probably 
from the standard procedure is I have pre-created the partitions for the 
journal and data, with parted.



/lib/udev/rules.d  osd rules has four different partition GUID codes,

45b0969e-9b03-4f30-b4c6-5ec00ceff106,

45b0969e-9b03-4f30-b4c6-b4b80ceff106,

4fbd7e29-9d25-41b8-afd0-062c0ceff05d,

4fbd7e29-9d25-41b8-afd0-5ec00ceff05d,



But all my partitions journal/data are having 
ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 as partition guid code.



Appreciate any help.



Regards,



Rama

=

-Original Message-
From: Gregory Farnum [mailto:g...@gregs42.com]
Sent: Sunday, November 09, 2014 3:36 PM
To: Ramakrishna Nishtala (rnishtal)
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] osds fails to start with mismatch in id



On Sun, Nov 9, 2014 at 3:21 PM, Ramakrishna Nishtala (rnishtal) 
rnish...@cisco.commailto:rnish...@cisco.com wrote:

 Hi



 I am on ceph 0.87, RHEL 7



 Out of 60 few osd’s start and the rest complain about mismatch about

 id’s as below.







 2014-11-09 07:09:55.501177 7f4633e01880 -1 OSD id 56 != my id 53



 2014-11-09 07:09:55.810048 7f636edf4880 -1 OSD id 57 != my id 54



 2014-11-09 07:09:56.122957 7f459a766880 -1 OSD id 58 != my id 55



 2014-11-09 07:09:56.429771 7f87f8e0c880 -1 OSD id 0 != my id 56



 2014-11-09 07:09:56.741329 7fadd9b91880 -1 OSD id 2 != my id 57







 Found one OSD ID in /var/lib/ceph/cluster-id/keyring. To check this

 out manually corrected it and turned authentication to none too, but

 did not help.







 Any clues, how it can be corrected?



It sounds like maybe the symlinks to data and journal aren't matching up with 
where they're supposed to be. This is usually a result of using unstable /dev 
links that don't always match to the same physical disks. Have you checked that?

-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG's incomplete after OSD failure

2014-11-10 Thread Matthew Anderson
Just an update, it appears that no data actually exists for those PG's
on osd.117 and osd.111 but it's showing as incomplete anyway.

So for the 8.ca PG, osd.111 has only an empty directory but osd 190 is
filled with data.
For 8.6ae, osd.117 has no data in the pg directory and osd.190 is
filled with data as before.

Since all of the required data is on OSD.190, would there be a way to
make osd.111 and osd.117 forget they have ever seen the two incomplete
PG's and therefore restart backfilling?


On Tue, Nov 11, 2014 at 10:37 AM, Matthew Anderson
manderson8...@gmail.com wrote:
 Hi All,

 We've had a string of very unfortunate failures and need a hand fixing
 the incomplete PG's that we're now left with. We're configured with 3
 replicas over different hosts with 5 in total.

 The timeline goes -
 -1 week  :: A full server goes offline with a failed backplane. Still
 not working
 -1 day  ::  OSD 190 fails
 -1 day + 3 minutes :: OSD 121 fails in a different server fails taking
 out several PG's and blocking IO
 Today  :: The first failed osd (osd.190) was cloned to a good drive
 with xfs_dump | xfs_restore and now boots fine. The last failed osd
 (osd.121) is completely unrecoverable and was marked as lost.

 What we're left with now is 2 incomplete PG's that are preventing RBD
 images from booting.

 # ceph pg dump_stuck inactive
 ok
 pg_statobjectsmipdegrmispunfbyteslog
 disklogstatestate_stampvreportedupup_primary
  actingacting_primarylast_scrubscrub_stamp
 last_deep_scrubdeep_scrub_stamp
 8.ca244000001021974886492059205
 incomplete2014-11-11 10:29:04.910512160435'959618
 161358:6071679[190,111]190[190,111]19086417'207324
2013-09-09 12:58:10.74900186229'1968872013-09-02
 12:57:58.162789
 8.6ae00000031763176incomplete
 2014-11-11 10:24:07.000373160931'1935986161358:267
 [117,190]117[117,190]11786424'3897482013-09-09
 16:52:58.79665086424'3897482013-09-09 16:52:58.796650

 We've tried doing a pg revert but it's saying 'no missing objects'
 followed by not doing anything. I've also done the usual scrub,
 deep-scrub, pg and osd repairs... so far nothing has helped.

 I think it could be a similar situation to this post [
 http://www.spinics.net/lists/ceph-users/msg11461.html ] where one of
 the osd's it holding a slightly newer but incomplete version of the PG
 which needs to be removed. Is anyone able to shed some light on how I
 might be able to use the objectstore tool to check if this is the
 case?

 If anyone has any suggestions it would be greatly appreciated.
 Likewise if you need any more information about my problem just let me
 know

 Thanks all
 -Matt
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] does anyone know what xfsaild and kworker are?they make osd disk busy. produce 100-200iops per osd disk?

2014-11-10 Thread duan . xufeng

ZTE Information Security Notice: The information contained in this mail (and 
any attachment transmitted herewith) is privileged and confidential and is 
intended for the exclusive use of the addressee(s).  If you are not an intended 
recipient, any disclosure, reproduction, distribution or other dissemination or 
use of the information contained is strictly prohibited.  If you have received 
this mail in error, please delete it and notify us immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG's incomplete after OSD failure

2014-11-10 Thread Sage Weil
On Tue, 11 Nov 2014, Matthew Anderson wrote:
 Just an update, it appears that no data actually exists for those PG's
 on osd.117 and osd.111 but it's showing as incomplete anyway.
 
 So for the 8.ca PG, osd.111 has only an empty directory but osd 190 is
 filled with data.
 For 8.6ae, osd.117 has no data in the pg directory and osd.190 is
 filled with data as before.
 
 Since all of the required data is on OSD.190, would there be a way to
 make osd.111 and osd.117 forget they have ever seen the two incomplete
 PG's and therefore restart backfilling?

Ah, that's good news.  You should know that the copy on osd.190 is 
slightly out of date, but it is much better than losing the entire 
contents of the PG.  More specifically, for 8.6ae the latest version was 
1935986 but the osd.190 is 1935747, about 200 writes in the past.  You'll 
need to fsck the RBD images after this is all done.

I don't think we've tested this recovery scenario, but I think you'll be 
able to recovery with ceph_objectstore_tool, which has an import/export 
function and a delete function.  First, try removing the newer version of 
the pg on osd.117.  First export it for good measure (even tho it's 
empty):

stop the osd

ceph_objectstore_tool --data-path /var/lib/ceph/osd/ceph-117  \
--journal-path /var/lib/ceph/osd/ceph-117/journal \
--op export --pgid 8.6ae --file osd.117.8.7ae

ceph_objectstore_tool --data-path /var/lib/ceph/osd/ceph-117  \
--journal-path /var/lib/ceph/osd/ceph-117/journal \
--op remove --pgid 8.6ae

and restart.  If that doesn't peer, you can also try exporting the pg from 
osd.190 and importing it into osd.117.  I think just removing the 
newer empty pg on osd.117 will do the trick, though...

sage



 
 
 On Tue, Nov 11, 2014 at 10:37 AM, Matthew Anderson
 manderson8...@gmail.com wrote:
  Hi All,
 
  We've had a string of very unfortunate failures and need a hand fixing
  the incomplete PG's that we're now left with. We're configured with 3
  replicas over different hosts with 5 in total.
 
  The timeline goes -
  -1 week  :: A full server goes offline with a failed backplane. Still
  not working
  -1 day  ::  OSD 190 fails
  -1 day + 3 minutes :: OSD 121 fails in a different server fails taking
  out several PG's and blocking IO
  Today  :: The first failed osd (osd.190) was cloned to a good drive
  with xfs_dump | xfs_restore and now boots fine. The last failed osd
  (osd.121) is completely unrecoverable and was marked as lost.
 
  What we're left with now is 2 incomplete PG's that are preventing RBD
  images from booting.
 
  # ceph pg dump_stuck inactive
  ok
  pg_statobjectsmipdegrmispunfbyteslog
  disklogstatestate_stampvreportedupup_primary
   actingacting_primarylast_scrubscrub_stamp
  last_deep_scrubdeep_scrub_stamp
  8.ca244000001021974886492059205
  incomplete2014-11-11 10:29:04.910512160435'959618
  161358:6071679[190,111]190[190,111]19086417'207324
 2013-09-09 12:58:10.74900186229'1968872013-09-02
  12:57:58.162789
  8.6ae00000031763176incomplete
  2014-11-11 10:24:07.000373160931'1935986161358:267
  [117,190]117[117,190]11786424'3897482013-09-09
  16:52:58.79665086424'3897482013-09-09 16:52:58.796650
 
  We've tried doing a pg revert but it's saying 'no missing objects'
  followed by not doing anything. I've also done the usual scrub,
  deep-scrub, pg and osd repairs... so far nothing has helped.
 
  I think it could be a similar situation to this post [
  http://www.spinics.net/lists/ceph-users/msg11461.html ] where one of
  the osd's it holding a slightly newer but incomplete version of the PG
  which needs to be removed. Is anyone able to shed some light on how I
  might be able to use the objectstore tool to check if this is the
  case?
 
  If anyone has any suggestions it would be greatly appreciated.
  Likewise if you need any more information about my problem just let me
  know
 
  Thanks all
  -Matt
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osds fails to start with mismatch in id

2014-11-10 Thread Irek Fasikhov
Hi, Ramakrishna.
I think you understand what the problem is:
[ceph@ceph05 ~]$ cat /var/lib/ceph/osd/ceph-56/whoami
56
[ceph@ceph05 ~]$ cat /var/lib/ceph/osd/ceph-57/whoami
57


Tue Nov 11 2014 at 6:01:40, Ramakrishna Nishtala (rnishtal) 
rnish...@cisco.com:

  Hi Greg,

 Thanks for the pointer. I think you are right. The full story is like this.



 After installation, everything works fine until I reboot. I do observe
 udevadm getting triggered in logs, but the devices do not come up after
 reboot. Exact issue as http://tracker.ceph.com/issues/5194. But this has
 been fixed a while back per the case details.

 As a workaround, I copied the contents from /proc/mounts to fstab and
 that’s where I landed into the issue.



 After your suggestion, defined as UUID in fstab, but similar problem.

 blkid.tab now moved to tmpfs and also isn’t consistent ever after issuing
 blkid explicitly to get the UUID’s. Goes in line with ceph-disk comments.



 Decided to reinstall, dd the partitions, zapdisks etc. Did not help. Very
 weird that links below change in /dev/disk/by-uuid and
 /dev/disk/by-partuuid etc.



 *Before reboot*

 lrwxrwxrwx 1 root root 10 Nov 10 06:31
 11aca3e2-a9d5-4bcc-a5b0-441c53d473b6 - ../../sdd2

 lrwxrwxrwx 1 root root 10 Nov 10 06:31
 89594989-90cb-4144-ac99-0ffd6a04146e - ../../sde2

 lrwxrwxrwx 1 root root 10 Nov 10 06:31
 c17fe791-5525-4b09-92c4-f90eaaf80dc6 - ../../sda2

 lrwxrwxrwx 1 root root 10 Nov 10 06:31
 c57541a1-6820-44a8-943f-94d68b4b03d4 - ../../sdc2

 lrwxrwxrwx 1 root root 10 Nov 10 06:31
 da7030dd-712e-45e4-8d89-6e795d9f8011 - ../../sdb2



 *After reboot*

 lrwxrwxrwx 1 root root 10 Nov 10 09:50
 11aca3e2-a9d5-4bcc-a5b0-441c53d473b6 - ../../sdd2

 lrwxrwxrwx 1 root root 10 Nov 10 09:50
 89594989-90cb-4144-ac99-0ffd6a04146e - ../../sde2

 lrwxrwxrwx 1 root root 10 Nov 10 09:50
 c17fe791-5525-4b09-92c4-f90eaaf80dc6 - ../../sda2

 lrwxrwxrwx 1 root root 10 Nov 10 09:50
 c57541a1-6820-44a8-943f-94d68b4b03d4 - ../../sdb2

 lrwxrwxrwx 1 root root 10 Nov 10 09:50
 da7030dd-712e-45e4-8d89-6e795d9f8011 - ../../sdh2



 Essentially, the transformation here is sdb2-sdh2 and sdc2- sdb2. In
 fact I haven’t partitioned my sdh at all before the test. The only
 difference probably from the standard procedure is I have pre-created the
 partitions for the journal and data, with parted.



 /lib/udev/rules.d  osd rules has four different partition GUID codes,

 45b0969e-9b03-4f30-b4c6-5ec00ceff106,

 45b0969e-9b03-4f30-b4c6-b4b80ceff106,

 4fbd7e29-9d25-41b8-afd0-062c0ceff05d,

 4fbd7e29-9d25-41b8-afd0-5ec00ceff05d,



 But all my partitions journal/data are having
 ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 as partition guid code.



 Appreciate any help.



 Regards,



 Rama

 =

 -Original Message-
 From: Gregory Farnum [mailto:g...@gregs42.com]
 Sent: Sunday, November 09, 2014 3:36 PM
 To: Ramakrishna Nishtala (rnishtal)
 Cc: ceph-us...@ceph.com
 Subject: Re: [ceph-users] osds fails to start with mismatch in id



 On Sun, Nov 9, 2014 at 3:21 PM, Ramakrishna Nishtala (rnishtal) 
 rnish...@cisco.com wrote:

  Hi

 

  I am on ceph 0.87, RHEL 7

 

  Out of 60 few osd’s start and the rest complain about mismatch about

  id’s as below.

 

 

 

  2014-11-09 07:09:55.501177 7f4633e01880 -1 OSD id 56 != my id 53

 

  2014-11-09 07:09:55.810048 7f636edf4880 -1 OSD id 57 != my id 54

 

  2014-11-09 07:09:56.122957 7f459a766880 -1 OSD id 58 != my id 55

 

  2014-11-09 07:09:56.429771 7f87f8e0c880 -1 OSD id 0 != my id 56

 

  2014-11-09 07:09:56.741329 7fadd9b91880 -1 OSD id 2 != my id 57

 

 

 

  Found one OSD ID in /var/lib/ceph/cluster-id/keyring. To check this

  out manually corrected it and turned authentication to none too, but

  did not help.

 

 

 

  Any clues, how it can be corrected?



 It sounds like maybe the symlinks to data and journal aren't matching up
 with where they're supposed to be. This is usually a result of using
 unstable /dev links that don't always match to the same physical disks.
 Have you checked that?

 -Greg
  ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osds fails to start with mismatch in id

2014-11-10 Thread Daniel Schwager
Hi Ramakrishna,

we use the phy. path (containing the serial number) to a disk to prevent 
complexity and wrong mapping... This path will never change:
/etc/ceph/ceph.conf
[osd.16]
devs = /dev/disk/by-id/scsi-SATA_ST4000NM0033-9Z_Z1Z0SDCY-part1
osd_journal = 
/dev/disk/by-id/scsi-SATA_INTEL_SSDSC2BA1BTTV330609AU100FGN-part1
...

regards
Danny



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Irek 
Fasikhov
Sent: Tuesday, November 11, 2014 6:36 AM
To: Ramakrishna Nishtala (rnishtal); Gregory Farnum
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] osds fails to start with mismatch in id

Hi, Ramakrishna.
I think you understand what the problem is:
[ceph@ceph05 ~]$ cat /var/lib/ceph/osd/ceph-56/whoami
56
[ceph@ceph05 ~]$ cat /var/lib/ceph/osd/ceph-57/whoami
57


Tue Nov 11 2014 at 6:01:40, Ramakrishna Nishtala (rnishtal) 
rnish...@cisco.commailto:rnish...@cisco.com:

Hi Greg,

Thanks for the pointer. I think you are right. The full story is like this.



After installation, everything works fine until I reboot. I do observe udevadm 
getting triggered in logs, but the devices do not come up after reboot. Exact 
issue as http://tracker.ceph.com/issues/5194. But this has been fixed a while 
back per the case details.

As a workaround, I copied the contents from /proc/mounts to fstab and that’s 
where I landed into the issue.



After your suggestion, defined as UUID in fstab, but similar problem.

blkid.tab now moved to tmpfs and also isn’t consistent ever after issuing blkid 
explicitly to get the UUID’s. Goes in line with ceph-disk comments.



Decided to reinstall, dd the partitions, zapdisks etc. Did not help. Very weird 
that links below change in /dev/disk/by-uuid and /dev/disk/by-partuuid etc.



Before reboot

lrwxrwxrwx 1 root root 10 Nov 10 06:31 11aca3e2-a9d5-4bcc-a5b0-441c53d473b6 - 
../../sdd2

lrwxrwxrwx 1 root root 10 Nov 10 06:31 89594989-90cb-4144-ac99-0ffd6a04146e - 
../../sde2

lrwxrwxrwx 1 root root 10 Nov 10 06:31 c17fe791-5525-4b09-92c4-f90eaaf80dc6 - 
../../sda2

lrwxrwxrwx 1 root root 10 Nov 10 06:31 c57541a1-6820-44a8-943f-94d68b4b03d4 - 
../../sdc2

lrwxrwxrwx 1 root root 10 Nov 10 06:31 da7030dd-712e-45e4-8d89-6e795d9f8011 - 
../../sdb2



After reboot

lrwxrwxrwx 1 root root 10 Nov 10 09:50 11aca3e2-a9d5-4bcc-a5b0-441c53d473b6 - 
../../sdd2

lrwxrwxrwx 1 root root 10 Nov 10 09:50 89594989-90cb-4144-ac99-0ffd6a04146e - 
../../sde2

lrwxrwxrwx 1 root root 10 Nov 10 09:50 c17fe791-5525-4b09-92c4-f90eaaf80dc6 - 
../../sda2

lrwxrwxrwx 1 root root 10 Nov 10 09:50 c57541a1-6820-44a8-943f-94d68b4b03d4 - 
../../sdb2

lrwxrwxrwx 1 root root 10 Nov 10 09:50 da7030dd-712e-45e4-8d89-6e795d9f8011 - 
../../sdh2



Essentially, the transformation here is sdb2-sdh2 and sdc2- sdb2. In fact I 
haven’t partitioned my sdh at all before the test. The only difference probably 
from the standard procedure is I have pre-created the partitions for the 
journal and data, with parted.



/lib/udev/rules.d  osd rules has four different partition GUID codes,

45b0969e-9b03-4f30-b4c6-5ec00ceff106,

45b0969e-9b03-4f30-b4c6-b4b80ceff106,

4fbd7e29-9d25-41b8-afd0-062c0ceff05d,

4fbd7e29-9d25-41b8-afd0-5ec00ceff05d,



But all my partitions journal/data are having 
ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 as partition guid code.



Appreciate any help.



Regards,



Rama

=

-Original Message-
From: Gregory Farnum [mailto:g...@gregs42.commailto:g...@gregs42.com]
Sent: Sunday, November 09, 2014 3:36 PM
To: Ramakrishna Nishtala (rnishtal)
Cc: ceph-us...@ceph.commailto:ceph-us...@ceph.com
Subject: Re: [ceph-users] osds fails to start with mismatch in id



On Sun, Nov 9, 2014 at 3:21 PM, Ramakrishna Nishtala (rnishtal) 
rnish...@cisco.commailto:rnish...@cisco.com wrote:

 Hi



 I am on ceph 0.87, RHEL 7



 Out of 60 few osd’s start and the rest complain about mismatch about

 id’s as below.







 2014-11-09 07:09:55.501177 7f4633e01880 -1 OSD id 56 != my id 53



 2014-11-09 07:09:55.810048 7f636edf4880 -1 OSD id 57 != my id 54



 2014-11-09 07:09:56.122957 7f459a766880 -1 OSD id 58 != my id 55



 2014-11-09 07:09:56.429771 7f87f8e0c880 -1 OSD id 0 != my id 56



 2014-11-09 07:09:56.741329 7fadd9b91880 -1 OSD id 2 != my id 57







 Found one OSD ID in /var/lib/ceph/cluster-id/keyring. To check this

 out manually corrected it and turned authentication to none too, but

 did not help.







 Any clues, how it can be corrected?



It sounds like maybe the symlinks to data and journal aren't matching up with 
where they're supposed to be. This is usually a result of using unstable /dev 
links that don't always match to the same physical disks. Have you checked that?

-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


smime.p7s
Description: S/MIME cryptographic signature

Re: [ceph-users] Triggering shallow scrub on OSD where scrub is already in progress

2014-11-10 Thread Gregory Farnum
On Sun, Nov 9, 2014 at 9:29 PM, Mallikarjun Biradar
mallikarjuna.bira...@gmail.com wrote:
 Hi all,

 Triggering shallow scrub on OSD where scrub is already in progress, restarts
 scrub from beginning on that OSD.


 Steps:
 Triggered shallow scrub on an OSD (Cluster is running heavy IO)
 While scrub is in progress, triggered shallow scrub again on that OSD.

 Observed behavior, is scrub restarted from beginning on that OSD.

 Please let me know, whether its expected behaviour?

What version of Ceph are you seeing this on? How are you identifying
that scrub is restarting from the beginning? It sounds sort of
familiar to me, but I thought this was fixed so it was a no-op if you
issue another scrub. (That's not authoritative though; I might just be
missing a reason we want to restart it.)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Deep scrub, cache pools, replica 1

2014-11-10 Thread Christian Balzer

Hello,

One of my clusters has become busy enough (I'm looking at you, evil Window
VMs that I shall banish elsewhere soon) to experience client noticeable
performance impacts during deep scrub. 
Before this I instructed all OSDs to deep scrub in parallel at Saturday
night and that finished before Sunday morning.
So for now I'll fire them off one by one to reduce the load.

Looking forward, that cluster doesn't need more space so instead of adding
more hosts and OSDs I was thinking of a cache pool instead.

I suppose that will keep the clients happy while the slow pool gets
scrubbed. 
Is there anybody who tested cache pools with Firefly and compared the
performance to Giant?

For testing I'm currently playing with a single storage node and 8 SSD
backed OSDs. 
Now what very much blew my mind is that a pool with a replication of 1
still does quite the impressive read orgy, clearly reading all the data in
the PGs. 
Why? And what is it comparing that data with, the cosmic background
radiation?

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com