Re: [ceph-users] 回复: half performace with keyvalue backend in 0.87
Haomai wang, Do you have proresss on this performance issue? 发件人: Haomai Wangmailto:haomaiw...@gmail.com 发送时间: 2014-10-31 10:05 收件人: 廖建锋mailto:de...@f-club.cn 抄送: ceph-usersmailto:ceph-users-boun...@lists.ceph.com; ceph-usersmailto:ceph-users@lists.ceph.com 主题: Re: Re: [ceph-users] 回复: half performace with keyvalue backend in 0.87 ok. I will explore it. On Fri, Oct 31, 2014 at 10:03 AM, 廖建锋 de...@f-club.cn wrote: I am not sure if it seq or ramdon, i just use rsync to copy millions small pic file form our pc server to ceph cluster 发件人: Haomai Wang 发送时间: 2014-10-31 09:59 收件人: 廖建锋 抄送: ceph-users; ceph-users 主题: Re: Re: [ceph-users] 回复: half performace with keyvalue backend in 0.87 Thanks, recently I mainly focus on rbd performance for it(random small write). I want to know your test situation. Is it seq write? On Fri, Oct 31, 2014 at 9:48 AM, 廖建锋 de...@f-club.cn wrote: which i can telll is : in 0.87 , osd's writting under 10MB/s ,but io utilization is about 95% in 0.80.6, osd's writting about 20MB/s, but io utilization is about 30% iostat -mx 2 with 0.87 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 43.00 9.00 85.50 0.95 1.18 46.14 1.36 14.49 10.01 94.55 sdc 0.00 37.50 6.00 99.00 0.62 10.01 207.31 2.24 21.31 9.33 97.95 sda 0.00 3.50 0.00 1.00 0.00 0.02 36.00 0.02 17.50 17.50 1.75 avg-cpu: %user %nice %system %iowait %steal %idle 3.16 0.00 1.01 17.45 0.00 78.38 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 36.50 0.00 47.50 0.00 1.09 47.07 0.82 17.17 16.71 79.35 sdc 0.00 25.00 15.00 77.50 1.26 0.65 42.34 1.73 18.72 10.70 99.00 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 发件人: Haomai Wang 发送时间: 2014-10-31 09:40 收件人: 廖建锋 抄送: ceph-users; ceph-users 主题: Re: [ceph-users] 回复: half performace with keyvalue backend in 0.87 Yes, it exists persistence problem at 0.80.6 and we fixed it at Giant. But at Giant, other performance optimization has been applied. Could you tell more about your tests? On Fri, Oct 31, 2014 at 8:27 AM, 廖建锋 de...@f-club.cn wrote: Also found the other problem is: the ceph osd directory has millions small files which will cause performance issue 1008 = # pwd /var/lib/ceph/osd/ceph-8/current 1007 = # ls |wc -l 21451 发件人: ceph-users 发送时间: 2014-10-31 08:23 收件人: ceph-users 主题: [ceph-users] half performace with keyvalue backend in 0.87 Dear Ceph, I used keyvalue backend in 0.80.6 and 0.80.7, the average speed with rsync millions small files is 10M byte /second when i upgrade to 0.87(giant), the speed slow down to 5M byte /second, I don't why , is there any tunning option for this? will superblock cause those performance slow down? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best Regards, Wheat -- Best Regards, Wheat -- Best Regards, Wheat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds isn't working anymore after osd's running full
Hello Greg and John, Thanks for solving the bug. I will compile the patch and make new rpm packages and test it on the Ceph cluster. I will let you know what the results are. Kind regards, Jasper Van: Gregory Farnum [g...@gregs42.com] Verzonden: vrijdag 7 november 2014 22:42 Aan: Jasper Siero CC: ceph-users; John Spray Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full On Thu, Nov 6, 2014 at 11:49 AM, John Spray john.sp...@redhat.com wrote: This is still an issue on master, so a fix will be coming soon. Follow the ticket for updates: http://tracker.ceph.com/issues/10025 Thanks for finding the bug! John is off for a vacation, but he pushed a branch wip-10025-firefly that if you install that (similar address to the other one) should work for you. You'll need to reset and undump again (I presume you still have the journal-as-a-file). I'll be merging them in to the stable branches pretty shortly as well. -Greg John On Thu, Nov 6, 2014 at 6:21 PM, John Spray john.sp...@redhat.com wrote: Jasper, Thanks for this -- I've reproduced this issue in a development environment. We'll see if this is also an issue on giant, and backport a fix if appropriate. I'll update this thread soon. Cheers, John On Mon, Nov 3, 2014 at 8:49 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Hello Greg, I saw that the site of the previous link of the logs uses a very short expiring time so I uploaded it to another one: http://www.mediafire.com/download/gikiy7cqs42cllt/ceph-mds.th1-mon001.log.tar.gz Thanks, Jasper Van: gregory.far...@inktank.com [gregory.far...@inktank.com] namens Gregory Farnum [gfar...@redhat.com] Verzonden: donderdag 30 oktober 2014 1:03 Aan: Jasper Siero CC: John Spray; ceph-users Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full On Wed, Oct 29, 2014 at 7:51 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Hello Greg, I added the debug options which you mentioned and started the process again: [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 --pid-file /var/run/ceph/mds.th1-mon001.pid -c /etc/ceph/ceph.conf --cluster ceph --reset-journal 0 old journal was 9483323613~134233517 new journal start will be 9621733376 (4176246 bytes past old end) writing journal head writing EResetJournal entry done [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 -c /etc/ceph/ceph.conf --cluster ceph --undump-journal 0 journaldumptgho-mon001 undump journaldumptgho-mon001 start 9483323613 len 134213311 writing header 200. writing 9483323613~1048576 writing 9484372189~1048576 writing 9485420765~1048576 writing 9486469341~1048576 writing 9487517917~1048576 writing 9488566493~1048576 writing 9489615069~1048576 writing 9490663645~1048576 writing 9491712221~1048576 writing 9492760797~1048576 writing 9493809373~1048576 writing 9494857949~1048576 writing 9495906525~1048576 writing 9496955101~1048576 writing 9498003677~1048576 writing 9499052253~1048576 writing 9500100829~1048576 writing 9501149405~1048576 writing 9502197981~1048576 writing 9503246557~1048576 writing 9504295133~1048576 writing 9505343709~1048576 writing 9506392285~1048576 writing 9507440861~1048576 writing 9508489437~1048576 writing 9509538013~1048576 writing 9510586589~1048576 writing 9511635165~1048576 writing 9512683741~1048576 writing 9513732317~1048576 writing 9514780893~1048576 writing 9515829469~1048576 writing 9516878045~1048576 writing 9517926621~1048576 writing 9518975197~1048576 writing 9520023773~1048576 writing 9521072349~1048576 writing 9522120925~1048576 writing 9523169501~1048576 writing 9524218077~1048576 writing 9525266653~1048576 writing 9526315229~1048576 writing 9527363805~1048576 writing 9528412381~1048576 writing 9529460957~1048576 writing 9530509533~1048576 writing 9531558109~1048576 writing 9532606685~1048576 writing 9533655261~1048576 writing 9534703837~1048576 writing 9535752413~1048576 writing 9536800989~1048576 writing 9537849565~1048576 writing 9538898141~1048576 writing 9539946717~1048576 writing 9540995293~1048576 writing 9542043869~1048576 writing 9543092445~1048576 writing 9544141021~1048576 writing 9545189597~1048576 writing 9546238173~1048576 writing 9547286749~1048576 writing 9548335325~1048576 writing 9549383901~1048576 writing 9550432477~1048576 writing 9551481053~1048576 writing 9552529629~1048576 writing 9553578205~1048576 writing 9554626781~1048576 writing 9555675357~1048576 writing 9556723933~1048576 writing 9557772509~1048576 writing 9558821085~1048576 writing 9559869661~1048576 writing 9560918237~1048576 writing 9561966813~1048576 writing 9563015389~1048576 writing 9564063965~1048576 writing 9565112541~1048576 writing 9566161117~1048576
Re: [ceph-users] Cache Tier Statistics
Hi Jean-Charles, Thanks for your response, I have found the following using ceph daemon osd.{id} perf dump. tier_promote: 1425, tier_flush: 0, tier_flush_fail: 0, tier_try_flush: 216, tier_try_flush_fail: 21, tier_evict: 1413, tier_whiteout: 201, tier_dirty: 671, tier_clean: 216, tier_delay: 16, I'm guessing the tier_promote should increase every time there is a cache miss? If this is the case then I simply need to add this value from every OSD and divide by total reads to work out the percentage hit rate. Nick From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jean-Charles Lopez Sent: 09 November 2014 01:43 To: Nick Fisk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Cache Tier Statistics Hi Nick If my brain doesn't fail me you can try ceph daemon osd.{id} perf dump ceph report (not 100% sure if cache stats are in Rgds JC On Saturday, November 8, 2014, Nick Fisk n...@fisk.me.uk wrote: Hi, Does anyone know if there any statistics available specific to the cache tier functionality, I’m thinking along the lines of cache hit ratios? Or should I be pulling out the Read statistics for backing+cache pools and assuming that if a read happens from the backing pool it was a miss and then calculating it from that? Thanks, Nick -- Sent while moving ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 回复: half performace with keyvalue backend in 0.87
Yep, be patient. Need more time On Mon, Nov 10, 2014 at 9:33 AM, 廖建锋 de...@f-club.cn wrote: Haomai wang, Do you have proresss on this performance issue? 发件人: Haomai Wang 发送时间: 2014-10-31 10:05 收件人: 廖建锋 抄送: ceph-users; ceph-users 主题: Re: Re: [ceph-users] 回复: half performace with keyvalue backend in 0.87 ok. I will explore it. On Fri, Oct 31, 2014 at 10:03 AM, 廖建锋 de...@f-club.cn wrote: I am not sure if it seq or ramdon, i just use rsync to copy millions small pic file form our pc server to ceph cluster 发件人: Haomai Wang 发送时间: 2014-10-31 09:59 收件人: 廖建锋 抄送: ceph-users; ceph-users 主题: Re: Re: [ceph-users] 回复: half performace with keyvalue backend in 0.87 Thanks, recently I mainly focus on rbd performance for it(random small write). I want to know your test situation. Is it seq write? On Fri, Oct 31, 2014 at 9:48 AM, 廖建锋 de...@f-club.cn wrote: which i can telll is : in 0.87 , osd's writting under 10MB/s ,but io utilization is about 95% in 0.80.6, osd's writting about 20MB/s, but io utilization is about 30% iostat -mx 2 with 0.87 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 43.00 9.00 85.50 0.95 1.18 46.14 1.36 14.49 10.01 94.55 sdc 0.00 37.50 6.00 99.00 0.62 10.01 207.31 2.24 21.31 9.33 97.95 sda 0.00 3.50 0.00 1.00 0.00 0.02 36.00 0.02 17.50 17.50 1.75 avg-cpu: %user %nice %system %iowait %steal %idle 3.16 0.00 1.01 17.45 0.00 78.38 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 36.50 0.00 47.50 0.00 1.09 47.07 0.82 17.17 16.71 79.35 sdc 0.00 25.00 15.00 77.50 1.26 0.65 42.34 1.73 18.72 10.70 99.00 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 发件人: Haomai Wang 发送时间: 2014-10-31 09:40 收件人: 廖建锋 抄送: ceph-users; ceph-users 主题: Re: [ceph-users] 回复: half performace with keyvalue backend in 0.87 Yes, it exists persistence problem at 0.80.6 and we fixed it at Giant. But at Giant, other performance optimization has been applied. Could you tell more about your tests? On Fri, Oct 31, 2014 at 8:27 AM, 廖建锋 de...@f-club.cn wrote: Also found the other problem is: the ceph osd directory has millions small files which will cause performance issue 1008 = # pwd /var/lib/ceph/osd/ceph-8/current 1007 = # ls |wc -l 21451 发件人: ceph-users 发送时间: 2014-10-31 08:23 收件人: ceph-users 主题: [ceph-users] half performace with keyvalue backend in 0.87 Dear Ceph, I used keyvalue backend in 0.80.6 and 0.80.7, the average speed with rsync millions small files is 10M byte /second when i upgrade to 0.87(giant), the speed slow down to 5M byte /second, I don't why , is there any tunning option for this? will superblock cause those performance slow down? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best Regards, Wheat -- Best Regards, Wheat -- Best Regards, Wheat -- Best Regards, Wheat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph on RHEL 7 using teuthology
Yes. I see similar package dependency when installing manually. ~Pras On Mon, Nov 10, 2014 at 3:00 PM, Loic Dachary l...@dachary.org wrote: Hi, It looks like there are broken packages on the target machine even before teuthology tries to install new packages. Do you see similar errors when trying to install a package manually ? Cheers On 10/11/2014 09:59, Sarang G wrote: Hi, 1. Created an instance on AWS using AMI: ami-99bef1a9 2. All the settings related to root access to sudo, passwordless ssh was configured. 3. started teuthology with basic yaml configuration: check-locks: false os_type: rhel os_version: '7.0' roles: - - mon.a - mon.b - mon.c - osd.0 - osd.1 - osd.2 - client.0 suite_path: /home/pras/ceph-qa-suite targets: user@hostname: ssh-rsa ssh-key tasks: - install: null - ceph: null - interactive: null Teuthology log attached. ~Pras On Mon, Nov 10, 2014 at 1:03 PM, Loic Dachary l...@dachary.org mailto: l...@dachary.org wrote: [moving the thread to ceph-devel] Hi, It would be useful if you could upload the full log somewhere and provide details about what you did on the machine prior to seeing this error. That would help figure out what is wrong. Cheers On 10/11/2014 07:20, Sarang G wrote: Hi, I am trying to install ceph on RHEL 7 AWS instance using teuthology. I am facing some dependency issues: 2014-11-09 22:10:57,269.269 DEBUG:teuthology.orchestra.run:Running [10.15.17.91]: 'sudo yum install ceph-radosgw-0.87 -y' 2014-11-09 22:10:58,223.223 INFO:teuthology.orchestra.run.out:[10.15.17.91]: Loaded plugins: amazon-id, rhui-lb 2014-11-09 22:10:59,008.008 INFO:teuthology.orchestra.run.out:[10.15.17.91]: Resolving Dependencies 2014-11-09 22:10:59,010.010 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Running transaction check 2014-11-09 22:10:59,010.010 INFO:teuthology.orchestra.run.out:[10.15.17.91]: --- Package ceph-radosgw.x86_64 1:0.87-665.gb8ec7d7.el7 will be installed 2014-11-09 22:10:59,028.028 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Processing Dependency: librados2 = 1:0.87-665.gb8ec7d7.el7 for package: 1:ceph-radosgw-0.87-665.gb8ec7d7.el7.x86_64 2014-11-09 22:10:59,086.086 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Processing Dependency: ceph-common = 1:0.87-665.gb8ec7d7.el7 for package: 1:ceph-radosgw-0.87-665.gb8ec7d7.el7.x86_64 2014-11-09 22:10:59,088.088 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Processing Dependency: librados.so.2()(64bit) for package: 1:ceph-radosgw-0.87-665.gb8ec7d7.el7.x86_64 2014-11-09 22:10:59,089.089 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Processing Dependency: libfcgi.so.0()(64bit) for package: 1:ceph-radosgw-0.87-665.gb8ec7d7.el7.x86_64 2014-11-09 22:10:59,089.089 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Running transaction check 2014-11-09 22:10:59,090.090 INFO:teuthology.orchestra.run.out:[10.15.17.91]: --- Package ceph-common.x86_64 1:0.87-665.gb8ec7d7.el7 will be installed 2014-11-09 22:10:59,103.103 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Processing Dependency: python-ceph = 1:0.87-665.gb8ec7d7.el7 for package: 1:ceph-common-0.87-665.gb8ec7d7.el7.x86_64 2014-11-09 22:10:59,255.255 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Processing Dependency: librbd1 = 1:0.87-665.gb8ec7d7.el7 for package: 1:ceph-common-0.87-665.gb8ec7d7.el7.x86_64 2014-11-09 22:10:59,255.255 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Processing Dependency: libtcmalloc.so.4()(64bit) for package: 1:ceph-common-0.87-665.gb8ec7d7.el7.x86_64 2014-11-09 22:10:59,255.255 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Processing Dependency: librbd.so.1()(64bit) for package: 1:ceph-common-0.87-665.gb8ec7d7.el7.x86_64 2014-11-09 22:10:59,255.255 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Processing Dependency: libboost_thread-mt.so.1.53.0()(64bit) for package: 1:ceph-common-0.87-665.gb8ec7d7.el7.x86_64 2014-11-09 22:10:59,256.256 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Processing Dependency: libboost_system-mt.so.1.53.0()(64bit) for package: 1:ceph-common-0.87-665.gb8ec7d7.el7.x86_64 2014-11-09 22:10:59,256.256 INFO:teuthology.orchestra.run.out:[10.15.17.91]: --- Package ceph-radosgw.x86_64 1:0.87-665.gb8ec7d7.el7 will be installed 2014-11-09 22:10:59,256.256 INFO:teuthology.orchestra.run.out:[10.15.17.91]: -- Processing Dependency: libfcgi.so.0()(64bit) for package: 1:ceph-radosgw-0.87-665.gb8ec7d7.el7.x86_64 2014-11-09 22:10:59,256.256 INFO:teuthology.orchestra.run.out:[10.15.17.91]: --- Package librados2.x86_64 1:0.87-665.gb8ec7d7.el7 will be installed 2014-11-09 22:10:59,256.256
Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem
Hi Craig and list, If you create a real osd.20, you might want to leave it OUT until you get things healthy again. I created a real osd.20 (and it turns out I needed an osd.21 also). ceph pg x.xx query no longer lists down osds for probing: down_osds_we_would_probe: [], But I cannot find the magic command line which will remove these incomplete PGs. Anyone know how to remove incomplete PGs ? Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Typical 10GbE latency
On 08-11-14 02:42, Gary M wrote: Wido, Take the switch out of the path between nodes and remeasure.. ICMP-echo requests are very low priority traffic for switches and network stacks. I tried with a direct TwinAx and fiber cable. No difference. If you really want to know, place a network analyzer between the nodes to measure the request packet to response packet latency.. The ICMP traffic to the ping application is not accurate in the sub-millisecond range. And should only be used as a rough estimate. True, I fully agree with you. But, why is everybody showing a lower latency here? My latencies are about 40% higher then what I see in this setup and other setups. You also may want to install the high resolution timer patch, sometimes called HRT, to the kernel which may give you different results. ICMP traffic takes a different path than the TCP traffic and should not be considered an indicator of defect. Yes, I'm aware. But it still doesn't explain me why the latency on other systems, which are in production, is lower then on this idle system. I believe the ping app calls the sendto system call.(sorry its been a while since I last looked) Systems calls can take between .1us and .2us each. However, the ping application makes several of these calls and waits for a signal from the kernel. The wait for a signal means the ping application must wait to be rescheduled to report the time.Rescheduling will depend on a lot of other factors in the os. eg, timers, card interrupts other tasks with higher priorities. Reporting the time must add a few more systems calls for this to happen. As the ping application loops to post the next ping request which again requires a few systems calls which may cause a task switch while in each system call. For the above factors, the ping application is not a good representation of network performance due to factors in the application and network traffic shaping performed at the switch and the tcp stacks. I think that netperf is probably a better tool, but that also does TCP latencies. I want the real IP latency, so I assumed that ICMP would be the most simple one. The other setups I have access to are in production and do not have any special tuning, yet their latency is still lower then on this new deployment. That's what gets me confused. Wido cheers, gary On Fri, Nov 7, 2014 at 4:32 PM, Łukasz Jagiełło jagiello.luk...@gmail.com mailto:jagiello.luk...@gmail.com wrote: Hi, rtt min/avg/max/mdev = 0.070/0.177/0.272/0.049 ms 04:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01) at both hosts and Arista 7050S-64 between. Both hosts were part of active ceph cluster. On Thu, Nov 6, 2014 at 5:18 AM, Wido den Hollander w...@42on.com mailto:w...@42on.com wrote: Hello, While working at a customer I've ran into a 10GbE latency which seems high to me. I have access to a couple of Ceph cluster and I ran a simple ping test: $ ping -s 8192 -c 100 -n ip Two results I got: rtt min/avg/max/mdev = 0.080/0.131/0.235/0.039 ms rtt min/avg/max/mdev = 0.128/0.168/0.226/0.023 ms Both these environment are running with Intel 82599ES 10Gbit cards in LACP. One with Extreme Networks switches, the other with Arista. Now, on a environment with Cisco Nexus 3000 and Nexus 7000 switches I'm seeing: rtt min/avg/max/mdev = 0.160/0.244/0.298/0.029 ms As you can see, the Cisco Nexus network has high latency compared to the other setup. You would say the switches are to blame, but we also tried with a direct TwinAx connection, but that didn't help. This setup also uses the Intel 82599ES cards, so the cards don't seem to be the problem. The MTU is set to 9000 on all these networks and cards. I was wondering, others with a Ceph cluster running on 10GbE, could you perform a simple network latency test like this? I'd like to compare the results. -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 tel:%2B31%20%280%2920%20700%209902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Łukasz Jagiełło lukaszatjagiellodotorg ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds isn't working anymore after osd's running full
Hello John and Greg, I used the new patch and now the undump succeeded and the mds is working fine and I can mount cephfs again! I still have one placement group which keeps deep scrubbing even after restarting the ceph cluster: dumped all in format plain 3.300 0 0 0 0 0 0 active+clean+scrubbing+deep 2014-11-10 17:21:15.866965 0'0 2414:418[1,9] 1 [1,9] 1 631'34632014-08-21 15:14:45.430926 602'31312014-08-18 15:14:37.494913 I there a way to solve this? Kind regards, Jasper Van: Gregory Farnum [g...@gregs42.com] Verzonden: vrijdag 7 november 2014 22:42 Aan: Jasper Siero CC: ceph-users; John Spray Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full On Thu, Nov 6, 2014 at 11:49 AM, John Spray john.sp...@redhat.com wrote: This is still an issue on master, so a fix will be coming soon. Follow the ticket for updates: http://tracker.ceph.com/issues/10025 Thanks for finding the bug! John is off for a vacation, but he pushed a branch wip-10025-firefly that if you install that (similar address to the other one) should work for you. You'll need to reset and undump again (I presume you still have the journal-as-a-file). I'll be merging them in to the stable branches pretty shortly as well. -Greg John On Thu, Nov 6, 2014 at 6:21 PM, John Spray john.sp...@redhat.com wrote: Jasper, Thanks for this -- I've reproduced this issue in a development environment. We'll see if this is also an issue on giant, and backport a fix if appropriate. I'll update this thread soon. Cheers, John On Mon, Nov 3, 2014 at 8:49 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Hello Greg, I saw that the site of the previous link of the logs uses a very short expiring time so I uploaded it to another one: http://www.mediafire.com/download/gikiy7cqs42cllt/ceph-mds.th1-mon001.log.tar.gz Thanks, Jasper Van: gregory.far...@inktank.com [gregory.far...@inktank.com] namens Gregory Farnum [gfar...@redhat.com] Verzonden: donderdag 30 oktober 2014 1:03 Aan: Jasper Siero CC: John Spray; ceph-users Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full On Wed, Oct 29, 2014 at 7:51 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Hello Greg, I added the debug options which you mentioned and started the process again: [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 --pid-file /var/run/ceph/mds.th1-mon001.pid -c /etc/ceph/ceph.conf --cluster ceph --reset-journal 0 old journal was 9483323613~134233517 new journal start will be 9621733376 (4176246 bytes past old end) writing journal head writing EResetJournal entry done [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 -c /etc/ceph/ceph.conf --cluster ceph --undump-journal 0 journaldumptgho-mon001 undump journaldumptgho-mon001 start 9483323613 len 134213311 writing header 200. writing 9483323613~1048576 writing 9484372189~1048576 writing 9485420765~1048576 writing 9486469341~1048576 writing 9487517917~1048576 writing 9488566493~1048576 writing 9489615069~1048576 writing 9490663645~1048576 writing 9491712221~1048576 writing 9492760797~1048576 writing 9493809373~1048576 writing 9494857949~1048576 writing 9495906525~1048576 writing 9496955101~1048576 writing 9498003677~1048576 writing 9499052253~1048576 writing 9500100829~1048576 writing 9501149405~1048576 writing 9502197981~1048576 writing 9503246557~1048576 writing 9504295133~1048576 writing 9505343709~1048576 writing 9506392285~1048576 writing 9507440861~1048576 writing 9508489437~1048576 writing 9509538013~1048576 writing 9510586589~1048576 writing 9511635165~1048576 writing 9512683741~1048576 writing 9513732317~1048576 writing 9514780893~1048576 writing 9515829469~1048576 writing 9516878045~1048576 writing 9517926621~1048576 writing 9518975197~1048576 writing 9520023773~1048576 writing 9521072349~1048576 writing 9522120925~1048576 writing 9523169501~1048576 writing 9524218077~1048576 writing 9525266653~1048576 writing 9526315229~1048576 writing 9527363805~1048576 writing 9528412381~1048576 writing 9529460957~1048576 writing 9530509533~1048576 writing 9531558109~1048576 writing 9532606685~1048576 writing 9533655261~1048576 writing 9534703837~1048576 writing 9535752413~1048576 writing 9536800989~1048576 writing 9537849565~1048576 writing 9538898141~1048576 writing 9539946717~1048576 writing 9540995293~1048576 writing 9542043869~1048576 writing 9543092445~1048576 writing 9544141021~1048576 writing 9545189597~1048576 writing 9546238173~1048576 writing 9547286749~1048576 writing 9548335325~1048576 writing 9549383901~1048576 writing 9550432477~1048576 writing 9551481053~1048576 writing
Re: [ceph-users] Installing CephFs via puppet
- Original Message - From: JIten Shah jshah2...@me.com To: Jean-Charles LOPEZ jc.lo...@inktank.com Cc: ceph-users ceph-us...@ceph.com Sent: Friday, November 7, 2014 7:18:10 PM Subject: Re: [ceph-users] Installing CephFs via puppet Thanks JC and Loic but we HAVE to use puppet. That’s how all of our configuration and deployment stuff works and I can’t sway away from it. Is https://github.com/enovance/puppet-ceph a good resource for cephFS? Has anyone used it successfully? Hi, This module doesn't currently doesn't provide any mean to deploy CephFS. -- François Charlier Software Engineer // eNovance SAS http://www.enovance.com/ // ✉ francois.charl...@enovance.com ☎ +33 1 49 70 99 81 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds isn't working anymore after osd's running full
It's supposed to do that; deep scrubbing is an ongoing consistency-check mechanism. If you really want to disable it you can set an osdmap flag to prevent it, but you'll have to check the docs for exactly what that is as I can't recall. Glad things are working for you; sorry it took so long! -Greg On Mon, Nov 10, 2014 at 8:49 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Hello John and Greg, I used the new patch and now the undump succeeded and the mds is working fine and I can mount cephfs again! I still have one placement group which keeps deep scrubbing even after restarting the ceph cluster: dumped all in format plain 3.300 0 0 0 0 0 0 active+clean+scrubbing+deep 2014-11-10 17:21:15.866965 0'0 2414:418[1,9] 1 [1,9] 1 631'34632014-08-21 15:14:45.430926 602'31312014-08-18 15:14:37.494913 I there a way to solve this? Kind regards, Jasper Van: Gregory Farnum [g...@gregs42.com] Verzonden: vrijdag 7 november 2014 22:42 Aan: Jasper Siero CC: ceph-users; John Spray Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full On Thu, Nov 6, 2014 at 11:49 AM, John Spray john.sp...@redhat.com wrote: This is still an issue on master, so a fix will be coming soon. Follow the ticket for updates: http://tracker.ceph.com/issues/10025 Thanks for finding the bug! John is off for a vacation, but he pushed a branch wip-10025-firefly that if you install that (similar address to the other one) should work for you. You'll need to reset and undump again (I presume you still have the journal-as-a-file). I'll be merging them in to the stable branches pretty shortly as well. -Greg John On Thu, Nov 6, 2014 at 6:21 PM, John Spray john.sp...@redhat.com wrote: Jasper, Thanks for this -- I've reproduced this issue in a development environment. We'll see if this is also an issue on giant, and backport a fix if appropriate. I'll update this thread soon. Cheers, John On Mon, Nov 3, 2014 at 8:49 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Hello Greg, I saw that the site of the previous link of the logs uses a very short expiring time so I uploaded it to another one: http://www.mediafire.com/download/gikiy7cqs42cllt/ceph-mds.th1-mon001.log.tar.gz Thanks, Jasper Van: gregory.far...@inktank.com [gregory.far...@inktank.com] namens Gregory Farnum [gfar...@redhat.com] Verzonden: donderdag 30 oktober 2014 1:03 Aan: Jasper Siero CC: John Spray; ceph-users Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full On Wed, Oct 29, 2014 at 7:51 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Hello Greg, I added the debug options which you mentioned and started the process again: [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 --pid-file /var/run/ceph/mds.th1-mon001.pid -c /etc/ceph/ceph.conf --cluster ceph --reset-journal 0 old journal was 9483323613~134233517 new journal start will be 9621733376 (4176246 bytes past old end) writing journal head writing EResetJournal entry done [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 -c /etc/ceph/ceph.conf --cluster ceph --undump-journal 0 journaldumptgho-mon001 undump journaldumptgho-mon001 start 9483323613 len 134213311 writing header 200. writing 9483323613~1048576 writing 9484372189~1048576 writing 9485420765~1048576 writing 9486469341~1048576 writing 9487517917~1048576 writing 9488566493~1048576 writing 9489615069~1048576 writing 9490663645~1048576 writing 9491712221~1048576 writing 9492760797~1048576 writing 9493809373~1048576 writing 9494857949~1048576 writing 9495906525~1048576 writing 9496955101~1048576 writing 9498003677~1048576 writing 9499052253~1048576 writing 9500100829~1048576 writing 9501149405~1048576 writing 9502197981~1048576 writing 9503246557~1048576 writing 9504295133~1048576 writing 9505343709~1048576 writing 9506392285~1048576 writing 9507440861~1048576 writing 9508489437~1048576 writing 9509538013~1048576 writing 9510586589~1048576 writing 9511635165~1048576 writing 9512683741~1048576 writing 9513732317~1048576 writing 9514780893~1048576 writing 9515829469~1048576 writing 9516878045~1048576 writing 9517926621~1048576 writing 9518975197~1048576 writing 9520023773~1048576 writing 9521072349~1048576 writing 9522120925~1048576 writing 9523169501~1048576 writing 9524218077~1048576 writing 9525266653~1048576 writing 9526315229~1048576 writing 9527363805~1048576 writing 9528412381~1048576 writing 9529460957~1048576 writing 9530509533~1048576 writing 9531558109~1048576 writing 9532606685~1048576 writing 9533655261~1048576 writing 9534703837~1048576 writing 9535752413~1048576 writing 9536800989~1048576 writing
Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem
If all of your PGs now have an empty down_osds_we_would_probe, I'd run through this discussion again. The commands to tell Ceph to give up on lost data should have an effect now. That's my experience anyway. Nothing progressed until I took care of down_osds_we_would_probe. After that was empty, I was able to repair. It wasn't immediate though. It still took ~24 hours, and a few OSD restarts, for the cluster to get itself healthy. You might try sequentially restarting OSDs. It shouldn't be necessary, but it shouldn't make anything worse. On Mon, Nov 10, 2014 at 7:17 AM, Chad Seys cws...@physics.wisc.edu wrote: Hi Craig and list, If you create a real osd.20, you might want to leave it OUT until you get things healthy again. I created a real osd.20 (and it turns out I needed an osd.21 also). ceph pg x.xx query no longer lists down osds for probing: down_osds_we_would_probe: [], But I cannot find the magic command line which will remove these incomplete PGs. Anyone know how to remove incomplete PGs ? Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Pg's stuck in inactive/unclean state + Association from PG-OSD does not seem to be happenning.
Folks, Now, we are running into an issue where the PG's(192) are stuck in creating state forever. I have experimented with various PG settings(osd_pool_default_pg_num from 50 to 400) for replicas and default and doesn't seem to help so far. Just to give you a brief overview, I have 8 osd's. I see the create_pg is pending messages in ceph monitor logs. I have attached the following logs in the zip file. 1) crush map(crush.map) 2) ceph osd tree, (OSD_TREE.txt OSD's 1,2,3,4 belong to host octeon and OSD's 0,5,6,7 belong to host octeon1). 3) ceph pg dump, health details etcetc(dump_pgs, health_detail) 4) Attached the ceph.conf 5) ceph osd lspools. 0 data,1 metadata,2 rbd, Here is the dump for ceph -w before any osd's were created: ceph -w cluster 3eda0199-93a9-428b-8209-caeff84d3d3f health HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds monmap e1: 1 mons at {essperf13=209.243.160.45:6789/0}, election epoch 1, quorum 0 essperf13 osdmap e205: 0 osds: 0 up, 0 in pgmap v928: 192 pgs, 3 pools, 0 bytes data, 0 objects 0 kB used, 0 kB / 0 kB avail 192 creating 2014-11-05 23:26:46.555348 mon.0 [INF] pgmap v928: 192 pgs: 192 creating; 0 bytes data, 0 kB used, 0 kB / 0 kB avail Here is the dump for ceph -w after 8 osd's were created: ceph -w cluster 3eda0199-93a9-428b-8209-caeff84d3d3f health HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean monmap e1: 1 mons at {essperf13=209.243.160.45:6789/0}, election epoch 1, quorum 0 essperf13 osdmap e213: 8 osds: 8 up, 8 in pgmap v958: 192 pgs, 3 pools, 0 bytes data, 0 objects 328 MB used, 14856 GB / 14856 GB avail 192 creating 2014-11-05 23:46:25.461143 mon.0 [INF] pgmap v958: 192 pgs: 192 creating; 0 bytes data, 328 MB used, 14856 GB / 14856 GB avail Any pointers to resolve this issue will be helpful. Thanks Prashanth -Original Message- From: Prashanth Nednoor Sent: Tuesday, October 28, 2014 9:26 PM To: 'Sage Weil' Cc: Philip Kufeldt; ceph-de...@vger.kernel.org Subject: RE: cephx auth issues:Having issues trying to get the OSD up on a MIPS64, when the OSD tries to communicate with the monitor!!! Sage, As requested I set the debug setting in ceph.conf on both the sides. Here are the logs for the OSD and MONITOR attached. 1) OSD : IPADDRESS: 209.243.157.187. Logfile attached is: Ceph-0.log 2) MONITOR: IP ADDRESS: 209.243.160.45, Logfile attached is: Ceph-mon.essperf13.log Please Note that AUTHENTICATION IS DISABLED IN THE /etc/ceph/ceph.conf files on both OSD and monitor. In addition to this on the OSD side I by-passed part of the authentication code that was causing trouble(monc-authenticate) in osd_init function call. I hope this is ok. Good news is my osd daemon is up now on the MIPS side, finally, but for some reason MONITOR is still not detecting the OSD. It seems from the ceph mon log, it knows the OSD is at 187 and it does exchange some information. Thanks for your prompt response and help. Thanks Prashanth -Original Message- From: Sage Weil [mailto:s...@newdream.net] Sent: Tuesday, October 28, 2014 4:59 PM To: Prashanth Nednoor Cc: Philip Kufeldt; ceph-de...@vger.kernel.org Subject: Re: cephx auth issues:Having issues trying to get the OSD up on a MIPS64, when the OSD tries to communicate with the monitor!!! Hi, On Tue, 28 Oct 2014, Prashanth Nednoor wrote: Folks, I am trying to get the osd up and having an issue. OSD does exchange some messages with the MONITOR before this error. Seems like an issue with authentication in my set up with MIPS based OSD and Intel XEON MONITORS. I have attached the logs. The OSD(209.243.157.187) sends some request to MONITOR (209.243.160.45). I see this message No session security set, followed by the below message. The reply is coming back as auth_reply(proto 2 -1 (1) Operation not permitted. Is there an ENDIAN issue here between MIPS based OSD(BIGEENDIAN) and INTEL XEONS(LITTLE ENDIAN), my CEPH-MOINTORS are INTEL XEONS??? I made sure the keyrings are all consistent. Here are the keys on OSD and MONITOR. I tried disabling authentication by setting the following auth_service_required = none, auth_client_required = none and auth_cluster_required = none. Looks there was some issue with this in osd_init code, where it seems like AUTHENTICATION IS MANDATORY. HERE IS THE INFORMATION ON MY KEYS ON OSD AND MONITOR. ON THE OSD: more /etc/ceph/ceph.client.admin.keyring [osd.0] key = AQCddYJv4JkxIhAApeqP7Ahp+uUXYrgmgQt+LA== [client.admin] key = AQA1jixUQAaWABAA1tAjhIbrmOCIqNAkeNVulQ== more /var/lib/ceph/bootstrap-osd/ceph.keyring [client.bootstrap-osd] key = AQA1jixUwGjoGxAASUUlYC2rGfH7Zl4rCfCylA== ON THE MONITOR: more /etc/ceph/ceph.client.admin.keyring [client.admin] key = AQA1jixUQAaWABAA1tAjhIbrmOCIqNAkeNVulQ== more /var/lib/ceph/bootstrap-osd/ceph.keyring
Re: [ceph-users] PG inconsistency
For #1, it depends what you mean by fast. I wouldn't worry about it taking 15 minutes. If you mark the old OSD out, ceph will start remapping data immediately, including a bunch of PGs on unrelated OSDs. Once you replace the disk, and put the same OSDID back in the same host, the CRUSH map will be back to what it was before you started. All of those remaps on unrelated OSDs will reverse. They'll complete fairly quickly, because they only have to backfill the data that was written during the remap. I prefer #1. ceph pg repair will just overwrite the replicas with whatever the primary OSD has, which may copy bad data from your bad OSD over good replicas. So #2 has the potential to corrupt the data. #1 will delete the data you know is bad, leaving only good data behind to replicate. Once ceph pg repair gets more intelligent, I'll revisit this. I also prefer the simplicity. If it's dead or corrupt, they're treated the same. On Sun, Nov 9, 2014 at 7:25 PM, GuangYang yguan...@outlook.com wrote: In terms of disk replacement, to avoid migrating data back and forth, are the below two approaches reasonable? 1. Keep the OSD in and do an ad-hoc disk replacement and provision a new OSD (so that keep the OSD id as the same), and then trigger data migration. In this way the data migration only happens once, however, it does require operators to replace the disk very fast. 2. Move the data on the broken disk to a new disk completely and use Ceph to repair bad objects. Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD commits suicide
Have you tuned any of the recovery or backfill parameters? My ceph.conf has: [osd] osd max backfills = 1 osd recovery max active = 1 osd recovery op priority = 1 Still, if it's running for a few hours, then failing, it sounds like there might be something else at play. OSDs use a lot of RAM during recovery. How much RAM and how many OSDs do you have in these nodes? What does memory usage look like after a fresh restart, and what does it look like when the problems start? Even better if you know what it looks like 5 minutes before the problems start. Is there anything interesting in the kernel logs? OOM killers, or memory deadlocks? On Sat, Nov 8, 2014 at 11:19 AM, Erik Logtenberg e...@logtenberg.eu wrote: Hi, I have some OSD's that keep committing suicide. My cluster has ~1.3M misplaced objects, and it can't really recover, because OSD's keep failing before recovering finishes. The load on the hosts is quite high, but the cluster currently has no other tasks than just the backfilling/recovering. I attached the logfile from a failed OSD. It shows the suicide, the recent events and also me starting the OSD again after some time. It'll keep running for a couple of hours and then fail again, for the same reason. I noticed a lot of timeouts. Apparently ceph stresses the hosts to the limit with the recovery tasks, so much that they timeout and can't finish that task. I don't understand why. Can I somehow throttle ceph a bit so that it doesn't keep overrunning itself? I kinda feel like it should chill out a bit and simply recover one step at a time instead of full force and then fail. Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] An OSD always crash few minutes after start
You're running 0.87-6. There were various fixes for this problem in Firefly. Were any of these snapshots created on early version of Firefly? So far, every fix for this issue has gotten developers involved. I'd see if you can talk to some devs on IRC, or post to the ceph-devel mailing list. My own experience is that I had to delete the affected PGs, and force create them. Hopefully there's a better answer now. On Fri, Nov 7, 2014 at 8:10 PM, Chu Duc Minh chu.ducm...@gmail.com wrote: One of my OSDs have problems and can NOT be start. I tried to start many times but it always crash few minutes after start. I think about two reasons to make it crash: 1. A read/write request to this OSD, but due to the corrupted volume/snapshot/parent-image/..., it crash. 2. The recovering process can NOT work properly due to the corrupted volumes/snapshot/parent-image/... After many retry and check log, i guess the reason (2) is the main cause. Because if (1) is the main cause, other OSDs (contain buggy volume/snapshot) will crash too. State of my ceph cluster (just few seconds before crash time): 111/57706299 objects degraded (0.001%) 14918 active+clean 1 active+clean+scrubbing+deep 52 active+recovery_wait+degraded 2 active+recovering+degraded PS: i attach crash-dump log of that OSD in this email for your information. Thank you! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Stuck in stale state
nothing to send, going to standby isn't necessarily bad, I see it from time to time. It shouldn't stay like that for long though. If it's been 5 minutes, and the cluster still isn't doing anything, I'd restart that osd. On Fri, Nov 7, 2014 at 1:55 PM, Jan Pekař jan.pe...@imatic.cz wrote: Hi, I was testing ceph cluster map changes and I got to stuck state which seems to be indefinite. First my description what I have done. I'm testing special case with only one copy of pg's (pool size = 1). All pg's was on one osd.0. I created second osd.1 and modified cluster map to transfer one pool (metadata) to the newly created osd.1 PG's started to remap and objects degraded number was dropping - so everything looked normal. During that recovery process I restarted both osd daemons. After that I noticed, that pg's, that should be remapped had stale state - stale+active+remapped+backfilling and other object with stale state . I tried to run ceph pg force_create_pg on one pg, that should be remapped, but nothing changed (that is 1 stuck / creating PG below in ceph health) Command rados -p metadata ls hangs so data are unavailable, but it should be there. What should I do in this state to get it working? ceph -s below: cluster 93418692-8e2e-4689-a237-ed5b47f39f72 health HEALTH_WARN 52 pgs backfill; 1 pgs backfilling; 63 pgs stale; 1 pgs stuck inactive; 63 pgs stuck stale; 54 pgs stuck unclean; recovery 107232/1881806 objects degraded (5.698%); mon.imatic-mce low disk space monmap e1: 1 mons at {imatic-mce=192.168.11.165:6789/0}, election epoch 1, quorum 0 imatic-mce mdsmap e450: 1/1/1 up {0=imatic-mce=up:active} osdmap e275: 2 osds: 2 up, 2 in pgmap v51624: 448 pgs, 4 pools, 790 GB data, 1732 kobjects 804 GB used, 2915 GB / 3720 GB avail 107232/1881806 objects degraded (5.698%) 52 stale+active+remapped+wait_backfill 1 creating 1 stale+active+remapped+backfilling 10 stale+active+clean 384 active+clean Last message in OSD log's: 2014-11-07 22:17:45.402791 deb4db70 0 -- 192.168.11.165:6804/29564 192.168.11.165:6807/29939 pipe(0x9d52f00 sd=213 :53216 s=2 pgs=1 cs=1 l=0 c=0x2c7f58c0).fault with nothing to send, going to standby Thank you for help With regards Jan Pekar, ceph fan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem
Hi Craig, If all of your PGs now have an empty down_osds_we_would_probe, I'd run through this discussion again. Yep, looks to be true. So I ran: # ceph pg force_create_pg 2.5 and it has been creating for about 3 hours now. :/ # ceph health detail | grep creating pg 2.5 is stuck inactive since forever, current state creating, last acting [] pg 2.5 is stuck unclean since forever, current state creating, last acting [] Then I restart all OSDs. The creating label disapears and I'm back with same number of incomplete PGs. :( is the 'force_create_pg' the right command? The 'mark_unfound_lost' complains that 'pg has no unfound objects' . I shall start the 'force_create_pg' again and wait longer. Unless there is a different command to use. ? Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd down
Craig, Thanks for the info. I ended up doing a zap and then a create via ceph-deploy. One question that I still have is surrounding adding the failed osd back into the pool. In this example...osd.70 was badwhen I added it back in via ceph-deploy...the disk was brought up as osd.108. Only after osd.108 was up and running did I think to remove osd.70 from the crush map etc. My question is this...had I removed it from the crush map prior to my ceph-deploy create...should/would Ceph have reused the osd number 70? I would prefer to replace a failed disk with a new one and keep the old osd assignment...if possible that is why I am asking. Anyway...thanks again for all the help. Shain Sent from my iPhone On Nov 7, 2014, at 2:09 PM, Craig Lewis cle...@centraldesktop.commailto:cle...@centraldesktop.com wrote: I'd stop that osd daemon, and run xfs_check / xfs_repair on that partition. If you repair anything, you should probably force a deep-scrub on all the PGs on that disk. I think ceph osd deep-scrub osdid will do that, but you might have to manually grep ceph pg dump . Or you could just treat it like a failed disk, but re-use the disk. ceph-disk-prepare --zap-disk should take care of you. On Thu, Nov 6, 2014 at 5:06 PM, Shain Miley smi...@npr.orgmailto:smi...@npr.org wrote: I tried restarting all the osd's on that node, osd.70 was the only ceph process that did not come back online. There is nothing in the ceph-osd log for osd.70. However I do see over 13,000 of these messages in the kern.log: Nov 6 19:54:27 hqosd6 kernel: [34042786.392178] XFS (sdl1): xfs_log_force: error 5 returned. Does anyone have any suggestions on how I might be able to get this HD back in the cluster (or whether or not it is worth even trying). Thanks, Shain Shain Miley | Manager of Systems and Infrastructure, Digital Media | smi...@npr.orgmailto:smi...@npr.org | 202.513.3649 From: Shain Miley [smi...@npr.orgmailto:smi...@npr.org] Sent: Tuesday, November 04, 2014 3:55 PM To: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com Subject: osd down Hello, We are running ceph version 0.80.5 with 108 osd's. Today I noticed that one of the osd's is down: root@hqceph1:/var/log/ceph# ceph -s cluster 504b5794-34bd-44e7-a8c3-0494cf800c23 health HEALTH_WARN crush map has legacy tunables monmap e1: 3 mons at {hqceph1=10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0http://10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0}, election epoch 146, quorum 0,1,2 hqceph1,hqceph2,hqceph3 osdmap e7119: 108 osds: 107 up, 107 in pgmap v6729985: 3208 pgs, 17 pools, 81193 GB data, 21631 kobjects 216 TB used, 171 TB / 388 TB avail 3204 active+clean 4 active+clean+scrubbing client io 4079 kB/s wr, 8 op/s Using osd dump I determined that it is osd number 70: osd.70 down out weight 0 up_from 2668 up_thru 6886 down_at 6913 last_clean_interval [488,2665) 10.35.1.217:6814/22440http://10.35.1.217:6814/22440 10.35.1.217:6820/22440http://10.35.1.217:6820/22440 10.35.1.217:6824/22440http://10.35.1.217:6824/22440 10.35.1.217:6830/22440 autoout,existshttp://10.35.1.217:6830/22440 autoout,exists 5dbd4a14-5045-490e-859b-15533cd67568 Looking at that node, the drive is still mounted and I did not see any errors in any of the system logs, and the raid level status shows the drive as up and healthy, etc. root@hqosd6:~# df -h |grep 70 /dev/sdl1 3.7T 1.9T 1.9T 51% /var/lib/ceph/osd/ceph-70 I was hoping that someone might be able to advise me on the next course of action (can I add the osd back in?, should I replace the drive altogether, etc) I have attached the osd log to this email. Any suggestions would be great. Thanks, Shain -- Shain Miley | Manager of Systems and Infrastructure, Digital Media | smi...@npr.orgmailto:smi...@npr.org | 202.513.3649 ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Pg's stuck in inactive/unclean state + Association from PG-OSD does not seem to be happenning.
It is simple. When you have this kind of problem (stuck), first look into crush map. And here you are: You have only one default ruleset 0 with step take default (so selecting osd's from default root subtree), but your root doesn't contain any osds. See below: rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } root default { id -1 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 } I recommend to add octeon1 and octeon as items into default root and it should work (or create another root and replace step take default with your new root name). JP On 2014-11-10 20:21, Prashanth Nednoor wrote: Folks, Now, we are running into an issue where the PG's(192) are stuck in creating state forever. I have experimented with various PG settings(osd_pool_default_pg_num from 50 to 400) for replicas and default and doesn't seem to help so far. Just to give you a brief overview, I have 8 osd's. I see the create_pg is pending messages in ceph monitor logs. I have attached the following logs in the zip file. 1) crush map(crush.map) 2) ceph osd tree, (OSD_TREE.txt OSD's 1,2,3,4 belong to host octeon and OSD's 0,5,6,7 belong to host octeon1). 3) ceph pg dump, health details etcetc(dump_pgs, health_detail) 4) Attached the ceph.conf 5) ceph osd lspools. 0 data,1 metadata,2 rbd, Here is the dump for ceph -w before any osd's were created: ceph -w cluster 3eda0199-93a9-428b-8209-caeff84d3d3f health HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds monmap e1: 1 mons at {essperf13=209.243.160.45:6789/0}, election epoch 1, quorum 0 essperf13 osdmap e205: 0 osds: 0 up, 0 in pgmap v928: 192 pgs, 3 pools, 0 bytes data, 0 objects 0 kB used, 0 kB / 0 kB avail 192 creating 2014-11-05 23:26:46.555348 mon.0 [INF] pgmap v928: 192 pgs: 192 creating; 0 bytes data, 0 kB used, 0 kB / 0 kB avail Here is the dump for ceph -w after 8 osd's were created: ceph -w cluster 3eda0199-93a9-428b-8209-caeff84d3d3f health HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean monmap e1: 1 mons at {essperf13=209.243.160.45:6789/0}, election epoch 1, quorum 0 essperf13 osdmap e213: 8 osds: 8 up, 8 in pgmap v958: 192 pgs, 3 pools, 0 bytes data, 0 objects 328 MB used, 14856 GB / 14856 GB avail 192 creating 2014-11-05 23:46:25.461143 mon.0 [INF] pgmap v958: 192 pgs: 192 creating; 0 bytes data, 328 MB used, 14856 GB / 14856 GB avail Any pointers to resolve this issue will be helpful. Thanks Prashanth -Original Message- From: Prashanth Nednoor Sent: Tuesday, October 28, 2014 9:26 PM To: 'Sage Weil' Cc: Philip Kufeldt; ceph-de...@vger.kernel.org Subject: RE: cephx auth issues:Having issues trying to get the OSD up on a MIPS64, when the OSD tries to communicate with the monitor!!! Sage, As requested I set the debug setting in ceph.conf on both the sides. Here are the logs for the OSD and MONITOR attached. 1) OSD : IPADDRESS: 209.243.157.187. Logfile attached is: Ceph-0.log 2) MONITOR: IP ADDRESS: 209.243.160.45, Logfile attached is: Ceph-mon.essperf13.log Please Note that AUTHENTICATION IS DISABLED IN THE /etc/ceph/ceph.conf files on both OSD and monitor. In addition to this on the OSD side I by-passed part of the authentication code that was causing trouble(monc-authenticate) in osd_init function call. I hope this is ok. Good news is my osd daemon is up now on the MIPS side, finally, but for some reason MONITOR is still not detecting the OSD. It seems from the ceph mon log, it knows the OSD is at 187 and it does exchange some information. Thanks for your prompt response and help. Thanks Prashanth -Original Message- From: Sage Weil [mailto:s...@newdream.net] Sent: Tuesday, October 28, 2014 4:59 PM To: Prashanth Nednoor Cc: Philip Kufeldt; ceph-de...@vger.kernel.org Subject: Re: cephx auth issues:Having issues trying to get the OSD up on a MIPS64, when the OSD tries to communicate with the monitor!!! Hi, On Tue, 28 Oct 2014, Prashanth Nednoor wrote: Folks, I am trying to get the osd up and having an issue. OSD does exchange some messages with the MONITOR before this error. Seems like an issue with authentication in my set up with MIPS based OSD and Intel XEON MONITORS. I have attached the logs. The OSD(209.243.157.187) sends some request to MONITOR (209.243.160.45). I see this message No session security set, followed by the below message. The reply is coming back as auth_reply(proto 2 -1 (1) Operation not permitted. Is there an ENDIAN issue here between MIPS based OSD(BIGEENDIAN) and INTEL XEONS(LITTLE ENDIAN), my CEPH-MOINTORS are INTEL XEONS??? I made sure the keyrings are all consistent. Here are the keys on OSD and MONITOR. I tried
Re: [ceph-users] Stuck in stale state
Thank you, sorry for bothering, I was new to ceph-users list and I couldn't cancel my message. I found out what happened few hours later. Main problem was, that I moved one OSD from host hostname {} crush map entry (I wanted to do so). Everything was OK, but restart of OSD caused automatic osd placement back into host hostname {} crush map section again. I solved it with osd crush update on start = false see ceph-crush-location hook http://ceph.com/docs/master/rados/operations/crush-map/ You can consider solved, no problem with CEPH, only my poor knowledge caused that. JP On 2014-11-10 20:53, Craig Lewis wrote: nothing to send, going to standby isn't necessarily bad, I see it from time to time. It shouldn't stay like that for long though. If it's been 5 minutes, and the cluster still isn't doing anything, I'd restart that osd. On Fri, Nov 7, 2014 at 1:55 PM, Jan Pekař jan.pe...@imatic.cz mailto:jan.pe...@imatic.cz wrote: Hi, I was testing ceph cluster map changes and I got to stuck state which seems to be indefinite. First my description what I have done. I'm testing special case with only one copy of pg's (pool size = 1). All pg's was on one osd.0. I created second osd.1 and modified cluster map to transfer one pool (metadata) to the newly created osd.1 PG's started to remap and objects degraded number was dropping - so everything looked normal. During that recovery process I restarted both osd daemons. After that I noticed, that pg's, that should be remapped had stale state - stale+active+remapped+__backfilling and other object with stale state . I tried to run ceph pg force_create_pg on one pg, that should be remapped, but nothing changed (that is 1 stuck / creating PG below in ceph health) Command rados -p metadata ls hangs so data are unavailable, but it should be there. What should I do in this state to get it working? ceph -s below: cluster 93418692-8e2e-4689-a237-__ed5b47f39f72 health HEALTH_WARN 52 pgs backfill; 1 pgs backfilling; 63 pgs stale; 1 pgs stuck inactive; 63 pgs stuck stale; 54 pgs stuck unclean; recovery 107232/1881806 objects degraded (5.698%); mon.imatic-mce low disk space monmap e1: 1 mons at {imatic-mce=192.168.11.165:__6789/0 http://192.168.11.165:6789/0}, election epoch 1, quorum 0 imatic-mce mdsmap e450: 1/1/1 up {0=imatic-mce=up:active} osdmap e275: 2 osds: 2 up, 2 in pgmap v51624: 448 pgs, 4 pools, 790 GB data, 1732 kobjects 804 GB used, 2915 GB / 3720 GB avail 107232/1881806 objects degraded (5.698%) 52 stale+active+remapped+wait___backfill 1 creating 1 stale+active+remapped+__backfilling 10 stale+active+clean 384 active+clean Last message in OSD log's: 2014-11-07 22:17:45.402791 deb4db70 0 -- 192.168.11.165:6804/29564 http://192.168.11.165:6804/29564 192.168.11.165:6807/29939 http://192.168.11.165:6807/29939 pipe(0x9d52f00 sd=213 :53216 s=2 pgs=1 cs=1 l=0 c=0x2c7f58c0).fault with nothing to send, going to standby Thank you for help With regards Jan Pekar, ceph fan _ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Ing. Jan Pekař jan.pe...@imatic.cz | +420603811737 Imatic | Jagellonská 14 | Praha 3 | 130 00 http://www.imatic.cz -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Node down question
I have searched the list archives, and have seen a couple of references to this question, but no real solution, unfortunately... We are running multiple ceph clusters, pretty much as media appliances. As such, the number of nodes is variable, and all of the nodes are symmetric (i.e. same CPU power, memory, disk space). As a result, we are running a monitor and OSD (connected to an SSD RAID) on each of the systems. The number of nodes is typically small, on the order of five to a dozen. As the node count gets higher, we are planning not to run monitors on all nodes. Our pools are typically set up with a replication size of 2 or 3, with a minsize of 1. The problem occurs when a single node goes down, such that its monitor and OSD stop at once. For a client (especially a writer) on another node, there is a pretty consistent 20 second delay until further operations go through. This is a delay that we cannot easily survive. If I first bring down the OSD, then wait a few seconds, and then bring down the monitor, the system behaves with only a few seconds of delay. However, we can't always guarantee the graceful shutdown (such as when a node is rebooted, loses network connectivity, or power is lost). Note that I get exactly the same behavior if I stop an OSD on one system, while stopping a monitor on another... Previous discussions similar to this have touched upon the osd heartbeat grace setting, which is conspiciously set to 20 seconds. I have tried changing this, along with any other related settings, to no avail -- for whatever I do, the delay remains at 20 seconds. Anything else to try? Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd down
Yes, removing an OSD before re-creating it will give you the same OSD ID. That's my preferred method, because it keeps the crushmap the same. Only PGs that existed on the replaced disk need to be backfilled. I don't know if adding the replacement to the same host then removing the old OSD gives you the same CRUSH map as the reverse. I suspect not, because the OSDs are re-ordered on that host. On Mon, Nov 10, 2014 at 1:29 PM, Shain Miley smi...@npr.org wrote: Craig, Thanks for the info. I ended up doing a zap and then a create via ceph-deploy. One question that I still have is surrounding adding the failed osd back into the pool. In this example...osd.70 was badwhen I added it back in via ceph-deploy...the disk was brought up as osd.108. Only after osd.108 was up and running did I think to remove osd.70 from the crush map etc. My question is this...had I removed it from the crush map prior to my ceph-deploy create...should/would Ceph have reused the osd number 70? I would prefer to replace a failed disk with a new one and keep the old osd assignment...if possible that is why I am asking. Anyway...thanks again for all the help. Shain Sent from my iPhone On Nov 7, 2014, at 2:09 PM, Craig Lewis cle...@centraldesktop.com wrote: I'd stop that osd daemon, and run xfs_check / xfs_repair on that partition. If you repair anything, you should probably force a deep-scrub on all the PGs on that disk. I think ceph osd deep-scrub osdid will do that, but you might have to manually grep ceph pg dump . Or you could just treat it like a failed disk, but re-use the disk. ceph-disk-prepare --zap-disk should take care of you. On Thu, Nov 6, 2014 at 5:06 PM, Shain Miley smi...@npr.org wrote: I tried restarting all the osd's on that node, osd.70 was the only ceph process that did not come back online. There is nothing in the ceph-osd log for osd.70. However I do see over 13,000 of these messages in the kern.log: Nov 6 19:54:27 hqosd6 kernel: [34042786.392178] XFS (sdl1): xfs_log_force: error 5 returned. Does anyone have any suggestions on how I might be able to get this HD back in the cluster (or whether or not it is worth even trying). Thanks, Shain Shain Miley | Manager of Systems and Infrastructure, Digital Media | smi...@npr.org | 202.513.3649 From: Shain Miley [smi...@npr.org] Sent: Tuesday, November 04, 2014 3:55 PM To: ceph-users@lists.ceph.com Subject: osd down Hello, We are running ceph version 0.80.5 with 108 osd's. Today I noticed that one of the osd's is down: root@hqceph1:/var/log/ceph# ceph -s cluster 504b5794-34bd-44e7-a8c3-0494cf800c23 health HEALTH_WARN crush map has legacy tunables monmap e1: 3 mons at {hqceph1= 10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0 }, election epoch 146, quorum 0,1,2 hqceph1,hqceph2,hqceph3 osdmap e7119: 108 osds: 107 up, 107 in pgmap v6729985: 3208 pgs, 17 pools, 81193 GB data, 21631 kobjects 216 TB used, 171 TB / 388 TB avail 3204 active+clean 4 active+clean+scrubbing client io 4079 kB/s wr, 8 op/s Using osd dump I determined that it is osd number 70: osd.70 down out weight 0 up_from 2668 up_thru 6886 down_at 6913 last_clean_interval [488,2665) 10.35.1.217:6814/22440 10.35.1.217:6820/22440 10.35.1.217:6824/22440 10.35.1.217:6830/22440 autoout,exists http://10.35.1.217:6830/22440autoout,exists 5dbd4a14-5045-490e-859b-15533cd67568 Looking at that node, the drive is still mounted and I did not see any errors in any of the system logs, and the raid level status shows the drive as up and healthy, etc. root@hqosd6:~# df -h |grep 70 /dev/sdl1 3.7T 1.9T 1.9T 51% /var/lib/ceph/osd/ceph-70 I was hoping that someone might be able to advise me on the next course of action (can I add the osd back in?, should I replace the drive altogether, etc) I have attached the osd log to this email. Any suggestions would be great. Thanks, Shain -- Shain Miley | Manager of Systems and Infrastructure, Digital Media | smi...@npr.org | 202.513.3649 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Trying to figure out usable space on erasure coded pools
Hi, It's easy to calculate the amount of raw storage vs actual storage on replicated pools. Example with 4x 2TB disks: - 8TB raw - 4TB usable (when using 2 replicas) I understand how erasure coded pools reduces the overhead of storage required for data redundancy and resiliency and how it depends on the erasure coding profile you use. Do you guys have an easy way to figure out the amount of usable storage ? Thanks ! -- David Moreau Simard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem
I had the same experience with force_create_pg too. I ran it, and the PGs sat there in creating state. I left the cluster overnight, and sometime in the middle of the night, they created. The actual transition from creating to active+clean happened during the recovery after a single OSD was kicked out. I don't recall if that single OSD was responsible for the creating PGs. I really can't say what un-jammed my creating. On Mon, Nov 10, 2014 at 12:33 PM, Chad Seys cws...@physics.wisc.edu wrote: Hi Craig, If all of your PGs now have an empty down_osds_we_would_probe, I'd run through this discussion again. Yep, looks to be true. So I ran: # ceph pg force_create_pg 2.5 and it has been creating for about 3 hours now. :/ # ceph health detail | grep creating pg 2.5 is stuck inactive since forever, current state creating, last acting [] pg 2.5 is stuck unclean since forever, current state creating, last acting [] Then I restart all OSDs. The creating label disapears and I'm back with same number of incomplete PGs. :( is the 'force_create_pg' the right command? The 'mark_unfound_lost' complains that 'pg has no unfound objects' . I shall start the 'force_create_pg' again and wait longer. Unless there is a different command to use. ? Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Trying to figure out usable space on erasure coded pools
On Mon, 10 Nov 2014, David Moreau Simard wrote: Hi, It's easy to calculate the amount of raw storage vs actual storage on replicated pools. Example with 4x 2TB disks: - 8TB raw - 4TB usable (when using 2 replicas) I understand how erasure coded pools reduces the overhead of storage required for data redundancy and resiliency and how it depends on the erasure coding profile you use. Do you guys have an easy way to figure out the amount of usable storage ? The 'ceph df' command now has a 'MAX AVAIL' column that factors in either the replication factor or erasure k/(k+m) ratio. It also takes into account the projected distribution of data across disks from the CRUSH rule and uses the 'first OSD to fill up' as the target. What it doesn't take into account is the expected variation in utilization or the 'full_ratio' and 'near_full_ratio' which will stop writes sometime before that point. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Trying to figure out usable space on erasure coded pools
Oh, that's interesting - I didn't know that. Thanks. -- David Moreau Simard On Nov 10, 2014, at 6:06 PM, Sage Weil s...@newdream.net wrote: On Mon, 10 Nov 2014, David Moreau Simard wrote: Hi, It's easy to calculate the amount of raw storage vs actual storage on replicated pools. Example with 4x 2TB disks: - 8TB raw - 4TB usable (when using 2 replicas) I understand how erasure coded pools reduces the overhead of storage required for data redundancy and resiliency and how it depends on the erasure coding profile you use. Do you guys have an easy way to figure out the amount of usable storage ? The 'ceph df' command now has a 'MAX AVAIL' column that factors in either the replication factor or erasure k/(k+m) ratio. It also takes into account the projected distribution of data across disks from the CRUSH rule and uses the 'first OSD to fill up' as the target. What it doesn't take into account is the expected variation in utilization or the 'full_ratio' and 'near_full_ratio' which will stop writes sometime before that point. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PG's incomplete after OSD failure
Hi All, We've had a string of very unfortunate failures and need a hand fixing the incomplete PG's that we're now left with. We're configured with 3 replicas over different hosts with 5 in total. The timeline goes - -1 week :: A full server goes offline with a failed backplane. Still not working -1 day :: OSD 190 fails -1 day + 3 minutes :: OSD 121 fails in a different server fails taking out several PG's and blocking IO Today :: The first failed osd (osd.190) was cloned to a good drive with xfs_dump | xfs_restore and now boots fine. The last failed osd (osd.121) is completely unrecoverable and was marked as lost. What we're left with now is 2 incomplete PG's that are preventing RBD images from booting. # ceph pg dump_stuck inactive ok pg_statobjectsmipdegrmispunfbyteslog disklogstatestate_stampvreportedupup_primary actingacting_primarylast_scrubscrub_stamp last_deep_scrubdeep_scrub_stamp 8.ca244000001021974886492059205 incomplete2014-11-11 10:29:04.910512160435'959618 161358:6071679[190,111]190[190,111]19086417'207324 2013-09-09 12:58:10.74900186229'1968872013-09-02 12:57:58.162789 8.6ae00000031763176incomplete 2014-11-11 10:24:07.000373160931'1935986161358:267 [117,190]117[117,190]11786424'3897482013-09-09 16:52:58.79665086424'3897482013-09-09 16:52:58.796650 We've tried doing a pg revert but it's saying 'no missing objects' followed by not doing anything. I've also done the usual scrub, deep-scrub, pg and osd repairs... so far nothing has helped. I think it could be a similar situation to this post [ http://www.spinics.net/lists/ceph-users/msg11461.html ] where one of the osd's it holding a slightly newer but incomplete version of the PG which needs to be removed. Is anyone able to shed some light on how I might be able to use the objectstore tool to check if this is the case? If anyone has any suggestions it would be greatly appreciated. Likewise if you need any more information about my problem just let me know Thanks all -Matt ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds fails to start with mismatch in id
Hi Greg, Thanks for the pointer. I think you are right. The full story is like this. After installation, everything works fine until I reboot. I do observe udevadm getting triggered in logs, but the devices do not come up after reboot. Exact issue as http://tracker.ceph.com/issues/5194. But this has been fixed a while back per the case details. As a workaround, I copied the contents from /proc/mounts to fstab and that’s where I landed into the issue. After your suggestion, defined as UUID in fstab, but similar problem. blkid.tab now moved to tmpfs and also isn’t consistent ever after issuing blkid explicitly to get the UUID’s. Goes in line with ceph-disk comments. Decided to reinstall, dd the partitions, zapdisks etc. Did not help. Very weird that links below change in /dev/disk/by-uuid and /dev/disk/by-partuuid etc. Before reboot lrwxrwxrwx 1 root root 10 Nov 10 06:31 11aca3e2-a9d5-4bcc-a5b0-441c53d473b6 - ../../sdd2 lrwxrwxrwx 1 root root 10 Nov 10 06:31 89594989-90cb-4144-ac99-0ffd6a04146e - ../../sde2 lrwxrwxrwx 1 root root 10 Nov 10 06:31 c17fe791-5525-4b09-92c4-f90eaaf80dc6 - ../../sda2 lrwxrwxrwx 1 root root 10 Nov 10 06:31 c57541a1-6820-44a8-943f-94d68b4b03d4 - ../../sdc2 lrwxrwxrwx 1 root root 10 Nov 10 06:31 da7030dd-712e-45e4-8d89-6e795d9f8011 - ../../sdb2 After reboot lrwxrwxrwx 1 root root 10 Nov 10 09:50 11aca3e2-a9d5-4bcc-a5b0-441c53d473b6 - ../../sdd2 lrwxrwxrwx 1 root root 10 Nov 10 09:50 89594989-90cb-4144-ac99-0ffd6a04146e - ../../sde2 lrwxrwxrwx 1 root root 10 Nov 10 09:50 c17fe791-5525-4b09-92c4-f90eaaf80dc6 - ../../sda2 lrwxrwxrwx 1 root root 10 Nov 10 09:50 c57541a1-6820-44a8-943f-94d68b4b03d4 - ../../sdb2 lrwxrwxrwx 1 root root 10 Nov 10 09:50 da7030dd-712e-45e4-8d89-6e795d9f8011 - ../../sdh2 Essentially, the transformation here is sdb2-sdh2 and sdc2- sdb2. In fact I haven’t partitioned my sdh at all before the test. The only difference probably from the standard procedure is I have pre-created the partitions for the journal and data, with parted. /lib/udev/rules.d osd rules has four different partition GUID codes, 45b0969e-9b03-4f30-b4c6-5ec00ceff106, 45b0969e-9b03-4f30-b4c6-b4b80ceff106, 4fbd7e29-9d25-41b8-afd0-062c0ceff05d, 4fbd7e29-9d25-41b8-afd0-5ec00ceff05d, But all my partitions journal/data are having ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 as partition guid code. Appreciate any help. Regards, Rama = -Original Message- From: Gregory Farnum [mailto:g...@gregs42.com] Sent: Sunday, November 09, 2014 3:36 PM To: Ramakrishna Nishtala (rnishtal) Cc: ceph-us...@ceph.com Subject: Re: [ceph-users] osds fails to start with mismatch in id On Sun, Nov 9, 2014 at 3:21 PM, Ramakrishna Nishtala (rnishtal) rnish...@cisco.commailto:rnish...@cisco.com wrote: Hi I am on ceph 0.87, RHEL 7 Out of 60 few osd’s start and the rest complain about mismatch about id’s as below. 2014-11-09 07:09:55.501177 7f4633e01880 -1 OSD id 56 != my id 53 2014-11-09 07:09:55.810048 7f636edf4880 -1 OSD id 57 != my id 54 2014-11-09 07:09:56.122957 7f459a766880 -1 OSD id 58 != my id 55 2014-11-09 07:09:56.429771 7f87f8e0c880 -1 OSD id 0 != my id 56 2014-11-09 07:09:56.741329 7fadd9b91880 -1 OSD id 2 != my id 57 Found one OSD ID in /var/lib/ceph/cluster-id/keyring. To check this out manually corrected it and turned authentication to none too, but did not help. Any clues, how it can be corrected? It sounds like maybe the symlinks to data and journal aren't matching up with where they're supposed to be. This is usually a result of using unstable /dev links that don't always match to the same physical disks. Have you checked that? -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG's incomplete after OSD failure
Just an update, it appears that no data actually exists for those PG's on osd.117 and osd.111 but it's showing as incomplete anyway. So for the 8.ca PG, osd.111 has only an empty directory but osd 190 is filled with data. For 8.6ae, osd.117 has no data in the pg directory and osd.190 is filled with data as before. Since all of the required data is on OSD.190, would there be a way to make osd.111 and osd.117 forget they have ever seen the two incomplete PG's and therefore restart backfilling? On Tue, Nov 11, 2014 at 10:37 AM, Matthew Anderson manderson8...@gmail.com wrote: Hi All, We've had a string of very unfortunate failures and need a hand fixing the incomplete PG's that we're now left with. We're configured with 3 replicas over different hosts with 5 in total. The timeline goes - -1 week :: A full server goes offline with a failed backplane. Still not working -1 day :: OSD 190 fails -1 day + 3 minutes :: OSD 121 fails in a different server fails taking out several PG's and blocking IO Today :: The first failed osd (osd.190) was cloned to a good drive with xfs_dump | xfs_restore and now boots fine. The last failed osd (osd.121) is completely unrecoverable and was marked as lost. What we're left with now is 2 incomplete PG's that are preventing RBD images from booting. # ceph pg dump_stuck inactive ok pg_statobjectsmipdegrmispunfbyteslog disklogstatestate_stampvreportedupup_primary actingacting_primarylast_scrubscrub_stamp last_deep_scrubdeep_scrub_stamp 8.ca244000001021974886492059205 incomplete2014-11-11 10:29:04.910512160435'959618 161358:6071679[190,111]190[190,111]19086417'207324 2013-09-09 12:58:10.74900186229'1968872013-09-02 12:57:58.162789 8.6ae00000031763176incomplete 2014-11-11 10:24:07.000373160931'1935986161358:267 [117,190]117[117,190]11786424'3897482013-09-09 16:52:58.79665086424'3897482013-09-09 16:52:58.796650 We've tried doing a pg revert but it's saying 'no missing objects' followed by not doing anything. I've also done the usual scrub, deep-scrub, pg and osd repairs... so far nothing has helped. I think it could be a similar situation to this post [ http://www.spinics.net/lists/ceph-users/msg11461.html ] where one of the osd's it holding a slightly newer but incomplete version of the PG which needs to be removed. Is anyone able to shed some light on how I might be able to use the objectstore tool to check if this is the case? If anyone has any suggestions it would be greatly appreciated. Likewise if you need any more information about my problem just let me know Thanks all -Matt ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] does anyone know what xfsaild and kworker are?they make osd disk busy. produce 100-200iops per osd disk?
ZTE Information Security Notice: The information contained in this mail (and any attachment transmitted herewith) is privileged and confidential and is intended for the exclusive use of the addressee(s). If you are not an intended recipient, any disclosure, reproduction, distribution or other dissemination or use of the information contained is strictly prohibited. If you have received this mail in error, please delete it and notify us immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG's incomplete after OSD failure
On Tue, 11 Nov 2014, Matthew Anderson wrote: Just an update, it appears that no data actually exists for those PG's on osd.117 and osd.111 but it's showing as incomplete anyway. So for the 8.ca PG, osd.111 has only an empty directory but osd 190 is filled with data. For 8.6ae, osd.117 has no data in the pg directory and osd.190 is filled with data as before. Since all of the required data is on OSD.190, would there be a way to make osd.111 and osd.117 forget they have ever seen the two incomplete PG's and therefore restart backfilling? Ah, that's good news. You should know that the copy on osd.190 is slightly out of date, but it is much better than losing the entire contents of the PG. More specifically, for 8.6ae the latest version was 1935986 but the osd.190 is 1935747, about 200 writes in the past. You'll need to fsck the RBD images after this is all done. I don't think we've tested this recovery scenario, but I think you'll be able to recovery with ceph_objectstore_tool, which has an import/export function and a delete function. First, try removing the newer version of the pg on osd.117. First export it for good measure (even tho it's empty): stop the osd ceph_objectstore_tool --data-path /var/lib/ceph/osd/ceph-117 \ --journal-path /var/lib/ceph/osd/ceph-117/journal \ --op export --pgid 8.6ae --file osd.117.8.7ae ceph_objectstore_tool --data-path /var/lib/ceph/osd/ceph-117 \ --journal-path /var/lib/ceph/osd/ceph-117/journal \ --op remove --pgid 8.6ae and restart. If that doesn't peer, you can also try exporting the pg from osd.190 and importing it into osd.117. I think just removing the newer empty pg on osd.117 will do the trick, though... sage On Tue, Nov 11, 2014 at 10:37 AM, Matthew Anderson manderson8...@gmail.com wrote: Hi All, We've had a string of very unfortunate failures and need a hand fixing the incomplete PG's that we're now left with. We're configured with 3 replicas over different hosts with 5 in total. The timeline goes - -1 week :: A full server goes offline with a failed backplane. Still not working -1 day :: OSD 190 fails -1 day + 3 minutes :: OSD 121 fails in a different server fails taking out several PG's and blocking IO Today :: The first failed osd (osd.190) was cloned to a good drive with xfs_dump | xfs_restore and now boots fine. The last failed osd (osd.121) is completely unrecoverable and was marked as lost. What we're left with now is 2 incomplete PG's that are preventing RBD images from booting. # ceph pg dump_stuck inactive ok pg_statobjectsmipdegrmispunfbyteslog disklogstatestate_stampvreportedupup_primary actingacting_primarylast_scrubscrub_stamp last_deep_scrubdeep_scrub_stamp 8.ca244000001021974886492059205 incomplete2014-11-11 10:29:04.910512160435'959618 161358:6071679[190,111]190[190,111]19086417'207324 2013-09-09 12:58:10.74900186229'1968872013-09-02 12:57:58.162789 8.6ae00000031763176incomplete 2014-11-11 10:24:07.000373160931'1935986161358:267 [117,190]117[117,190]11786424'3897482013-09-09 16:52:58.79665086424'3897482013-09-09 16:52:58.796650 We've tried doing a pg revert but it's saying 'no missing objects' followed by not doing anything. I've also done the usual scrub, deep-scrub, pg and osd repairs... so far nothing has helped. I think it could be a similar situation to this post [ http://www.spinics.net/lists/ceph-users/msg11461.html ] where one of the osd's it holding a slightly newer but incomplete version of the PG which needs to be removed. Is anyone able to shed some light on how I might be able to use the objectstore tool to check if this is the case? If anyone has any suggestions it would be greatly appreciated. Likewise if you need any more information about my problem just let me know Thanks all -Matt ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds fails to start with mismatch in id
Hi, Ramakrishna. I think you understand what the problem is: [ceph@ceph05 ~]$ cat /var/lib/ceph/osd/ceph-56/whoami 56 [ceph@ceph05 ~]$ cat /var/lib/ceph/osd/ceph-57/whoami 57 Tue Nov 11 2014 at 6:01:40, Ramakrishna Nishtala (rnishtal) rnish...@cisco.com: Hi Greg, Thanks for the pointer. I think you are right. The full story is like this. After installation, everything works fine until I reboot. I do observe udevadm getting triggered in logs, but the devices do not come up after reboot. Exact issue as http://tracker.ceph.com/issues/5194. But this has been fixed a while back per the case details. As a workaround, I copied the contents from /proc/mounts to fstab and that’s where I landed into the issue. After your suggestion, defined as UUID in fstab, but similar problem. blkid.tab now moved to tmpfs and also isn’t consistent ever after issuing blkid explicitly to get the UUID’s. Goes in line with ceph-disk comments. Decided to reinstall, dd the partitions, zapdisks etc. Did not help. Very weird that links below change in /dev/disk/by-uuid and /dev/disk/by-partuuid etc. *Before reboot* lrwxrwxrwx 1 root root 10 Nov 10 06:31 11aca3e2-a9d5-4bcc-a5b0-441c53d473b6 - ../../sdd2 lrwxrwxrwx 1 root root 10 Nov 10 06:31 89594989-90cb-4144-ac99-0ffd6a04146e - ../../sde2 lrwxrwxrwx 1 root root 10 Nov 10 06:31 c17fe791-5525-4b09-92c4-f90eaaf80dc6 - ../../sda2 lrwxrwxrwx 1 root root 10 Nov 10 06:31 c57541a1-6820-44a8-943f-94d68b4b03d4 - ../../sdc2 lrwxrwxrwx 1 root root 10 Nov 10 06:31 da7030dd-712e-45e4-8d89-6e795d9f8011 - ../../sdb2 *After reboot* lrwxrwxrwx 1 root root 10 Nov 10 09:50 11aca3e2-a9d5-4bcc-a5b0-441c53d473b6 - ../../sdd2 lrwxrwxrwx 1 root root 10 Nov 10 09:50 89594989-90cb-4144-ac99-0ffd6a04146e - ../../sde2 lrwxrwxrwx 1 root root 10 Nov 10 09:50 c17fe791-5525-4b09-92c4-f90eaaf80dc6 - ../../sda2 lrwxrwxrwx 1 root root 10 Nov 10 09:50 c57541a1-6820-44a8-943f-94d68b4b03d4 - ../../sdb2 lrwxrwxrwx 1 root root 10 Nov 10 09:50 da7030dd-712e-45e4-8d89-6e795d9f8011 - ../../sdh2 Essentially, the transformation here is sdb2-sdh2 and sdc2- sdb2. In fact I haven’t partitioned my sdh at all before the test. The only difference probably from the standard procedure is I have pre-created the partitions for the journal and data, with parted. /lib/udev/rules.d osd rules has four different partition GUID codes, 45b0969e-9b03-4f30-b4c6-5ec00ceff106, 45b0969e-9b03-4f30-b4c6-b4b80ceff106, 4fbd7e29-9d25-41b8-afd0-062c0ceff05d, 4fbd7e29-9d25-41b8-afd0-5ec00ceff05d, But all my partitions journal/data are having ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 as partition guid code. Appreciate any help. Regards, Rama = -Original Message- From: Gregory Farnum [mailto:g...@gregs42.com] Sent: Sunday, November 09, 2014 3:36 PM To: Ramakrishna Nishtala (rnishtal) Cc: ceph-us...@ceph.com Subject: Re: [ceph-users] osds fails to start with mismatch in id On Sun, Nov 9, 2014 at 3:21 PM, Ramakrishna Nishtala (rnishtal) rnish...@cisco.com wrote: Hi I am on ceph 0.87, RHEL 7 Out of 60 few osd’s start and the rest complain about mismatch about id’s as below. 2014-11-09 07:09:55.501177 7f4633e01880 -1 OSD id 56 != my id 53 2014-11-09 07:09:55.810048 7f636edf4880 -1 OSD id 57 != my id 54 2014-11-09 07:09:56.122957 7f459a766880 -1 OSD id 58 != my id 55 2014-11-09 07:09:56.429771 7f87f8e0c880 -1 OSD id 0 != my id 56 2014-11-09 07:09:56.741329 7fadd9b91880 -1 OSD id 2 != my id 57 Found one OSD ID in /var/lib/ceph/cluster-id/keyring. To check this out manually corrected it and turned authentication to none too, but did not help. Any clues, how it can be corrected? It sounds like maybe the symlinks to data and journal aren't matching up with where they're supposed to be. This is usually a result of using unstable /dev links that don't always match to the same physical disks. Have you checked that? -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds fails to start with mismatch in id
Hi Ramakrishna, we use the phy. path (containing the serial number) to a disk to prevent complexity and wrong mapping... This path will never change: /etc/ceph/ceph.conf [osd.16] devs = /dev/disk/by-id/scsi-SATA_ST4000NM0033-9Z_Z1Z0SDCY-part1 osd_journal = /dev/disk/by-id/scsi-SATA_INTEL_SSDSC2BA1BTTV330609AU100FGN-part1 ... regards Danny From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Irek Fasikhov Sent: Tuesday, November 11, 2014 6:36 AM To: Ramakrishna Nishtala (rnishtal); Gregory Farnum Cc: ceph-us...@ceph.com Subject: Re: [ceph-users] osds fails to start with mismatch in id Hi, Ramakrishna. I think you understand what the problem is: [ceph@ceph05 ~]$ cat /var/lib/ceph/osd/ceph-56/whoami 56 [ceph@ceph05 ~]$ cat /var/lib/ceph/osd/ceph-57/whoami 57 Tue Nov 11 2014 at 6:01:40, Ramakrishna Nishtala (rnishtal) rnish...@cisco.commailto:rnish...@cisco.com: Hi Greg, Thanks for the pointer. I think you are right. The full story is like this. After installation, everything works fine until I reboot. I do observe udevadm getting triggered in logs, but the devices do not come up after reboot. Exact issue as http://tracker.ceph.com/issues/5194. But this has been fixed a while back per the case details. As a workaround, I copied the contents from /proc/mounts to fstab and that’s where I landed into the issue. After your suggestion, defined as UUID in fstab, but similar problem. blkid.tab now moved to tmpfs and also isn’t consistent ever after issuing blkid explicitly to get the UUID’s. Goes in line with ceph-disk comments. Decided to reinstall, dd the partitions, zapdisks etc. Did not help. Very weird that links below change in /dev/disk/by-uuid and /dev/disk/by-partuuid etc. Before reboot lrwxrwxrwx 1 root root 10 Nov 10 06:31 11aca3e2-a9d5-4bcc-a5b0-441c53d473b6 - ../../sdd2 lrwxrwxrwx 1 root root 10 Nov 10 06:31 89594989-90cb-4144-ac99-0ffd6a04146e - ../../sde2 lrwxrwxrwx 1 root root 10 Nov 10 06:31 c17fe791-5525-4b09-92c4-f90eaaf80dc6 - ../../sda2 lrwxrwxrwx 1 root root 10 Nov 10 06:31 c57541a1-6820-44a8-943f-94d68b4b03d4 - ../../sdc2 lrwxrwxrwx 1 root root 10 Nov 10 06:31 da7030dd-712e-45e4-8d89-6e795d9f8011 - ../../sdb2 After reboot lrwxrwxrwx 1 root root 10 Nov 10 09:50 11aca3e2-a9d5-4bcc-a5b0-441c53d473b6 - ../../sdd2 lrwxrwxrwx 1 root root 10 Nov 10 09:50 89594989-90cb-4144-ac99-0ffd6a04146e - ../../sde2 lrwxrwxrwx 1 root root 10 Nov 10 09:50 c17fe791-5525-4b09-92c4-f90eaaf80dc6 - ../../sda2 lrwxrwxrwx 1 root root 10 Nov 10 09:50 c57541a1-6820-44a8-943f-94d68b4b03d4 - ../../sdb2 lrwxrwxrwx 1 root root 10 Nov 10 09:50 da7030dd-712e-45e4-8d89-6e795d9f8011 - ../../sdh2 Essentially, the transformation here is sdb2-sdh2 and sdc2- sdb2. In fact I haven’t partitioned my sdh at all before the test. The only difference probably from the standard procedure is I have pre-created the partitions for the journal and data, with parted. /lib/udev/rules.d osd rules has four different partition GUID codes, 45b0969e-9b03-4f30-b4c6-5ec00ceff106, 45b0969e-9b03-4f30-b4c6-b4b80ceff106, 4fbd7e29-9d25-41b8-afd0-062c0ceff05d, 4fbd7e29-9d25-41b8-afd0-5ec00ceff05d, But all my partitions journal/data are having ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 as partition guid code. Appreciate any help. Regards, Rama = -Original Message- From: Gregory Farnum [mailto:g...@gregs42.commailto:g...@gregs42.com] Sent: Sunday, November 09, 2014 3:36 PM To: Ramakrishna Nishtala (rnishtal) Cc: ceph-us...@ceph.commailto:ceph-us...@ceph.com Subject: Re: [ceph-users] osds fails to start with mismatch in id On Sun, Nov 9, 2014 at 3:21 PM, Ramakrishna Nishtala (rnishtal) rnish...@cisco.commailto:rnish...@cisco.com wrote: Hi I am on ceph 0.87, RHEL 7 Out of 60 few osd’s start and the rest complain about mismatch about id’s as below. 2014-11-09 07:09:55.501177 7f4633e01880 -1 OSD id 56 != my id 53 2014-11-09 07:09:55.810048 7f636edf4880 -1 OSD id 57 != my id 54 2014-11-09 07:09:56.122957 7f459a766880 -1 OSD id 58 != my id 55 2014-11-09 07:09:56.429771 7f87f8e0c880 -1 OSD id 0 != my id 56 2014-11-09 07:09:56.741329 7fadd9b91880 -1 OSD id 2 != my id 57 Found one OSD ID in /var/lib/ceph/cluster-id/keyring. To check this out manually corrected it and turned authentication to none too, but did not help. Any clues, how it can be corrected? It sounds like maybe the symlinks to data and journal aren't matching up with where they're supposed to be. This is usually a result of using unstable /dev links that don't always match to the same physical disks. Have you checked that? -Greg ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com smime.p7s Description: S/MIME cryptographic signature
Re: [ceph-users] Triggering shallow scrub on OSD where scrub is already in progress
On Sun, Nov 9, 2014 at 9:29 PM, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: Hi all, Triggering shallow scrub on OSD where scrub is already in progress, restarts scrub from beginning on that OSD. Steps: Triggered shallow scrub on an OSD (Cluster is running heavy IO) While scrub is in progress, triggered shallow scrub again on that OSD. Observed behavior, is scrub restarted from beginning on that OSD. Please let me know, whether its expected behaviour? What version of Ceph are you seeing this on? How are you identifying that scrub is restarting from the beginning? It sounds sort of familiar to me, but I thought this was fixed so it was a no-op if you issue another scrub. (That's not authoritative though; I might just be missing a reason we want to restart it.) -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Deep scrub, cache pools, replica 1
Hello, One of my clusters has become busy enough (I'm looking at you, evil Window VMs that I shall banish elsewhere soon) to experience client noticeable performance impacts during deep scrub. Before this I instructed all OSDs to deep scrub in parallel at Saturday night and that finished before Sunday morning. So for now I'll fire them off one by one to reduce the load. Looking forward, that cluster doesn't need more space so instead of adding more hosts and OSDs I was thinking of a cache pool instead. I suppose that will keep the clients happy while the slow pool gets scrubbed. Is there anybody who tested cache pools with Firefly and compared the performance to Giant? For testing I'm currently playing with a single storage node and 8 SSD backed OSDs. Now what very much blew my mind is that a pool with a replication of 1 still does quite the impressive read orgy, clearly reading all the data in the PGs. Why? And what is it comparing that data with, the cosmic background radiation? Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com