[ceph-users] OSD space imbalance
Hello, I'm having an issue where disk usages between OSDs aren't well balanced thus causing disk space to be wasted. Ceph is latest 0.94.2, used exclusively through cephfs. Re-weighting helps, but just slightly, and it has to be done on a daily basis causing constant refills. In the end I get OSD with 65% usage with some other going over 90%. I also set the ceph osd crush tunables optimal, but I didn't notice any changes when it comes to disk usage. Is there anything I can do to get them within 10% range at least? health HEALTH_OK mdsmap e2577: 1/1/1 up, 2 up:standby osdmap e25239: 48 osds: 48 up, 48 in pgmap v3188836: 5184 pgs, 3 pools, 18028 GB data, 6385 kobjects 36156 GB used, 9472 GB / 45629 GB avail 5184 active+clean ID WEIGHT REWEIGHT SIZE USEAVAIL %USE VAR 37 0.92999 1.0 950G 625G324G 65.85 0.83 21 0.92999 1.0 950G 649G300G 68.35 0.86 32 0.92999 1.0 950G 670G279G 70.58 0.89 7 0.92999 1.0 950G 676G274G 71.11 0.90 17 0.92999 1.0 950G 681G268G 71.73 0.91 40 0.92999 1.0 950G 689G260G 72.55 0.92 20 0.92999 1.0 950G 690G260G 72.62 0.92 25 0.92999 1.0 950G 691G258G 72.76 0.92 2 0.92999 1.0 950G 694G256G 73.03 0.92 39 0.92999 1.0 950G 697G253G 73.35 0.93 18 0.92999 1.0 950G 703G247G 74.00 0.93 47 0.92999 1.0 950G 703G246G 74.05 0.93 23 0.92999 0.86693 950G 704G245G 74.14 0.94 6 0.92999 1.0 950G 726G224G 76.39 0.96 8 0.92999 1.0 950G 727G223G 76.54 0.97 5 0.92999 1.0 950G 728G222G 76.62 0.97 35 0.92999 1.0 950G 728G221G 76.66 0.97 11 0.92999 1.0 950G 730G220G 76.82 0.97 43 0.92999 1.0 950G 730G219G 76.87 0.97 33 0.92999 1.0 950G 734G215G 77.31 0.98 38 0.92999 1.0 950G 736G214G 77.49 0.98 12 0.92999 1.0 950G 737G212G 77.61 0.98 31 0.92999 0.85184 950G 742G208G 78.09 0.99 28 0.92999 1.0 950G 745G205G 78.41 0.99 27 0.92999 1.0 950G 751G199G 79.04 1.00 10 0.92999 1.0 950G 754G195G 79.40 1.00 13 0.92999 1.0 950G 762G188G 80.21 1.01 9 0.92999 1.0 950G 763G187G 80.29 1.01 16 0.92999 1.0 950G 764G186G 80.37 1.01 0 0.92999 1.0 950G 778G171G 81.94 1.03 3 0.92999 1.0 950G 780G170G 82.11 1.04 41 0.92999 1.0 950G 780G169G 82.13 1.04 34 0.92999 0.87303 950G 783G167G 82.43 1.04 14 0.92999 1.0 950G 784G165G 82.56 1.04 42 0.92999 1.0 950G 786G164G 82.70 1.04 46 0.92999 1.0 950G 788G162G 82.93 1.05 30 0.92999 1.0 950G 790G160G 83.12 1.05 45 0.92999 1.0 950G 804G146G 84.59 1.07 44 0.92999 1.0 950G 807G143G 84.92 1.07 1 0.92999 1.0 950G 817G132G 86.05 1.09 22 0.92999 1.0 950G 825G125G 86.81 1.10 15 0.92999 1.0 950G 826G123G 86.97 1.10 19 0.92999 1.0 950G 829G120G 87.30 1.10 36 0.92999 1.0 950G 831G119G 87.48 1.10 24 0.92999 1.0 950G 831G118G 87.50 1.10 26 0.92999 1.0 950G 851G 101692M 89.55 1.13 29 0.92999 1.0 950G 851G 101341M 89.59 1.13 4 0.92999 1.0 950G 860G 92164M 90.53 1.14 MIN/MAX VAR: 0.83/1.14 STDDEV: 5.94 TOTAL 45629G 36156G 9473G 79.24 Thanks, Vedran ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD space imbalance
On 13.08.2015 18:01, GuangYang wrote: Try 'ceph osd reweight-by-pg int' right after creating the pools? Would it do any good now when pool is in use and nearly full as I can't re-create it now. Also, what's the integer argument in the command above? I failed to find proper explanation in the docs. What is the typical object size in the cluster? Around 50 MB. Thanks, Vedran ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cephfs and ERESTARTSYS on writes
On 07/24/2015 03:29 PM, Ilya Dryomov wrote: ngx_write_fd() is just a write(), which, when interrupted by SIGALRM, fails with EINTR because SA_RESTART is not set. We can try digging further, but I think nginx should retry in this case. Hello, Culprit was the timer_resolution 50ms; setting in nginx which was interrupting syscalls every 50ms. It usually shouldn't interrupt writes, but it did in case of Cephs so just removing it fixed the problem. Nginx devs also supplied me with patch which should make it retry write() in case it's interrupted (but it now might not be needed). Thanks for helping me with this. Regards, Vedran ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cephfs and ERESTARTSYS on writes
Hello, I'm having an issue with nginx writing to cephfs. Often I'm getting: writev() /home/ceph/temp/44/94/1/119444 failed (4: Interrupted system call) while reading upstream looking with strace, this happens: ... write(65, e\314\366\36\302..., 65536) = ? ERESTARTSYS (To be restarted) It happens after first 4MBs (exactly) are written, subsequent write gets ERESTARTSYS (sometimes, but more rarely, it fails after first 32 or 64MBs, etc are written). Apparently nginx doesn't expect this and doesn't handle it so it cancels writes and deletes this partial file. Is it possible Ceph cannot find the destination PG fast enough and returns ERESTARTSYS? Is there any way to fix this behavior or reduce it? Regards, Vedran ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cephfs and ERESTARTSYS on writes
On 07/23/2015 03:20 PM, Gregory Farnum wrote: On Thu, Jul 23, 2015 at 1:17 PM, Vedran Furač vedran.fu...@gmail.com wrote: Is it possible Ceph cannot find the destination PG fast enough and returns ERESTARTSYS? Is there any way to fix this behavior or reduce it? That's...odd. Are you using the kernel client or ceph-fuse, and on which version? Not seeing write errors with ceph-fuse, but it's slow. Regards, Vedran ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cephfs and ERESTARTSYS on writes
On 07/23/2015 03:20 PM, Gregory Farnum wrote: On Thu, Jul 23, 2015 at 1:17 PM, Vedran Furač vedran.fu...@gmail.com wrote: Hello, I'm having an issue with nginx writing to cephfs. Often I'm getting: writev() /home/ceph/temp/44/94/1/119444 failed (4: Interrupted system call) while reading upstream looking with strace, this happens: ... write(65, e\314\366\36\302..., 65536) = ? ERESTARTSYS (To be restarted) It happens after first 4MBs (exactly) are written, subsequent write gets ERESTARTSYS (sometimes, but more rarely, it fails after first 32 or 64MBs, etc are written). Apparently nginx doesn't expect this and doesn't handle it so it cancels writes and deletes this partial file. Is it possible Ceph cannot find the destination PG fast enough and returns ERESTARTSYS? Is there any way to fix this behavior or reduce it? That's...odd. Are you using the kernel client or ceph-fuse, and on which version? Sorry, forgot to mention, it's kernel client, tried both 3.10 and 4.1, but it's the same. Ceph is firefly. I'll also try fuse. Regards, Vedran ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cephfs and ERESTARTSYS on writes
On 07/23/2015 04:19 PM, Ilya Dryomov wrote: On Thu, Jul 23, 2015 at 4:23 PM, Vedran Furač vedran.fu...@gmail.com wrote: On 07/23/2015 03:20 PM, Gregory Farnum wrote: That's...odd. Are you using the kernel client or ceph-fuse, and on which version? Sorry, forgot to mention, it's kernel client, tried both 3.10 and 4.1, but it's the same. Ceph is firefly. That's probably a wait_*() return value, meaning it timed out, so userspace logs might help understand what's going on. A separate issue is that we leak ERESTARTSYS to userspace - this needs to be fixed. Hmm, what's the timeout value? This happens even when ceph is nearly idle. When you mention logs, do you mean Ceph server logs? MON logs don't have anything special, OSD logs are full of: 2015-07-23 16:31:35.535622 7ff3fe020700 0 -- x.x.x.x:6849/27688 x.x.x.x:6841/27679 pipe(0x241e58c0 sd=183 :6849 s=2 pgs=1240 cs=127 l=0 c=0x19855de0).fault with nothing to send, going to standby 2015-07-23 16:31:42.492520 7ff401a53700 0 -- x.x.x.x:6849/27688 x.x.x.x:6841/27679 pipe(0x241e5080 sd=226 :6849 s=0 pgs=0 cs=0 l=0 c=0x21b31860).accept connect_seq 128 vs existing 127 state standby 2015-07-23 16:32:02.989102 7ff401851700 0 -- x.x.x.x:6849/27688 x.x.x.x:6854/27690 pipe(0x1916a680 sd=33 :43507 s=2 pgs=1366 cs=131 l=0 c=0x177e8680).fault with nothing to send, going to standby 2015-07-23 16:32:12.339357 7ff40144d700 0 -- x.x.x.x:6849/27688 x.x.x.x:6823/27279 pipe(0x241e7c80 sd=249 :6849 s=2 pgs=1246 cs=155 l=0 c=0x16ea46e0).fault with nothing to send, going to standby 2015-07-23 16:32:13.279426 7ff3fe828700 0 -- x.x.x.x:6849/27688 185.75.253.10:6810/9746 pipe(0x1c75e840 sd=72 :57221 s=2 pgs=1352 cs=149 l=0 c=0x147cbde0).fault with nothing to send, going to standby 2015-07-23 16:32:17.916440 7ff3fb3f4700 0 -- x.x.x.x:6849/27688 185.75.253.10:6810/9746 pipe(0x241e4000 sd=34 :6849 s=0 pgs=0 cs=0 l=0 c=0x21b2e160).accept connect_seq 150 vs existing 149 state standby 2015-07-23 16:32:22.922462 7ff40154e700 0 -- x.x.x.x:6849/27688 x.x.x.x:6823/27279 pipe(0x241e5e40 sd=216 :6849 s=0 pgs=0 cs=0 l=0 c=0x10089b80).accept connect_seq 156 vs existing 155 state standby ... Regards, Vedran ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cephfs and ERESTARTSYS on writes
On 07/23/2015 04:45 PM, Ilya Dryomov wrote: Can you provide the full strace output? This is pretty much the all the relevant part: 4118 open(/home/ceph/temp/45/45/5/154545, O_RDWR|O_CREAT|O_EXCL, 0600) = 377 4118 writev(377, [{\3\0\0\0\0..., 4096}, {\247\0\0\3\23..., 4096}, {\225\0\0\4\334..., 4096}, {\204\0\0\t\n..., 4096}, {9\0\0\v\322..., 4096}, ...], 33) = 135168 4118 readv(1206, [{\334\1\210C\315..., 4096}, {X\1\266\343\320..., 4096}, {\304\1\345k\226..., 4096}, {}\2\17\27\371..., 4096}, {\203\2:\0e..., 4096}, ...], 33) = 135168 4118 writev(377, [{\334\1\210C\315..., 4096}, {X\1\266\343\320..., 4096}, {\304\1\345k\226..., 4096}, {}\2\17\27\371..., 4096}, {\203\2:\0e..., 4096}, ...], 33) = 135168 4118 readv(1206, [{\206\0\0\1c..., 4096}, {\336\0\0\1\351..., 4096}, {\265\0\0\0\313..., 4096}, {K\0\0\1A..., 4096}, {\217\0\0\1l..., 4096}, ...], 33) = 135168 4118 writev(377, [{\206\0\0\1c..., 4096}, {\336\0\0\1\351..., 4096}, {\265\0\0\0\313..., 4096}, {K\0\0\1A..., 4096}, {\217\0\0\1l..., 4096}, ...], 33) = 135168 4118 readv(1206, [{\2\0\366\371\273..., 4096}, {\256\1\22\3015..., 4096}, {\252\1-\361\225..., 4096}, {{\1I\335\4..., 4096}, {V\1`{\303..., 4096}, ...], 33) = 135168 4118 writev(377, [{\2\0\366\371\273..., 4096}, {\256\1\22\3015..., 4096}, {\252\1-\361\225..., 4096}, {{\1I\335\4..., 4096}, {V\1`{\303..., 4096}, ...], 33) = 135168 4118 readv(1206, [{O\\U\377\210..., 4096}, {\354 Gww..., 4096}, {\356\357|\317\250..., 4096}, {\272J\231\222E..., 4096}, {w\35W\213\277..., 4096}, ...], 33) = 135168 4118 writev(377, [{O\\U\377\210..., 4096}, {\354 Gww..., 4096}, {\356\357|\317\250..., 4096}, {\272J\231\222E..., 4096}, {w\35W\213\277..., 4096}, ...], 33) = 135168 4118 readv(1206, [{O\30\256|\350..., 4096}, {\316f\21|..., 4096}, {\346\330\354YU..., 4096}, {\257{R\5\16..., 4096}, {_C\n\21w..., 4096}, ...], 33) = 135168 4118 writev(377, [{O\30\256|\350..., 4096}, {\316f\21|..., 4096}, {\346\330\354YU..., 4096}, {\257{R\5\16..., 4096}, {_C\n\21w..., 4096}, ...], 33) = 135168 4118 readv(1206, [{\233p\217\356[..., 4096}, {m\264\323F\7..., 4096}, {q\5\362/\21..., 4096}, {\262\353z(\251..., 4096}, {of\365\245U..., 4096}, ...], 33) = 135168 4118 writev(377, [{\233p\217\356[..., 4096}, {m\264\323F\7..., 4096}, {q\5\362/\21..., 4096}, {\262\353z(\251..., 4096}, {of\365\245U..., 4096}, ...], 33) = 135168 4118 readv(1206, [{\257\3335X\300..., 4096}, {\207\37BW\252..., 4096}, {U\331a)..., 4096}, {\323\33i\256`..., 4096}, {\271m\356\]..., 4096}, ...], 33) = 135168 4118 writev(377, [{\257\3335X\300..., 4096}, {\207\37BW\252..., 4096}, {U\331a)..., 4096}, {\323\33i\256`..., 4096}, {\271m\356\]..., 4096}, ...], 33) = 135168 4118 readv(1206, [{b\\\337Y\240..., 4096}, {\233\r\326o\372..., 4096}, {\346(.\32\252..., 4096}, {\252FpJW..., 4096}, {\3648\237\220\352..., 4096}, ...], 33) = 135168 4118 writev(377, [{b\\\337Y\240..., 4096}, {\233\r\326o\372..., 4096}, {\346(.\32\252..., 4096}, {\252FpJW..., 4096}, {\3648\237\220\352..., 4096}, ...], 33) = 135168 4118 readv(1206, [{\376\375\257'\310..., 4096}, {\352\256R\342..., 4096}, {\361\340\342Rq..., 4096}, {|7 \3017..., 4096}, {\224\256\356\353\312..., 4096}, ...], 33) = 135168 4118 writev(377, [{\376\375\257'\310..., 4096}, {\352\256R\342..., 4096}, {\361\340\342Rq..., 4096}, {|7 \3017..., 4096}, {\224\256\356\353\312..., 4096}, ...], 33) = 135168 4118 readv(1206, [{}y\\vJ..., 4096}, {0$\v\6\2..., 4096}, {\2135\357zy..., 4096}, {{\343N\352\215..., 4096}, {\347\321x\352\272..., 4096}, ...], 33) = 135168 4118 writev(377, [{}y\\vJ..., 4096}, {0$\v\6\2..., 4096}, {\2135\357zy..., 4096}, {{\343N\352\215..., 4096}, {\347\321x\352\272..., 4096}, ...], 33) = 135168 4118 readv(1206, [{\v\6\2\301\200..., 4096}, {C\276\232\207\210..., 4096}, {\21\0006\262\255..., 4096}, {\224\222\n\276{..., 4096}, {Ys\337w\357..., 4096}, ...], 33) = 135168 4118 writev(377, [{\v\6\2\301\200..., 4096}, {C\276\232\207\210..., 4096}, {\21\0006\262\255..., 4096}, {\224\222\n\276{..., 4096}, {Ys\337w\357..., 4096}, ...], 33) = 135168 4118 readv(1206, [{6Y\236W\345..., 4096}, {Q\207uu\252..., 4096}, {\32\346]\313i..., 4096}, {n\356\\-\336..., 4096}, {{y~]\247..., 4096}, ...], 33) = 135168 4118 writev(377, [{6Y\236W\345..., 4096}, {Q\207uu\252..., 4096}, {\32\346]\313i..., 4096}, {n\356\\-\336..., 4096}, {{y~]\247..., 4096}, ...], 33) = 135168 4118 readv(1206, [{H0\337\275\302..., 4096}, {g\177\225\316\333..., 4096}, {\364\212\374X\360..., 4096}, {\337\260\226XL..., 4096}, {Y\356\360\301r..., 4096}, ...], 33) = 135168 4118 writev(377, [{H0\337\275\302..., 4096}, {g\177\225\316\333..., 4096}, {\364\212\374X\360..., 4096}, {\337\260\226XL..., 4096}, {Y\356\360\301r..., 4096}, ...], 33) = 135168 4118 readv(1206, [{_'\255\374v..., 4096}, {\271\231/II..., 4096}, {\277]\274\200\253..., 4096}, {'\3Qe\244..., 4096}, {\341\361\210h\363..., 4096}, ...], 33) = 135168 4118 writev(377, [{_'\255\374v..., 4096}, {\271\231/II..., 4096}, {\277]\274\200\253..., 4096}, {'\3Qe\244..., 4096},
Re: [ceph-users] Cephfs and ERESTARTSYS on writes
On 07/23/2015 05:25 PM, Ilya Dryomov wrote: On Thu, Jul 23, 2015 at 6:02 PM, Vedran Furač vedran.fu...@gmail.com wrote: 4118 writev(377, [{\5\356\307l\361..., 4096}, {\337\261\17\257..., 4096}, {\211;s\310..., 4096}, {\370N\372:\252..., 4096}, {\202\311/\347\260..., 4096}, ...], 33) = ? ERESTARTSYS (To be restarted) 4118 --- SIGALRM (Alarm clock) @ 0 (0) --- 4118 rt_sigreturn(0xe) = -1 EINTR (Interrupted system call) 4118 gettid() = 4118 4118 write(4, 2015/..., 520) = 520 4118 close(1206) = 0 4118 unlink(/home/ceph/temp/45/45/5/154545) = 0 Sorry, I misread your original email and missed the nginx part entirely. Looks like Zheng, who commented on IRC, was right: the ERESTARTSYS is likely caused by some timeout mechanism in nginx signal handler for SIGALARM does not want to restart the write syscall Knowing that this might be an nginx issues as well, I've asked the same thing on their mailing list in parallel, their response was: It more looks like a bug in cephfs. writev() should never return ERESTARTSYS. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cephfs and ERESTARTSYS on writes
On 07/23/2015 03:20 PM, Gregory Farnum wrote: On Thu, Jul 23, 2015 at 1:17 PM, Vedran Furač vedran.fu...@gmail.com wrote: Hello, I'm having an issue with nginx writing to cephfs. Often I'm getting: writev() /home/ceph/temp/44/94/1/119444 failed (4: Interrupted system call) while reading upstream looking with strace, this happens: ... write(65, e\314\366\36\302..., 65536) = ? ERESTARTSYS (To be restarted) It happens after first 4MBs (exactly) are written, subsequent write gets ERESTARTSYS (sometimes, but more rarely, it fails after first 32 or 64MBs, etc are written). Apparently nginx doesn't expect this and doesn't handle it so it cancels writes and deletes this partial file. Is it possible Ceph cannot find the destination PG fast enough and returns ERESTARTSYS? Is there any way to fix this behavior or reduce it? That's...odd. Are you using the kernel client or ceph-fuse, and on which version? Sorry, forgot to mention, it's kernel client, tried both 3.10 and 4.1, but it's the same. Ceph is firefly. I'll also try fuse. Regards, Vedran ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cephfs and ERESTARTSYS on writes
On 07/23/2015 06:47 PM, Ilya Dryomov wrote: To me this looks like a writev() interrupted by a SIGALRM. I think nginx guys read your original email the same way I did, which is write syscall *returned* ERESTARTSYS, but I'm pretty sure that is not the case here. ERESTARTSYS shows up in strace output but it is handled by the kernel, userpace doesn't see it (but strace has to be able to see it, otherwise you wouldn't know if your system call has been restarted or not). You cut the output short - I asked for the entire output for a reason, please paste it somewhere. Might be, however I don't know why would be nginx interrupting it, all writes are done pretty fast and timeouts are set to 10 minutes. Here are 2 examples on 2 servers with slightly different configs (timestams included): http://pastebin.com/wUAAcdT7 http://pastebin.com/wHyWc9U5 Thanks, Vedran ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Failures with Ceph without redundancy/replication
Hello, I'm experimenting with ceph for caching, it's configured with size=1 (so no redundancy/replication) and exported via cephfs to clients, now I'm wondering what happens is an SSD dies and all of its data is lost? I'm seeing files being in 4MB chunks in PGs, do we know if a whole file as saved through cephfs (all its chunks) are in a single PG (or at least in a multiple PGs within a single OSD), or it might be spread over multiple OSD, so in that case an SSD failure would entail effectively loosing more than data than it fits on a single drive, or even worse, massive corruption potentially affecting most of the content. Note that losing a single drive and all of its data (so 1% in case of a 100 drives) isn't an issue for me. However losing much more or files being silently corrupted with holes in them is unacceptable. I would then have to go with some erasure coding. Thanks, Vedran ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com