Re: page allocation failures on osd nodes
On Sun, Jan 27, 2013 at 2:52 PM, Andrey Korolyov and...@xdel.ru wrote: Ahem. once on almost empty node same trace produced by qemu process(which was actually pinned to the specific numa node), so seems that`s generally is a some scheduler/mm bug, not directly related to the osd processes. In other words, the less percentage of memory actually is an RSS, the more is a probability of such allocation failure. This might be a known bug in xen for your kernel? The xen users list might be able to help. -sam -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RadosGW performance and disk space usage
On 01/27/2013 11:10 PM, Cesar Mello wrote: Hi, Just tried rest-bench. This little tool is wonderful, thanks! I still have to learn lots of things. So please don't spend much time explaining me, but instead please give me any pointers to documentation or source code that can be useful. As a curiosity, I'm pasting the results from my laptop. I'll repeat the same tests using my desktop as the server. Notice there is an assert being triggered, so I guess I'm running a build with debugging code ?!. I compiled using ./configure --with-radosgw --with-rest-bench followed by make. asserts are usually used to mark invariants on the code logic, and are always built, regardless debugging being enabled or disabled. Given you are hitting one, it probably means something is not quite right (might be a bug, or some invariant was broken for some reason). common/WorkQueue.cc: In function 'virtual ThreadPool::~ThreadPool()' thread 7f1211401780 time 2013-01-27 20:51:01.196525 common/WorkQueue.cc: 59: FAILED assert(_threads.empty()) ceph version 0.56-464-gfa421cf (fa421cf5f52ca16fa1328dbea2f4bda85c56cd3f) 1: (ThreadPool::~ThreadPool()+0x10c) [0x43bf9c] 2: (RESTDispatcher::~RESTDispatcher()+0xf1) [0x42a021] 3: (main()+0x75b) [0x42521b] 4: (__libc_start_main()+0xed) [0x7f120f37576d] 5: rest-bench() [0x426079] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Looks like http://tracker.newdream.net/issues/3896 Am not sure who should be made aware of this issue though. Maybe Yehuda (cc'ing)? -Joao 2013-01-27 20:51:01.197253 7f1211401780 -1 common/WorkQueue.cc: In function 'virtual ThreadPool::~ThreadPool()' thread 7f1211401780 time 2013-01-27 20:51:01.196525 common/WorkQueue.cc: 59: FAILED assert(_threads.empty()) ceph version 0.56-464-gfa421cf (fa421cf5f52ca16fa1328dbea2f4bda85c56cd3f) 1: (ThreadPool::~ThreadPool()+0x10c) [0x43bf9c] 2: (RESTDispatcher::~RESTDispatcher()+0xf1) [0x42a021] 3: (main()+0x75b) [0x42521b] 4: (__libc_start_main()+0xed) [0x7f120f37576d] 5: rest-bench() [0x426079] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- -11 2013-01-27 20:49:09.292227 7f1211401780 5 asok(0x29dc270) register_command perfcounters_dump hook 0x29dc440 -10 2013-01-27 20:49:09.292259 7f1211401780 5 asok(0x29dc270) register_command 1 hook 0x29dc440 -9 2013-01-27 20:49:09.292262 7f1211401780 5 asok(0x29dc270) register_command perf dump hook 0x29dc440 -8 2013-01-27 20:49:09.292271 7f1211401780 5 asok(0x29dc270) register_command perfcounters_schema hook 0x29dc440 -7 2013-01-27 20:49:09.292275 7f1211401780 5 asok(0x29dc270) register_command 2 hook 0x29dc440 -6 2013-01-27 20:49:09.292278 7f1211401780 5 asok(0x29dc270) register_command perf schema hook 0x29dc440 -5 2013-01-27 20:49:09.292285 7f1211401780 5 asok(0x29dc270) register_command config show hook 0x29dc440 -4 2013-01-27 20:49:09.292290 7f1211401780 5 asok(0x29dc270) register_command config set hook 0x29dc440 -3 2013-01-27 20:49:09.292292 7f1211401780 5 asok(0x29dc270) register_command log flush hook 0x29dc440 -2 2013-01-27 20:49:09.292296 7f1211401780 5 asok(0x29dc270) register_command log dump hook 0x29dc440 -1 2013-01-27 20:49:09.292300 7f1211401780 5 asok(0x29dc270) register_command log reopen hook 0x29dc440 0 2013-01-27 20:51:01.197253 7f1211401780 -1 common/WorkQueue.cc: In function 'virtual ThreadPool::~ThreadPool()' thread 7f1211401780 time 2013-01-27 20:51:01.196525 common/WorkQueue.cc: 59: FAILED assert(_threads.empty()) ceph version 0.56-464-gfa421cf (fa421cf5f52ca16fa1328dbea2f4bda85c56cd3f) 1: (ThreadPool::~ThreadPool()+0x10c) [0x43bf9c] 2: (RESTDispatcher::~RESTDispatcher()+0xf1) [0x42a021] 3: (main()+0x75b) [0x42521b] 4: (__libc_start_main()+0xed) [0x7f120f37576d] 5: rest-bench() [0x426079] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 0/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) 99/99 (stderr threshold) max_recent10 max_new 1000 log_file --- end dump of recent events --- terminate called after throwing an instance of 'ceph::FailedAssertion' *** Caught signal (Aborted) ** in thread
[PATCH 1/3] configure: fix check for fuse_getgroups()
Check for fuse_getgroups() only in case we have found libfuse already. Moved the check to the check for --with-fuse. Small fix: fix string for NO_ATOMIC_OPS, don't use '. Signed-off-by: Danny Al-Gaaf danny.al-g...@bisect.de --- configure.ac | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/configure.ac b/configure.ac index ffbd150..d6cff80 100644 --- a/configure.ac +++ b/configure.ac @@ -241,6 +241,8 @@ AS_IF([test x$with_fuse != xno], AC_DEFINE([HAVE_LIBFUSE], [1], [Define if you have fuse]) HAVE_LIBFUSE=1 + # look for fuse_getgroups and define FUSE_GETGROUPS if found + AC_CHECK_FUNCS([fuse_getgroups]) ], [AC_MSG_FAILURE( [no FUSE found (use --without-fuse to disable)])])]) @@ -391,7 +393,8 @@ AS_IF([test x$with_libatomic_ops != xno], ])]) AS_IF([test $HAVE_ATOMIC_OPS = 1], [], - AC_DEFINE([NO_ATOMIC_OPS], [1], [Defined if you don't have atomic_ops])) + [AC_DEFINE([NO_ATOMIC_OPS], [1], [Defined if you do not have atomic_ops])]) + AM_CONDITIONAL(WITH_LIBATOMIC, [test $HAVE_ATOMIC_OPS = 1]) # newsyn? requires mpi. @@ -417,9 +420,6 @@ AS_IF([test x$with_system_leveldb = xcheck], [AC_CHECK_LIB([leveldb], [leveldb_open], [with_system_leveldb=yes], [], [-lsnappy -lpthread])]) AM_CONDITIONAL(WITH_SYSTEM_LEVELDB, [ test $with_system_leveldb = yes ]) -# look for fuse_getgroups and define FUSE_GETGROUPS if found -AC_CHECK_FUNCS([fuse_getgroups]) - # use system libs3? AC_ARG_WITH([system-libs3], [AS_HELP_STRING([--with-system-libs3], [use system libs3])], -- 1.8.1.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] fix some rbd-fuse related issues
Here three patches to fix some issues with the new rbd-fuse code and an issues with the fuse handling in configure. Danny Al-Gaaf (3): configure: fix check for fuse_getgroups() rbd-fuse: fix usage of conn-want rbd-fuse: fix printf format for off_t and size_t configure.ac| 8 src/rbd_fuse/rbd-fuse.c | 12 +++- 2 files changed, 11 insertions(+), 9 deletions(-) -- 1.8.1.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] rbd-fuse: fix printf format for off_t and size_t
Fix printf format for off_t and size_t to print the same on 32 and 64bit systems. Use PRI* macros from inttypes.h. Signed-off-by: Danny Al-Gaaf danny.al-g...@bisect.de --- src/rbd_fuse/rbd-fuse.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/src/rbd_fuse/rbd-fuse.c b/src/rbd_fuse/rbd-fuse.c index c204463..748976a 100644 --- a/src/rbd_fuse/rbd-fuse.c +++ b/src/rbd_fuse/rbd-fuse.c @@ -15,6 +15,7 @@ #include sys/types.h #include unistd.h #include getopt.h +#include inttypes.h #include include/rbd/librbd.h @@ -321,7 +322,7 @@ static int rbdfs_write(const char *path, const char *buf, size_t size, if (offset + size rbdsize(fi-fh)) { int r; - fprintf(stderr, rbdfs_write resizing %s to 0x%lx\n, + fprintf(stderr, rbdfs_write resizing %s to 0x%PRIxMAX\n, path, offset+size); r = rbd_resize(rbd-image, offset+size); if (r 0) @@ -516,7 +517,7 @@ rbdfs_truncate(const char *path, off_t size) return -ENOENT; rbd = opentbl[fd]; - fprintf(stderr, truncate %s to %ld (0x%lx)\n, path, size, size); + fprintf(stderr, truncate %s to %PRIdMAX (0x%PRIxMAX)\n, path, size, size); r = rbd_resize(rbd-image, size); if (r 0) return r; @@ -559,7 +560,7 @@ rbdfs_setxattr(const char *path, const char *name, const char *value, for (ap = attrs; ap-attrname != NULL; ap++) { if (strcmp(name, ap-attrname) == 0) { *ap-attrvalp = strtoull(value, NULL, 0); - fprintf(stderr, rbd-fuse: %s set to 0x%lx\n, + fprintf(stderr, rbd-fuse: %s set to 0x%PRIx64\n, ap-attrname, *ap-attrvalp); return 0; } @@ -578,7 +579,7 @@ rbdfs_getxattr(const char *path, const char *name, char *value, for (ap = attrs; ap-attrname != NULL; ap++) { if (strcmp(name, ap-attrname) == 0) { - sprintf(buf, %lu, *ap-attrvalp); + sprintf(buf, %PRIu64, *ap-attrvalp); if (value != NULL size = strlen(buf)) strcpy(value, buf); fprintf(stderr, rbd-fuse: get %s\n, ap-attrname); -- 1.8.1.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] rbd-fuse: fix usage of conn-want
Fix usage of conn-want and FUSE_CAP_BIG_WRITES. Both need libfuse version = 2.8. Encapsulate the related code line into a check for the needed FUSE_VERSION as already done in ceph-fuse in some cases. Signed-off-by: Danny Al-Gaaf danny.al-g...@bisect.de --- src/rbd_fuse/rbd-fuse.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/rbd_fuse/rbd-fuse.c b/src/rbd_fuse/rbd-fuse.c index b3e318f..c204463 100644 --- a/src/rbd_fuse/rbd-fuse.c +++ b/src/rbd_fuse/rbd-fuse.c @@ -461,8 +461,9 @@ rbdfs_init(struct fuse_conn_info *conn) ret = rados_ioctx_create(cluster, pool_name, ioctx); if (ret 0) exit(91); - +#if FUSE_VERSION = FUSE_MAKE_VERSION(2, 8) conn-want |= FUSE_CAP_BIG_WRITES; +#endif gotrados = 1; // init's return value shows up in fuse_context.private_data, -- 1.8.1.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: page allocation failures on osd nodes
On Mon, Jan 28, 2013 at 5:48 PM, Sam Lang sam.l...@inktank.com wrote: On Sun, Jan 27, 2013 at 2:52 PM, Andrey Korolyov and...@xdel.ru wrote: Ahem. once on almost empty node same trace produced by qemu process(which was actually pinned to the specific numa node), so seems that`s generally is a some scheduler/mm bug, not directly related to the osd processes. In other words, the less percentage of memory actually is an RSS, the more is a probability of such allocation failure. This might be a known bug in xen for your kernel? The xen users list might be able to help. -sam It is vanilla-3.4, I really wonder from where comes paravirt bits in the trace. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RadosGW performance and disk space usage
On Mon, Jan 28, 2013 at 6:28 AM, Joao Eduardo Luis joao.l...@inktank.com wrote: On 01/27/2013 11:10 PM, Cesar Mello wrote: Hi, Just tried rest-bench. This little tool is wonderful, thanks! I still have to learn lots of things. So please don't spend much time explaining me, but instead please give me any pointers to documentation or source code that can be useful. As a curiosity, I'm pasting the results from my laptop. I'll repeat the same tests using my desktop as the server. Notice there is an assert being triggered, so I guess I'm running a build with debugging code ?!. I compiled using ./configure --with-radosgw --with-rest-bench followed by make. asserts are usually used to mark invariants on the code logic, and are always built, regardless debugging being enabled or disabled. Given you are hitting one, it probably means something is not quite right (might be a bug, or some invariant was broken for some reason). common/WorkQueue.cc: In function 'virtual ThreadPool::~ThreadPool()' thread 7f1211401780 time 2013-01-27 20:51:01.196525 common/WorkQueue.cc: 59: FAILED assert(_threads.empty()) ceph version 0.56-464-gfa421cf (fa421cf5f52ca16fa1328dbea2f4bda85c56cd3f) 1: (ThreadPool::~ThreadPool()+0x10c) [0x43bf9c] 2: (RESTDispatcher::~RESTDispatcher()+0xf1) [0x42a021] 3: (main()+0x75b) [0x42521b] 4: (__libc_start_main()+0xed) [0x7f120f37576d] 5: rest-bench() [0x426079] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Looks like http://tracker.newdream.net/issues/3896 Right, 3896. Probably some cleanup before shutdown issues. Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RadosGW performance and disk space usage
On Sun, Jan 27, 2013 at 3:10 PM, Cesar Mello cme...@gmail.com wrote: Hi, Just tried rest-bench. This little tool is wonderful, thanks! I still have to learn lots of things. So please don't spend much time explaining me, but instead please give me any pointers to documentation or source code that can be useful. As a curiosity, I'm pasting the results from my laptop. I'll repeat the same tests using my desktop as the server. Notice there is an assert being triggered, so I guess I'm running a build with debugging code ?!. I compiled using ./configure --with-radosgw --with-rest-bench followed by make. Thanks a lot for the attention. Best regards! Mello root@l3:/etc/init.d# rest-bench --api-host=localhost --bucket=test --access-key=JJABVJ3AWBS1ZOCML7NS --secret=A+ecBz2+Sdxj4Y8Mo+u3akIewGvJPkwOhwRgPKkX --protocol=http --uri_style=path write host=localhost Maintaining 16 concurrent writes of 4194304 bytes for at least 60 seconds. Object prefix: benchmark_data_l3_4032 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 3 3 0 0 0 - 0 1 1616 0 0 0 - 0 2 1616 0 0 0 - 0 3 1616 0 0 0 - 0 4 1616 0 0 0 - 0 5 1616 0 0 0 - 0 6 1616 0 0 0 - 0 7 1616 0 0 0 - 0 8 1616 0 0 0 - 0 9 1616 0 0 0 - 0 10 1616 0 0 0 - 0 11 1616 0 0 0 - 0 12 1617 1 0.333265 0.33 11.2761 11.2761 13 1618 2 0.615257 4 12.5964 11.9363 14 1620 4 1.14262 8 13.1392 12.5365 15 1623 7 1.8662812 14.2273 13.2594 16 162711 2.7494416 15.0222 13.8968 17 163216 3.7639420 16.2604 14.6301 18 163216 3.55483 0 - 14.6301 19 1634183.7887 46.2274 13.7695 2013-01-27 20:49:29.703509min lat: 6.2274 max lat: 16.2604 avg lat: 13.7695 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 163418 3.59927 0 - 13.7695 21 163418 3.42787 0 - 13.7695 22 163418 3.27205 0 - 13.7695 23 163519 3.30367 1 9.09053 13.5233 24 163620 3.33264 4 9.0966713.302 25 163620 3.19933 0 -13.302 26 163620 3.07628 0 -13.302 27 163721 3.11047 1.3 11.245913.204 28 163721 2.99938 0 -13.204 29 163721 2.89595 0 -13.204 30 163721 2.79942 0 -13.204 31 163721 2.70912 0 -13.204 32 163923 2.87441 1.6 14.9981 13.3602 33 1639232.7873 0 - 13.3602 34 163923 2.70533 0 - 13.3602 35 164024 2.74229 1.3 21.5365 13.7009 36 164024 2.66612 0 - 13.7009 37 164226 2.81023 4 22.6059 14.3855 38 164226 2.73628 0 - 14.3855 39 164529 2.97374 6 23.2615 15.3025 2013-01-27 20:49:49.707740min lat: 6.2274 max lat: 23.4496 avg lat: 16.1307 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 40 165135 3.4992724 21.0123 16.1307 41 165135 3.41392 0 - 16.1307 42 165236 3.42786 2 19.024316.211 43 165236 3.34814 0 -16.211 44 165236 3.27204 0 -16.211 45 165236 3.19933 0 -16.211 46 165337 3.21672 1 11.0793 16.0723 47 165337
Re: [PATCH 0/3] fix some rbd-fuse related issues
Thanks Danny, I'll look at these today. On Jan 28, 2013, at 7:33 AM, Danny Al-Gaaf danny.al-g...@bisect.de wrote: Here three patches to fix some issues with the new rbd-fuse code and an issues with the fuse handling in configure. Danny Al-Gaaf (3): configure: fix check for fuse_getgroups() rbd-fuse: fix usage of conn-want rbd-fuse: fix printf format for off_t and size_t configure.ac| 8 src/rbd_fuse/rbd-fuse.c | 12 +++- 2 files changed, 11 insertions(+), 9 deletions(-) -- 1.8.1.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: can't download from radosgw
On Mon, Jan 28, 2013 at 3:55 AM, Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: 2013/1/28 Gandalf Corvotempesta gandalf.corvotempe...@gmail.com: 2013-01-28 12:22:27.759162 7fe8657c3700 0 NOTICE: failed to send response to client 2013-01-28 12:22:27.759186 7fe8657c3700 0 ERROR: s-cio-print() returned err=-1 2013-01-28 12:22:27.759206 7fe8657c3700 0 ERROR: s-cio-print() returned err=-1 2013-01-28 12:22:27.759211 7fe8657c3700 0 ERROR: s-cio-print() returned err=-1 2013-01-28 12:22:27.759216 7fe8657c3700 0 ERROR: s-cio-print() returned err=-1 2013-01-28 12:22:27.759268 7fe8657c3700 2 req 128:0.051384:s3:GET /public2/shared/9780470398661.pdf:get_obj:http status=403 2013-01-28 12:22:27.759336 7fe8657c3700 1 == req done req=0x3192980 http_status=403 == This happens only with Google Chrome. Firefox, curl, wget and many other are able to download properly. (resending to all) It looks like the connection is closed early by the client (chrome). Just a thought, maybe the content-type is not set correctly on the object? Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RadosGW performance and disk space usage
Sure I can later when I arrive home. With the end of my vacation, I'll be able to devote a couple of hours after my 3-year-old sleeps. :-) I guess my laptop hard disk has horrible seek times. I'll repeat the tests in my desktop as soon as possible. Thanks a lot for the attention! Best regards Mello On Mon, Jan 28, 2013 at 3:35 PM, Yehuda Sadeh yeh...@inktank.com wrote: On Sun, Jan 27, 2013 at 3:10 PM, Cesar Mello cme...@gmail.com wrote: Hi, Just tried rest-bench. This little tool is wonderful, thanks! I still have to learn lots of things. So please don't spend much time explaining me, but instead please give me any pointers to documentation or source code that can be useful. As a curiosity, I'm pasting the results from my laptop. I'll repeat the same tests using my desktop as the server. Notice there is an assert being triggered, so I guess I'm running a build with debugging code ?!. I compiled using ./configure --with-radosgw --with-rest-bench followed by make. Thanks a lot for the attention. Best regards! Mello root@l3:/etc/init.d# rest-bench --api-host=localhost --bucket=test --access-key=JJABVJ3AWBS1ZOCML7NS --secret=A+ecBz2+Sdxj4Y8Mo+u3akIewGvJPkwOhwRgPKkX --protocol=http --uri_style=path write host=localhost Maintaining 16 concurrent writes of 4194304 bytes for at least 60 seconds. Object prefix: benchmark_data_l3_4032 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 3 3 0 0 0 - 0 1 1616 0 0 0 - 0 2 1616 0 0 0 - 0 3 1616 0 0 0 - 0 4 1616 0 0 0 - 0 5 1616 0 0 0 - 0 6 1616 0 0 0 - 0 7 1616 0 0 0 - 0 8 1616 0 0 0 - 0 9 1616 0 0 0 - 0 10 1616 0 0 0 - 0 11 1616 0 0 0 - 0 12 1617 1 0.333265 0.33 11.2761 11.2761 13 1618 2 0.615257 4 12.5964 11.9363 14 1620 4 1.14262 8 13.1392 12.5365 15 1623 7 1.8662812 14.2273 13.2594 16 162711 2.7494416 15.0222 13.8968 17 163216 3.7639420 16.2604 14.6301 18 163216 3.55483 0 - 14.6301 19 1634183.7887 46.2274 13.7695 2013-01-27 20:49:29.703509min lat: 6.2274 max lat: 16.2604 avg lat: 13.7695 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 163418 3.59927 0 - 13.7695 21 163418 3.42787 0 - 13.7695 22 163418 3.27205 0 - 13.7695 23 163519 3.30367 1 9.09053 13.5233 24 163620 3.33264 4 9.0966713.302 25 163620 3.19933 0 -13.302 26 163620 3.07628 0 -13.302 27 163721 3.11047 1.3 11.245913.204 28 163721 2.99938 0 -13.204 29 163721 2.89595 0 -13.204 30 163721 2.79942 0 -13.204 31 163721 2.70912 0 -13.204 32 163923 2.87441 1.6 14.9981 13.3602 33 1639232.7873 0 - 13.3602 34 163923 2.70533 0 - 13.3602 35 164024 2.74229 1.3 21.5365 13.7009 36 164024 2.66612 0 - 13.7009 37 164226 2.81023 4 22.6059 14.3855 38 164226 2.73628 0 - 14.3855 39 164529 2.97374 6 23.2615 15.3025 2013-01-27 20:49:49.707740min lat: 6.2274 max lat: 23.4496 avg lat: 16.1307 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 40 165135 3.4992724 21.0123 16.1307 41 165135 3.41392 0 - 16.1307 42 165236
Re: Geo-replication with RADOS GW
On Monday, January 28, 2013 at 9:54 AM, Ben Rowland wrote: Hi, I'm considering using Ceph to create a cluster across several data centres, with the strict requirement that writes should go to both DCs. This seems possible by specifying rules in the CRUSH map, with an understood latency hit resulting from purely synchronous writes. The part I'm unsure about is how the RADOS GW fits into this picture. For high availability (and to improve best-case latency on reads), we'd want to run a gateway in each data centre. However, the first paragraph of the following post suggests this is not possible: http://article.gmane.org/gmane.comp.file-systems.ceph.devel/12238 Is there a hard restriction on how many radosgw instances can run across the cluster, or is the point of the above post more about a performance hit? It's talking about the performance hit. Most people can't afford data-center level connectivity between two different buildings. ;) If you did have a Ceph cluster split across two DC (with the bandwidth to support them) this will work fine. There aren't any strict limits on the number of gateways you stick on a cluster, just the scaling costs associated with cache invalidation notifications. It seems to me it should be possible to run more than one radosgw, particularly if each instance communicates with a local OSD which can proxy reads/writes to the primary (which may or may not be DC-local). They aren't going to do this, though — each gateway will communicate with the primaries directly. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] fix some compiler warnings
I'd just noticed utime on my laptop 32-bit build and was trying to figure out why our 32-bit nightly didn't see it. And Greg had seen the system build problem where I didn't, and I was isolating differences there as well. I purposely didn't spend time on the system() error handling because I was thinking of those calls as best-effort, if they fail the map will likely fail anyway, but there's no harm in handling errors, particularly if it'll shit the compiler up :) On Jan 27, 2013, at 12:57 PM, Danny Al-Gaaf danny.al-g...@bisect.de wrote: Attached two patches to fix some compiler warnings. Danny Al-Gaaf (2): utime: fix narrowing conversion compiler warning in sleep() rbd: don't ignore return value of system() src/include/utime.h | 2 +- src/rbd.cc | 36 ++-- 2 files changed, 31 insertions(+), 7 deletions(-) -- 1.8.1.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
Gregory, i recreated the osd down problem again this morning on two nodes (g13ct, g14ct). First, i created a 1-node cluster on g13ct (with osd.0, 1 ,2) and then added host g14ct (osd3. 4, 5). osd.1 went down for about 1 minute and half after adding osd 3, 4, 5 were adde4d. i have included the routing table of each node at the time osd.1 went down. ceph.conf and ceph-osd.1.log files are attached. The crush map was default. Also, it could be a timing issue because it does not always fail when using default crush map, it takes several trials before you see it. Thank you. [root@g13ct ~]# netstat -r Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface default 133.164.98.250 0.0.0.0 UG 0 0 0 eth2 133.164.98.0 * 255.255.255.0 U 0 0 0 eth2 link-local * 255.255.0.0 U 0 0 0 eth3 link-local * 255.255.0.0 U 0 0 0 eth0 link-local * 255.255.0.0 U 0 0 0 eth2 192.0.0.0 * 255.0.0.0 U 0 0 0 eth3 192.0.0.0 * 255.0.0.0 U 0 0 0 eth0 192.168.0.0 * 255.255.255.0 U 0 0 0 eth3 192.168.1.0 * 255.255.255.0 U 0 0 0 eth0 [root@g13ct ~]# ceph osd tree # id weight type name up/down reweight -1 6 root default -3 6 rack unknownrack -2 3 host g13ct 0 1 osd.0 up 1 1 1 osd.1 down 1 2 1 osd.2 up 1 -4 3 host g14ct 3 1 osd.3 up 1 4 1 osd.4 up 1 5 1 osd.5 up 1 [root@g14ct ~]# ceph osd tree # id weight type name up/down reweight -1 6 root default -3 6 rack unknownrack -2 3 host g13ct 0 1 osd.0 up 1 1 1 osd.1 down 1 2 1 osd.2 up 1 -4 3 host g14ct 3 1 osd.3 up 1 4 1 osd.4 up 1 5 1 osd.5 up 1 [root@g14ct ~]# netstat -r Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface default 133.164.98.250 0.0.0.0 UG 0 0 0 eth0 133.164.98.0 * 255.255.255.0 U 0 0 0 eth0 link-local * 255.255.0.0 U 0 0 0 eth3 link-local * 255.255.0.0 U 0 0 0 eth5 link-local * 255.255.0.0 U 0 0 0 eth0 192.0.0.0 * 255.0.0.0 U 0 0 0 eth3 192.0.0.0 * 255.0.0.0 U 0 0 0 eth5 192.168.0.0 * 255.255.255.0 U 0 0 0 eth3 192.168.1.0 * 255.255.255.0 U 0 0 0 eth5 [root@g14ct ~]# ceph osd tree # id weight type name up/down reweight -1 6 root default -3 6 rack unknownrack -2 3 host g13ct 0 1 osd.0 up 1 1 1 osd.1 down 1 2 1 osd.2 up 1 -4 3 host g14ct 3 1 osd.3 up 1 4 1 osd.4 up 1 5 1 osd.5 up 1 Isaac - Original Message - From: Isaac Otsiabah zmoo...@yahoo.com To: Gregory Farnum g...@inktank.com Cc: ceph-devel@vger.kernel.org ceph-devel@vger.kernel.org Sent: Friday, January 25, 2013 9:51 AM Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster Gregory, the network physical layout is simple, the two networks are separate. the 192.168.0 and the 192.168.1 are not subnets within a network. Isaac - Original Message - From: Gregory Farnum g...@inktank.com To: Isaac Otsiabah zmoo...@yahoo.com Cc: ceph-devel@vger.kernel.org ceph-devel@vger.kernel.org Sent: Thursday, January 24, 2013 1:28 PM Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster What's the physical layout of your networking? This additional log may prove helpful as well, but I really need a bit more context in evaluating the messages I see from the first one. :) -Greg On Thursday, January 24,
Re: [PATCH 07/25] mds: don't early reply rename
On Wed, 23 Jan 2013, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com _rename_finish() does not send dentry link/unlink message to replicas. We should prevent dentries that are modified by the rename operation from getting new replicas when the rename operation is committing. So don't mark xlocks done and early reply for rename Can we change this to only skip early reply if there are replicas? Or change things so we do send thos messages (or something isilar) early? As is this will kill workloads like rsync that rename every file. Thanks! s Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Server.cc | 8 1 file changed, 8 insertions(+) diff --git a/src/mds/Server.cc b/src/mds/Server.cc index eced76f..4492341 100644 --- a/src/mds/Server.cc +++ b/src/mds/Server.cc @@ -796,6 +796,14 @@ void Server::early_reply(MDRequest *mdr, CInode *tracei, CDentry *tracedn) return; } + // _rename_finish() does not send dentry link/unlink message to replicas. + // so do not mark xlocks done, the xlocks prevent srcdn and destdn from + // getting new replica. + if (mdr-client_request-get_op() == CEPH_MDS_OP_RENAME) { +dout(10) early_reply - rename, not allowed dendl; +return; + } + MClientRequest *req = mdr-client_request; entity_inst_t client_inst = req-get_source_inst(); if (client_inst.name.is_mds()) -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] rbd: manage racing opens/removes
A recent change to rbd prevented rbd devices from being unmapped when they were in use. However that change did not address a different, but related problem. It is possible for an open (the one that would bump the open count from 0 to 1) to begin after a request to remove the rbd device has decided it can proceed. To fix this, define a new removing flag to prevent opens from proceeding once ermoval of a device has begun. The first patch in this series defines a new flags field, and uses it for this as well as the exists flag for snapshot mappings. -Alex [PATCH 1/2] rbd: define flags field, use it for exists flag [PATCH 2/2] rbd: prevent open for image being removed -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] rbd: define flags field, use it for exists flag
Define a new rbd device flags field, manipulated using bit operations. Replace the use of the current exists flag with a bit in this new flags field. Add a little commentary about the exists flag, which does not need to be manipulated atomically. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 37 - 1 file changed, 28 insertions(+), 9 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 177ba0c..107df40 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -262,7 +262,7 @@ struct rbd_device { spinlock_t lock; /* queue lock */ struct rbd_image_header header; - atomic_texists; + unsigned long flags; struct rbd_spec *spec; char*header_name; @@ -291,6 +291,12 @@ struct rbd_device { unsigned long open_count; }; +/* Flag bits for rbd_dev-flags */ + +enum rbd_dev_flags { + rbd_dev_flag_exists,/* mapped snapshot has not been deleted */ +}; + static DEFINE_MUTEX(ctl_mutex); /* Serialize open/close/setup/teardown */ static LIST_HEAD(rbd_dev_list);/* devices */ @@ -790,7 +796,8 @@ static int rbd_dev_set_mapping(struct rbd_device *rbd_dev) goto done; rbd_dev-mapping.read_only = true; } - atomic_set(rbd_dev-exists, 1); + set_bit(rbd_dev_flag_exists, rbd_dev-flags); + done: return ret; } @@ -1886,9 +1893,14 @@ static void rbd_request_fn(struct request_queue *q) rbd_assert(rbd_dev-spec-snap_id == CEPH_NOSNAP); } - /* Quit early if the snapshot has disappeared */ - - if (!atomic_read(rbd_dev-exists)) { + /* +* Quit early if the mapped snapshot no longer +* exists. It's still possible the snapshot will +* have disappeared by the time our request arrives +* at the osd, but there's no sense in sending it if +* we already know. +*/ + if (!test_bit(rbd_dev_flag_exists, rbd_dev-flags)) { dout(request for non-existent snapshot); rbd_assert(rbd_dev-spec-snap_id != CEPH_NOSNAP); result = -ENXIO; @@ -2578,7 +2590,7 @@ struct rbd_device *rbd_dev_create(struct rbd_client *rbdc, return NULL; spin_lock_init(rbd_dev-lock); - atomic_set(rbd_dev-exists, 0); + rbd_dev-flags = 0; INIT_LIST_HEAD(rbd_dev-node); INIT_LIST_HEAD(rbd_dev-snaps); init_rwsem(rbd_dev-header_rwsem); @@ -3207,10 +3219,17 @@ static int rbd_dev_snaps_update(struct rbd_device *rbd_dev) if (snap_id == CEPH_NOSNAP || (snap snap-id snap_id)) { struct list_head *next = links-next; - /* Existing snapshot not in the new snap context */ - + /* +* A previously-existing snapshot is not in +* the new snap context. +* +* If the now missing snapshot is the one the +* image is mapped to, clear its exists flag +* so we can avoid sending any more requests +* to it. +*/ if (rbd_dev-spec-snap_id == snap-id) - atomic_set(rbd_dev-exists, 0); + clear_bit(rbd_dev_flag_exists, rbd_dev-flags); rbd_remove_snap_dev(snap); dout(%ssnap id %llu has been removed\n, rbd_dev-spec-snap_id == snap-id ? -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] rbd: prevent open for image being removed
An open request for a mapped rbd image can arrive while removal of that mapping is underway. We need to prevent such an open request from succeeding. (It appears that Maciej Galkiewicz ran into this problem.) Define and use a removing flag to indicate a mapping is getting removed. Set it in the remove path after verifying nothing holds the device open. And check it in the open path before allowing the open to proceed. Acquire the rbd device's lock around each of these spots to avoid any races accessing the flags and open_count fields. This addresses: http://tracker.newdream.net/issues/3427 Reported-by: Maciej Galkiewicz maciejgalkiew...@ragnarson.com Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 42 +- 1 file changed, 33 insertions(+), 9 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 107df40..03b15b8 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -259,10 +259,10 @@ struct rbd_device { charname[DEV_NAME_LEN]; /* blkdev name, e.g. rbd3 */ - spinlock_t lock; /* queue lock */ + spinlock_t lock; /* queue, flags, open_count */ struct rbd_image_header header; - unsigned long flags; + unsigned long flags; /* possibly lock protected */ struct rbd_spec *spec; char*header_name; @@ -288,13 +288,20 @@ struct rbd_device { /* sysfs related */ struct device dev; - unsigned long open_count; + unsigned long open_count; /* protected by lock */ }; -/* Flag bits for rbd_dev-flags */ +/* + * Flag bits for rbd_dev-flags. If atomicity is required, + * rbd_dev-lock is used to protect access. + * + * Currently, only the removing flag (which is coupled with the + * open_count field) requires atomic access. + */ enum rbd_dev_flags { rbd_dev_flag_exists,/* mapped snapshot has not been deleted */ + rbd_dev_flag_removing, /* this mapping is being removed */ }; static DEFINE_MUTEX(ctl_mutex); /* Serialize open/close/setup/teardown */ @@ -383,14 +390,23 @@ static int rbd_dev_v2_refresh(struct rbd_device *rbd_dev, u64 *hver); static int rbd_open(struct block_device *bdev, fmode_t mode) { struct rbd_device *rbd_dev = bdev-bd_disk-private_data; + bool removing = false; if ((mode FMODE_WRITE) rbd_dev-mapping.read_only) return -EROFS; + spin_lock(rbd_dev-lock); + if (test_bit(rbd_dev_flag_removing, rbd_dev-flags)) + removing = true; + else + rbd_dev-open_count++; + spin_unlock(rbd_dev-lock); + if (removing) + return -ENOENT; + mutex_lock_nested(ctl_mutex, SINGLE_DEPTH_NESTING); (void) get_device(rbd_dev-dev); set_device_ro(bdev, rbd_dev-mapping.read_only); - rbd_dev-open_count++; mutex_unlock(ctl_mutex); return 0; @@ -399,10 +415,14 @@ static int rbd_open(struct block_device *bdev, fmode_t mode) static int rbd_release(struct gendisk *disk, fmode_t mode) { struct rbd_device *rbd_dev = disk-private_data; + unsigned long open_count_before; + + spin_lock(rbd_dev-lock); + open_count_before = rbd_dev-open_count--; + spin_unlock(rbd_dev-lock); + rbd_assert(open_count_before 0); mutex_lock_nested(ctl_mutex, SINGLE_DEPTH_NESTING); - rbd_assert(rbd_dev-open_count 0); - rbd_dev-open_count--; put_device(rbd_dev-dev); mutex_unlock(ctl_mutex); @@ -4135,10 +4155,14 @@ static ssize_t rbd_remove(struct bus_type *bus, goto done; } - if (rbd_dev-open_count) { + spin_lock(rbd_dev-lock); + if (rbd_dev-open_count) ret = -EBUSY; + else + set_bit(rbd_dev_flag_removing, rbd_dev-flags); + spin_unlock(rbd_dev-lock); + if (ret 0) goto done; - } while (rbd_dev-parent_spec) { struct rbd_device *first = rbd_dev; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] fix some rbd-fuse related issues
Actually Sage merged them into master. Thanks again. On 01/28/2013 09:45 AM, Dan Mick wrote: Thanks Danny, I'll look at these today. On Jan 28, 2013, at 7:33 AM, Danny Al-Gaaf danny.al-g...@bisect.de wrote: Here three patches to fix some issues with the new rbd-fuse code and an issues with the fuse handling in configure. Danny Al-Gaaf (3): configure: fix check for fuse_getgroups() rbd-fuse: fix usage of conn-want rbd-fuse: fix printf format for off_t and size_t configure.ac| 8 src/rbd_fuse/rbd-fuse.c | 12 +++- 2 files changed, 11 insertions(+), 9 deletions(-) -- 1.8.1.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] fix some compiler warnings
Sage merged these into master. Thanks! On 01/27/2013 12:57 PM, Danny Al-Gaaf wrote: Attached two patches to fix some compiler warnings. Danny Al-Gaaf (2): utime: fix narrowing conversion compiler warning in sleep() rbd: don't ignore return value of system() src/include/utime.h | 2 +- src/rbd.cc | 36 ++-- 2 files changed, 31 insertions(+), 7 deletions(-) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/25] mds: don't early reply rename
On 01/29/2013 05:44 AM, Sage Weil wrote: On Wed, 23 Jan 2013, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com _rename_finish() does not send dentry link/unlink message to replicas. We should prevent dentries that are modified by the rename operation from getting new replicas when the rename operation is committing. So don't mark xlocks done and early reply for rename Can we change this to only skip early reply if there are replicas? Or change things so we do send thos messages (or something isilar) early? As is this will kill workloads like rsync that rename every file. How about not mark xlocks on dentries done. Regards Yan, Zheng Thanks! s Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Server.cc | 8 1 file changed, 8 insertions(+) diff --git a/src/mds/Server.cc b/src/mds/Server.cc index eced76f..4492341 100644 --- a/src/mds/Server.cc +++ b/src/mds/Server.cc @@ -796,6 +796,14 @@ void Server::early_reply(MDRequest *mdr, CInode *tracei, CDentry *tracedn) return; } + // _rename_finish() does not send dentry link/unlink message to replicas. + // so do not mark xlocks done, the xlocks prevent srcdn and destdn from + // getting new replica. + if (mdr-client_request-get_op() == CEPH_MDS_OP_RENAME) { +dout(10) early_reply - rename, not allowed dendl; +return; + } + MClientRequest *req = mdr-client_request; entity_inst_t client_inst = req-get_source_inst(); if (client_inst.name.is_mds()) -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/25] mds: don't early reply rename
On 01/29/2013 10:23 AM, Sage Weil wrote: On Tue, 29 Jan 2013, Yan, Zheng wrote: On 01/29/2013 05:44 AM, Sage Weil wrote: On Wed, 23 Jan 2013, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com _rename_finish() does not send dentry link/unlink message to replicas. We should prevent dentries that are modified by the rename operation from getting new replicas when the rename operation is committing. So don't mark xlocks done and early reply for rename Can we change this to only skip early reply if there are replicas? Or change things so we do send thos messages (or something isilar) early? As is this will kill workloads like rsync that rename every file. How about not mark xlocks on dentries done. Yeah, I like that if we do that just in the rename case. The other patches look okay to me (from a quick review). With that change I'd like to pull the whole branch in. I assume your current wip-mds branch include sthe fix or squashes the problem from the previous series? Just force update my wip-mds branch. That patch is renamed to mds: don't set xlocks on dentries done when early reply rename. I also updated mds: preserve non-auth/unlinked objects until slave commit and mds: fix slave rename rollback. The new patches trim non-auth subtrees more actively. Regards Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fwd: Ceph Production Environment Setup and Configurations?
Hi, Please with regards to my questions on Ceph Production Environment ... I like to give u these details. i like to test a write, read and delete operation on ceph storage cluster in a production environment. i also like to check the self healing and managing functionalities. i like to know in the production setup , are gateways required for any of the three methods of accessing ceph cluster? or should the setup just be like all the servers should be storage nodes with mon,mds and osd running on each of them ...while i access these storage nodes through a single computer which one can call a client just like you described in the 5 mins setup? -- Forwarded message -- Date: Tue, Jan 29, 2013 at 2:56 AM Subject: Ceph Production Environment Setup and Configurations? To: ceph-devel@vger.kernel.org Please can anyone an advise on how exactly a CEPH production environment should look like? and what the configuration files should be. My hardwares include the following: Server A, B, C configuration CPU - Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz RAM - 16GB Hard drive - 500GB SSD - 120GB Server D,E,F,G,H,J configuration CPU - Intel(R) Atom(TM) CPU D525 @ 1.80GHz RAM - 4 GB Boot drive - 320gb SSD - 120 GB Storage drives - 16 X 2 TB I am thinking of these configurations but i am not sure. Server A - MDS and MON Server B - MON Server C - MON Server D, E,F,G,H,J - OSD Regards. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] two small patches for CEPH wireshark plugin
You could look at the wip-wireshark-zafman branch. I rebased it and force pushed it. It has changes to the wireshark.patch and a minor change I needed to get it to build. I'm surprised the recent checkin didn't include the change to packet-ceph.c which I needed to get it to build. David Zafman Senior Developer david.zaf...@inktank.com On Jan 24, 2013, at 12:49 PM, Danny Al-Gaaf danny.al-g...@bisect.de wrote: Am 24.01.2013 19:31, schrieb Sage Weil: Hi Danny! [...] Since you brought up wireshark... We would LOVE LOVE LOVE it if this plugin could get upstream into wireshark. Yes, this would be great. IIRC, the problem (last time we checked, ages ago) was that there were strict coding guidelines for that project that weren't followed. I'm not sure if that is still the case, or even if that is accurate. It would be great if someone on this list who is looking for a way to contribute could take the lead on trying to make this happen... :-) I'll take a look at it maybe ... if I find some free time for it. What about the patches? Can we apply them to the ceph git tree until we have another solution for the wireshark code? Danny -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ceph] locking fun with d_materialise_unique()
There's a fun potential problem with CEPH_MDS_OP_LOOKUPSNAP handling in ceph_fill_trace(). Consider the following scenario: Process calls stat(2). Lookup locks parent, allocates dentry and calls -lookup(). Request is created and sent over the wire. Then we sit and wait for completion. Just as the reply has arrived, process gets SIGKILL. OK, we get to /* * ensure we aren't running concurrently with * ceph_fill_trace or ceph_readdir_prepopulate, which * rely on locks (dir mutex) held by our caller. */ mutex_lock(req-r_fill_mutex); req-r_err = err; req-r_aborted = true; mutex_unlock(req-r_fill_mutex); and we got there before handle_reply() grabbed -r_fill_mutex. Then we return to ceph_lookup(), drop the reference to request and bugger off. Parent is unlocked by caller. In the meanwhile, there's another thread sitting in handle_reply(). It got -r_fill_mutex and called ceph_fill_trace(). Had that been something like rename request, ceph_fill_trace() would've checked req-r_aborted and that would've been it. However, we hit this: } else if (req-r_op == CEPH_MDS_OP_LOOKUPSNAP || req-r_op == CEPH_MDS_OP_MKSNAP) { struct dentry *dn = req-r_dentry; and proceed to dout( linking snapped dir %p to dn %p\n, in, dn); dn = splice_dentry(dn, in, NULL, true); which does realdn = d_materialise_unique(dn, in); and we are in trouble - d_materialise_unique() assumes that -i_mutex on parent is held, which isn't guaranteed anymore. Not that d_delete() done a couple of lines earlier was any better... I'm not sure if we are guaranteed that ceph_readdir_prepopulate() won't get to its splice_dentry() and d_delete() calls in similar situations - I hadn't checked that one yet. If it isn't guaranteed, we have a problem there as well. I might very well be missing something - that code is seriously convoluted, and race wouldn't be easy to hit, so I don't have anything resembling a candidate reproducer ;-/ IOW, this is just from RTFS and I'd really appreciate comments from folks familiar with ceph. VFS side of requirements is fairly simple: * d_splice_alias(d, _), d_add_ci(d, _), d_add(d, _), d_materialise_unique(d, _), d_delete(d), d_move(_, d) should be called only with -i_mutex held on the parent of d. * d_move(d, _), d_add_unique(d, _), d_instantiate_unique(d, _), d_instantiate(d, _) should be called only with d being parentless (i.e. d-d_parent == d, aka. IS_ROOT(d)) or with -i_mutex held on the parent of d. * with the exception of prepopulate dentry tree at -get_sb() time kind of situations, d_alloc(d, _) and d_alloc_name(d, _) should be called only with d-d_inode-i_mutex held (and it won't be too hard to get rid of those exceptions, actually). * lookup_one_len(_, d, _) should only be called with -i_mutex held on d-d_inode * d_move(d1, d2) in case when d1 and d2 have different parents should only be called with -s_vfs_rename_mutex held on d1-d_sb (== d2-d_sb). We are guaranteed that -i_mutex is held on (inode of) parent of d in -lookup(_, d, _) -atomic_open(_, d, _, _, _, _) -mkdir(_, d, _) -symlink(_, d, _) -create(_, d, _, _) -mknod(_, d, _, _) -link(_, _, d) -unlink(_, d) -rmdir(_, d) -rename(_, d, _, _) -rename(_, _, _, d) Note that this is *not* guaranteed for another argument of -link() - the inode we are linking has -i_mutex held, but nothing of that kind is promised for its parent directory. We also are guaranteed that -i_mutex is held on the inode of opened directory passed to -readdir() and on victims of -unlink(), -rmdir() and overwriting -rename(). FWIW, I went through that stuff this weekend and we are fairly close to having those requirements satisfied - I'll push a branch with the accumulated fixes in a few and after that we should be down to very few remaining violations and dubious places (ceph issues above being one of those). And yes, this stuff really need to be in Documentation/filesystems somewhere, along with the full description of locking rules for -d_parent and -d_name accesses. I'm trying to put that together right now... -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph] locking fun with d_materialise_unique()
Hi Al, On Tue, 29 Jan 2013, Al Viro wrote: There's a fun potential problem with CEPH_MDS_OP_LOOKUPSNAP handling in ceph_fill_trace(). Consider the following scenario: Process calls stat(2). Lookup locks parent, allocates dentry and calls -lookup(). Request is created and sent over the wire. Then we sit and wait for completion. Just as the reply has arrived, process gets SIGKILL. OK, we get to /* * ensure we aren't running concurrently with * ceph_fill_trace or ceph_readdir_prepopulate, which * rely on locks (dir mutex) held by our caller. */ mutex_lock(req-r_fill_mutex); req-r_err = err; req-r_aborted = true; mutex_unlock(req-r_fill_mutex); and we got there before handle_reply() grabbed -r_fill_mutex. Then we return to ceph_lookup(), drop the reference to request and bugger off. Parent is unlocked by caller. In the meanwhile, there's another thread sitting in handle_reply(). It got -r_fill_mutex and called ceph_fill_trace(). Had that been something like rename request, ceph_fill_trace() would've checked req-r_aborted and that would've been it. However, we hit this: } else if (req-r_op == CEPH_MDS_OP_LOOKUPSNAP || req-r_op == CEPH_MDS_OP_MKSNAP) { struct dentry *dn = req-r_dentry; and proceed to dout( linking snapped dir %p to dn %p\n, in, dn); dn = splice_dentry(dn, in, NULL, true); which does realdn = d_materialise_unique(dn, in); and we are in trouble - d_materialise_unique() assumes that -i_mutex on parent is held, which isn't guaranteed anymore. Not that d_delete() done a couple of lines earlier was any better... Yep, that is indeed a problem. I think we just need to do the r_aborted and/or r_locked_dir check in the else if condition... I'm not sure if we are guaranteed that ceph_readdir_prepopulate() won't get to its splice_dentry() and d_delete() calls in similar situations - I hadn't checked that one yet. If it isn't guaranteed, we have a problem there as well. ...and the condition guarding readdir_prepopulate(). :) I might very well be missing something - that code is seriously convoluted, and race wouldn't be easy to hit, so I don't have anything resembling a candidate reproducer ;-/ IOW, this is just from RTFS and I'd really appreciate comments from folks familiar with ceph. I think you're reading it correctly. The main thing to keep in mind here is that we *do* need to call fill_inode() for the inode metadata on these requests to keep the mds and client state in sync. The dentry state is safe to ignore. It would be great to have the dir i_mutex rules summarized somewhere, even if it is just a copy of the below. It took a fair bit of trial and error to infer what was going on when writing this code. :) Ping me when you've pushed that branch and I'll take a look... Thanks! sage VFS side of requirements is fairly simple: * d_splice_alias(d, _), d_add_ci(d, _), d_add(d, _), d_materialise_unique(d, _), d_delete(d), d_move(_, d) should be called only with -i_mutex held on the parent of d. * d_move(d, _), d_add_unique(d, _), d_instantiate_unique(d, _), d_instantiate(d, _) should be called only with d being parentless (i.e. d-d_parent == d, aka. IS_ROOT(d)) or with -i_mutex held on the parent of d. * with the exception of prepopulate dentry tree at -get_sb() time kind of situations, d_alloc(d, _) and d_alloc_name(d, _) should be called only with d-d_inode-i_mutex held (and it won't be too hard to get rid of those exceptions, actually). * lookup_one_len(_, d, _) should only be called with -i_mutex held on d-d_inode * d_move(d1, d2) in case when d1 and d2 have different parents should only be called with -s_vfs_rename_mutex held on d1-d_sb (== d2-d_sb). We are guaranteed that -i_mutex is held on (inode of) parent of d in -lookup(_, d, _) -atomic_open(_, d, _, _, _, _) -mkdir(_, d, _) -symlink(_, d, _) -create(_, d, _, _) -mknod(_, d, _, _) -link(_, _, d) -unlink(_, d) -rmdir(_, d) -rename(_, d, _, _) -rename(_, _, _, d) Note that this is *not* guaranteed for another argument of -link() - the inode we are linking has -i_mutex held, but nothing of that kind is promised for its parent directory. We also are guaranteed that -i_mutex is held on the inode of opened directory passed to -readdir() and on victims of -unlink(), -rmdir() and overwriting -rename(). FWIW, I went through that stuff this weekend and we are fairly close to having those requirements satisfied - I'll push a branch with the accumulated fixes in a few and after that we should be down to very few remaining violations and dubious places (ceph issues above being one