Re: crush changes via cli
On Fri, 22 Mar 2013, Gregory Farnum wrote: > I suspect users are going to easily get in trouble without a more > rigid separation between multi-linked and single-linked buckets. It's > probably best if anybody who's gone to the trouble of setting up a DAG > can't wipe it out without being very explicit ? so for instance "move" > should only work against a bucket with a single parent. Good idea; I'll add that. > Rather than > defaulting to all ancestors, removals should (for multiply-linked > buckets) require users to either specify a set of ancestors or to pass > in a "--all" flag. 'rm' only works on an empty bucket, so I'm not sure there is much danger is removing all links (and the bucket) in that case? > Also, I suspect that "rm" actually deletes the bucket while "unlink" > simply removes it from all parents (but leaves it in the tree); that > distinction might need to be a little stronger (or is possibly not > appropriate to leave in the CLI?). That's right. The "remove" versus "unlink" verbs make that pretty clear to me, at least... Are you suggesting this be clarified in the docs, or that the command set change? I think once we settle on the CLI, John can make a pass through the crush docs and make sure these commands are explained. > You mention that one of the commands "does nothing" under some > circumstances ? does that mean there's no error? If a command can't be > logically completed it should complain to the user, not just fail > silently. It returns -ENOTEMPTY; sorry, poor choice of words. :) sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: crush changes via cli
On Fri, Mar 22, 2013 at 3:38 PM, Sage Weil wrote: > There's a branch pending that lets you do the remainder of the most common > crush map changes via teh cli. The command set breaks down like so: > > Updating leaves (devices): > > ceph osd crush set[ ...] > ceph osd crush add[ ...] > ceph osd crush create-or-move[ ...] > > These let you create, add, and move devices in teh map. The different > between add and set is that add will create an additional instance of the > osd (leaf), while set will move the old instance. This is useful for some > configurations. > > The loc ... bits let you specify the 'where' part in teh form of key/value > pairs, like 'host=foo rack=bar root=default'. It will find the > most-specific pair that matches an existing item, and create any > intervening ancestors. For example, if my map has only a root=default > node (nothing else) and I do > > ceph osd crush set osd.0 1 host=foo rack=myrack row=first root=default > > it will create teh row, rack, and host nodes, and then stick osd.0 inside > host=foo. > > Create-or-move is similar to set except that it won't ever change teh > weight of the device; only set the initial weight if it has to create it. > This is used by the upstart hook so that it doesn't inadvertantly clobber > changes the admin has made. > > The next set of commands adjust the map structure. Although people usually > create a tree structure, in reality the crush map is a DAG (directed > acyclic graph). > > > ceph osd crush rm [ancestor] > > Will remove an osd or internal node from the, assuming there are no > children. With the optional ancestor arg, it will remove only instances > under the given ancestor. Otherwise, all instances are removed. If it is > a bucket and non-empty, it does nothing. > > ceph osd crush unlink [ancestor] > > Is similar, but will let you remove a (or all) link(s) to a bucket even if > it is non-empty. > > ceph osd crush move [ ...] > > will unlink the bucket from its existing location(s) and link it in a new > position. > > ceph osd crush link [ ...] > > Doesn't touch existing links, only adds a new one. > > Finally, > > ceph osd crush add-bucket > > is the one command that will create an internal node with no parent. > Normally this is just used to create the root of the tree (e.g., > root=default). Once it is there, then devices can be added beneath with > it set, add, link, etc. and loc... bit will add any intervening ancestors > that are missing. > > This maps cleanly on to the internal data model that CRUSH is using. As > long as it doesn't bend everyone's mind in uncomfortable ways, I'd like to > stick with it (or something like it)... but if there is something here > that seems wrong, let me know! I suspect users are going to easily get in trouble without a more rigid separation between multi-linked and single-linked buckets. It's probably best if anybody who's gone to the trouble of setting up a DAG can't wipe it out without being very explicit — so for instance "move" should only work against a bucket with a single parent. Rather than defaulting to all ancestors, removals should (for multiply-linked buckets) require users to either specify a set of ancestors or to pass in a "--all" flag. Also, I suspect that "rm" actually deletes the bucket while "unlink" simply removes it from all parents (but leaves it in the tree); that distinction might need to be a little stronger (or is possibly not appropriate to leave in the CLI?). You mention that one of the commands "does nothing" under some circumstances — does that mean there's no error? If a command can't be logically completed it should complain to the user, not just fail silently. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
crush changes via cli
There's a branch pending that lets you do the remainder of the most common crush map changes via teh cli. The command set breaks down like so: Updating leaves (devices): ceph osd crush set[ ...] ceph osd crush add[ ...] ceph osd crush create-or-move[ ...] These let you create, add, and move devices in teh map. The different between add and set is that add will create an additional instance of the osd (leaf), while set will move the old instance. This is useful for some configurations. The loc ... bits let you specify the 'where' part in teh form of key/value pairs, like 'host=foo rack=bar root=default'. It will find the most-specific pair that matches an existing item, and create any intervening ancestors. For example, if my map has only a root=default node (nothing else) and I do ceph osd crush set osd.0 1 host=foo rack=myrack row=first root=default it will create teh row, rack, and host nodes, and then stick osd.0 inside host=foo. Create-or-move is similar to set except that it won't ever change teh weight of the device; only set the initial weight if it has to create it. This is used by the upstart hook so that it doesn't inadvertantly clobber changes the admin has made. The next set of commands adjust the map structure. Although people usually create a tree structure, in reality the crush map is a DAG (directed acyclic graph). ceph osd crush rm [ancestor] Will remove an osd or internal node from the, assuming there are no children. With the optional ancestor arg, it will remove only instances under the given ancestor. Otherwise, all instances are removed. If it is a bucket and non-empty, it does nothing. ceph osd crush unlink [ancestor] Is similar, but will let you remove a (or all) link(s) to a bucket even if it is non-empty. ceph osd crush move [ ...] will unlink the bucket from its existing location(s) and link it in a new position. ceph osd crush link [ ...] Doesn't touch existing links, only adds a new one. Finally, ceph osd crush add-bucket is the one command that will create an internal node with no parent. Normally this is just used to create the root of the tree (e.g., root=default). Once it is there, then devices can be added beneath with it set, add, link, etc. and loc... bit will add any intervening ancestors that are missing. This maps cleanly on to the internal data model that CRUSH is using. As long as it doesn't bend everyone's mind in uncomfortable ways, I'd like to stick with it (or something like it)... but if there is something here that seems wrong, let me know! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
Quoting Chris Mason (2013-03-22 14:07:05) > [ mmap corruptions with leveldb and btrfs compression ] > > I ran this a number of times with compression off and wasn't able to > trigger problems. With compress=lzo, I see errors on every run. > > Compile: gcc -Wall -o mmap-trunc mmap-trunc.c > Run: ./mmap-trunc file_name > > The basic idea is to create a 256MB file in steps. Each step ftruncates > the file larger, and then mmaps a region for writing. It dirties some > unaligned bytes (a little more than 8K), and then munmaps. > > Then a verify stage goes back through the file to make sure the data we > wrote is really there. I'm using a simple rotating pattern of chars > that compress very well. Going through the code here, when I change the test to truncate once in the very beginning, I still get errors. So, it isn't an interaction between mmap and truncate. It must be a problem between lzo and mmap. -chris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Latest 0.56.3 and qemu-1.4.0 and cloned VM-image producing massive fs-corruption, not crashing
On 03/22/2013 12:09 PM, Oliver Francke wrote: Hi Josh, all, I did not want to hijack the thread dealing with a crashing VM, but perhaps there are some common things. Today I installed a fresh cluster with mkephfs, went fine, imported a "master" debian 6.0 image with "format 2", made a snapshot, protected it, and made some clones. Clones mounted with qemu-nbd, fiddled a bit with IP/interfaces/hosts/net.rules…etc and cleanly unmounted, VM started, took 2 secs and the VM was up n running. Cool. Now an ordinary shutdown was performed, made a snapshot of this image. Started again, did some "apt-get update… install s/t…". Shutdown -> rbd rollback -> startup again -> login -> install s/t else… filesystem showed "many" ex3-errors, fell into read-only mode, massive corruption. This sounds like it might be a bug in rollback. Could you try cloning and snapshotting again, but export the image before booting, and after rolling back, and compare the md5sums? Running the rollback with: --debug-ms 1 --debug-rbd 20 --log-file rbd-rollback.log might help too. Does your ceph.conf where you ran the rollback have anything related to rbd_cache in it? qemu config was with ":rbd_cache=false" if it matters. Above scenario is reproducible, and as I stated out, no crash detected. Perhaps it is in the same area as in the crash-thread, otherwise I will provide logfiles as needed. It's unrelated, the other thread is an issue with the cache, which does not cause corruption but triggers a crash. Josh -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Latest 0.56.3 and qemu-1.4.0 and cloned VM-image producing massive fs-corruption, not crashing
Hi Josh, all, I did not want to hijack the thread dealing with a crashing VM, but perhaps there are some common things. Today I installed a fresh cluster with mkephfs, went fine, imported a "master" debian 6.0 image with "format 2", made a snapshot, protected it, and made some clones. Clones mounted with qemu-nbd, fiddled a bit with IP/interfaces/hosts/net.rules…etc and cleanly unmounted, VM started, took 2 secs and the VM was up n running. Cool. Now an ordinary shutdown was performed, made a snapshot of this image. Started again, did some "apt-get update… install s/t…". Shutdown -> rbd rollback -> startup again -> login -> install s/t else… filesystem showed "many" ex3-errors, fell into read-only mode, massive corruption. qemu config was with ":rbd_cache=false" if it matters. Above scenario is reproducible, and as I stated out, no crash detected. Perhaps it is in the same area as in the crash-thread, otherwise I will provide logfiles as needed. Kind regards, Oliver. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Latest 0.56.3 and qemu-1.4.0 and cloned VM-image producing massive fs-corruption, not crashing
Hi Josh, all, I did not want to hijack the thread dealing with a crashing VM, but perhaps there are some common things. Today I installed a fresh cluster with mkephfs, went fine, imported a "master" debian 6.0 image with "format 2", made a snapshot, protected it, and made some clones. Clones mounted with qemu-nbd, fiddled a bit with IP/interfaces/hosts/net.rules…etc and cleanly unmounted, VM started, took 2 secs and the VM was up n running. Cool. Now an ordinary shutdown was performed, made a snapshot of this image. Started again, did some "apt-get update… install s/t…". Shutdown -> rbd rollback -> startup again -> login -> install s/t else… filesystem showed "many" ex3-errors, fell into read-only mode, massive corruption. qemu config was with ":rbd_cache=false" if it matters. Above scenario is reproducible, and as I stated out, no crash detected. Perhaps it is in the same area as in the crash-thread, otherwise I will provide logfiles as needed. Kind regards, Oliver. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: github pull requests
On Fri, Mar 22, 2013 at 12:15 AM, Gregory Farnum wrote: > I'm not sure that we handle enough incoming yet that the extra process > weight of something like Gerrit or Launchpad is necessary over Github. > What are you looking for in that system which Github doesn't provide? > -Greg Automated regression tests and gated commits come to mind. Gerrit alone of course doesn't help with that, you'd probably want to consider either running Jenkins, or hook the master merges up with automatic teuthology runs. Just my two cents, though. Cheers, Florian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit()
That's awesome Josh. Thanks for looking into it. Good luck with the fix! - Travis On Fri, Mar 22, 2013 at 1:11 PM, Josh Durgin wrote: > I think I found the root cause based on your logs: > > http://tracker.ceph.com/issues/4531 > > Josh > > > On 03/20/2013 02:47 PM, Travis Rhoden wrote: >> >> Didn't take long to re-create with the detailed debugging (ms = 20). >> I'm sending Josh a link to the gzip'd log off-list, I"m not sure if >> the log will contain any CephX keys or anything like that. >> >> On Wed, Mar 20, 2013 at 4:39 PM, Travis Rhoden wrote: >>> >>> Thanks Josh. I will respond when I have something useful! >>> >>> On Wed, Mar 20, 2013 at 4:32 PM, Josh Durgin >>> wrote: On 03/20/2013 01:19 PM, Josh Durgin wrote: > > > On 03/20/2013 01:14 PM, Stefan Priebe wrote: >> >> >> Hi, >> >>> In this case, they are format 2. And they are from cloned snapshots. >>> Exactly like the following: >>> >>> # rbd ls -l -p volumes >>> NAME SIZE >>> PARENT FMT PROT LOCK >>> volume-099a6d74-05bd-4f00-a12e-009d60629aa8 5120M >>> images/b8bdda90-664b-4906-86d6-dd33735441f2@snap 2 >>> >>> I'm doing an OpenStack boot-from-volume setup. >> >> >> >> OK i've never used cloned snapshots so maybe this is the reason. >> strange i've never seen this. Which qemu version? >>> >>> >>> >>> # qemu-x86_64 -version >>> qemu-x86_64 version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008 >>> Fabrice Bellard >>> >>> that's coming from Ubuntu 12.04 apt repos. >> >> >> >> maybe you should try qemu 1.4 there are a LOT of bugfixes. qemu-kvm >> does >> not exist anymore it was merged into qemu with 1.3 or 1.4. > > > > This particular problem won't be solved by upgrading qemu. It's a ceph > bug. Disabling caching would work around the issue. > > Travis, could you get a log from qemu of this happening with: > > debug ms = 20 > debug objectcacher = 20 > debug rbd = 20 > log file = /path/writeable/by/qemu If it doesn't reproduce with those settings, try changing debug ms to 1 instead of 20. > From those we can tell whether the issue is on the client side at > least, > and hopefully what's causing it. > > Thanks! > Josh >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: docs
My meeting got cancelled today, so I'll work with Gary to get this resolved. On Fri, Mar 22, 2013 at 11:18 AM, Dan Mick wrote: > > > On 03/22/2013 05:37 AM, Jerker Nyberg wrote: >> >> >> There seem to be a missing argument to ceph osd lost (also in help for >> the command). >> >> http://ceph.com/docs/master/rados/operations/control/#osd-subsystem >> > > Indeed, it seems to be missing the id. The CLI is getting a big rework > right now, but the docs should be corrected. Patch or file an issue, either > way. > > >> src/tools/ceph.cc >> src/test/cli/ceph/help.t >> doc/rados/operations/control.rst >> >> The documentation for development release packages is slightly confused. >> Should it not refer to http://ceph.com/rpm-testing for development >> release packages? (Also, the ceph-release package in the development >> release does not refer to itself (in /etc/yum.repos.d/ceph.repo) but to >> (http://ceph.com/rpms) packages.) >> >> http://ceph.com/docs/master/install/rpm/ >> >> >> Do you want patches? >> >> --jerker >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- John Wilkins Senior Technical Writer Intank john.wilk...@inktank.com (415) 425-9599 http://inktank.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: docs
On 03/22/2013 05:37 AM, Jerker Nyberg wrote: There seem to be a missing argument to ceph osd lost (also in help for the command). http://ceph.com/docs/master/rados/operations/control/#osd-subsystem Indeed, it seems to be missing the id. The CLI is getting a big rework right now, but the docs should be corrected. Patch or file an issue, either way. src/tools/ceph.cc src/test/cli/ceph/help.t doc/rados/operations/control.rst The documentation for development release packages is slightly confused. Should it not refer to http://ceph.com/rpm-testing for development release packages? (Also, the ceph-release package in the development release does not refer to itself (in /etc/yum.repos.d/ceph.repo) but to (http://ceph.com/rpms) packages.) http://ceph.com/docs/master/install/rpm/ Do you want patches? --jerker -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
[ mmap corruptions with leveldb and btrfs compression ] I ran this a number of times with compression off and wasn't able to trigger problems. With compress=lzo, I see errors on every run. Compile: gcc -Wall -o mmap-trunc mmap-trunc.c Run: ./mmap-trunc file_name The basic idea is to create a 256MB file in steps. Each step ftruncates the file larger, and then mmaps a region for writing. It dirties some unaligned bytes (a little more than 8K), and then munmaps. Then a verify stage goes back through the file to make sure the data we wrote is really there. I'm using a simple rotating pattern of chars that compress very well. I run it in batches of 100 with some memory pressure on the side: for x in `seq 1 100` ; do (mmap-trunc f$x &) ; done #define _FILE_OFFSET_BITS 64 #include #include #include #include #include #include #include #include #include #define FILE_SIZE ((loff_t)256 * 1024 * 1024) /* make a painfully unaligned chunk size */ #define CHUNK_SIZE (8192 + 932) #define mmap_align(x) (((x) + 4095) & ~4095) char *file_name = NULL; void mmap_one_chunk(int fd, loff_t *cur_size, unsigned char *file_buf) { int ret; loff_t new_size = *cur_size + CHUNK_SIZE; loff_t pos = *cur_size; unsigned long map_size = mmap_align(CHUNK_SIZE) + 4096; char val = file_buf[0]; char *p; int extra; /* step one, truncate out a hole */ ret = ftruncate(fd, new_size); if (ret) { perror("truncate"); exit(1); } if (val == 0 || val == 'z') val = 'a'; else val++; memset(file_buf, val, CHUNK_SIZE); extra = pos & 4095; p = mmap(0, map_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, pos - extra); if (p == MAP_FAILED) { perror("mmap"); exit(1); } memcpy(p + extra, file_buf, CHUNK_SIZE); ret = munmap(p, map_size); if (ret) { perror("munmap"); exit(1); } *cur_size = new_size; } void check_chunks(int fd) { char *p; loff_t checked = 0; char val = 'a'; int i; int errors = 0; int ret; int extra; unsigned long map_size = mmap_align(CHUNK_SIZE) + 4096; fprintf(stderr, "checking chunks\n"); while (checked < FILE_SIZE) { extra = checked & 4095; p = mmap(0, map_size, PROT_READ, MAP_SHARED, fd, checked - extra); if (p == MAP_FAILED) { perror("mmap"); exit(1); } for (i = 0; i < CHUNK_SIZE; i++) { if (p[i + extra] != val) { fprintf(stderr, "%s: bad val %x wanted %x offset 0x%llx\n", file_name, p[i + extra], val, (unsigned long long)checked + i); errors++; } } if (val == 'z') val = 'a'; else val++; ret = munmap(p, map_size); if (ret) { perror("munmap"); exit(1); } checked += CHUNK_SIZE; } printf("%s found %d errors\n", file_name, errors); if (errors) exit(1); } int main(int ac, char **av) { unsigned char *file_buf; loff_t pos = 0; int ret; int fd; if (ac < 2) { fprintf(stderr, "usage: mmap-trunc filename\n"); exit(1); } ret = posix_memalign((void **)&file_buf, 4096, CHUNK_SIZE); if (ret) { perror("cannot allocate memory\n"); exit(1); } file_buf[0] = 0; file_name = av[1]; fprintf(stderr, "running test on %s\n", file_name); unlink(file_name); fd = open(file_name, O_RDWR | O_CREAT, 0600); if (fd < 0) { perror("open"); exit(1); } fprintf(stderr, "writing chunks\n"); while (pos < FILE_SIZE) { mmap_one_chunk(fd, &pos, file_buf); } check_chunks(fd); return 0; } -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
On Fri, 22 Mar 2013, Chris Mason wrote: > Quoting Alexandre Oliva (2013-03-22 10:17:30) > > On Mar 22, 2013, Chris Mason wrote: > > > > > Are you using compression in btrfs or just in leveldb? > > > > btrfs lzo compression. > > Perfect, I'll focus on that part of things. > > > > > > I'd like to take snapshots out of the picture for a minute. > > > > That's understandable, I guess, but I don't know that anyone has ever > > got the problem without snapshots. I mean, even when the master copy of > > the database got corrupted, snapshots of the subvol containing it were > > being taken every now and again, because that's the way ceph works. > > Hopefully Sage can comment, but the basic idea is that if you snapshot a > database file the db must participate. If it doesn't, it really is the > same effect as crashing the box. > > Something is definitely broken if we're corrupting the source files > (either with or without snapshots), but avoiding incomplete writes in > the snapshot files requires synchronization with the db. In this case, we quiesce write activity, call leveldb's sync(), take the snapshot, and then continue. (FWIW, this isn't the first time we've heard about leveldb corruption, but in each case we've looked into the user had the btrfs compression enabled so I suspect that's the right avenue of investigation!) sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
In this case, I think Alexandre is scanning for zeros in the file. The incomplete writes will definitely show that. -chris Quoting Samuel Just (2013-03-22 13:06:41) > Incomplete writes for leveldb should just result in lost updates, not > corruption. Also, we do stop writes before the snapshot is initiated > so there should be no in-progress writes to leveldb other than leveldb > compaction (though that might be something to investigate). > -Sam > > On Fri, Mar 22, 2013 at 7:26 AM, Chris Mason wrote: > > Quoting Alexandre Oliva (2013-03-22 10:17:30) > >> On Mar 22, 2013, Chris Mason wrote: > >> > >> > Are you using compression in btrfs or just in leveldb? > >> > >> btrfs lzo compression. > > > > Perfect, I'll focus on that part of things. > > > >> > >> > I'd like to take snapshots out of the picture for a minute. > >> > >> That's understandable, I guess, but I don't know that anyone has ever > >> got the problem without snapshots. I mean, even when the master copy of > >> the database got corrupted, snapshots of the subvol containing it were > >> being taken every now and again, because that's the way ceph works. > > > > Hopefully Sage can comment, but the basic idea is that if you snapshot a > > database file the db must participate. If it doesn't, it really is the > > same effect as crashing the box. > > > > Something is definitely broken if we're corrupting the source files > > (either with or without snapshots), but avoiding incomplete writes in > > the snapshot files requires synchronization with the db. > > > > -chris > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit()
I think I found the root cause based on your logs: http://tracker.ceph.com/issues/4531 Josh On 03/20/2013 02:47 PM, Travis Rhoden wrote: Didn't take long to re-create with the detailed debugging (ms = 20). I'm sending Josh a link to the gzip'd log off-list, I"m not sure if the log will contain any CephX keys or anything like that. On Wed, Mar 20, 2013 at 4:39 PM, Travis Rhoden wrote: Thanks Josh. I will respond when I have something useful! On Wed, Mar 20, 2013 at 4:32 PM, Josh Durgin wrote: On 03/20/2013 01:19 PM, Josh Durgin wrote: On 03/20/2013 01:14 PM, Stefan Priebe wrote: Hi, In this case, they are format 2. And they are from cloned snapshots. Exactly like the following: # rbd ls -l -p volumes NAME SIZE PARENT FMT PROT LOCK volume-099a6d74-05bd-4f00-a12e-009d60629aa8 5120M images/b8bdda90-664b-4906-86d6-dd33735441f2@snap 2 I'm doing an OpenStack boot-from-volume setup. OK i've never used cloned snapshots so maybe this is the reason. strange i've never seen this. Which qemu version? # qemu-x86_64 -version qemu-x86_64 version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008 Fabrice Bellard that's coming from Ubuntu 12.04 apt repos. maybe you should try qemu 1.4 there are a LOT of bugfixes. qemu-kvm does not exist anymore it was merged into qemu with 1.3 or 1.4. This particular problem won't be solved by upgrading qemu. It's a ceph bug. Disabling caching would work around the issue. Travis, could you get a log from qemu of this happening with: debug ms = 20 debug objectcacher = 20 debug rbd = 20 log file = /path/writeable/by/qemu If it doesn't reproduce with those settings, try changing debug ms to 1 instead of 20. From those we can tell whether the issue is on the client side at least, and hopefully what's causing it. Thanks! Josh -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
On Fri, Mar 22, 2013 at 10:26:59AM -0400, Chris Mason wrote: > Quoting Alexandre Oliva (2013-03-22 10:17:30) > > On Mar 22, 2013, Chris Mason wrote: > > > > > Are you using compression in btrfs or just in leveldb? > > > > btrfs lzo compression. > > Perfect, I'll focus on that part of things. > > > I'd like to take snapshots out of the picture for a minute. I've reproduced this without compression, with autodefrag on. The test was using snapshots (ie. the unmmodified versino) and ended with 1087 blocks, 4316779 total size snaptest.268/ca snaptest.268/db differ: char 4245170, line 16 after a few minutes. Before that, I was running the NOSNAPS mode for many-minutes (up to 50k rounds) without a reported problem. There was the same 'make clean && make -j 32' kernel compilation running in parallel, the box has 8 cpus, 4GB ram. Watching 'free' showed the memory going up to a few gigs and down to ~130MB. david -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
Incomplete writes for leveldb should just result in lost updates, not corruption. Also, we do stop writes before the snapshot is initiated so there should be no in-progress writes to leveldb other than leveldb compaction (though that might be something to investigate). -Sam On Fri, Mar 22, 2013 at 7:26 AM, Chris Mason wrote: > Quoting Alexandre Oliva (2013-03-22 10:17:30) >> On Mar 22, 2013, Chris Mason wrote: >> >> > Are you using compression in btrfs or just in leveldb? >> >> btrfs lzo compression. > > Perfect, I'll focus on that part of things. > >> >> > I'd like to take snapshots out of the picture for a minute. >> >> That's understandable, I guess, but I don't know that anyone has ever >> got the problem without snapshots. I mean, even when the master copy of >> the database got corrupted, snapshots of the subvol containing it were >> being taken every now and again, because that's the way ceph works. > > Hopefully Sage can comment, but the basic idea is that if you snapshot a > database file the db must participate. If it doesn't, it really is the > same effect as crashing the box. > > Something is definitely broken if we're corrupting the source files > (either with or without snapshots), but avoiding incomplete writes in > the snapshot files requires synchronization with the db. > > -chris > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
Quoting Alexandre Oliva (2013-03-22 10:17:30) > On Mar 22, 2013, Chris Mason wrote: > > > Are you using compression in btrfs or just in leveldb? > > btrfs lzo compression. Perfect, I'll focus on that part of things. > > > I'd like to take snapshots out of the picture for a minute. > > That's understandable, I guess, but I don't know that anyone has ever > got the problem without snapshots. I mean, even when the master copy of > the database got corrupted, snapshots of the subvol containing it were > being taken every now and again, because that's the way ceph works. Hopefully Sage can comment, but the basic idea is that if you snapshot a database file the db must participate. If it doesn't, it really is the same effect as crashing the box. Something is definitely broken if we're corrupting the source files (either with or without snapshots), but avoiding incomplete writes in the snapshot files requires synchronization with the db. -chris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
On Mar 22, 2013, Chris Mason wrote: > Are you using compression in btrfs or just in leveldb? btrfs lzo compression. > I'd like to take snapshots out of the picture for a minute. That's understandable, I guess, but I don't know that anyone has ever got the problem without snapshots. I mean, even when the master copy of the database got corrupted, snapshots of the subvol containing it were being taken every now and again, because that's the way ceph works. Even back when I noticed corruption of firefox _CACHE_* files, snapshots taken for archival were involved. So, unless the program happens to trigger the problem with the -DNOSNAPS option about as easily as it did without it, I guess we may not have a choice but to keep snapshots in the picture. > We need some way to synchronize the leveldb with snapshotting I purposefully refrained from doing that, because AFAICT ceph doesn't do that. Once I failed to trigger the problem with Sync calls, and determined ceph only syncs the leveldb logs before taking its snapshots, I went without syncing and finally succeeded in triggering the bug in snapshots, by simulating very similar snapshotting and mmaping conditions to those generated by ceph. I haven't managed to trigger the corruption of the master subvol yet with the test program, but I already knew its corruption didn't occur as often as that of the snapshots, and since it smells like two slightly different symptoms of the same bug, I decided to leave the test program at that. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] ceph 0.59 cephx problem
(Re-CC'ing the list) On 03/22/2013 01:36 PM, Steffen Thorhauer wrote: I was upgrading from 0.58 to ceph version 0.59 (cbae6a435c62899f857775f66659de052fb0e759) Upgrading from 0.57 to 0.58 was an easy one, so I was suprised with the problems v0.59 is the first dev release with a major monitor rework. We've tested it thoroughly over the past weeks, but different usages tend to trigger different behaviours, so you might just have hit one of those buggers. It seems to me, that I make an fatal error, that I dont understand. I had 5 working mons (mon.{0-4]). After the upgrade of the first node I lost the mon.4 with the cephx error. Then I upgraded all of the nodes and I lost the mon.0 with the starting error. The v0.59 monitors is unable to communicate with the <=0.58 monitors, so that's likely why the monitor appeared to be lost: you would need at least a majority of monitors on v0.59 so they could form a quorum. After some restarts it looks like the other mons lost any quorum so ceph -s or any kind of ceph commands didn't work anymore. As long as you have a majority of monitors running v0.59, they ought to be able to form a quorum. If they didn't, then something weird must have happened and logs would be much appreciated! So I made today the decision to reinstall the test "cluster". You decided to go back to v0.58, is that it? Regardless, if you have logs that could provide some insight into what happened, we'd really appreciate it. Thanks! -Joao -Steffen Btw. ceph rbd, adding/removing osds works great. On Fri, Mar 22, 2013 at 10:01:10AM +, Joao Eduardo Luis wrote: On 03/21/2013 03:47 PM, Steffen Thorhauer wrote: I think, I was impatient and should wait for the v.59 announcement. It seems I should upgrading all monitors. After upgrading all nodes I have on 2 monitors errors like: === mon.0 === Starting Ceph mon.0 on u124-161-ceph... mon fs missing 'monmap/latest' and 'mkfs/monmap' failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i 0 --pid-file /var/run/ceph/mon.0.pid -c /etc/ceph/ceph.conf ' Steffen Which version are you upgrading from? Also, could you provide us with some logs of those monitors with 'debug mon = 20' ? -Joao On 03/21/2013 02:22 PM, Steffen Thorhauer wrote: Hi, I just upgraded one node of my ceph "cluster". I wanted upgrade node after node. osd on this node has no problem. but the mon (mon.4) has authorization problems. I did'nt change any config, just made an apt-get upgrade . ceph -s health HEALTH_WARN 1 mons down, quorum 0,1,2,3 0,1,2,3 monmap e2: 5 mons at {0=10.37.124.161:6789/0,1=10.37.124.162:6789/0,2=10.37.124.163:6789/0,3=10.37.124.164:6789/0,4=10.37.124.167:6789/0}, election epoch 162, quorum 0,1,2,3 0,1,2,3 osdmap e4839: 16 osds: 16 up, 16 in pgmap v195213: 3144 pgs: 3144 active+clean; 255 GB data, 820 GB used, 778 GB / 1599 GB avail mdsmap e54723: 1/1/1 up {0=0=up:active}, 3 up:standby but the mon.4 log file look like: 2013-03-21 12:45:15.701747 7f45412c6780 2 mon.4@-1(probing) e2 init 2013-03-21 12:45:15.702051 7f45412c6780 10 mon.4@-1(probing) e2 bootstrap 2013-03-21 12:45:15.702094 7f45412c6780 10 mon.4@-1(probing) e2 unregister_cluster_logger - not registered 2013-03-21 12:45:15.702121 7f45412c6780 10 mon.4@-1(probing) e2 cancel_probe_timeout (none scheduled) 2013-03-21 12:45:15.702147 7f45412c6780 0 mon.4@-1(probing) e2 my rank is now 4 (was -1) 2013-03-21 12:45:15.702190 7f45412c6780 10 mon.4@4(probing) e2 reset_sync 2013-03-21 12:45:15.702213 7f45412c6780 10 mon.4@4(probing) e2 reset 2013-03-21 12:45:15.702238 7f45412c6780 10 mon.4@4(probing) e2 timecheck_finish 2013-03-21 12:45:15.702286 7f45412c6780 10 mon.4@4(probing) e2 cancel_probe_timeout (none scheduled) 2013-03-21 12:45:15.702312 7f45412c6780 10 mon.4@4(probing) e2 reset_probe_timeout 0x24d6580 after 2 seconds 2013-03-21 12:45:15.702387 7f45412c6780 10 mon.4@4(probing) e2 probing other monitors 2013-03-21 12:45:15.703459 7f453a15f700 10 mon.4@4(probing) e2 ms_get_authorizer for mon 2013-03-21 12:45:15.703641 7f453a15f700 10 cephx: build_service_ticket service mon secret_id 18446744073709551615 ticket_info.ticket.name=mon. 2013-03-21 12:45:15.703642 7f453a361700 10 mon.4@4(probing) e2 ms_get_authorizer for mon 2013-03-21 12:45:15.703694 7f453a361700 10 cephx: build_service_ticket service mon secret_id 18446744073709551615 ticket_info.ticket.name=mon. 2013-03-21 12:45:15.703869 7f453a260700 10 mon.4@4(probing) e2 ms_get_authorizer for mon 2013-03-21 12:45:15.703957 7f453a260700 10 cephx: build_service_ticket service mon secret_id 18446744073709551615 ticket_info.ticket.name=mon. 2013-03-21 12:45:15.704244 7f453a05e700 10 mon.4@4(probing) e2 ms_get_authorizer for mon 2013-03-21 12:45:15.704306 7f453a05e700 10 cephx: build_service_ticket service mon secret_id 18446744073709551615 ticket_info.ticket.name=mon. 2013-03-21 12:45:15.704323 7f453a361700 0 cephx: verify_reply coudln't decrypt with error: error decoding block for decryption 2013-0
docs
There seem to be a missing argument to ceph osd lost (also in help for the command). http://ceph.com/docs/master/rados/operations/control/#osd-subsystem src/tools/ceph.cc src/test/cli/ceph/help.t doc/rados/operations/control.rst The documentation for development release packages is slightly confused. Should it not refer to http://ceph.com/rpm-testing for development release packages? (Also, the ceph-release package in the development release does not refer to itself (in /etc/yum.repos.d/ceph.repo) but to (http://ceph.com/rpms) packages.) http://ceph.com/docs/master/install/rpm/ Do you want patches? --jerker -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
Quoting Alexandre Oliva (2013-03-22 01:27:42) > On Mar 21, 2013, Chris Mason wrote: > > > Quoting Chris Mason (2013-03-21 14:06:14) > >> With mmap the kernel can pick any given time to start writing out dirty > >> pages. The idea is that if the application makes more changes the page > >> becomes dirty again and the kernel writes it again. > > That's the theory. But what if there's some race between the time the > page is frozen for compressing and the time it's marked as clean, or > it's marked as clean after it's further modified, or a subsequent write > to the same page ends up overridden by the background compression of the > old contents of the page? These are all possibilities that come to mind > without knowing much about btrfs inner workings. Definitely, there is a lot of room for racing. Are you using compression in btrfs or just in leveldb? > > >> So the question is, can you trigger this without snapshots being done > >> at all? > > I haven't tried, but I now have a program that hit the error condition > while taking snapshots in background with small time perturbations to > increase the likelihood of hitting a race condition at the exact time. > It uses leveldb's infrastructure for the mmapping, but it shouldn't be > too hard to adapt it so that it doesn't. > > > So my test program creates an 8GB file in chunks of 1MB each. > > That's probably too large a chunk to write at a time. The bug is > exercised with writes slightly smaller than a single page (although > straddling across two consecutive pages). > > This half-baked test program (hereby provided under the terms of the GNU > GPLv3+) creates a btrfs subvolume and two files in it: one in which I/O > will be performed with write()s, another that will get the same data > appended with leveldb's mmap-based output interface. Random block > sizes, as well as milli and microsecond timing perturbations, are read > from /dev/urandom, and the rest of the output buffer is filled with > (char)1. > > The test that actually failed (on the first try!, after some other > variations that didn't fail) didn't have any of the #ifdef options > enabled (i.e., no -D* flags during compilation), but it triggered the > exact failure observed with ceph: zeros at the end of a page where there > should have been nonzero data, followed by nonzero data on the following > page! That was within snapshots, not in the main subvol, but hopefully > it's the same problem, just a bit harder to trigger. I'd like to take snapshots out of the picture for a minute. We need some way to synchronize the leveldb with snapshotting because the snapshot is basically the same thing as a crash from a db point of view. Corrupting the main database file is a much different (and bigger) problem. -chris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html