Re: corruption of active mmapped files in btrfs snapshots
Quoting Chris Mason (2013-03-22 16:31:42) Going through the code here, when I change the test to truncate once in the very beginning, I still get errors. So, it isn't an interaction between mmap and truncate. It must be a problem between lzo and mmap. With compression off, we use clear_page_dirty_for_io to create a wall between applications using mmap and our crc code. Once we call clear_page_dirty_for_io, it means we're in the process of writing the page and anyone using mmap must wait (by calling page_mkwrite) before they are allowed to change the page. We use it with compression on as well, but it only ends up protecting the crcs. It gets called after the compression is done, which allows applications to race in and modify the pages while we are compressing them. This patch changes our compression code to call clear_page_dirty_for_io before we compress, and then redirty the pages if the compression fails. Alexandre, many thanks for tracking this down into a well defined use case. -chris diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index f173c5a..cdee391 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1257,6 +1257,39 @@ int unlock_extent(struct extent_io_tree *tree, u64 start, u64 end) GFP_NOFS); } +int extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end) +{ + unsigned long index = start PAGE_CACHE_SHIFT; + unsigned long end_index = end PAGE_CACHE_SHIFT; + struct page *page; + + while (index = end_index) { + page = find_get_page(inode-i_mapping, index); + BUG_ON(!page); /* Pages should be in the extent_io_tree */ + clear_page_dirty_for_io(page); + page_cache_release(page); + index++; + } + return 0; +} + +int extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end) +{ + unsigned long index = start PAGE_CACHE_SHIFT; + unsigned long end_index = end PAGE_CACHE_SHIFT; + struct page *page; + + while (index = end_index) { + page = find_get_page(inode-i_mapping, index); + BUG_ON(!page); /* Pages should be in the extent_io_tree */ + account_page_redirty(page); + __set_page_dirty_nobuffers(page); + page_cache_release(page); + index++; + } + return 0; +} + /* * helper function to set both pages and extents in the tree writeback */ diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 6068a19..258c921 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -325,6 +325,8 @@ int map_private_extent_buffer(struct extent_buffer *eb, unsigned long offset, unsigned long *map_len); int extent_range_uptodate(struct extent_io_tree *tree, u64 start, u64 end); +int extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end); +int extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end); int extent_clear_unlock_delalloc(struct inode *inode, struct extent_io_tree *tree, u64 start, u64 end, struct page *locked_page, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index ca1b767..88d4a18 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -353,6 +353,7 @@ static noinline int compress_file_range(struct inode *inode, int i; int will_compress; int compress_type = root-fs_info-compress_type; + int redirty = 0; /* if this is a small write inside eof, kick off a defrag */ if ((end - start + 1) 16 * 1024 @@ -415,6 +416,8 @@ again: if (BTRFS_I(inode)-force_compress) compress_type = BTRFS_I(inode)-force_compress; + extent_range_clear_dirty_for_io(inode, start, end); + redirty = 1; ret = btrfs_compress_pages(compress_type, inode-i_mapping, start, total_compressed, pages, @@ -554,6 +557,8 @@ cleanup_and_bail_uncompressed: __set_page_dirty_nobuffers(locked_page); /* unlocked later on in the async handlers */ } + if (redirty) + extent_range_redirty_for_io(inode, start, end); add_async_extent(async_cow, start, end - start + 1, 0, NULL, 0, BTRFS_COMPRESS_NONE); *num_added += 1; -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
Quoting Alexandre Oliva (2013-03-22 01:27:42) On Mar 21, 2013, Chris Mason chris.ma...@fusionio.com wrote: Quoting Chris Mason (2013-03-21 14:06:14) With mmap the kernel can pick any given time to start writing out dirty pages. The idea is that if the application makes more changes the page becomes dirty again and the kernel writes it again. That's the theory. But what if there's some race between the time the page is frozen for compressing and the time it's marked as clean, or it's marked as clean after it's further modified, or a subsequent write to the same page ends up overridden by the background compression of the old contents of the page? These are all possibilities that come to mind without knowing much about btrfs inner workings. Definitely, there is a lot of room for racing. Are you using compression in btrfs or just in leveldb? So the question is, can you trigger this without snapshots being done at all? I haven't tried, but I now have a program that hit the error condition while taking snapshots in background with small time perturbations to increase the likelihood of hitting a race condition at the exact time. It uses leveldb's infrastructure for the mmapping, but it shouldn't be too hard to adapt it so that it doesn't. So my test program creates an 8GB file in chunks of 1MB each. That's probably too large a chunk to write at a time. The bug is exercised with writes slightly smaller than a single page (although straddling across two consecutive pages). This half-baked test program (hereby provided under the terms of the GNU GPLv3+) creates a btrfs subvolume and two files in it: one in which I/O will be performed with write()s, another that will get the same data appended with leveldb's mmap-based output interface. Random block sizes, as well as milli and microsecond timing perturbations, are read from /dev/urandom, and the rest of the output buffer is filled with (char)1. The test that actually failed (on the first try!, after some other variations that didn't fail) didn't have any of the #ifdef options enabled (i.e., no -D* flags during compilation), but it triggered the exact failure observed with ceph: zeros at the end of a page where there should have been nonzero data, followed by nonzero data on the following page! That was within snapshots, not in the main subvol, but hopefully it's the same problem, just a bit harder to trigger. I'd like to take snapshots out of the picture for a minute. We need some way to synchronize the leveldb with snapshotting because the snapshot is basically the same thing as a crash from a db point of view. Corrupting the main database file is a much different (and bigger) problem. -chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
Quoting Alexandre Oliva (2013-03-22 10:17:30) On Mar 22, 2013, Chris Mason clma...@fusionio.com wrote: Are you using compression in btrfs or just in leveldb? btrfs lzo compression. Perfect, I'll focus on that part of things. I'd like to take snapshots out of the picture for a minute. That's understandable, I guess, but I don't know that anyone has ever got the problem without snapshots. I mean, even when the master copy of the database got corrupted, snapshots of the subvol containing it were being taken every now and again, because that's the way ceph works. Hopefully Sage can comment, but the basic idea is that if you snapshot a database file the db must participate. If it doesn't, it really is the same effect as crashing the box. Something is definitely broken if we're corrupting the source files (either with or without snapshots), but avoiding incomplete writes in the snapshot files requires synchronization with the db. -chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
In this case, I think Alexandre is scanning for zeros in the file. The incomplete writes will definitely show that. -chris Quoting Samuel Just (2013-03-22 13:06:41) Incomplete writes for leveldb should just result in lost updates, not corruption. Also, we do stop writes before the snapshot is initiated so there should be no in-progress writes to leveldb other than leveldb compaction (though that might be something to investigate). -Sam On Fri, Mar 22, 2013 at 7:26 AM, Chris Mason clma...@fusionio.com wrote: Quoting Alexandre Oliva (2013-03-22 10:17:30) On Mar 22, 2013, Chris Mason clma...@fusionio.com wrote: Are you using compression in btrfs or just in leveldb? btrfs lzo compression. Perfect, I'll focus on that part of things. I'd like to take snapshots out of the picture for a minute. That's understandable, I guess, but I don't know that anyone has ever got the problem without snapshots. I mean, even when the master copy of the database got corrupted, snapshots of the subvol containing it were being taken every now and again, because that's the way ceph works. Hopefully Sage can comment, but the basic idea is that if you snapshot a database file the db must participate. If it doesn't, it really is the same effect as crashing the box. Something is definitely broken if we're corrupting the source files (either with or without snapshots), but avoiding incomplete writes in the snapshot files requires synchronization with the db. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
[ mmap corruptions with leveldb and btrfs compression ] I ran this a number of times with compression off and wasn't able to trigger problems. With compress=lzo, I see errors on every run. Compile: gcc -Wall -o mmap-trunc mmap-trunc.c Run: ./mmap-trunc file_name The basic idea is to create a 256MB file in steps. Each step ftruncates the file larger, and then mmaps a region for writing. It dirties some unaligned bytes (a little more than 8K), and then munmaps. Then a verify stage goes back through the file to make sure the data we wrote is really there. I'm using a simple rotating pattern of chars that compress very well. I run it in batches of 100 with some memory pressure on the side: for x in `seq 1 100` ; do (mmap-trunc f$x ) ; done #define _FILE_OFFSET_BITS 64 #include sys/types.h #include sys/stat.h #include sys/mman.h #include fcntl.h #include unistd.h #include stdio.h #include stdlib.h #include string.h #include sys/time.h #define FILE_SIZE ((loff_t)256 * 1024 * 1024) /* make a painfully unaligned chunk size */ #define CHUNK_SIZE (8192 + 932) #define mmap_align(x) (((x) + 4095) ~4095) char *file_name = NULL; void mmap_one_chunk(int fd, loff_t *cur_size, unsigned char *file_buf) { int ret; loff_t new_size = *cur_size + CHUNK_SIZE; loff_t pos = *cur_size; unsigned long map_size = mmap_align(CHUNK_SIZE) + 4096; char val = file_buf[0]; char *p; int extra; /* step one, truncate out a hole */ ret = ftruncate(fd, new_size); if (ret) { perror(truncate); exit(1); } if (val == 0 || val == 'z') val = 'a'; else val++; memset(file_buf, val, CHUNK_SIZE); extra = pos 4095; p = mmap(0, map_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, pos - extra); if (p == MAP_FAILED) { perror(mmap); exit(1); } memcpy(p + extra, file_buf, CHUNK_SIZE); ret = munmap(p, map_size); if (ret) { perror(munmap); exit(1); } *cur_size = new_size; } void check_chunks(int fd) { char *p; loff_t checked = 0; char val = 'a'; int i; int errors = 0; int ret; int extra; unsigned long map_size = mmap_align(CHUNK_SIZE) + 4096; fprintf(stderr, checking chunks\n); while (checked FILE_SIZE) { extra = checked 4095; p = mmap(0, map_size, PROT_READ, MAP_SHARED, fd, checked - extra); if (p == MAP_FAILED) { perror(mmap); exit(1); } for (i = 0; i CHUNK_SIZE; i++) { if (p[i + extra] != val) { fprintf(stderr, %s: bad val %x wanted %x offset 0x%llx\n, file_name, p[i + extra], val, (unsigned long long)checked + i); errors++; } } if (val == 'z') val = 'a'; else val++; ret = munmap(p, map_size); if (ret) { perror(munmap); exit(1); } checked += CHUNK_SIZE; } printf(%s found %d errors\n, file_name, errors); if (errors) exit(1); } int main(int ac, char **av) { unsigned char *file_buf; loff_t pos = 0; int ret; int fd; if (ac 2) { fprintf(stderr, usage: mmap-trunc filename\n); exit(1); } ret = posix_memalign((void **)file_buf, 4096, CHUNK_SIZE); if (ret) { perror(cannot allocate memory\n); exit(1); } file_buf[0] = 0; file_name = av[1]; fprintf(stderr, running test on %s\n, file_name); unlink(file_name); fd = open(file_name, O_RDWR | O_CREAT, 0600); if (fd 0) { perror(open); exit(1); } fprintf(stderr, writing chunks\n); while (pos FILE_SIZE) { mmap_one_chunk(fd, pos, file_buf); } check_chunks(fd); return 0; } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
Quoting Chris Mason (2013-03-22 14:07:05) [ mmap corruptions with leveldb and btrfs compression ] I ran this a number of times with compression off and wasn't able to trigger problems. With compress=lzo, I see errors on every run. Compile: gcc -Wall -o mmap-trunc mmap-trunc.c Run: ./mmap-trunc file_name The basic idea is to create a 256MB file in steps. Each step ftruncates the file larger, and then mmaps a region for writing. It dirties some unaligned bytes (a little more than 8K), and then munmaps. Then a verify stage goes back through the file to make sure the data we wrote is really there. I'm using a simple rotating pattern of chars that compress very well. Going through the code here, when I change the test to truncate once in the very beginning, I still get errors. So, it isn't an interaction between mmap and truncate. It must be a problem between lzo and mmap. -chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
Quoting Chris Mason (2013-03-21 14:06:14) Quoting Alexandre Oliva (2013-03-21 03:14:02) On Mar 19, 2013, Alexandre Oliva ol...@gnu.org wrote: On Mar 19, 2013, Alexandre Oliva ol...@gnu.org wrote: that is being processed inside the snapshot. This doesn't explain why the master database occasionally gets similarly corrupted, does it? Actually, scratch this bit for now. I don't really have proof that the master database actually gets corrupted while it's in use Scratch the “scratch this”. The master database actually gets corrupted, and it's with recently-created files, created after earlier known-good snapshots. So, it can't really be orphan processing, can it? Right, it can't be orphan processing. Some more info from the errors and instrumentation: - no data syncing on the affected files is taking place. it's just memcpy()ing data in 4KiB-sized chunks onto mmap()ed areas, munmap()ing it, growing the file with ftruncate and mapping a subsequent chunk for further output - the NULs at the end of pages do NOT occur at munmap/mmap boundaries as I suspected at first, but they do coincide with the end of extents that are smaller than the maximum compressed extent size. So, something's making btrfs flush pages to disk before the pages are completely written (which is fine in principle), but apparently failing to pick up subsequent changes to the pages (eek!) With mmap the kernel can pick any given time to start writing out dirty pages. The idea is that if the application makes more changes the page becomes dirty again and the kernel writes it again. So the question is, can you trigger this without snapshots being done at all? I'll try to make an mmap tester here that hammers on the related code. We usually test this with fsx, which catches all kinds of horrors. So my test program creates an 8GB file in chunks of 1MB each. Using truncate to extend the file and then mmap to write into the new hole. It is writing in 1MB chunks, ever so slightly not aligned. After creating the whole file, it reads it back to look for errors. I'm running this with heavy memory pressure, but no snapshots. No corruptions yet, but I'll let it run a while long. -chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
Quoting Alexandre Oliva (2013-03-19 01:20:10) On Mar 18, 2013, Chris Mason chris.ma...@fusionio.com wrote: A few questions. Does leveldb use O_DIRECT and mmap together? No, it doesn't use O_DIRECT at all. Its I/O interface is very simplified: it just opens each new file (database chunks limited to 2MB) with O_CREAT|O_RDWR|O_TRUNC, and then uses ftruncate, mmap, msync, munmap and fdatasync. It doesn't seem to modify data once it's written; it only appends. Reading data back from it uses a completely different class interface, using separate descriptors and using pread only. (the source of a write being pages that are mmap'd from somewhere else) AFAICT the source of the memcpy()s that append to the file are malloc()ed memory. That's the most likely place for this kind of problem. Also, you mention crc errors. Are those reported by btrfs or are they application level crcs. These are CRCs leveldb computes and writes out after each db block. No btrfs CRC errors are reported in this process. Ok, so we have three moving pieces here. 1) leveldb truncating the files 2) leveldb using mmap to write 3) btrfs snapshots My guess is the truncate is creating a orphan item that is being processed inside the snapshot. Is it possible to create a smaller leveldb unit test that we might use to exercise all of this? -chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
A few questions. Does leveldb use O_DIRECT and mmap together? (the source of a write being pages that are mmap'd from somewhere else) That's the most likely place for this kind of problem. Also, you mention crc errors. Are those reported by btrfs or are they application level crcs. Thanks for all the time you spent tracking it down this far. -chris Quoting Alexandre Oliva (2013-03-18 17:14:41) For quite a while, I've experienced oddities with snapshotted Firefox _CACHE_00?_ files, whose checksums (and contents) would change after the btrfs snapshot was taken, and would even change depending on how the file was brought to memory (e.g., rsyncing it to backup storage vs checking its md5sum before or after the rsync). This only affected these cache files, so I didn't give it too much attention. A similar problem seems to affect the leveldb databases maintained by ceph within the periodic snapshots it takes of its object storage volumes. I'm told others using ceph on filesystems other than btrfs are not observing this problem, which makes me thing it's not memory corruption within ceph itself. I've looked into this for a bit, and I'm now inclined to believe it has to do with some bad interaction of mmap and snapshots; I'm not sure the fact that the filesystem has compression enabled has any effect, but that's certainly a possibility. leveldb does not modify file contents once they're initialized, it only appends to files, ftruncate()ing them to about a MB early on, mmap()ping that in and memcpy()ing blocks of various sizes to the end of the output buffer, occasionally msync()ing the maps, or running fdatasync if it didn't msync a map before munmap()ping it. If it runs out of space in a map, it munmap()s the previously mapped range, truncates the file to a larger size, then maps in the new tail of the file, starting at the page it should append to next. What I'm observing is that some btrfs snapshots taken by ceph osds, containing the leveldb database, are corrupted, causing crashes during the use of the database. I've scripted regular checks of osd snapshots, saving the last-known-good database along with the first one that displays the corruption. Studying about two dozen failures over the weekend, that took place on all of 13 btrfs-based osds on 3 servers running btrfs as in 3.8.3(-gnu), I noticed that all of the corrupted databases had a similar pattern: a stream of NULs of varying sizes at the end of a page, starting at a block boundary (leveldb doesn't do page-sized blocking, so blocks can start anywhere in a page), and ending close to the beginning of the next page, although not exactly at the page boundary; 20 bytes past the page boundary seemed to be the most common size, but the occasional presence of NULs in the database contents makes it harder to tell for sure. The stream of NULs ended in the middle of a database block (meaning it was not the beginning of a subsequent database block written later; the beginning of the database block was partially replaced with NULs). Furthermore, the checksum fails to match on this one partially-NULed block. Since the checksum is computed just before the block and the checksum trailer are memcpy()ed to the mmap()ed area, it is a certainty that the block was copied entirely to the right place at some point, and if part of it became zeros, it's either because the modification was partially lost, or because the mmapped buffer was partially overwritten. The fact that all instances of corruption I looked at were correct right to the end of one block boundary, and then all zeros instead of the beginning of the subsequent block to the end of that page, makes a failure to write that modified page seem more likely in my mind (more so given the Firefox _CACHE_ file oddities in snapshots); intense memory pressure at the time of the corruption also seems to favor this possibility. Now, it could be that btrfs requires those who modify SHARED mmap()ed files so as to make sure that data makes it to a subsequent snapshot, along the lines of msync MS_ASYNC, and leveldb does not take this sort of precaution. However, I noticed that the unexpected stream of zeros after a prior block and before the rest of the subsequent block *remains* in subsequent snapshots, which to me indicates the page update is effectively lost. This explains why even the running osd, that operates on the “current” subvolumes from which snapshots for recovery are taken, occasionally crashes because of database corruption, and will later fail to restart from an earlier snapshot due to that same corruption. Does this problem sound familiar to anyone else? Should mmaped-file writers in general do more than umount or msync to ensure changes make it to subsequent snapshots that are supposed to be consistent? Any tips on where to start looking so as to fix the problem, or even to confirm that the problem is indeed
Re: ceph-on-btrfs inline-cow regression fix for 3.4.3
On Tue, Jun 12, 2012 at 09:46:26PM -0600, Alexandre Oliva wrote: Hi, Greg, There's a btrfs regression in 3.4 that's causing a lot of grief to ceph-on-btrfs users like myself. This small and nice patch cures it. It's in Linus' master already. I've been running it on top of 3.4.2, and it would be very convenient for me if this could be in 3.4.3. Ack, this can definitely to go 3.4-stable. Thanks Alexandre. -chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs slowdown with ceph (how to reproduce)
On Tue, Jan 24, 2012 at 08:15:58PM +0100, Martin Mailand wrote: Hi I tried the branch on one of my ceph osd, and there is a big difference in the performance. The average request size stayed high, but after around a hour the kernel crashed. IOstat http://pastebin.com/xjuriJ6J Kernel trace http://pastebin.com/SYE95GgH Aha, this I know how to fix. Thanks for trying it out. -chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs slowdown with ceph (how to reproduce)
On Mon, Jan 23, 2012 at 01:19:29PM -0500, Josef Bacik wrote: On Fri, Jan 20, 2012 at 01:13:37PM +0100, Christian Brunner wrote: As you might know, I have been seeing btrfs slowdowns in our ceph cluster for quite some time. Even with the latest btrfs code for 3.3 I'm still seeing these problems. To make things reproducible, I've now written a small test, that imitates ceph's behavior: On a freshly created btrfs filesystem (2 TB size, mounted with noatime,nodiratime,compress=lzo,space_cache,inode_cache) I'm opening 100 files. After that I'm doing random writes on these files with a sync_file_range after each write (each write has a size of 100 bytes) and ioctl(BTRFS_IOC_SYNC) after every 100 writes. After approximately 20 minutes, write activity suddenly increases fourfold and the average request size decreases (see chart in the attachment). You can find IOstat output here: http://pastebin.com/Smbfg1aG I hope that you are able to trace down the problem with the test program in the attachment. Ran it, saw the problem, tried the dangerdonteveruse branch in Chris's tree and formatted the fs with 64k node and leaf sizes and the problem appeared to go away. So surprise surprise fragmentation is biting us in the ass. If you can try running that branch with 64k node and leaf sizes with your ceph cluster and see how that works out. Course you should only do that if you dont mind if you lose everything :). Thanks, Please keep in mind this branch is only out there for development, and it really might have huge flaws. scrub doesn't work with it correctly right now, and the IO error recovery code is probably broken too. Long term though, I think the bigger block sizes are going to make a huge difference in these workloads. If you use the very dangerous code: mkfs.btrfs -l 64k -n 64k /dev/xxx (-l is leaf size, -n is node size). 64K is the max right now, 32K may help just as much at a lower CPU cost. -chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Tue, Oct 25, 2011 at 04:22:48PM -0400, Josef Bacik wrote: On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote: On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote: On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote: Attached is a perf-report. I have included the whole report, so that you can see the difference between the good and the bad btrfs-endio-wri. We also shouldn't be running run_ordered_operations, man this is screwed up, thanks so much for this, I should be able to nail this down pretty easily. Thanks, Looks like we're getting there from reserve_metadata_bytes when we join the transaction? We don't do reservations in the endio stuff, we assume you've reserved all the space you need in delalloc, plus we would have seen reserve_metadata_bytes in the trace. Though it does look like perf is lying to us in at least one case sicne btrfs_alloc_logged_file_extent is only called from log replay and not during normal runtime, so it definitely shouldn't be showing up. Thanks, Whoops, I should have read that num_items 0 check harder. btrfs_end_transaction is doing it by setting -blocked = 1 if (lock !atomic_read(root-fs_info-open_ioctl_trans) should_end_transaction(trans, root)) { trans-transaction-blocked = 1; ^ smp_wmb(); } if (lock cur_trans-blocked !cur_trans-in_commit) { ^^^ if (throttle) { /* * We may race with somebody else here so end up having * to call end_transaction on ourselves again, so inc * our use_count. */ trans-use_count++; return btrfs_commit_transaction(trans, root); } else { wake_up_process(info-transaction_kthread); } } perf is definitely lying a little bit about the trace ;) -chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote: On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote: Attached is a perf-report. I have included the whole report, so that you can see the difference between the good and the bad btrfs-endio-wri. We also shouldn't be running run_ordered_operations, man this is screwed up, thanks so much for this, I should be able to nail this down pretty easily. Thanks, Looks like we're getting there from reserve_metadata_bytes when we join the transaction? -chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Mon, Oct 24, 2011 at 03:51:47PM -0400, Josef Bacik wrote: On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote: [adding linux-btrfs to cc] Josef, Chris, any ideas on the below issues? On Mon, 24 Oct 2011, Christian Brunner wrote: Thanks for explaining this. I don't have any objections against btrfs as a osd filesystem. Even the fact that there is no btrfs-fsck doesn't scare me, since I can use the ceph replication to recover a lost btrfs-filesystem. The only problem I have is, that btrfs is not stable on our side and I wonder what you are doing to make it work. (Maybe it's related to the load pattern of using ceph as a backend store for qemu). Here is a list of the btrfs problems I'm having: - When I run ceph with the default configuration (btrfs snaps enabled) I can see a rapid increase in Disk-I/O after a few hours of uptime. Btrfs-cleaner is using more and more time in btrfs_clean_old_snapshots(). In theory, there shouldn't be any significant difference between taking a snapshot and removing it a few commits later, and the prior root refs that btrfs holds on to internally until the new commit is complete. That's clearly not quite the case, though. In any case, we're going to try to reproduce this issue in our environment. I've noticed this problem too, clean_old_snapshots is taking quite a while in cases where it really shouldn't. I will see if I can come up with a reproducer that doesn't require setting up ceph ;). This sounds familiar though, I thought we had fixed a similar regression. Either way, Arne's readahead code should really help. Which kernel version were you running? [ ack on the rest of Josef's comments ] -chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs slowdown
Excerpts from Christian Brunner's message of 2011-07-25 03:54:47 -0400: Hi, we are running a ceph cluster with btrfs as it's base filesystem (kernel 3.0). At the beginning everything worked very well, but after a few days (2-3) things are getting very slow. When I look at the object store servers I see heavy disk-i/o on the btrfs filesystems (disk utilization is between 60% and 100%). I also did some tracing on the Cepp-Object-Store-Daemon, but I'm quite certain, that the majority of the disk I/O is not caused by ceph or any other userland process. When reboot the system(s) the problems go away for another 2-3 days, but after that, it starts again. I'm not sure if the problem is related to the kernel warning I've reported last week. At least there is no temporal relationship between the warning and the slowdown. Any hints on how to trace this would be welcome. The easiest way to trace this is with latencytop. Apply this patch: http://oss.oracle.com/~mason/latencytop.patch And then use latencytop -c for a few minutes while the system is slow. Send the output here and hopefully we'll be able to figure it out. -chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected
Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400: [ two different btrfs crashes ] I think your two crashes in btrfs were from the uninit variables and those should be fixed in rc2. When I did my bisection, my criteria for success/failure was did mkcephfs succeed?. When I apply this criteria to a recent linus kernel (e.g. 06e86849cf4019), which includes the fix you mentioned (aa0467d8d2a00e), I get still a different failure mode, which doesn't actually reference btrfs: [ 276.364178] BUG: unable to handle kernel NULL pointer dereference at 000a [ 276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd] Looking at the resulting code in the oops, we're here in journal_start: if (handle) { J_ASSERT(handle-h_transaction-t_journal == journal); handle comes from current-journal_info, and we're doing a deref on handle-h_transaction, which is probably 0xa. So, we're leaving crud in current-journal_info and ext3 is finding it. Perhaps its from ceph starting a transaction but leaving it running? The bug came with Josef's transaction performance fixes, but it is probably a mixture of his code with the ioctls ceph is using. [ rest of the oops below for context ] -chris [ 276.365127] PGD 1e4469067 PUD 1e1658067 PMD 0 [ 276.365127] Oops: [#1] SMP [ 276.365127] CPU 2 [ 276.365127] Modules linked in: btrfs zlib_deflate lzo_compress ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp i2c_dev i2c_core ext3 jbd scsi_transport_iscsi rds ib_ipoib rdma_ucm rdma_cm ib_ucm ib_uverbs ib_umad ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror dm_region_hash dm_log dm_multipath scsi_dh dm_mod video sbs sbshc pci_slot battery acpi_pad ac kvm sg ses sd_mod enclosure megaraid_sas ide_cd_mod cdrom ib_mthca ib_mad qla2xxx button ib_core serio_raw scsi_transport_fc scsi_tgt dcdbas ata_piix libata tpm_tis tpm i5k_amb ioatdma tpm_bios hwmon iTCO_wdt scsi_mod i5000_edac iTCO_vendor_support ehci_hcd dca edac_core uhci_hcd pcspkr rtc nfs nfs_acl auth_rpcgss fscache lockd sunrpc tg3 bnx2 e1000 [last unloaded: freq_table] [ 276.365127] [ 276.365127] Pid: 6076, comm: cosd Not tainted 3.0.0-rc2-00196-g06e8684 #26 Dell Inc. PowerEdge 1950/0DT097 [ 276.365127] RIP: 0010:[a05434b1] [a05434b1] journal_start+0x3e/0x9c [jbd] [ 276.365127] RSP: 0018:8801e2897b28 EFLAGS: 00010286 [ 276.365127] RAX: 000a RBX: 8801de8e1090 RCX: 0002 [ 276.365127] RDX: 19b2d000 RSI: 000e RDI: 000e [ 276.365127] RBP: 8801e2897b48 R08: 0003 R09: 8801e2897c38 [ 276.365127] R10: 8801e2897ed8 R11: 0001 R12: 880223ff4400 [ 276.365127] R13: 880218522d60 R14: 0ec6 R15: 88021f54d878 [ 276.365127] FS: 7f8ff0bbb710() GS:88022fc8() knlGS: [ 276.365127] CS: 0010 DS: ES: CR0: 8005003b [ 276.365127] CR2: 000a CR3: 00021744f000 CR4: 06e0 [ 276.365127] DR0: DR1: DR2: [ 276.365127] DR3: DR6: 0ff0 DR7: 0400 [ 276.365127] Process cosd (pid: 6076, threadinfo 8801e2896000, task 880218522d60) [ 276.365127] Stack: [ 276.365127] 8801e2897b68 ea000756e788 88021f54d728 8801e2897c78 [ 276.365127] 8801e2897b58 a05670ce 8801e2897b68 a055c72d [ 276.365127] 8801e2897be8 a055f044 8801e2897c38 0074 [ 276.365127] Call Trace: [ 276.365127] [a05670ce] ext3_journal_start_sb+0x4f/0x51 [ext3] [ 276.365127] [a055c72d] ext3_journal_start+0x12/0x14 [ext3] [ 276.365127] [a055f044] ext3_write_begin+0x93/0x1a1 [ext3] [ 276.365127] [810c6f0e] ? __kunmap_atomic+0xe/0x10 [ 276.365127] [810c75e5] generic_perform_write+0xb1/0x172 [ 276.365127] [81036a33] ? need_resched+0x23/0x2d [ 276.365127] [810c76ea] generic_file_buffered_write+0x44/0x6f [ 276.365127] [810c91f5] __generic_file_aio_write+0x253/0x2a8 [ 276.365127] [810c92ad] generic_file_aio_write+0x63/0xb8 [ 276.365127] [81113b26] do_sync_write+0xc7/0x10b [ 276.365127] [81036a4b] ? should_resched+0xe/0x2f [ 276.365127] [813b0faf] ? _cond_resched+0xe/0x22 [ 276.365127] [811986c3] ? security_file_permission+0x2c/0x31 [ 276.365127] [81113d21] ? rw_verify_area+0xac/0xdb [ 276.365127] [81114253] vfs_write+0xac/0xe4 [ 276.365127] [8111434f] sys_write+0x4c/0x71 [ 276.365127] [813b8beb] system_call_fastpath+0x16/0x1b [ 276.365127] Code: 89 fc 48 c7 c3 e2 ff ff ff 89 f7 65 4c 8b 2c 25 c0 b5 00 00 4d 85 e4 49 8b 85 48 06 00 00 74