Re: corruption of active mmapped files in btrfs snapshots

2013-03-25 Thread Chris Mason
Quoting Chris Mason (2013-03-22 16:31:42)
 Going through the code here, when I change the test to truncate once in
 the very beginning, I still get errors.  So, it isn't an interaction
 between mmap and truncate.  It must be a problem between lzo and mmap.

With compression off, we use clear_page_dirty_for_io to create a wall
between applications using mmap and our crc code.  Once we call
clear_page_dirty_for_io, it means we're in the process of writing the
page and anyone using mmap must wait (by calling page_mkwrite) before
they are allowed to change the page.

We use it with compression on as well, but it only ends up protecting
the crcs.  It gets called after the compression is done, which allows
applications to race in and modify the pages while we are compressing
them.

This patch changes our compression code to call clear_page_dirty_for_io
before we compress, and then redirty the pages if the compression fails.

Alexandre, many thanks for tracking this down into a well defined use
case. 

-chris

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f173c5a..cdee391 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1257,6 +1257,39 @@ int unlock_extent(struct extent_io_tree *tree, u64 
start, u64 end)
GFP_NOFS);
 }
 
+int extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end)
+{
+   unsigned long index = start  PAGE_CACHE_SHIFT;
+   unsigned long end_index = end  PAGE_CACHE_SHIFT;
+   struct page *page;
+
+   while (index = end_index) {
+   page = find_get_page(inode-i_mapping, index);
+   BUG_ON(!page); /* Pages should be in the extent_io_tree */
+   clear_page_dirty_for_io(page);
+   page_cache_release(page);
+   index++;
+   }
+   return 0;
+}
+
+int extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end)
+{
+   unsigned long index = start  PAGE_CACHE_SHIFT;
+   unsigned long end_index = end  PAGE_CACHE_SHIFT;
+   struct page *page;
+
+   while (index = end_index) {
+   page = find_get_page(inode-i_mapping, index);
+   BUG_ON(!page); /* Pages should be in the extent_io_tree */
+   account_page_redirty(page);
+   __set_page_dirty_nobuffers(page);
+   page_cache_release(page);
+   index++;
+   }
+   return 0;
+}
+
 /*
  * helper function to set both pages and extents in the tree writeback
  */
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 6068a19..258c921 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -325,6 +325,8 @@ int map_private_extent_buffer(struct extent_buffer *eb, 
unsigned long offset,
  unsigned long *map_len);
 int extent_range_uptodate(struct extent_io_tree *tree,
  u64 start, u64 end);
+int extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end);
+int extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end);
 int extent_clear_unlock_delalloc(struct inode *inode,
struct extent_io_tree *tree,
u64 start, u64 end, struct page *locked_page,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ca1b767..88d4a18 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -353,6 +353,7 @@ static noinline int compress_file_range(struct inode *inode,
int i;
int will_compress;
int compress_type = root-fs_info-compress_type;
+   int redirty = 0;
 
/* if this is a small write inside eof, kick off a defrag */
if ((end - start + 1)  16 * 1024 
@@ -415,6 +416,8 @@ again:
if (BTRFS_I(inode)-force_compress)
compress_type = BTRFS_I(inode)-force_compress;
 
+   extent_range_clear_dirty_for_io(inode, start, end);
+   redirty = 1;
ret = btrfs_compress_pages(compress_type,
   inode-i_mapping, start,
   total_compressed, pages,
@@ -554,6 +557,8 @@ cleanup_and_bail_uncompressed:
__set_page_dirty_nobuffers(locked_page);
/* unlocked later on in the async handlers */
}
+   if (redirty)
+   extent_range_redirty_for_io(inode, start, end);
add_async_extent(async_cow, start, end - start + 1,
 0, NULL, 0, BTRFS_COMPRESS_NONE);
*num_added += 1;
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corruption of active mmapped files in btrfs snapshots

2013-03-22 Thread Chris Mason
Quoting Alexandre Oliva (2013-03-22 01:27:42)
 On Mar 21, 2013, Chris Mason chris.ma...@fusionio.com wrote:
 
  Quoting Chris Mason (2013-03-21 14:06:14)
  With mmap the kernel can pick any given time to start writing out dirty
  pages.  The idea is that if the application makes more changes the page
  becomes dirty again and the kernel writes it again.
 
 That's the theory.  But what if there's some race between the time the
 page is frozen for compressing and the time it's marked as clean, or
 it's marked as clean after it's further modified, or a subsequent write
 to the same page ends up overridden by the background compression of the
 old contents of the page?  These are all possibilities that come to mind
 without knowing much about btrfs inner workings.

Definitely, there is a lot of room for racing.  Are you using
compression in btrfs or just in leveldb?

 
  So the question is, can you trigger this without snapshots being done
  at all?
 
 I haven't tried, but I now have a program that hit the error condition
 while taking snapshots in background with small time perturbations to
 increase the likelihood of hitting a race condition at the exact time.
 It uses leveldb's infrastructure for the mmapping, but it shouldn't be
 too hard to adapt it so that it doesn't.
 
  So my test program creates an 8GB file in chunks of 1MB each.
 
 That's probably too large a chunk to write at a time.  The bug is
 exercised with writes slightly smaller than a single page (although
 straddling across two consecutive pages).
 
 This half-baked test program (hereby provided under the terms of the GNU
 GPLv3+) creates a btrfs subvolume and two files in it: one in which I/O
 will be performed with write()s, another that will get the same data
 appended with leveldb's mmap-based output interface.  Random block
 sizes, as well as milli and microsecond timing perturbations, are read
 from /dev/urandom, and the rest of the output buffer is filled with
 (char)1.
 
 The test that actually failed (on the first try!, after some other
 variations that didn't fail) didn't have any of the #ifdef options
 enabled (i.e., no -D* flags during compilation), but it triggered the
 exact failure observed with ceph: zeros at the end of a page where there
 should have been nonzero data, followed by nonzero data on the following
 page!  That was within snapshots, not in the main subvol, but hopefully
 it's the same problem, just a bit harder to trigger.

I'd like to take snapshots out of the picture for a minute.  We need
some way to synchronize the leveldb with snapshotting because the
snapshot is basically the same thing as a crash from a db point of view.

Corrupting the main database file is a much different (and bigger)
problem.

-chris

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corruption of active mmapped files in btrfs snapshots

2013-03-22 Thread Chris Mason
Quoting Alexandre Oliva (2013-03-22 10:17:30)
 On Mar 22, 2013, Chris Mason clma...@fusionio.com wrote:
 
  Are you using compression in btrfs or just in leveldb?
 
 btrfs lzo compression.

Perfect, I'll focus on that part of things.

 
  I'd like to take snapshots out of the picture for a minute.
 
 That's understandable, I guess, but I don't know that anyone has ever
 got the problem without snapshots.  I mean, even when the master copy of
 the database got corrupted, snapshots of the subvol containing it were
 being taken every now and again, because that's the way ceph works.

Hopefully Sage can comment, but the basic idea is that if you snapshot a
database file the db must participate.  If it doesn't, it really is the
same effect as crashing the box.

Something is definitely broken if we're corrupting the source files
(either with or without snapshots), but avoiding incomplete writes in
the snapshot files requires synchronization with the db.

-chris
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corruption of active mmapped files in btrfs snapshots

2013-03-22 Thread Chris Mason
In this case, I think Alexandre is scanning for zeros in the file.   The
incomplete writes will definitely show that.

-chris

Quoting Samuel Just (2013-03-22 13:06:41)
 Incomplete writes for leveldb should just result in lost updates, not
 corruption.  Also, we do stop writes before the snapshot is initiated
 so there should be no in-progress writes to leveldb other than leveldb
 compaction (though that might be something to investigate).
 -Sam
 
 On Fri, Mar 22, 2013 at 7:26 AM, Chris Mason clma...@fusionio.com wrote:
  Quoting Alexandre Oliva (2013-03-22 10:17:30)
  On Mar 22, 2013, Chris Mason clma...@fusionio.com wrote:
 
   Are you using compression in btrfs or just in leveldb?
 
  btrfs lzo compression.
 
  Perfect, I'll focus on that part of things.
 
 
   I'd like to take snapshots out of the picture for a minute.
 
  That's understandable, I guess, but I don't know that anyone has ever
  got the problem without snapshots.  I mean, even when the master copy of
  the database got corrupted, snapshots of the subvol containing it were
  being taken every now and again, because that's the way ceph works.
 
  Hopefully Sage can comment, but the basic idea is that if you snapshot a
  database file the db must participate.  If it doesn't, it really is the
  same effect as crashing the box.
 
  Something is definitely broken if we're corrupting the source files
  (either with or without snapshots), but avoiding incomplete writes in
  the snapshot files requires synchronization with the db.
 
  -chris
  --
  To unsubscribe from this list: send the line unsubscribe linux-btrfs in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corruption of active mmapped files in btrfs snapshots

2013-03-22 Thread Chris Mason
[ mmap corruptions with leveldb and btrfs compression ]

I ran this a number of times with compression off and wasn't able to
trigger problems.  With compress=lzo, I see errors on every run.

Compile: gcc -Wall -o mmap-trunc mmap-trunc.c
Run: ./mmap-trunc file_name

The basic idea is to create a 256MB file in steps.  Each step ftruncates
the file larger, and then mmaps a region for writing.  It dirties some
unaligned bytes (a little more than 8K), and then munmaps.

Then a verify stage goes back through the file to make sure the data we
wrote is really there.  I'm using a simple rotating pattern of chars
that compress very well.

I run it in batches of 100 with some memory pressure on the side:

for x in `seq 1 100` ; do (mmap-trunc f$x ) ; done

#define _FILE_OFFSET_BITS 64
#include sys/types.h
#include sys/stat.h
#include sys/mman.h
#include fcntl.h
#include unistd.h
#include stdio.h
#include stdlib.h
#include string.h
#include sys/time.h

#define FILE_SIZE ((loff_t)256 * 1024 * 1024)
/* make a painfully unaligned chunk size */
#define CHUNK_SIZE (8192 + 932)

#define mmap_align(x) (((x) + 4095)  ~4095)

char *file_name = NULL;

void mmap_one_chunk(int fd, loff_t *cur_size, unsigned char *file_buf)
{
int ret;
loff_t new_size = *cur_size + CHUNK_SIZE;
loff_t pos = *cur_size;
unsigned long map_size = mmap_align(CHUNK_SIZE) + 4096;
char val = file_buf[0];
char *p;
int extra;

/* step one, truncate out a hole */
ret = ftruncate(fd, new_size);
if (ret) {
perror(truncate);
exit(1);
}

if (val == 0 || val == 'z')
val = 'a';
else
val++;

memset(file_buf, val, CHUNK_SIZE);

extra = pos  4095;
p = mmap(0, map_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
 pos - extra);
if (p == MAP_FAILED) {
perror(mmap);
exit(1);
}
memcpy(p + extra, file_buf, CHUNK_SIZE);

ret = munmap(p, map_size);
if (ret) {
perror(munmap);
exit(1);
}
*cur_size = new_size;
}

void check_chunks(int fd)
{
char *p;
loff_t checked = 0;
char val = 'a';
int i;
int errors = 0;
int ret;
int extra;
unsigned long map_size = mmap_align(CHUNK_SIZE) + 4096;

fprintf(stderr, checking chunks\n);
while (checked  FILE_SIZE) {
extra = checked  4095;
p = mmap(0, map_size, PROT_READ,
 MAP_SHARED, fd, checked - extra);
if (p == MAP_FAILED) {
perror(mmap);
exit(1);
}
for (i = 0; i  CHUNK_SIZE; i++) {
if (p[i + extra] != val) {
fprintf(stderr, %s: bad val %x wanted %x 
offset 0x%llx\n,
file_name, p[i + extra], val,
(unsigned long long)checked + i);
errors++;
}
}
if (val == 'z')
val = 'a';
else
val++;
ret = munmap(p, map_size);
if (ret) {
perror(munmap);
exit(1);
}
checked += CHUNK_SIZE;
}
printf(%s found %d errors\n, file_name, errors);
if (errors)
exit(1);
}

int main(int ac, char **av)
{
unsigned char *file_buf;
loff_t pos = 0;
int ret;
int fd;

if (ac  2) {
fprintf(stderr, usage: mmap-trunc filename\n);
exit(1);
}

ret = posix_memalign((void **)file_buf, 4096, CHUNK_SIZE);
if (ret) {
perror(cannot allocate memory\n);
exit(1);
}

file_buf[0] = 0;

file_name = av[1];

fprintf(stderr, running test on %s\n, file_name);

unlink(file_name);
fd = open(file_name, O_RDWR | O_CREAT, 0600);
if (fd  0) {
perror(open);
exit(1);
}

fprintf(stderr, writing chunks\n);
while (pos  FILE_SIZE) {
mmap_one_chunk(fd, pos, file_buf);
}
check_chunks(fd);
return 0;
}
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corruption of active mmapped files in btrfs snapshots

2013-03-22 Thread Chris Mason
Quoting Chris Mason (2013-03-22 14:07:05)
 [ mmap corruptions with leveldb and btrfs compression ]
 
 I ran this a number of times with compression off and wasn't able to
 trigger problems.  With compress=lzo, I see errors on every run.
 
 Compile: gcc -Wall -o mmap-trunc mmap-trunc.c
 Run: ./mmap-trunc file_name
 
 The basic idea is to create a 256MB file in steps.  Each step ftruncates
 the file larger, and then mmaps a region for writing.  It dirties some
 unaligned bytes (a little more than 8K), and then munmaps.
 
 Then a verify stage goes back through the file to make sure the data we
 wrote is really there.  I'm using a simple rotating pattern of chars
 that compress very well.

Going through the code here, when I change the test to truncate once in
the very beginning, I still get errors.  So, it isn't an interaction
between mmap and truncate.  It must be a problem between lzo and mmap.

-chris
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corruption of active mmapped files in btrfs snapshots

2013-03-21 Thread Chris Mason
Quoting Chris Mason (2013-03-21 14:06:14)
 Quoting Alexandre Oliva (2013-03-21 03:14:02)
  On Mar 19, 2013, Alexandre Oliva ol...@gnu.org wrote:
  
   On Mar 19, 2013, Alexandre Oliva ol...@gnu.org wrote:
   that is being processed inside the snapshot.
  
   This doesn't explain why the master database occasionally gets similarly
   corrupted, does it?
  
   Actually, scratch this bit for now.  I don't really have proof that the
   master database actually gets corrupted while it's in use
  
  Scratch the “scratch this”.  The master database actually gets
  corrupted, and it's with recently-created files, created after earlier
  known-good snapshots.  So, it can't really be orphan processing, can it?
 
 Right, it can't be orphan processing.
 
  
  Some more info from the errors and instrumentation:
  
  - no data syncing on the affected files is taking place.  it's just
memcpy()ing data in 4KiB-sized chunks onto mmap()ed areas,
munmap()ing it, growing the file with ftruncate and mapping a
subsequent chunk for further output
  
  - the NULs at the end of pages do NOT occur at munmap/mmap boundaries as
I suspected at first, but they do coincide with the end of extents
that are smaller than the maximum compressed extent size.  So,
something's making btrfs flush pages to disk before the pages are
completely written (which is fine in principle), but apparently
failing to pick up subsequent changes to the pages (eek!)
 
 With mmap the kernel can pick any given time to start writing out dirty
 pages.  The idea is that if the application makes more changes the page
 becomes dirty again and the kernel writes it again.
 
 So the question is, can you trigger this without snapshots being done
 at all?  I'll try to make an mmap tester here that hammers on the
 related code.  We usually test this with fsx, which catches all kinds of
 horrors.

So my test program creates an 8GB file in chunks of 1MB each.  Using
truncate to extend the file and then mmap to write into the new hole.
It is writing in 1MB chunks, ever so slightly not aligned.  After
creating the whole file, it reads it back to look for errors.

I'm running this with heavy memory pressure, but no snapshots.  No
corruptions yet, but I'll let it run a while long.

-chris

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corruption of active mmapped files in btrfs snapshots

2013-03-19 Thread Chris Mason
Quoting Alexandre Oliva (2013-03-19 01:20:10)
 On Mar 18, 2013, Chris Mason chris.ma...@fusionio.com wrote:
 
  A few questions.  Does leveldb use O_DIRECT and mmap together?
 
 No, it doesn't use O_DIRECT at all.  Its I/O interface is very
 simplified: it just opens each new file (database chunks limited to 2MB)
 with O_CREAT|O_RDWR|O_TRUNC, and then uses ftruncate, mmap, msync,
 munmap and fdatasync.  It doesn't seem to modify data once it's written;
 it only appends.  Reading data back from it uses a completely different
 class interface, using separate descriptors and using pread only.
 
  (the source of a write being pages that are mmap'd from somewhere
  else)
 
 AFAICT the source of the memcpy()s that append to the file are
 malloc()ed memory.
 
  That's the most likely place for this kind of problem.  Also, you
  mention crc errors.  Are those reported by btrfs or are they application
  level crcs.
 
 These are CRCs leveldb computes and writes out after each db block.  No
 btrfs CRC errors are reported in this process.

Ok, so we have three moving pieces here.

1) leveldb truncating the files
2) leveldb using mmap to write
3) btrfs snapshots

My guess is the truncate is creating a orphan item that is being
processed inside the snapshot.

Is it possible to create a smaller leveldb unit test that we might use
to exercise all of this?

-chris

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corruption of active mmapped files in btrfs snapshots

2013-03-18 Thread Chris Mason
A few questions.  Does leveldb use O_DIRECT and mmap together? (the
source of a write being pages that are mmap'd from somewhere else)

That's the most likely place for this kind of problem.  Also, you
mention crc errors.  Are those reported by btrfs or are they application
level crcs.

Thanks for all the time you spent tracking it down this far.

-chris

Quoting Alexandre Oliva (2013-03-18 17:14:41)
 For quite a while, I've experienced oddities with snapshotted Firefox
 _CACHE_00?_ files, whose checksums (and contents) would change after the
 btrfs snapshot was taken, and would even change depending on how the
 file was brought to memory (e.g., rsyncing it to backup storage vs
 checking its md5sum before or after the rsync).  This only affected
 these cache files, so I didn't give it too much attention.
 
 A similar problem seems to affect the leveldb databases maintained by
 ceph within the periodic snapshots it takes of its object storage
 volumes.  I'm told others using ceph on filesystems other than btrfs are
 not observing this problem, which makes me thing it's not memory
 corruption within ceph itself.  I've looked into this for a bit, and I'm
 now inclined to believe it has to do with some bad interaction of mmap
 and snapshots; I'm not sure the fact that the filesystem has compression
 enabled has any effect, but that's certainly a possibility.
 
 leveldb does not modify file contents once they're initialized, it only
 appends to files, ftruncate()ing them to about a MB early on, mmap()ping
 that in and memcpy()ing blocks of various sizes to the end of the output
 buffer, occasionally msync()ing the maps, or running fdatasync if it
 didn't msync a map before munmap()ping it.  If it runs out of space in a
 map, it munmap()s the previously mapped range, truncates the file to a
 larger size, then maps in the new tail of the file, starting at the page
 it should append to next.
 
 What I'm observing is that some btrfs snapshots taken by ceph osds,
 containing the leveldb database, are corrupted, causing crashes during
 the use of the database.
 
 I've scripted regular checks of osd snapshots, saving the
 last-known-good database along with the first one that displays the
 corruption.  Studying about two dozen failures over the weekend, that
 took place on all of 13 btrfs-based osds on 3 servers running btrfs as
 in 3.8.3(-gnu), I noticed that all of the corrupted databases had a
 similar pattern: a stream of NULs of varying sizes at the end of a page,
 starting at a block boundary (leveldb doesn't do page-sized blocking, so
 blocks can start anywhere in a page), and ending close to the beginning
 of the next page, although not exactly at the page boundary; 20 bytes
 past the page boundary seemed to be the most common size, but the
 occasional presence of NULs in the database contents makes it harder to
 tell for sure.
 
 The stream of NULs ended in the middle of a database block (meaning it
 was not the beginning of a subsequent database block written later; the
 beginning of the database block was partially replaced with NULs).
 Furthermore, the checksum fails to match on this one partially-NULed
 block.  Since the checksum is computed just before the block and the
 checksum trailer are memcpy()ed to the mmap()ed area, it is a certainty
 that the block was copied entirely to the right place at some point, and
 if part of it became zeros, it's either because the modification was
 partially lost, or because the mmapped buffer was partially overwritten.
 The fact that all instances of corruption I looked at were correct right
 to the end of one block boundary, and then all zeros instead of the
 beginning of the subsequent block to the end of that page, makes a
 failure to write that modified page seem more likely in my mind (more so
 given the Firefox _CACHE_ file oddities in snapshots); intense memory
 pressure at the time of the corruption also seems to favor this
 possibility.
 
 Now, it could be that btrfs requires those who modify SHARED mmap()ed
 files so as to make sure that data makes it to a subsequent snapshot,
 along the lines of msync MS_ASYNC, and leveldb does not take this sort
 of precaution.  However, I noticed that the unexpected stream of zeros
 after a prior block and before the rest of the subsequent block
 *remains* in subsequent snapshots, which to me indicates the page update
 is effectively lost.  This explains why even the running osd, that
 operates on the “current” subvolumes from which snapshots for recovery
 are taken, occasionally crashes because of database corruption, and will
 later fail to restart from an earlier snapshot due to that same
 corruption.
 
 
 Does this problem sound familiar to anyone else?
 
 Should mmaped-file writers in general do more than umount or msync to
 ensure changes make it to subsequent snapshots that are supposed to be
 consistent?
 
 Any tips on where to start looking so as to fix the problem, or even to
 confirm that the problem is indeed 

Re: ceph-on-btrfs inline-cow regression fix for 3.4.3

2012-06-13 Thread Chris Mason
On Tue, Jun 12, 2012 at 09:46:26PM -0600, Alexandre Oliva wrote:
 Hi, Greg,
 
 There's a btrfs regression in 3.4 that's causing a lot of grief to
 ceph-on-btrfs users like myself.  This small and nice patch cures it.
 It's in Linus' master already.  I've been running it on top of 3.4.2,
 and it would be very convenient for me if this could be in 3.4.3.

Ack, this can definitely to go 3.4-stable.  Thanks Alexandre.

-chris
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs slowdown with ceph (how to reproduce)

2012-01-24 Thread Chris Mason
On Tue, Jan 24, 2012 at 08:15:58PM +0100, Martin Mailand wrote:
 Hi
 I tried the branch on one of my ceph osd, and there is a big
 difference in the performance.
 The average request size stayed high, but after around a hour the
 kernel crashed.
 
 IOstat
 http://pastebin.com/xjuriJ6J
 
 Kernel trace
 http://pastebin.com/SYE95GgH

Aha, this I know how to fix.  Thanks for trying it out.

-chris
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs slowdown with ceph (how to reproduce)

2012-01-23 Thread Chris Mason
On Mon, Jan 23, 2012 at 01:19:29PM -0500, Josef Bacik wrote:
 On Fri, Jan 20, 2012 at 01:13:37PM +0100, Christian Brunner wrote:
  As you might know, I have been seeing btrfs slowdowns in our ceph
  cluster for quite some time. Even with the latest btrfs code for 3.3
  I'm still seeing these problems. To make things reproducible, I've now
  written a small test, that imitates ceph's behavior:
  
  On a freshly created btrfs filesystem (2 TB size, mounted with
  noatime,nodiratime,compress=lzo,space_cache,inode_cache) I'm opening
  100 files. After that I'm doing random writes on these files with a
  sync_file_range after each write (each write has a size of 100 bytes)
  and ioctl(BTRFS_IOC_SYNC) after every 100 writes.
  
  After approximately 20 minutes, write activity suddenly increases
  fourfold and the average request size decreases (see chart in the
  attachment).
  
  You can find IOstat output here: http://pastebin.com/Smbfg1aG
  
  I hope that you are able to trace down the problem with the test
  program in the attachment.
  
 Ran it, saw the problem, tried the dangerdonteveruse branch in Chris's tree 
 and
 formatted the fs with 64k node and leaf sizes and the problem appeared to go
 away.  So surprise surprise fragmentation is biting us in the ass.  If you can
 try running that branch with 64k node and leaf sizes with your ceph cluster 
 and
 see how that works out.  Course you should only do that if you dont mind if 
 you
 lose everything :).  Thanks,
 

Please keep in mind this branch is only out there for development, and
it really might have huge flaws.  scrub doesn't work with it correctly
right now, and the IO error recovery code is probably broken too.

Long term though, I think the bigger block sizes are going to make a
huge difference in these workloads.

If you use the very dangerous code:

mkfs.btrfs -l 64k -n 64k /dev/xxx

(-l is leaf size, -n is node size).

64K is the max right now, 32K may help just as much at a lower CPU cost.

-chris

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-26 Thread Chris Mason
On Tue, Oct 25, 2011 at 04:22:48PM -0400, Josef Bacik wrote:
 On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
  On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
   On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:

Attached is a perf-report. I have included the whole report, so that
you can see the difference between the good and the bad
btrfs-endio-wri.
   
   
   We also shouldn't be running run_ordered_operations, man this is screwed 
   up,
   thanks so much for this, I should be able to nail this down pretty easily.
   Thanks,
  
  Looks like we're getting there from reserve_metadata_bytes when we join
  the transaction?
 
 
 We don't do reservations in the endio stuff, we assume you've reserved all the
 space you need in delalloc, plus we would have seen reserve_metadata_bytes in
 the trace.  Though it does look like perf is lying to us in at least one case
 sicne btrfs_alloc_logged_file_extent is only called from log replay and not
 during normal runtime, so it definitely shouldn't be showing up.  Thanks,

Whoops, I should have read that num_items  0 check harder.

btrfs_end_transaction is doing it by setting -blocked = 1

if (lock  !atomic_read(root-fs_info-open_ioctl_trans) 
should_end_transaction(trans, root)) {
trans-transaction-blocked = 1;
^
smp_wmb();
}

   if (lock  cur_trans-blocked  !cur_trans-in_commit) {
   ^^^
if (throttle) {
/*
 * We may race with somebody else here so end up having
 * to call end_transaction on ourselves again, so inc
 * our use_count.
 */
trans-use_count++;
return btrfs_commit_transaction(trans, root);
} else {
wake_up_process(info-transaction_kthread);
}
}

perf is definitely lying a little bit about the trace ;)

-chris

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Chris Mason
On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
 On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
  
  Attached is a perf-report. I have included the whole report, so that
  you can see the difference between the good and the bad
  btrfs-endio-wri.
 
 
 We also shouldn't be running run_ordered_operations, man this is screwed up,
 thanks so much for this, I should be able to nail this down pretty easily.
 Thanks,

Looks like we're getting there from reserve_metadata_bytes when we join
the transaction?

-chris
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-24 Thread Chris Mason
On Mon, Oct 24, 2011 at 03:51:47PM -0400, Josef Bacik wrote:
 On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
  [adding linux-btrfs to cc]
  
  Josef, Chris, any ideas on the below issues?
  
  On Mon, 24 Oct 2011, Christian Brunner wrote:
   Thanks for explaining this. I don't have any objections against btrfs
   as a osd filesystem. Even the fact that there is no btrfs-fsck doesn't
   scare me, since I can use the ceph replication to recover a lost
   btrfs-filesystem. The only problem I have is, that btrfs is not stable
   on our side and I wonder what you are doing to make it work. (Maybe
   it's related to the load pattern of using ceph as a backend store for
   qemu).
   
   Here is a list of the btrfs problems I'm having:
   
   - When I run ceph with the default configuration (btrfs snaps enabled)
   I can see a rapid increase in Disk-I/O after a few hours of uptime.
   Btrfs-cleaner is using more and more time in
   btrfs_clean_old_snapshots().
  
  In theory, there shouldn't be any significant difference between taking a 
  snapshot and removing it a few commits later, and the prior root refs that 
  btrfs holds on to internally until the new commit is complete.  That's 
  clearly not quite the case, though.
  
  In any case, we're going to try to reproduce this issue in our 
  environment.
  
 
 I've noticed this problem too, clean_old_snapshots is taking quite a while in
 cases where it really shouldn't.  I will see if I can come up with a 
 reproducer
 that doesn't require setting up ceph ;).

This sounds familiar though, I thought we had fixed a similar
regression.  Either way, Arne's readahead code should really help.

Which kernel version were you running?

[ ack on the rest of Josef's comments ]

-chris
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs slowdown

2011-07-25 Thread Chris Mason
Excerpts from Christian Brunner's message of 2011-07-25 03:54:47 -0400:
 Hi,
 
 we are running a ceph cluster with btrfs as it's base filesystem
 (kernel 3.0). At the beginning everything worked very well, but after
 a few days (2-3) things are getting very slow.
 
 When I look at the object store servers I see heavy disk-i/o on the
 btrfs filesystems (disk utilization is between 60% and 100%). I also
 did some tracing on the Cepp-Object-Store-Daemon, but I'm quite
 certain, that the majority of the disk I/O is not caused by ceph or
 any other userland process.
 
 When reboot the system(s) the problems go away for another 2-3 days,
 but after that, it starts again. I'm not sure if the problem is
 related to the kernel warning I've reported last week. At least there
 is no temporal relationship between the warning and the slowdown.
 
 Any hints on how to trace this would be welcome.

The easiest way to trace this is with latencytop.

Apply this patch:

http://oss.oracle.com/~mason/latencytop.patch

And then use latencytop -c for a few minutes while the system is slow.
Send the output here and hopefully we'll be able to figure it out.

-chris
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected

2011-06-10 Thread Chris Mason
Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400:

[ two different btrfs crashes ]

I think your two crashes in btrfs were from the uninit variables and
those should be fixed in rc2.

 When I did my bisection, my criteria for success/failure was
 did mkcephfs succeed?.  When I apply this criteria to a recent
 linus kernel (e.g. 06e86849cf4019), which includes the fix you
 mentioned (aa0467d8d2a00e), I get still a different failure mode,
 which doesn't actually reference btrfs:
 
 [  276.364178] BUG: unable to handle kernel NULL pointer dereference at 
 000a
 [  276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd]

Looking at the resulting code in the oops, we're here in journal_start:

if (handle) {
J_ASSERT(handle-h_transaction-t_journal == journal);

handle comes from current-journal_info, and we're doing a deref on
handle-h_transaction, which is probably 0xa.

So, we're leaving crud in current-journal_info and ext3 is finding it.

Perhaps its from ceph starting a transaction but leaving it running?
The bug came with Josef's transaction performance fixes, but it is
probably a mixture of his code with the ioctls ceph is using.

[ rest of the oops below for context ]

-chris

 [  276.365127] PGD 1e4469067 PUD 1e1658067 PMD 0
 [  276.365127] Oops:  [#1] SMP
 [  276.365127] CPU 2
 [  276.365127] Modules linked in: btrfs zlib_deflate lzo_compress 
 ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state 
 nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge 
 stp i2c_dev i2c_core ext3 jbd scsi_transport_iscsi rds ib_ipoib rdma_ucm 
 rdma_cm ib_ucm ib_uverbs ib_umad ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror 
 dm_region_hash dm_log dm_multipath scsi_dh dm_mod video sbs sbshc pci_slot 
 battery acpi_pad ac kvm sg ses sd_mod enclosure megaraid_sas ide_cd_mod cdrom 
 ib_mthca ib_mad qla2xxx button ib_core serio_raw scsi_transport_fc scsi_tgt 
 dcdbas ata_piix libata tpm_tis tpm i5k_amb ioatdma tpm_bios hwmon iTCO_wdt 
 scsi_mod i5000_edac iTCO_vendor_support ehci_hcd dca edac_core uhci_hcd 
 pcspkr rtc nfs nfs_acl auth_rpcgss fscache lockd sunrpc tg3 bnx2 e1000 [last 
 unloaded: freq_table]
 [  276.365127]
 [  276.365127] Pid: 6076, comm: cosd Not tainted 3.0.0-rc2-00196-g06e8684 #26 
 Dell Inc. PowerEdge 1950/0DT097
 [  276.365127] RIP: 0010:[a05434b1]  [a05434b1] 
 journal_start+0x3e/0x9c [jbd]
 [  276.365127] RSP: 0018:8801e2897b28  EFLAGS: 00010286
 [  276.365127] RAX: 000a RBX: 8801de8e1090 RCX: 
 0002
 [  276.365127] RDX: 19b2d000 RSI: 000e RDI: 
 000e
 [  276.365127] RBP: 8801e2897b48 R08: 0003 R09: 
 8801e2897c38
 [  276.365127] R10: 8801e2897ed8 R11: 0001 R12: 
 880223ff4400
 [  276.365127] R13: 880218522d60 R14: 0ec6 R15: 
 88021f54d878
 [  276.365127] FS:  7f8ff0bbb710() GS:88022fc8() 
 knlGS:
 [  276.365127] CS:  0010 DS:  ES:  CR0: 8005003b
 [  276.365127] CR2: 000a CR3: 00021744f000 CR4: 
 06e0
 [  276.365127] DR0:  DR1:  DR2: 
 
 [  276.365127] DR3:  DR6: 0ff0 DR7: 
 0400
 [  276.365127] Process cosd (pid: 6076, threadinfo 8801e2896000, task 
 880218522d60)
 [  276.365127] Stack:
 [  276.365127]  8801e2897b68 ea000756e788 88021f54d728 
 8801e2897c78
 [  276.365127]  8801e2897b58 a05670ce 8801e2897b68 
 a055c72d
 [  276.365127]  8801e2897be8 a055f044 8801e2897c38 
 0074
 [  276.365127] Call Trace:
 [  276.365127]  [a05670ce] ext3_journal_start_sb+0x4f/0x51 [ext3]
 [  276.365127]  [a055c72d] ext3_journal_start+0x12/0x14 [ext3]
 [  276.365127]  [a055f044] ext3_write_begin+0x93/0x1a1 [ext3]
 [  276.365127]  [810c6f0e] ? __kunmap_atomic+0xe/0x10
 [  276.365127]  [810c75e5] generic_perform_write+0xb1/0x172
 [  276.365127]  [81036a33] ? need_resched+0x23/0x2d
 [  276.365127]  [810c76ea] generic_file_buffered_write+0x44/0x6f
 [  276.365127]  [810c91f5] __generic_file_aio_write+0x253/0x2a8
 [  276.365127]  [810c92ad] generic_file_aio_write+0x63/0xb8
 [  276.365127]  [81113b26] do_sync_write+0xc7/0x10b
 [  276.365127]  [81036a4b] ? should_resched+0xe/0x2f
 [  276.365127]  [813b0faf] ? _cond_resched+0xe/0x22
 [  276.365127]  [811986c3] ? security_file_permission+0x2c/0x31
 [  276.365127]  [81113d21] ? rw_verify_area+0xac/0xdb
 [  276.365127]  [81114253] vfs_write+0xac/0xe4
 [  276.365127]  [8111434f] sys_write+0x4c/0x71
 [  276.365127]  [813b8beb] system_call_fastpath+0x16/0x1b
 [  276.365127] Code: 89 fc 48 c7 c3 e2 ff ff ff 89 f7 65 4c 8b 2c 25 c0 b5 00 
 00 4d 85 e4 49 8b 85 48 06 00 00 74