[radosgw] Race condition corrupting data on COPY ?
Hi, I've just noticed something rather worrying on our cluster. Some files are apparently truncated. From the first look I had at it, it happened on files where there was a metadata update right after the file was stored. The exact sequence was: - PUT to store the file - GET to get the file (which at that point is still correct and has the proper length) - PUT using a 'copy source' over itself to update the metadata all of theses happening sequentially in the same second, very quickly. Then subsequent GET return a truncated file. I'm looking into it to narrow down the issue but I wanted to know if anyone had seen something similar ? Cheers, Sylvain -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [radosgw] Race condition corrupting data on COPY ?
On Mon, Mar 18, 2013 at 2:50 AM, Sylvain Munaut s.mun...@whatever-company.com wrote: Hi, I've just noticed something rather worrying on our cluster. Some files are apparently truncated. From the first look I had at it, it happened on files where there was a metadata update right after the file was stored. The exact sequence was: - PUT to store the file - GET to get the file (which at that point is still correct and has the proper length) - PUT using a 'copy source' over itself to update the metadata all of theses happening sequentially in the same second, very quickly. Then subsequent GET return a truncated file. I'm looking into it to narrow down the issue but I wanted to know if anyone had seen something similar ? What version are you using? Do you have logs? Thanks, Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS Space Accounting and Quotas
On 03/15/2013 05:17 PM, Greg Farnum wrote: [Putting list back on cc] On Friday, March 15, 2013 at 4:11 PM, Jim Schutt wrote: On 03/15/2013 04:23 PM, Greg Farnum wrote: As I come back and look at these again, I'm not sure what the context for these logs is. Which test did they come from, and which behavior (slow or not slow, etc) did you see? :) -Greg They come from a test where I had debug mds = 20 and debug ms = 1 on the MDS while writing files from 198 clients. It turns out that for some reason I need debug mds = 20 during writing to reproduce the slow stat behavior later. strace.find.dirs.txt.bz2 contains the log of running strace -tt -o strace.find.dirs.txt find /mnt/ceph/stripe-4M -type d -exec ls -lhd {} \; From that output, I believe that the stat of at least these files is slow: zero0.rc11 zero0.rc30 zero0.rc46 zero0.rc8 zero0.tc103 zero0.tc105 zero0.tc106 I believe that log shows slow stats on more files, but those are the first few. mds.cs28.slow-stat.partial.bz2 contains the MDS log from just before the find command started, until just after the fifth or sixth slow stat from the list above. I haven't yet tried to find other ways of reproducing this, but so far it appears that something happens during the writing of the files that ends up causing the condition that results in slow stat commands. I have the full MDS log from the writing of the files, as well, but it's big Is that what you were after? Thanks for taking a look! -- Jim I just was coming back to these to see what new information was available, but I realized we'd discussed several tests and I wasn't sure what these ones came from. That information is enough, yes. If in fact you believe you've only seen this with high-level MDS debugging, I believe the cause is as I mentioned last time: the MDS is flapping a bit and so some files get marked as needsrecover, but they aren't getting recovered asynchronously, and the first thing that pokes them into doing a recover is the stat. OK, that makes sense. That's definitely not the behavior we want and so I'll be poking around the code a bit and generating bugs, but given that explanation it's a bit less scary than random slow stats are so it's not such a high priority. :) Do let me know if you come across it without the MDS and clients having had connection issues! No problem - thanks! -- Jim -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [radosgw] Race condition corrupting data on COPY ?
Hi, What version are you using? Do you have logs? I'm running a custom build 0.56.3 + some patches ( basically up to7889c5412 + fixes for #4150 and #4177 ). I don't have any radosgw low ( debug level is set to 0 and it didn't output anything ). I have the HTTP logs : 10.0.0.253 s3.svc - [14/Mar/2013:09:23:14 +] PUT /rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34 HTTP/1.1 200 0 - Boto/2.6.0 (linux2) 10.0.0.74 s3.svc - [14/Mar/2013:09:23:14 +] GET /rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34?Signature=XXX%3DExpires=1363256594AWSAccessKeyId=XXX HTTP/1.1 200 622080 - python-requests 10.0.0.253 s3.svc - [14/Mar/2013:09:23:14 +] PUT /rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34 HTTP/1.1 200 146 - Boto/2.6.0 (linux2) 10.0.0.74 s3.svc - [14/Mar/2013:10:14:53 +] GET /rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34?Signature=XXX%3DExpires=1363258236AWSAccessKeyId=XXX HTTP/1.1 200 461220 - python-requests Cheers, Sylvain -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph availability test recovering question
Hello, I`m experiencing same long-lasting problem - during recovery ops, some percentage of read I/O remains in-flight for seconds, rendering upper-level filesystem on the qemu client very slow and almost unusable. Different striping has almost no effect on visible delays and reads may be non-intensive at all but they still are very slow. Here is some fio results on randread with small blocks, so it is not affected by readahead as linear one: Intensive reads during recovery: lat (msec) : 2=0.01%, 4=0.08%, 10=1.87%, 20=4.17%, 50=8.34% lat (msec) : 100=13.93%, 250=2.77%, 500=1.19%, 750=25.13%, 1000=0.41% lat (msec) : 2000=15.45%, =2000=26.66% same on healthy cluster: lat (msec) : 20=0.33%, 50=9.17%, 100=23.35%, 250=25.47%, 750=6.53% lat (msec) : 1000=0.42%, 2000=34.17%, =2000=0.56% On Sun, Mar 17, 2013 at 8:18 AM, kelvin_hu...@wiwynn.com wrote: Hi, all I have some problem after availability test Setup: Linux kernel: 3.2.0 OS: Ubuntu 12.04 Storage server : 11 HDD (each storage server has 11 osd, 7200 rpm, 1T) + 10GbE NIC RAID card: LSI MegaRAID SAS 9260-4i For every HDD: RAID0, Write Policy: Write Back with BBU, Read Policy: ReadAhead, IO Policy: Direct Storage server number : 2 Ceph version : 0.48.2 Replicas : 2 Monitor number:3 We have two storage server as a cluter, then use ceph client create 1T RBD image for testing, the client also has 10GbE NIC , Linux kernel 3.2.0 , Ubuntu 12.04 We also use FIO to produce workload fio command: [Sequencial Read] fio --iodepth = 32 --numjobs=1 --runtime=120 --bs = 65536 --rw = read --ioengine=libaio --group_reporting --direct=1 --eta=always --ramp_time=10 --thinktime=10 [Sequencial Write] fio --iodepth = 32 --numjobs=1 --runtime=120 --bs = 65536 --rw = write --ioengine=libaio --group_reporting --direct=1 --eta=always --ramp_time=10 --thinktime=10 Now I want observe to ceph state when one storage server is crash, so I turn off one storage server networking. We expect that data write and data read operation can be quickly resume or even not be suspended in ceph recovering time, but the experimental results show the data write and data read operation will pause for about 20~30 seconds in ceph recovering time. My question is: 1.The state of I/O pause is normal when ceph recovering ? 2.The pause time of I/O that can not be avoided when ceph recovering ? 3.How to reduce the I/O pause time ? Thanks!! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [radosgw] Race condition corrupting data on COPY ?
On Mon, Mar 18, 2013 at 7:40 AM, Sylvain Munaut s.mun...@whatever-company.com wrote: Hi, What version are you using? Do you have logs? I'm running a custom build 0.56.3 + some patches ( basically up to7889c5412 + fixes for #4150 and #4177 ). I don't have any radosgw low ( debug level is set to 0 and it didn't output anything ). I have the HTTP logs : 10.0.0.253 s3.svc - [14/Mar/2013:09:23:14 +] PUT /rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34 HTTP/1.1 200 0 - Boto/2.6.0 (linux2) 10.0.0.74 s3.svc - [14/Mar/2013:09:23:14 +] GET /rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34?Signature=XXX%3DExpires=1363256594AWSAccessKeyId=XXX HTTP/1.1 200 622080 - python-requests 10.0.0.253 s3.svc - [14/Mar/2013:09:23:14 +] PUT /rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34 HTTP/1.1 200 146 - Boto/2.6.0 (linux2) 10.0.0.74 s3.svc - [14/Mar/2013:10:14:53 +] GET /rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34?Signature=XXX%3DExpires=1363258236AWSAccessKeyId=XXX HTTP/1.1 200 461220 - python-requests Can't make much out of it, will probably need rgw logs (and preferably with also 'debug ms = 1') for this issue. Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [radosgw] Race condition corrupting data on COPY ?
Hi, Can't make much out of it, will probably need rgw logs (and preferably with also 'debug ms = 1') for this issue. Well, the problem is that I can't make it happen again ... it happened 4 times during an import of ~3000 files ... I'm trying to reproduce this on a test cluster but so far, no luck. I'll give it another shot tomorrow. And I can't enable debug on prod for long periods, the space for log is limited and would be filled in minutes with all the requests. I also disabled the use of copy in production anyway because I can't have it corrupt random customer files. Cheers, Sylvain -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
corruption of active mmapped files in btrfs snapshots
For quite a while, I've experienced oddities with snapshotted Firefox _CACHE_00?_ files, whose checksums (and contents) would change after the btrfs snapshot was taken, and would even change depending on how the file was brought to memory (e.g., rsyncing it to backup storage vs checking its md5sum before or after the rsync). This only affected these cache files, so I didn't give it too much attention. A similar problem seems to affect the leveldb databases maintained by ceph within the periodic snapshots it takes of its object storage volumes. I'm told others using ceph on filesystems other than btrfs are not observing this problem, which makes me thing it's not memory corruption within ceph itself. I've looked into this for a bit, and I'm now inclined to believe it has to do with some bad interaction of mmap and snapshots; I'm not sure the fact that the filesystem has compression enabled has any effect, but that's certainly a possibility. leveldb does not modify file contents once they're initialized, it only appends to files, ftruncate()ing them to about a MB early on, mmap()ping that in and memcpy()ing blocks of various sizes to the end of the output buffer, occasionally msync()ing the maps, or running fdatasync if it didn't msync a map before munmap()ping it. If it runs out of space in a map, it munmap()s the previously mapped range, truncates the file to a larger size, then maps in the new tail of the file, starting at the page it should append to next. What I'm observing is that some btrfs snapshots taken by ceph osds, containing the leveldb database, are corrupted, causing crashes during the use of the database. I've scripted regular checks of osd snapshots, saving the last-known-good database along with the first one that displays the corruption. Studying about two dozen failures over the weekend, that took place on all of 13 btrfs-based osds on 3 servers running btrfs as in 3.8.3(-gnu), I noticed that all of the corrupted databases had a similar pattern: a stream of NULs of varying sizes at the end of a page, starting at a block boundary (leveldb doesn't do page-sized blocking, so blocks can start anywhere in a page), and ending close to the beginning of the next page, although not exactly at the page boundary; 20 bytes past the page boundary seemed to be the most common size, but the occasional presence of NULs in the database contents makes it harder to tell for sure. The stream of NULs ended in the middle of a database block (meaning it was not the beginning of a subsequent database block written later; the beginning of the database block was partially replaced with NULs). Furthermore, the checksum fails to match on this one partially-NULed block. Since the checksum is computed just before the block and the checksum trailer are memcpy()ed to the mmap()ed area, it is a certainty that the block was copied entirely to the right place at some point, and if part of it became zeros, it's either because the modification was partially lost, or because the mmapped buffer was partially overwritten. The fact that all instances of corruption I looked at were correct right to the end of one block boundary, and then all zeros instead of the beginning of the subsequent block to the end of that page, makes a failure to write that modified page seem more likely in my mind (more so given the Firefox _CACHE_ file oddities in snapshots); intense memory pressure at the time of the corruption also seems to favor this possibility. Now, it could be that btrfs requires those who modify SHARED mmap()ed files so as to make sure that data makes it to a subsequent snapshot, along the lines of msync MS_ASYNC, and leveldb does not take this sort of precaution. However, I noticed that the unexpected stream of zeros after a prior block and before the rest of the subsequent block *remains* in subsequent snapshots, which to me indicates the page update is effectively lost. This explains why even the running osd, that operates on the “current” subvolumes from which snapshots for recovery are taken, occasionally crashes because of database corruption, and will later fail to restart from an earlier snapshot due to that same corruption. Does this problem sound familiar to anyone else? Should mmaped-file writers in general do more than umount or msync to ensure changes make it to subsequent snapshots that are supposed to be consistent? Any tips on where to start looking so as to fix the problem, or even to confirm that the problem is indeed in btrfs? TIA, -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Direct IO on CephFS for blocks larger than 8MB
On Saturday, March 16, 2013 at 5:38 AM, Henry C Chang wrote: The following patch should fix the problem. -Henry diff --git a/fs/ceph/file.c b/fs/ceph/file.c index e51558f..4bcbcb6 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -608,7 +608,7 @@ out: pos += len; written += len; left -= len; - data += written; + data += len; if (left) goto more; This looks good to me. If you'd like to submit it as a proper patch with a sign-off I'll pull it into our tree. :) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Direct IO on CephFS for blocks larger than 8MB
On Mon, 18 Mar 2013, Greg Farnum wrote: On Saturday, March 16, 2013 at 5:38 AM, Henry C Chang wrote: The following patch should fix the problem. -Henry diff --git a/fs/ceph/file.c b/fs/ceph/file.c index e51558f..4bcbcb6 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -608,7 +608,7 @@ out: pos += len; written += len; left -= len; - data += written; + data += len; if (left) goto more; This looks good to me. If you'd like to submit it as a proper patch with a sign-off I'll pull it into our tree. :) -Greg I just added a quick test and it fixes it up. :) sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
While I wrote the previous email, a smoking gun formed in one of my servers: a snapshot that had passed a database consistency check turned out to be corrupted when I tried to rollback to it! Since the snapshot was not modified in any way between the initial scripted check and the later manual check, the problem must be in btrfs. On Mar 18, 2013, Alexandre Oliva ol...@gnu.org wrote: I've scripted regular checks of osd snapshots, saving the last-known-good database along with the first one that displays the corruption. Studying about two dozen failures over the weekend, that took place on all of 13 btrfs-based osds on 3 servers running btrfs as in 3.8.3(-gnu), I noticed that all of the corrupted databases had a similar pattern: a stream of NULs of varying sizes at the end of a page, starting at a block boundary (leveldb doesn't do page-sized blocking, so blocks can start anywhere in a page), and ending close to the beginning of the next page, although not exactly at the page boundary; 20 bytes past the page boundary seemed to be the most common size, but the occasional presence of NULs in the database contents makes it harder to tell for sure. Additional corrupted snapshots collected today have confirmed this pattern, except that today I got several corrupted files with non-NULs right at the beginning of the page following the one that marked the beginning of the corrupted database block. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
A few questions. Does leveldb use O_DIRECT and mmap together? (the source of a write being pages that are mmap'd from somewhere else) That's the most likely place for this kind of problem. Also, you mention crc errors. Are those reported by btrfs or are they application level crcs. Thanks for all the time you spent tracking it down this far. -chris Quoting Alexandre Oliva (2013-03-18 17:14:41) For quite a while, I've experienced oddities with snapshotted Firefox _CACHE_00?_ files, whose checksums (and contents) would change after the btrfs snapshot was taken, and would even change depending on how the file was brought to memory (e.g., rsyncing it to backup storage vs checking its md5sum before or after the rsync). This only affected these cache files, so I didn't give it too much attention. A similar problem seems to affect the leveldb databases maintained by ceph within the periodic snapshots it takes of its object storage volumes. I'm told others using ceph on filesystems other than btrfs are not observing this problem, which makes me thing it's not memory corruption within ceph itself. I've looked into this for a bit, and I'm now inclined to believe it has to do with some bad interaction of mmap and snapshots; I'm not sure the fact that the filesystem has compression enabled has any effect, but that's certainly a possibility. leveldb does not modify file contents once they're initialized, it only appends to files, ftruncate()ing them to about a MB early on, mmap()ping that in and memcpy()ing blocks of various sizes to the end of the output buffer, occasionally msync()ing the maps, or running fdatasync if it didn't msync a map before munmap()ping it. If it runs out of space in a map, it munmap()s the previously mapped range, truncates the file to a larger size, then maps in the new tail of the file, starting at the page it should append to next. What I'm observing is that some btrfs snapshots taken by ceph osds, containing the leveldb database, are corrupted, causing crashes during the use of the database. I've scripted regular checks of osd snapshots, saving the last-known-good database along with the first one that displays the corruption. Studying about two dozen failures over the weekend, that took place on all of 13 btrfs-based osds on 3 servers running btrfs as in 3.8.3(-gnu), I noticed that all of the corrupted databases had a similar pattern: a stream of NULs of varying sizes at the end of a page, starting at a block boundary (leveldb doesn't do page-sized blocking, so blocks can start anywhere in a page), and ending close to the beginning of the next page, although not exactly at the page boundary; 20 bytes past the page boundary seemed to be the most common size, but the occasional presence of NULs in the database contents makes it harder to tell for sure. The stream of NULs ended in the middle of a database block (meaning it was not the beginning of a subsequent database block written later; the beginning of the database block was partially replaced with NULs). Furthermore, the checksum fails to match on this one partially-NULed block. Since the checksum is computed just before the block and the checksum trailer are memcpy()ed to the mmap()ed area, it is a certainty that the block was copied entirely to the right place at some point, and if part of it became zeros, it's either because the modification was partially lost, or because the mmapped buffer was partially overwritten. The fact that all instances of corruption I looked at were correct right to the end of one block boundary, and then all zeros instead of the beginning of the subsequent block to the end of that page, makes a failure to write that modified page seem more likely in my mind (more so given the Firefox _CACHE_ file oddities in snapshots); intense memory pressure at the time of the corruption also seems to favor this possibility. Now, it could be that btrfs requires those who modify SHARED mmap()ed files so as to make sure that data makes it to a subsequent snapshot, along the lines of msync MS_ASYNC, and leveldb does not take this sort of precaution. However, I noticed that the unexpected stream of zeros after a prior block and before the rest of the subsequent block *remains* in subsequent snapshots, which to me indicates the page update is effectively lost. This explains why even the running osd, that operates on the “current” subvolumes from which snapshots for recovery are taken, occasionally crashes because of database corruption, and will later fail to restart from an earlier snapshot due to that same corruption. Does this problem sound familiar to anyone else? Should mmaped-file writers in general do more than umount or msync to ensure changes make it to subsequent snapshots that are supposed to be consistent? Any tips on where to start looking so as to fix the problem, or even to confirm that the problem is indeed
[PATCH] ceph: fix buffer pointer advance in ceph_sync_write
We should advance the user data pointer by _len_ instead of _written_. _len_ is the data length written in each iteration while _written_ is the accumulated data length we have writtent out. Signed-off-by: Henry C Chang henry.cy.ch...@gmail.com --- fs/ceph/file.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index e51558f..4bcbcb6 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -608,7 +608,7 @@ out: pos += len; written += len; left -= len; - data += written; + data += len; if (left) goto more; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Direct IO on CephFS for blocks larger than 8MB
I just sent out the patch with sign-off. Thanks for testing. 2013/3/19 Sage Weil s...@inktank.com: On Mon, 18 Mar 2013, Greg Farnum wrote: On Saturday, March 16, 2013 at 5:38 AM, Henry C Chang wrote: The following patch should fix the problem. -Henry diff --git a/fs/ceph/file.c b/fs/ceph/file.c index e51558f..4bcbcb6 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -608,7 +608,7 @@ out: pos += len; written += len; left -= len; - data += written; + data += len; if (left) goto more; This looks good to me. If you'd like to submit it as a proper patch with a sign-off I'll pull it into our tree. :) -Greg I just added a quick test and it fixes it up. :) sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph availability test recovering question
On 03/17/2013 05:18 AM, kelvin_hu...@wiwynn.com wrote: Hi, all Hi, ... My question is: 1.The state of I/O pause is normal when ceph recovering ? I have experienced the same issue. This works as designed, and is probably because of the heartbeat-timeout in osd heartbeat grace period set to 20 secs - see: http://ceph.com/docs/master/rados/configuration/mon-osd-interaction/ 2.The pause time of I/O that can not be avoided when ceph recovering ? You can always lower the grace period and heartbeat time, though I don't know if this is a wise idea. Short networking interruptions might mark your OSD out very quickly then. 3.How to reduce the I/O pause time ? see the link above, or this link here: http://ceph.com/docs/master/rados/configuration/osd-config-ref/#monitor-osd-interaction Thanks!! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- DI (FH) Wolfgang Hennerbichler Software Development Unit Advanced Computing Technologies RISC Software GmbH A company of the Johannes Kepler University Linz IT-Center Softwarepark 35 4232 Hagenberg Austria Phone: +43 7236 3343 245 Fax: +43 7236 3343 250 wolfgang.hennerbich...@risc-software.at http://www.risc-software.at -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
On Mar 18, 2013, Chris Mason chris.ma...@fusionio.com wrote: A few questions. Does leveldb use O_DIRECT and mmap together? No, it doesn't use O_DIRECT at all. Its I/O interface is very simplified: it just opens each new file (database chunks limited to 2MB) with O_CREAT|O_RDWR|O_TRUNC, and then uses ftruncate, mmap, msync, munmap and fdatasync. It doesn't seem to modify data once it's written; it only appends. Reading data back from it uses a completely different class interface, using separate descriptors and using pread only. (the source of a write being pages that are mmap'd from somewhere else) AFAICT the source of the memcpy()s that append to the file are malloc()ed memory. That's the most likely place for this kind of problem. Also, you mention crc errors. Are those reported by btrfs or are they application level crcs. These are CRCs leveldb computes and writes out after each db block. No btrfs CRC errors are reported in this process. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html