Re: Modifying RBD image changes it's snapshot
On Wednesday 26 of June 2013 14:42:27 Josh Durgin wrote: On 06/26/2013 11:15 AM, Josh Durgin wrote: On 06/26/2013 05:40 AM, Karol Jurak wrote: Hi, I'm using ceph 0.56.6 and kernel 3.9.7 and it looks like modifying RBD image also changes it's snapshot. I can reproduce this as follows: Just reproduced on 3.10-rc7 as well. It seems the snapshot context loading is broken for format 1 a t least, since unmapping and mapping after the snapshot exists still has the same problem. I added http://tracker.ceph.com/issues/5464 to track this. Apparently the regression test for this hasn't been running, or we would've caught this sooner. There's a fix for this on the stable kernel in the wip-snapc-3.9.y branch of ceph-client.git. Thanks for the fix. Snapshots work as expected now. -- Karol Jurak -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Modifying RBD image changes it's snapshot
Hi, I'm using ceph 0.56.6 and kernel 3.9.7 and it looks like modifying RBD image also changes it's snapshot. I can reproduce this as follows: # create and map an image rbd create --size 128 test-1 rbd map test-1 # write some data to the image dd if=/dev/zero of=/dev/rbd/rbd/test-1 bs=1M count=128 # create and map a snapshot rbd snap create test-1@snap-1 rbd map test-1@snap-1 # verify that image and snapshot are identical md5sum /dev/rbd/rbd/test-1 /dev/rbd/rbd/test-1@snap-1 fde9e0818281836e4fc0edfede2b8762 /dev/rbd/rbd/test-1 fde9e0818281836e4fc0edfede2b8762 /dev/rbd/rbd/test-1@snap-1 # modify the image dd if=/dev/urandom of=/dev/rbd/rbd/test-1 bs=512 count=1 # compare checksums again md5sum /dev/rbd/rbd/test-1 /dev/rbd/rbd/test-1@snap-1 1d942c8a5bc7480cecb945ea0d020eed /dev/rbd/rbd/test-1 1d942c8a5bc7480cecb945ea0d020eed /dev/rbd/rbd/test-1@snap-1 Checksums are identical although the snapshot isn't supposed to be modified. -- Karol Jurak -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Writing to RBD image while it's snapshot is being created causes I/O errors
On Monday 17 of June 2013 21:25:49 Sage Weil wrote: What kernel version are you using? I'm not able to reproduce this with ext4 or reiserfs and many snapshots over several minutes of write workload. I'm using vanilla 3.4.48. However, this bug must already have been fixed in 3.6.11 because everything worked fine when I tested this version today. -- Karol Jurak -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Writing to RBD image while it's snapshot is being created causes I/O errors
On Friday 14 of June 2013 08:56:55 Sage Weil wrote: On Fri, 14 Jun 2013, Karol Jurak wrote: I noticed that writing to RBD image using kernel driver while it's snapshot is being created causes I/O errors and the filesystem (reiserfs) eventually aborts and remounts itself in read-only mode: This is definitely a bug; you should be able to create a snapshot at any time. After a rollback, it should look (to the fs) like a crash or power cycle. How easy is this to reproduce? Does it happen every time? I can reproduce it in the following way: # rbd create -s 10240 test # rbd map test # mkfs -t reiserfs /dev/rbd/rbd/test # mount /dev/rbd/rbd/test /mnt/test # dd if=/dev/zero of=/mnt/test/a bs=1M count=1024 and in another shell while dd is running: # rbd snap create test@snap1 After 2 or 3 seconds dmesg shows I/O errors: [429532.259910] end_request: I/O error, dev rbd1, sector 1384448 [429532.272554] end_request: I/O error, dev rbd1, sector 872 [429532.275556] REISERFS abort (device rbd1): Journal write error in flush_commit_list and dd fails: dd: writing `/mnt/test/a': Cannot allocate memory 590+0 records in 589+0 records out 618225664 bytes (618 MB) copied, 15.8701 s, 39.0 MB/s This happens every time I repeat the test. -- Karol Jurak -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Writing to RBD image while it's snapshot is being created causes I/O errors
Hi, I noticed that writing to RBD image using kernel driver while it's snapshot is being created causes I/O errors and the filesystem (reiserfs) eventually aborts and remounts itself in read-only mode: [192507.327359] end_request: I/O error, dev rbd7, sector 818528 [192507.331510] end_request: I/O error, dev rbd7, sector 819200 [192507.348478] end_request: I/O error, dev rbd7, sector 408 [192507.352647] REISERFS abort (device rbd7): Journal write error in flush_commit_list Is this happening by design or is it a bug? I know that I should freeze the filesystem before attempting to create a snapshot but these I/O errors seem unnecessary to me. If it's really impossible to write to an image while it's snapshot is being created couldn't write requests be blocked until snapshotting completes? I'm using ceph 0.56.6 and kernel 3.4.48. -- Karol Jurak -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Journal too small
On Thursday 17 of May 2012 20:59:52 Josh Durgin wrote: On 05/17/2012 03:59 AM, Karol Jurak wrote: How serious is such situation? Do the OSDs know how to handle it correctly? Or could this result in some data loss or corruption? After the recovery finished (ceph -w showed that all PGs are in active+clean state) I noticed that a few rbd images were corrupted. As Sage mentioned, the OSDs know how to handle full journals correctly. I'd like to figure out how your rbd images got corrupted, if possible. How did you notice the corruption? Has your cluster always run 0.46, or did you upgrade from earlier versions? What happened to the cluster between your last check for corruption and now? Did your use of it or any ceph client or server configuration change? My question about journal is actually connected to a larger case I'm currently trying to investigate. The cluster initially run v0.45 but I upgraded it to v0.46 because of the issue I described in this bug report (upgrade didn't resolve it): http://tracker.newdream.net/issues/2446 The cluster consisted of 26 OSDs and used the crushmap which had a structure identical to that of a default crushmap constructed during the cluster creation. It had the unknownrack which contained 26 hosts and every host contained one OSD. Problems started when one of my collegues created and installed into the cluster the new crush map which introduced a couple of new racks, changed the placement rule to 'step chooseleaf firstn 0 type rack' and changed the weights of most of the OSDs to 0 (they were meant to be removed from the cluster). I don't have the exact copy of that crushmap but my collegue reconstructed it from memory the best he could. It's attached as new- crushmap.txt. The OSDs reacted to the new crushmap by allocating large amounts of memory. Most of them had only 1 or 2 GB of RAM. That proved to be not enough and the Xen VMs hosting the OSDs crashed. It turned out later, that most of the OSDs required as much as 6 to 10 GB of memory to complete the peering phase (ceph -w showed large number of PGs in that state while the OSDs were allocating memory). One factor which I think might have played significant role in this situation was the large number of PGs - 2. Our idea was to incrementally build the cluster consisting of approximately 200 OSDs, hence the 2 PGs. I see some items in your issue tracker that look like they may be addressing this large memory consumption issue: http://tracker.newdream.net/issues/2321 http://tracker.newdream.net/issues/2041 I reverted to the default crushmap, changed replication level to 1 and marked all OSDs but 2 out. That allowed me to finally recover the cluster and bring it online but in the process all the OSDs crashed numerous times. They were either killed by the OOM Killer or the whole VMs were destroyed by me because they were unresponsive or the OSDs crashed due to failed asserts such as: 2012-05-10 13:07:39.869811 7f878645a700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_ handle_d*, const char*, time_t)' thread 7f878645a700 time 2012-05-10 13:07:38.816680 common/HeartbeatMap.cc: 78: FAILED assert(0 == hit suicide timeout) ceph version 0.46 (commit:cb7f1c9c7520848b0899b26440ac34a8acea58d1) 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x270) [0x7a32e0] 2: (ceph::HeartbeatMap::is_healthy()+0x87) [0x7a34f7] 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x7a3748] 4: (CephContextServiceThread::entry()+0x5c) [0x64c27c] 5: (()+0x68ba) [0x7f87888be8ba] 6: (clone()+0x6d) [0x7f8786f4302d] or 2012-05-10 16:33:30.437730 7f062e9c1700 -1 osd/PG.cc: In function 'void PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_ log_t, int)' thread 7f062e9c1700 time 2012-05-10 16:33:30.369211 osd/PG.cc: 369: FAILED assert(log.head = olog.tail olog.head = log.tail) ceph version 0.46 (commit:cb7f1c9c7520848b0899b26440ac34a8acea58d1) 1: (PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t, int)+0x1f14) [0x77d894] 2: (PG::RecoveryState::Stray::react(PG::RecoveryState::MLogRec const)+0x2c5) [0x77dba5] 3: (boost::statechart::simple_statePG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, (boost::statechart::history_mode)0::react_impl(boost::statechart::event_base const, void const*)+0x213) [0x794d93] 4: (boost::statechart::state_machinePG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocatorvoid, boost::statechart::null_exception_translator::process_event(boost::statechart::event_base const)+0x6b) [0x78c3cb] 5: (PG::RecoveryState::handle_log(int, MOSDPGLog*, PG::RecoveryCtx*)+0x1a6) [0x745b76] 6: (OSD::handle_pg_log(std
Re: Journal too small
On Thursday 17 of May 2012 18:01:55 Sage Weil wrote: On Thu, 17 May 2012, Karol Jurak wrote: Hi, During an ongoing recovery in one of my clusters a couple of OSDs complained about too small journal. For instance: 2012-05-12 13:31:04.034144 7f491061d700 1 journal check_for_full at 863363072 : JOURNAL FULL 863363072 = 1048571903 (max_size 1048576000 start 863363072) 2012-05-12 13:31:04.034680 7f491061d700 0 journal JOURNAL TOO SMALL: item 1693745152 journal 1048571904 (usable) I was under the impression that the OSDs stopped participating in recovery after this event. (ceph -w showed that the number of PGs in state active+clean no longer increased.) They resumed recovery after I enlarged their journals (stop osd, --flush-journal, --mkjournal, start osd). How serious is such situation? Do the OSDs know how to handle it correctly? Or could this result in some data loss or corruption? After the recovery finished (ceph -w showed that all PGs are in active+clean state) I noticed that a few rbd images were corrupted. The osds tolerate the full journal. There will be a big latency spike, but they'll recover without risking data. You should definitely increase the journal size if this happens regulary, though. Thank you for clarification and advice. Karol -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Journal too small
Hi, During an ongoing recovery in one of my clusters a couple of OSDs complained about too small journal. For instance: 2012-05-12 13:31:04.034144 7f491061d700 1 journal check_for_full at 863363072 : JOURNAL FULL 863363072 = 1048571903 (max_size 1048576000 start 863363072) 2012-05-12 13:31:04.034680 7f491061d700 0 journal JOURNAL TOO SMALL: item 1693745152 journal 1048571904 (usable) I was under the impression that the OSDs stopped participating in recovery after this event. (ceph -w showed that the number of PGs in state active+clean no longer increased.) They resumed recovery after I enlarged their journals (stop osd, --flush-journal, --mkjournal, start osd). How serious is such situation? Do the OSDs know how to handle it correctly? Or could this result in some data loss or corruption? After the recovery finished (ceph -w showed that all PGs are in active+clean state) I noticed that a few rbd images were corrupted. The cluster runs v0.46. The OSDs use ext4. I'm pretty sure that during the recovery no clients were accessing the cluster. Best regards, Karol -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html