Re: Modifying RBD image changes it's snapshot

2013-06-27 Thread Karol Jurak
On Wednesday 26 of June 2013 14:42:27 Josh Durgin wrote:
 On 06/26/2013 11:15 AM, Josh Durgin wrote:
  On 06/26/2013 05:40 AM, Karol Jurak wrote:
  Hi,
  
  I'm using ceph 0.56.6 and kernel 3.9.7 and it looks like modifying RBD
  image
  also changes it's snapshot. I can reproduce this as follows:
  
  Just reproduced on 3.10-rc7 as well. It seems the snapshot context
  loading is broken for format 1 a t least, since unmapping and mapping
  after the snapshot exists still has the same problem. I added
  http://tracker.ceph.com/issues/5464 to track this.
  
  Apparently the regression test for this hasn't been running, or we
  would've caught this sooner.
 
 There's a fix for this on the stable kernel in the wip-snapc-3.9.y
 branch of ceph-client.git.

Thanks for the fix. Snapshots work as expected now.

-- 
Karol Jurak
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Modifying RBD image changes it's snapshot

2013-06-26 Thread Karol Jurak
Hi,

I'm using ceph 0.56.6 and kernel 3.9.7 and it looks like modifying RBD image 
also changes it's snapshot. I can reproduce this as follows:

# create and map an image
rbd create --size 128 test-1
rbd map test-1

# write some data to the image
dd if=/dev/zero of=/dev/rbd/rbd/test-1 bs=1M count=128

# create and map a snapshot
rbd snap create test-1@snap-1
rbd map test-1@snap-1

# verify that image and snapshot are identical
md5sum /dev/rbd/rbd/test-1 /dev/rbd/rbd/test-1@snap-1 
fde9e0818281836e4fc0edfede2b8762  /dev/rbd/rbd/test-1
fde9e0818281836e4fc0edfede2b8762  /dev/rbd/rbd/test-1@snap-1

# modify the image
dd if=/dev/urandom of=/dev/rbd/rbd/test-1 bs=512 count=1

# compare checksums again
md5sum /dev/rbd/rbd/test-1 /dev/rbd/rbd/test-1@snap-1 
1d942c8a5bc7480cecb945ea0d020eed  /dev/rbd/rbd/test-1
1d942c8a5bc7480cecb945ea0d020eed  /dev/rbd/rbd/test-1@snap-1

Checksums are identical although the snapshot isn't supposed to be modified.

-- 
Karol Jurak
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Writing to RBD image while it's snapshot is being created causes I/O errors

2013-06-18 Thread Karol Jurak
On Monday 17 of June 2013 21:25:49 Sage Weil wrote:
 What kernel version are you using?  I'm not able to reproduce this with
 ext4 or reiserfs and many snapshots over several minutes of write
 workload.

I'm using vanilla 3.4.48. However, this bug must already have been fixed in 
3.6.11 because everything worked fine when I tested this version today.

-- 
Karol Jurak
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Writing to RBD image while it's snapshot is being created causes I/O errors

2013-06-17 Thread Karol Jurak
On Friday 14 of June 2013 08:56:55 Sage Weil wrote:
 On Fri, 14 Jun 2013, Karol Jurak wrote:
  I noticed that writing to RBD image using kernel driver while it's
  snapshot is being created causes I/O errors and the filesystem
  (reiserfs) eventually aborts and remounts itself in read-only mode:
 
 This is definitely a bug; you should be able to create a snapshot at any
 time.  After a rollback, it should look (to the fs) like a crash or power
 cycle.
 
 How easy is this to reproduce?  Does it happen every time?

I can reproduce it in the following way:

# rbd create -s 10240 test
# rbd map test
# mkfs -t reiserfs /dev/rbd/rbd/test
# mount /dev/rbd/rbd/test /mnt/test
# dd if=/dev/zero of=/mnt/test/a bs=1M count=1024

and in another shell while dd is running:

# rbd snap create test@snap1

After 2 or 3 seconds dmesg shows I/O errors:

[429532.259910] end_request: I/O error, dev rbd1, sector 1384448
[429532.272554] end_request: I/O error, dev rbd1, sector 872
[429532.275556] REISERFS abort (device rbd1): Journal write error in 
flush_commit_list

and dd fails:

dd: writing `/mnt/test/a': Cannot allocate memory
590+0 records in
589+0 records out
618225664 bytes (618 MB) copied, 15.8701 s, 39.0 MB/s

This happens every time I repeat the test.

-- 
Karol Jurak

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Writing to RBD image while it's snapshot is being created causes I/O errors

2013-06-14 Thread Karol Jurak
Hi,

I noticed that writing to RBD image using kernel driver while it's snapshot 
is being created causes I/O errors and the filesystem (reiserfs) eventually 
aborts and remounts itself in read-only mode:

[192507.327359] end_request: I/O error, dev rbd7, sector 818528
[192507.331510] end_request: I/O error, dev rbd7, sector 819200
[192507.348478] end_request: I/O error, dev rbd7, sector 408
[192507.352647] REISERFS abort (device rbd7): Journal write error in 
flush_commit_list

Is this happening by design or is it a bug? I know that I should freeze the 
filesystem before attempting to create a snapshot but these I/O errors seem 
unnecessary to me. If it's really impossible to write to an image while it's 
snapshot is being created couldn't write requests be blocked until 
snapshotting completes?

I'm using ceph 0.56.6 and kernel 3.4.48.

-- 
Karol Jurak
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Journal too small

2012-05-18 Thread Karol Jurak
On Thursday 17 of May 2012 20:59:52 Josh Durgin wrote:
 On 05/17/2012 03:59 AM, Karol Jurak wrote:
  How serious is such situation? Do the OSDs know how to handle it
  correctly? Or could this result in some data loss or corruption?
  After the recovery finished (ceph -w showed that all PGs are in
  active+clean state) I noticed that a few rbd images were corrupted.
 
 As Sage mentioned, the OSDs know how to handle full journals correctly.
 
 I'd like to figure out how your rbd images got corrupted, if possible.
 
 How did you notice the corruption?
 
 Has your cluster always run 0.46, or did you upgrade from earlier
 versions?
 
 What happened to the cluster between your last check for corruption and
 now? Did your use of it or any ceph client or server configuration
 change?

My question about journal is actually connected to a larger case I'm 
currently trying to investigate.

The cluster initially run v0.45 but I upgraded it to v0.46 because of the 
issue I described in this bug report (upgrade didn't resolve it):

http://tracker.newdream.net/issues/2446

The cluster consisted of 26 OSDs and used the crushmap which had a 
structure identical to that of a default crushmap constructed during the 
cluster creation. It had the unknownrack which contained 26 hosts and 
every host contained one OSD.

Problems started when one of my collegues created and installed into the 
cluster the new crush map which introduced a couple of new racks, changed 
the placement rule to 'step chooseleaf firstn 0 type rack' and changed the 
weights of most of the OSDs to 0 (they were meant to be removed from the 
cluster). I don't have the exact copy of that crushmap but my collegue 
reconstructed it from memory the best he could. It's attached as new-
crushmap.txt.

The OSDs reacted to the new crushmap by allocating large amounts of 
memory. Most of them had only 1 or 2 GB of RAM. That proved to be not 
enough and the Xen VMs hosting the OSDs crashed. It turned out later, that 
most of the OSDs required as much as 6 to 10 GB of memory to complete the 
peering phase (ceph -w showed large number of PGs in that state while the 
OSDs were allocating memory).

One factor which I think might have played significant role in this 
situation was the large number of PGs - 2. Our idea was to 
incrementally build the cluster consisting of approximately 200 OSDs, 
hence the 2 PGs.

I see some items in your issue tracker that look like they may be 
addressing this large memory consumption issue:

http://tracker.newdream.net/issues/2321
http://tracker.newdream.net/issues/2041

I reverted to the default crushmap, changed replication level to 1 and 
marked all OSDs but 2 out. That allowed me to finally recover the cluster 
and bring it online but in the process all the OSDs crashed numerous 
times. They were either killed by the OOM Killer or the whole VMs were 
destroyed by me because they were unresponsive or the OSDs crashed due to 
failed asserts such as:


2012-05-10 13:07:39.869811 7f878645a700 -1 common/HeartbeatMap.cc: In 
function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_
handle_d*, const char*, time_t)' thread 7f878645a700 time 2012-05-10 
13:07:38.816680
common/HeartbeatMap.cc: 78: FAILED assert(0 == hit suicide timeout)

 ceph version 0.46 (commit:cb7f1c9c7520848b0899b26440ac34a8acea58d1)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, 
long)+0x270) [0x7a32e0]
 2: (ceph::HeartbeatMap::is_healthy()+0x87) [0x7a34f7]
 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x7a3748]
 4: (CephContextServiceThread::entry()+0x5c) [0x64c27c]
 5: (()+0x68ba) [0x7f87888be8ba]
 6: (clone()+0x6d) [0x7f8786f4302d]


or


2012-05-10 16:33:30.437730 7f062e9c1700 -1 osd/PG.cc: In function 'void 
PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_
log_t, int)' thread 7f062e9c1700 time 2012-05-10 16:33:30.369211
osd/PG.cc: 369: FAILED assert(log.head = olog.tail  olog.head = 
log.tail)

 ceph version 0.46 (commit:cb7f1c9c7520848b0899b26440ac34a8acea58d1)
 1: (PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t, 
int)+0x1f14) [0x77d894]
 2: (PG::RecoveryState::Stray::react(PG::RecoveryState::MLogRec 
const)+0x2c5) [0x77dba5]
 3: (boost::statechart::simple_statePG::RecoveryState::Stray, 
PG::RecoveryState::Started, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, 
(boost::statechart::history_mode)0::react_impl(boost::statechart::event_base 
const, void const*)+0x213) [0x794d93]
 4: (boost::statechart::state_machinePG::RecoveryState::RecoveryMachine, 
PG::RecoveryState::Initial, std::allocatorvoid, 
boost::statechart::null_exception_translator::process_event(boost::statechart::event_base
 
const)+0x6b) [0x78c3cb]
 5: (PG::RecoveryState::handle_log(int, MOSDPGLog*, 
PG::RecoveryCtx*)+0x1a6) [0x745b76]
 6: (OSD::handle_pg_log(std

Re: Journal too small

2012-05-18 Thread Karol Jurak
On Thursday 17 of May 2012 18:01:55 Sage Weil wrote:
 On Thu, 17 May 2012, Karol Jurak wrote:
  Hi,
  
  During an ongoing recovery in one of my clusters a couple of OSDs
  complained about too small journal. For instance:
  
  2012-05-12 13:31:04.034144 7f491061d700  1 journal check_for_full at
  863363072 : JOURNAL FULL 863363072 = 1048571903 (max_size 1048576000
  start 863363072)
  2012-05-12 13:31:04.034680 7f491061d700  0 journal JOURNAL TOO SMALL:
  item 1693745152  journal 1048571904 (usable)
  
  I was under the impression that the OSDs stopped participating in
  recovery after this event. (ceph -w showed that the number of PGs in
  state active+clean no longer increased.) They resumed recovery after
  I enlarged their journals (stop osd, --flush-journal, --mkjournal,
  start osd).
  
  How serious is such situation? Do the OSDs know how to handle it
  correctly? Or could this result in some data loss or corruption?
  After the recovery finished (ceph -w showed that all PGs are in
  active+clean state) I noticed that a few rbd images were corrupted.
 
 The osds tolerate the full journal.  There will be a big latency spike,
 but they'll recover without risking data.  You should definitely
 increase the journal size if this happens regulary, though.

Thank you for clarification and advice.

Karol
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Journal too small

2012-05-17 Thread Karol Jurak
Hi,

During an ongoing recovery in one of my clusters a couple of OSDs 
complained about too small journal. For instance:

2012-05-12 13:31:04.034144 7f491061d700  1 journal check_for_full at 
863363072 : JOURNAL FULL 863363072 = 1048571903 (max_size 1048576000 
start 863363072)
2012-05-12 13:31:04.034680 7f491061d700  0 journal JOURNAL TOO SMALL: item 
1693745152  journal 1048571904 (usable)

I was under the impression that the OSDs stopped participating in recovery 
after this event. (ceph -w showed that the number of PGs in state 
active+clean no longer increased.) They resumed recovery after I enlarged 
their journals (stop osd, --flush-journal, --mkjournal, start osd).

How serious is such situation? Do the OSDs know how to handle it 
correctly? Or could this result in some data loss or corruption? After the 
recovery finished (ceph -w showed that all PGs are in active+clean state) 
I noticed that a few rbd images were corrupted.

The cluster runs v0.46. The OSDs use ext4. I'm pretty sure that during the 
recovery no clients were accessing the cluster.

Best regards,
Karol
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html