Re: Journal too small

2012-05-18 Thread Karol Jurak
On Thursday 17 of May 2012 20:59:52 Josh Durgin wrote:
 On 05/17/2012 03:59 AM, Karol Jurak wrote:
  How serious is such situation? Do the OSDs know how to handle it
  correctly? Or could this result in some data loss or corruption?
  After the recovery finished (ceph -w showed that all PGs are in
  active+clean state) I noticed that a few rbd images were corrupted.
 
 As Sage mentioned, the OSDs know how to handle full journals correctly.
 
 I'd like to figure out how your rbd images got corrupted, if possible.
 
 How did you notice the corruption?
 
 Has your cluster always run 0.46, or did you upgrade from earlier
 versions?
 
 What happened to the cluster between your last check for corruption and
 now? Did your use of it or any ceph client or server configuration
 change?

My question about journal is actually connected to a larger case I'm 
currently trying to investigate.

The cluster initially run v0.45 but I upgraded it to v0.46 because of the 
issue I described in this bug report (upgrade didn't resolve it):

http://tracker.newdream.net/issues/2446

The cluster consisted of 26 OSDs and used the crushmap which had a 
structure identical to that of a default crushmap constructed during the 
cluster creation. It had the unknownrack which contained 26 hosts and 
every host contained one OSD.

Problems started when one of my collegues created and installed into the 
cluster the new crush map which introduced a couple of new racks, changed 
the placement rule to 'step chooseleaf firstn 0 type rack' and changed the 
weights of most of the OSDs to 0 (they were meant to be removed from the 
cluster). I don't have the exact copy of that crushmap but my collegue 
reconstructed it from memory the best he could. It's attached as new-
crushmap.txt.

The OSDs reacted to the new crushmap by allocating large amounts of 
memory. Most of them had only 1 or 2 GB of RAM. That proved to be not 
enough and the Xen VMs hosting the OSDs crashed. It turned out later, that 
most of the OSDs required as much as 6 to 10 GB of memory to complete the 
peering phase (ceph -w showed large number of PGs in that state while the 
OSDs were allocating memory).

One factor which I think might have played significant role in this 
situation was the large number of PGs - 2. Our idea was to 
incrementally build the cluster consisting of approximately 200 OSDs, 
hence the 2 PGs.

I see some items in your issue tracker that look like they may be 
addressing this large memory consumption issue:

http://tracker.newdream.net/issues/2321
http://tracker.newdream.net/issues/2041

I reverted to the default crushmap, changed replication level to 1 and 
marked all OSDs but 2 out. That allowed me to finally recover the cluster 
and bring it online but in the process all the OSDs crashed numerous 
times. They were either killed by the OOM Killer or the whole VMs were 
destroyed by me because they were unresponsive or the OSDs crashed due to 
failed asserts such as:


2012-05-10 13:07:39.869811 7f878645a700 -1 common/HeartbeatMap.cc: In 
function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_
handle_d*, const char*, time_t)' thread 7f878645a700 time 2012-05-10 
13:07:38.816680
common/HeartbeatMap.cc: 78: FAILED assert(0 == hit suicide timeout)

 ceph version 0.46 (commit:cb7f1c9c7520848b0899b26440ac34a8acea58d1)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, 
long)+0x270) [0x7a32e0]
 2: (ceph::HeartbeatMap::is_healthy()+0x87) [0x7a34f7]
 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x7a3748]
 4: (CephContextServiceThread::entry()+0x5c) [0x64c27c]
 5: (()+0x68ba) [0x7f87888be8ba]
 6: (clone()+0x6d) [0x7f8786f4302d]


or


2012-05-10 16:33:30.437730 7f062e9c1700 -1 osd/PG.cc: In function 'void 
PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_
log_t, int)' thread 7f062e9c1700 time 2012-05-10 16:33:30.369211
osd/PG.cc: 369: FAILED assert(log.head = olog.tail  olog.head = 
log.tail)

 ceph version 0.46 (commit:cb7f1c9c7520848b0899b26440ac34a8acea58d1)
 1: (PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t, 
int)+0x1f14) [0x77d894]
 2: (PG::RecoveryState::Stray::react(PG::RecoveryState::MLogRec 
const)+0x2c5) [0x77dba5]
 3: (boost::statechart::simple_statePG::RecoveryState::Stray, 
PG::RecoveryState::Started, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, 
(boost::statechart::history_mode)0::react_impl(boost::statechart::event_base 
const, void const*)+0x213) [0x794d93]
 4: (boost::statechart::state_machinePG::RecoveryState::RecoveryMachine, 
PG::RecoveryState::Initial, std::allocatorvoid, 
boost::statechart::null_exception_translator::process_event(boost::statechart::event_base
 
const)+0x6b) [0x78c3cb]
 5: (PG::RecoveryState::handle_log(int, MOSDPGLog*, 
PG::RecoveryCtx*)+0x1a6) [0x745b76]
 6: 

Re: Journal too small

2012-05-18 Thread Karol Jurak
On Thursday 17 of May 2012 18:01:55 Sage Weil wrote:
 On Thu, 17 May 2012, Karol Jurak wrote:
  Hi,
  
  During an ongoing recovery in one of my clusters a couple of OSDs
  complained about too small journal. For instance:
  
  2012-05-12 13:31:04.034144 7f491061d700  1 journal check_for_full at
  863363072 : JOURNAL FULL 863363072 = 1048571903 (max_size 1048576000
  start 863363072)
  2012-05-12 13:31:04.034680 7f491061d700  0 journal JOURNAL TOO SMALL:
  item 1693745152  journal 1048571904 (usable)
  
  I was under the impression that the OSDs stopped participating in
  recovery after this event. (ceph -w showed that the number of PGs in
  state active+clean no longer increased.) They resumed recovery after
  I enlarged their journals (stop osd, --flush-journal, --mkjournal,
  start osd).
  
  How serious is such situation? Do the OSDs know how to handle it
  correctly? Or could this result in some data loss or corruption?
  After the recovery finished (ceph -w showed that all PGs are in
  active+clean state) I noticed that a few rbd images were corrupted.
 
 The osds tolerate the full journal.  There will be a big latency spike,
 but they'll recover without risking data.  You should definitely
 increase the journal size if this happens regulary, though.

Thank you for clarification and advice.

Karol
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Journal too small

2012-05-18 Thread Josh Durgin

On 05/18/2012 03:56 AM, Karol Jurak wrote:

On Thursday 17 of May 2012 20:59:52 Josh Durgin wrote:

On 05/17/2012 03:59 AM, Karol Jurak wrote:

How serious is such situation? Do the OSDs know how to handle it
correctly? Or could this result in some data loss or corruption?
After the recovery finished (ceph -w showed that all PGs are in
active+clean state) I noticed that a few rbd images were corrupted.


As Sage mentioned, the OSDs know how to handle full journals correctly.

I'd like to figure out how your rbd images got corrupted, if possible.

How did you notice the corruption?

Has your cluster always run 0.46, or did you upgrade from earlier
versions?

What happened to the cluster between your last check for corruption and
now? Did your use of it or any ceph client or server configuration
change?


My question about journal is actually connected to a larger case I'm
currently trying to investigate.

The cluster initially run v0.45 but I upgraded it to v0.46 because of the
issue I described in this bug report (upgrade didn't resolve it):

http://tracker.newdream.net/issues/2446


Could you attach an archive of all the osdmaps from to that bug?
You can extract them with something like:

for epoch in $(seq 1 2000)
do
  ceph osd getmap $epoch -o osdmap_$epoch
done


The cluster consisted of 26 OSDs and used the crushmap which had a
structure identical to that of a default crushmap constructed during the
cluster creation. It had the unknownrack which contained 26 hosts and
every host contained one OSD.

Problems started when one of my collegues created and installed into the
cluster the new crush map which introduced a couple of new racks, changed
the placement rule to 'step chooseleaf firstn 0 type rack' and changed the
weights of most of the OSDs to 0 (they were meant to be removed from the
cluster). I don't have the exact copy of that crushmap but my collegue
reconstructed it from memory the best he could. It's attached as new-
crushmap.txt.

The OSDs reacted to the new crushmap by allocating large amounts of
memory. Most of them had only 1 or 2 GB of RAM. That proved to be not
enough and the Xen VMs hosting the OSDs crashed. It turned out later, that
most of the OSDs required as much as 6 to 10 GB of memory to complete the
peering phase (ceph -w showed large number of PGs in that state while the
OSDs were allocating memory).

One factor which I think might have played significant role in this
situation was the large number of PGs - 2. Our idea was to
incrementally build the cluster consisting of approximately 200 OSDs,
hence the 2 PGs.


Large numbers of PGs per OSD are problematic due to memory usage linear
in the number of PGs, and increased during peering and recovery.
We recommend keeping the number of PGs per OSD on the order of 100s.
In the future, it'll be possible to split PGs to increase their number
when your cluster grows, or merge them when it shrinks. For now you
should probably wait to create a pool with a large number of PGs until
you have enough OSDs up and in to handle them.

PG splitting is http://tracker.newdream.net/issues/1515

Your crushmap with many devices with weight 0 might also have
contributed to the problem due an issue with local retries.
See:

http://tracker.newdream.net/issues/2047
http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/6244

A workaround in the meantime is to remove devices in deep hierarchies
from the crush map.


I see some items in your issue tracker that look like they may be
addressing this large memory consumption issue:

http://tracker.newdream.net/issues/2321
http://tracker.newdream.net/issues/2041


Those and the recent improvements in OSD map processing will help.


I reverted to the default crushmap, changed replication level to 1 and
marked all OSDs but 2 out. That allowed me to finally recover the cluster
and bring it online but in the process all the OSDs crashed numerous
times. They were either killed by the OOM Killer or the whole VMs were
destroyed by me because they were unresponsive or the OSDs crashed due to
failed asserts such as:


2012-05-10 13:07:39.869811 7f878645a700 -1 common/HeartbeatMap.cc: In
function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_
handle_d*, const char*, time_t)' thread 7f878645a700 time 2012-05-10
13:07:38.816680
common/HeartbeatMap.cc: 78: FAILED assert(0 == hit suicide timeout)

  ceph version 0.46 (commit:cb7f1c9c7520848b0899b26440ac34a8acea58d1)
  1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
long)+0x270) [0x7a32e0]
  2: (ceph::HeartbeatMap::is_healthy()+0x87) [0x7a34f7]
  3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x7a3748]
  4: (CephContextServiceThread::entry()+0x5c) [0x64c27c]
  5: (()+0x68ba) [0x7f87888be8ba]
  6: (clone()+0x6d) [0x7f8786f4302d]



This is unresponsiveness again.



2012-05-10 16:33:30.437730 7f062e9c1700 -1 osd/PG.cc: In function 'void
PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_
log_t, int)' thread 

Journal too small

2012-05-17 Thread Karol Jurak
Hi,

During an ongoing recovery in one of my clusters a couple of OSDs 
complained about too small journal. For instance:

2012-05-12 13:31:04.034144 7f491061d700  1 journal check_for_full at 
863363072 : JOURNAL FULL 863363072 = 1048571903 (max_size 1048576000 
start 863363072)
2012-05-12 13:31:04.034680 7f491061d700  0 journal JOURNAL TOO SMALL: item 
1693745152  journal 1048571904 (usable)

I was under the impression that the OSDs stopped participating in recovery 
after this event. (ceph -w showed that the number of PGs in state 
active+clean no longer increased.) They resumed recovery after I enlarged 
their journals (stop osd, --flush-journal, --mkjournal, start osd).

How serious is such situation? Do the OSDs know how to handle it 
correctly? Or could this result in some data loss or corruption? After the 
recovery finished (ceph -w showed that all PGs are in active+clean state) 
I noticed that a few rbd images were corrupted.

The cluster runs v0.46. The OSDs use ext4. I'm pretty sure that during the 
recovery no clients were accessing the cluster.

Best regards,
Karol
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Journal too small

2012-05-17 Thread Sage Weil
On Thu, 17 May 2012, Karol Jurak wrote:
 Hi,
 
 During an ongoing recovery in one of my clusters a couple of OSDs 
 complained about too small journal. For instance:
 
 2012-05-12 13:31:04.034144 7f491061d700  1 journal check_for_full at 
 863363072 : JOURNAL FULL 863363072 = 1048571903 (max_size 1048576000 
 start 863363072)
 2012-05-12 13:31:04.034680 7f491061d700  0 journal JOURNAL TOO SMALL: item 
 1693745152  journal 1048571904 (usable)
 
 I was under the impression that the OSDs stopped participating in recovery 
 after this event. (ceph -w showed that the number of PGs in state 
 active+clean no longer increased.) They resumed recovery after I enlarged 
 their journals (stop osd, --flush-journal, --mkjournal, start osd).
 
 How serious is such situation? Do the OSDs know how to handle it 
 correctly? Or could this result in some data loss or corruption? After the 
 recovery finished (ceph -w showed that all PGs are in active+clean state) 
 I noticed that a few rbd images were corrupted.

The osds tolerate the full journal.  There will be a big latency spike, 
but they'll recover without risking data.  You should definitely increase 
the journal size if this happens regulary, though.

sage

 
 The cluster runs v0.46. The OSDs use ext4. I'm pretty sure that during the 
 recovery no clients were accessing the cluster.
 
 Best regards,
 Karol
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Journal too small

2012-05-17 Thread Tommi Virtanen
On Thu, May 17, 2012 at 9:01 AM, Sage Weil s...@inktank.com wrote:
 2012-05-12 13:31:04.034144 7f491061d700  1 journal check_for_full at
 863363072 : JOURNAL FULL 863363072 = 1048571903 (max_size 1048576000
 start 863363072)
 2012-05-12 13:31:04.034680 7f491061d700  0 journal JOURNAL TOO SMALL: item
 1693745152  journal 1048571904 (usable)

 The osds tolerate the full journal.  There will be a big latency spike,
 but they'll recover without risking data.  You should definitely increase
 the journal size if this happens regulary, though.

I propose for your merging pleasure:
https://github.com/ceph/ceph/commits/journal-too-small
https://github.com/ceph/ceph/commit/62db60bede8b187e25acb715f6616d2ce7cfc97f
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Journal too small

2012-05-17 Thread Sage Weil
On Thu, 17 May 2012, Tommi Virtanen wrote:
 On Thu, May 17, 2012 at 9:01 AM, Sage Weil s...@inktank.com wrote:
  2012-05-12 13:31:04.034144 7f491061d700  1 journal check_for_full at
  863363072 : JOURNAL FULL 863363072 = 1048571903 (max_size 1048576000
  start 863363072)
  2012-05-12 13:31:04.034680 7f491061d700  0 journal JOURNAL TOO SMALL: item
  1693745152  journal 1048571904 (usable)
 
  The osds tolerate the full journal.  There will be a big latency spike,
  but they'll recover without risking data.  You should definitely increase
  the journal size if this happens regulary, though.
 
 I propose for your merging pleasure:
 https://github.com/ceph/ceph/commits/journal-too-small
 https://github.com/ceph/ceph/commit/62db60bede8b187e25acb715f6616d2ce7cfc97f

Perfect, merged.

Re: Journal too small

2012-05-17 Thread Josh Durgin

On 05/17/2012 03:59 AM, Karol Jurak wrote:

How serious is such situation? Do the OSDs know how to handle it
correctly? Or could this result in some data loss or corruption? After the
recovery finished (ceph -w showed that all PGs are in active+clean state)
I noticed that a few rbd images were corrupted.


As Sage mentioned, the OSDs know how to handle full journals correctly.

I'd like to figure out how your rbd images got corrupted, if possible.

How did you notice the corruption?

Has your cluster always run 0.46, or did you upgrade from earlier
versions?

What happened to the cluster between your last check for corruption and
now? Did your use of it or any ceph client or server configuration
change?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html