slow requests, hunting for new mon

2013-02-11 Thread Chris Dunlop
Hi,

What are likely causes for "slow requests" and "monclient: hunting for new
mon" messages? E.g.:

2013-02-12 16:27:07.318943 7f9c0bc16700  0 monclient: hunting for new mon
...
2013-02-12 16:27:45.892314 7f9c13c26700  0 log [WRN] : 6 slow requests, 6 
included below; oldest blocked for > 30.383883 secs
2013-02-12 16:27:45.892323 7f9c13c26700  0 log [WRN] : slow request 30.383883 
seconds old, received at 2013-02-12 16:27:15.508374: 
osd_op(client.9821.0:122242 rb.0.209f.74b0dc51.0120 [write 921600~4096] 
2.981cf6bc) v4 currently no flag points reached
2013-02-12 16:27:45.892328 7f9c13c26700  0 log [WRN] : slow request 30.383782 
seconds old, received at 2013-02-12 16:27:15.508475: 
osd_op(client.9821.0:122243 rb.0.209f.74b0dc51.0120 [write 987136~4096] 
2.981cf6bc) v4 currently no flag points reached
2013-02-12 16:27:45.892334 7f9c13c26700  0 log [WRN] : slow request 30.383720 
seconds old, received at 2013-02-12 16:27:15.508537: 
osd_op(client.9821.0:122244 rb.0.209f.74b0dc51.0120 [write 
1036288~8192] 2.981cf6bc) v4 currently no flag points reached
2013-02-12 16:27:45.892338 7f9c13c26700  0 log [WRN] : slow request 30.383684 
seconds old, received at 2013-02-12 16:27:15.508573: 
osd_op(client.9821.0:122245 rb.0.209f.74b0dc51.0122 [write 
1454080~4096] 2.fff29a9a) v4 currently no flag points reached
2013-02-12 16:27:45.892341 7f9c13c26700  0 log [WRN] : slow request 30.328986 
seconds old, received at 2013-02-12 16:27:15.563271: 
osd_op(client.9821.0:122246 rb.0.209f.74b0dc51.0122 [write 
1482752~4096] 2.fff29a9a) v4 currently no flag points reached

I have a ceph 0.56.2 system with 3 boxes: two boxes have both mon and a
single osd, and the 3rd box has just a mon - see ceph.conf below. The boxes
are running an eclectic mix of self-compiled kernels: b2 is linux-3.4.6, b4
is linux-3.7.3 and b5 is linux-3.6.10.

On b5 / osd.1 the 'hunting' message appears in the osd log regularly, e.g.
190 times yesterday. The message does't appear at all on b4 / osd.0.

Both osd logs show the 'slow requests' messages. Generally these come in
waves, with 30-50 of the associated individual 'slow request' messages
coming in just after the initial 'slow requests' message. Each box saw
around 30 waves yesterday, with no obvious time correlation between the two.

The osd disks are generally cruising along at around 400-800 KB/s, with
occasional spikes up to 1.2-2 MB/s, with a mostly write load.

The gigabit network interfaces (2 per box for public vs cluster) are
also cruising, with the busiest interface at less than 5% utilisation.

CPU utilisation is likewise small, with 90% or more idle and less then 3%
wait for io. There's plenty of free memory, 19 GB on b4 and 6 GB on b5.

Cheers,

Chris


ceph.conf

[global]
auth supported = cephx
[mon]
[mon.b2]
host = b2
mon addr = 10.200.63.130:6789
[mon.b4]
host = b4
mon addr = 10.200.63.132:6789
[mon.b5]
host = b5
mon addr = 10.200.63.133:6789
[osd]
osd journal size = 1000
public network = 10.200.63.0/24
cluster network = 192.168.254.0/24
[osd.0]
host = b4
public addr = 10.200.63.132
cluster addr = 192.168.254.132
[osd.1]
host = b5
public addr = 10.200.63.133
cluster addr = 192.168.254.133

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow requests, hunting for new mon

2013-02-12 Thread Chris Dunlop
On Tue, Feb 12, 2013 at 06:28:15PM +1100, Chris Dunlop wrote:
> Hi,
> 
> What are likely causes for "slow requests" and "monclient: hunting for new
> mon" messages? E.g.:
> 
> 2013-02-12 16:27:07.318943 7f9c0bc16700  0 monclient: hunting for new mon
> ...
> 2013-02-12 16:27:45.892314 7f9c13c26700  0 log [WRN] : 6 slow requests, 6 
> included below; oldest blocked for > 30.383883 secs
> 2013-02-12 16:27:45.892323 7f9c13c26700  0 log [WRN] : slow request 30.383883 
> seconds old, received at 2013-02-12 16:27:15.508374: 
> osd_op(client.9821.0:122242 rb.0.209f.74b0dc51.0120 [write 
> 921600~4096] 2.981cf6bc) v4 currently no flag points reached
> 2013-02-12 16:27:45.892328 7f9c13c26700  0 log [WRN] : slow request 30.383782 
> seconds old, received at 2013-02-12 16:27:15.508475: 
> osd_op(client.9821.0:122243 rb.0.209f.74b0dc51.0120 [write 
> 987136~4096] 2.981cf6bc) v4 currently no flag points reached
> 2013-02-12 16:27:45.892334 7f9c13c26700  0 log [WRN] : slow request 30.383720 
> seconds old, received at 2013-02-12 16:27:15.508537: 
> osd_op(client.9821.0:122244 rb.0.209f.74b0dc51.0120 [write 
> 1036288~8192] 2.981cf6bc) v4 currently no flag points reached
> 2013-02-12 16:27:45.892338 7f9c13c26700  0 log [WRN] : slow request 30.383684 
> seconds old, received at 2013-02-12 16:27:15.508573: 
> osd_op(client.9821.0:122245 rb.0.209f.74b0dc51.0122 [write 
> 1454080~4096] 2.fff29a9a) v4 currently no flag points reached
> 2013-02-12 16:27:45.892341 7f9c13c26700  0 log [WRN] : slow request 30.328986 
> seconds old, received at 2013-02-12 16:27:15.563271: 
> osd_op(client.9821.0:122246 rb.0.209f.74b0dc51.0122 [write 
> 1482752~4096] 2.fff29a9a) v4 currently no flag points reached

Sorry, I forgot to add what lead me to look into this in the first place...

The b2 machine is running a number of libvirt / kvm virtuals on rbd via
librados:


  
  

  
  
  
  
  


Around the time the "slow requests" messages pop up, the virtual machines 
experience
a spike in disk latency, e.g. from inside one of the virtuals:

load %user %nice  %sys  %iow  %stl %idle  dev rrqm/s wrqm/s 
   r/sw/srkB/swkB/s arq-sz aqu-sz  await  rwait wwait %util
2013-02-12-18:55:23  1.7   0.3   0.0   0.9  42.2   0.0  18.5  vdb0.0   25.8 
   0.09.6 0.00   136.53  28.44   1.34 139.56   0.00 139.6  70.7
2013-02-12-18:55:38  1.5   0.9   0.0   0.4  35.8   0.0  42.8  vdb0.0   26.6 
   0.0   10.8 0.00   143.20  26.52   0.46  41.98   0.00  42.0  40.0
2013-02-12-18:55:53  1.4   0.7   0.0   0.4  38.0   0.0  51.7  vdb0.07.5 
   0.06.8 0.0053.33  15.69   0.46  68.08   0.00  68.1  38.8
2013-02-12-18:56:10  2.2   0.0   0.0   0.1   8.0   0.0  15.8  vdb0.02.1 
   0.00.2 0.00 9.18  78.00   0.98 4164.00   0.00 4164.0  70.6
2013-02-12-18:56:32  3.7   0.0   0.0   0.1   0.0   0.0   0.0  vdb0.00.1 
   0.00.8 0.00 3.27   8.00  24.12 14519.78   0.00 14519.8 100.0
2013-02-12-18:56:47  5.1   0.5   0.0   0.6   0.0   0.0   0.0  vdb0.0   11.5 
   0.05.4 0.0065.87  24.40   1.60 3620.15   0.00 3620.1  88.4
2013-02-12-18:57:03  5.2   0.6   0.0   1.1   3.5   0.0   0.0  vdb0.0   19.6 
   0.06.9 0.00   101.25  29.19   1.13 162.41   0.00 162.4  87.4
2013-02-12-18:57:20  5.2   0.6   0.0   0.9   9.7   0.0   0.0  vdb0.0   44.6 
   0.09.7 0.00   214.82  44.27   2.41 248.22   0.00 248.2  85.3
2013-02-12-18:57:36  4.4   0.4   0.0   0.5  28.7   0.0  46.5  vdb0.0   17.6 
   0.05.2 0.0087.50  33.33   0.56 107.14   0.00 107.1  48.5

...corresponding to this in the b5 / osd.1 log:

2013-02-12 18:52:08.812880 7f9c0bc16700  0 monclient: hunting for new mon
2013-02-12 18:55:18.851791 7f9c0bc16700  0 monclient: hunting for new mon
2013-02-12 18:56:25.414948 7f9c13c26700  0 log [WRN] : 6 slow requests, 6 
included below; oldest blocked for > 30.372124 secs
2013-02-12 18:56:25.414958 7f9c13c26700  0 log [WRN] : slow request 30.372124 
seconds old, received at 2013-02-12 18:55:55.042767: 
osd_op(client.9821.0:144779 rb.0.209f.74b0dc51.0023 [write 
1593344~4096] 2.1882ddb7) v4 currently no flag points reached
...
2013-02-12 18:57:13.427008 7f9c13c26700  0 log [WRN] : slow request 40.721769 
seconds old, received at 2013-02-12 18:56:32.705190: 
osd_op(client.9821.0:146756 rb.0.209f.74b0dc51.0128 [write 819200~8192] 
2.b4390173) v4 currently commit sent
2013-02-12 18:59:43.886517 7f9c0bc16700  0 monclient: hunting for new mon
2013-02-12 19:02:53.911641 7f9c0bc16700  0 monclient: hunting for new mon


Cheers,

Chris.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow requests, hunting for new mon

2013-02-14 Thread Chris Dunlop
On 2013-02-12, Chris Dunlop  wrote:
> Hi,
>
> What are likely causes for "slow requests" and "monclient: hunting for new
> mon" messages? E.g.:
>
> 2013-02-12 16:27:07.318943 7f9c0bc16700  0 monclient: hunting for new mon
> ...
> 2013-02-12 16:27:45.892314 7f9c13c26700  0 log [WRN] : 6 slow requests, 6 
> included below; oldest blocked for > 30.383883 secs
> 2013-02-12 16:27:45.892323 7f9c13c26700  0 log [WRN] : slow request 30.383883 
> seconds old, received at 2013-02-12 16:27:15.508374: 
> osd_op(client.9821.0:122242 rb.0.209f.74b0dc51.0120 [write 
> 921600~4096] 2.981cf6bc) v4 currently no flag points reached
> 2013-02-12 16:27:45.892328 7f9c13c26700  0 log [WRN] : slow request 30.383782 
> seconds old, received at 2013-02-12 16:27:15.508475: 
> osd_op(client.9821.0:122243 rb.0.209f.74b0dc51.0120 [write 
> 987136~4096] 2.981cf6bc) v4 currently no flag points reached
> 2013-02-12 16:27:45.892334 7f9c13c26700  0 log [WRN] : slow request 30.383720 
> seconds old, received at 2013-02-12 16:27:15.508537: 
> osd_op(client.9821.0:122244 rb.0.209f.74b0dc51.0120 [write 
> 1036288~8192] 2.981cf6bc) v4 currently no flag points reached
> 2013-02-12 16:27:45.892338 7f9c13c26700  0 log [WRN] : slow request 30.383684 
> seconds old, received at 2013-02-12 16:27:15.508573: 
> osd_op(client.9821.0:122245 rb.0.209f.74b0dc51.0122 [write 
> 1454080~4096] 2.fff29a9a) v4 currently no flag points reached
> 2013-02-12 16:27:45.892341 7f9c13c26700  0 log [WRN] : slow request 30.328986 
> seconds old, received at 2013-02-12 16:27:15.563271: 
> osd_op(client.9821.0:122246 rb.0.209f.74b0dc51.0122 [write 
> 1482752~4096] 2.fff29a9a) v4 currently no flag points reached

OK, for the sake of anyone who might come across this thread when
searching for similar issues...

http://ceph.com/docs/master/rados/operations/troubleshooting-osd/#slow-or-unresponsive-osd

...unfortunately the error message in the link above says "old request"
rather than "slow request" (old code?), so that page doesn't come up
when googling for the "slow request" message. The page needs
updating.

The underlying problem in our case seems to have been spikes in the
number of IOPS going to the disks (e.g. watch 'iostat -x' output). Whilst
the disks were coping with steady state load, occasionally something (in
this case, activity in a vm running on rbd) would cause a spike in
activity and the disks couldn't cope. I'd initially looked at the amount
of data going to the disks and thought it was well with the disks'
capabilities, however I hadn't considered the IOPS.

The (partial?) solution was to move the journals onto a separate device,
halving the IOPS going to the data disk (write journal, write data) as
well as avoiding having the heads slamming back and forth between the
data and journal. We're continuing to watch the IOPS and will add more
OSDs to spread the load further if necessary.

I still don't know what the "hunting" messages actually indicate, but
they've also disappeared since fixing the "slow request" messages.

Incidentally, it strikes me that there is a significant amount of write
amplification going on when running vms with a file system such as xfs
or ext4 (with journal) on top of rbd/rados (with journal) on top of xfs
(with journal). I.e. a single write from a vm can turn into up to 8
separate writes by the time it hits the underlying xfs filesystem. I
think this is why our ceph setup is struggling at far less load on the
same hardware compared to the drbd setup we're wanting to replace.

Cheers!

Chris.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Mon losing touch with OSDs

2013-02-14 Thread Chris Dunlop
G'day,

In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the
mons to lose touch with the osds?

I imagine a network glitch could cause it, but I can't see any issues in any
other system logs on any of the machines on the network.

Having (mostly?) resolved my previous "slow requests" issue
(http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13076) at around
13:45, there were no problems until the mon lost osd.0 at 20:26 and lost osd.1
5 seconds later:

ceph-mon.b2.log:
2013-02-14 20:11:19.892060 7fa48d4f8700  0 log [INF] : pgmap v2822096: 576 pgs: 
576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
2013-02-14 20:11:21.719513 7fa48d4f8700  0 log [INF] : pgmap v2822097: 576 pgs: 
576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
2013-02-14 20:26:20.656162 7fa48dcf9700 -1 mon.b2@0(leader).osd e768 no osd or 
pg stats from osd.0 since 2013-02-14 20:11:19.720812, 900.935345 seconds ago.  
marking down
2013-02-14 20:26:20.780244 7fa48d4f8700  1 mon.b2@0(leader).osd e769 e769: 2 
osds: 1 up, 2 in
2013-02-14 20:26:20.837123 7fa48d4f8700  0 log [INF] : osdmap e769: 2 osds: 1 
up, 2 in
2013-02-14 20:26:20.947523 7fa48d4f8700  0 log [INF] : pgmap v2822098: 576 pgs: 
304 active+clean, 272 stale+active+clean; 407 GB data, 835 GB used, 2889 GB / 
3724 GB avail
2013-02-14 20:26:25.709341 7fa48dcf9700 -1 mon.b2@0(leader).osd e769 no osd or 
pg stats from osd.1 since 2013-02-14 20:11:21.523741, 904.185596 seconds ago.  
marking down
2013-02-14 20:26:25.822773 7fa48d4f8700  1 mon.b2@0(leader).osd e770 e770: 2 
osds: 0 up, 2 in
2013-02-14 20:26:25.863493 7fa48d4f8700  0 log [INF] : osdmap e770: 2 osds: 0 
up, 2 in
2013-02-14 20:26:25.954799 7fa48d4f8700  0 log [INF] : pgmap v2822099: 576 pgs: 
576 stale+active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
2013-02-14 20:31:30.772360 7fa48dcf9700  0 log [INF] : osd.1 out (down for 
304.933403)
2013-02-14 20:31:30.893521 7fa48d4f8700  1 mon.b2@0(leader).osd e771 e771: 2 
osds: 0 up, 1 in
2013-02-14 20:31:30.933439 7fa48d4f8700  0 log [INF] : osdmap e771: 2 osds: 0 
up, 1 in
2013-02-14 20:31:31.055408 7fa48d4f8700  0 log [INF] : pgmap v2822100: 576 pgs: 
576 stale+active+clean; 407 GB data, 417 GB used, 1444 GB / 1862 GB avail
2013-02-14 20:35:05.831221 7fa48dcf9700  0 log [INF] : osd.0 out (down for 
525.033581)
2013-02-14 20:35:05.989724 7fa48d4f8700  1 mon.b2@0(leader).osd e772 e772: 2 
osds: 0 up, 0 in
2013-02-14 20:35:06.031409 7fa48d4f8700  0 log [INF] : osdmap e772: 2 osds: 0 
up, 0 in
2013-02-14 20:35:06.129046 7fa48d4f8700  0 log [INF] : pgmap v2822101: 576 pgs: 
576 stale+active+clean; 407 GB data, 0 KB used, 0 KB / 0 KB avail

The other 2 mons both have messages like this in their logs, starting at around 
20:12:

2013-02-14 20:12:26.534977 7f2092b86700  0 -- 10.200.63.133:6789/0 >> 
10.200.63.133:6800/6466 pipe(0xade76500 sd=22 :6789 s=0 pgs=0 cs=0 l=1).accept 
replacing existing (lossy) channel (new one lossy=1)
2013-02-14 20:13:24.741092 7f2092d88700  0 -- 10.200.63.133:6789/0 >> 
10.200.63.132:6800/2456 pipe(0x9f8b7180 sd=28 :6789 s=0 pgs=0 cs=0 l=1).accept 
replacing existing (lossy) channel (new one lossy=1)
2013-02-14 20:13:56.551908 7f2090560700  0 -- 10.200.63.133:6789/0 >> 
10.200.63.133:6800/6466 pipe(0x9f8b6000 sd=41 :6789 s=0 pgs=0 cs=0 l=1).accept 
replacing existing (lossy) channel (new one lossy=1)
2013-02-14 20:14:24.752356 7f209035e700  0 -- 10.200.63.133:6789/0 >> 
10.200.63.132:6800/2456 pipe(0x9f8b6500 sd=42 :6789 s=0 pgs=0 cs=0 l=1).accept 
replacing existing (lossy) channel (new one lossy=1)

(10.200.63.132 is mon.b4/osd.0, 10.200.63.133 is mon.b5/osd.1)

...although Greg Farnum indicates these messages are "normal":

http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/5989/focus=5993

Osd.0 doesn't show any signs of distress at all:

ceph-osd.0.log:
2013-02-14 20:00:10.280601 7ffceb012700  0 log [INF] : 2.7e scrub ok
2013-02-14 20:14:19.923490 7ffceb012700  0 log [INF] : 2.5b scrub ok
2013-02-14 20:14:50.571980 7ffceb012700  0 log [INF] : 2.7b scrub ok
2013-02-14 20:17:48.475129 7ffceb012700  0 log [INF] : 2.7d scrub ok
2013-02-14 20:28:22.601594 7ffceb012700  0 log [INF] : 2.91 scrub ok
2013-02-14 20:28:32.839278 7ffceb012700  0 log [INF] : 2.92 scrub ok
2013-02-14 20:28:46.992226 7ffceb012700  0 log [INF] : 2.93 scrub ok
2013-02-14 20:29:12.330668 7ffceb012700  0 log [INF] : 2.95 scrub ok

...although osd.1 started seeing problems around this time:

ceph-osd.1.log:
2013-02-14 20:03:11.413352 7fd1d8f0a700  0 log [INF] : 2.23 scrub ok
2013-02-14 20:26:51.601425 7fd1e6f26700  0 log [WRN] : 6 slow requests, 6 
included below; oldest blocked for > 30.750063 secs
2013-02-14 20:26:51.601432 7fd1e6f26700  0 log [WRN] : slow request 30.750063 
seconds old, received at 2013-02-14 20:26:20.851304: osd_op(client.9983.0:28173 
xxx.rbd [watch 1~0] 2.10089424) v4 currently wait for new map
2013-02-14 20:26:51.601437 7fd1e6f26700  0 log [WRN] : slow request 30.749947 
secon

Re: Mon losing touch with OSDs

2013-02-15 Thread Chris Dunlop
G'day Sage,

On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote:
> On Fri, 15 Feb 2013, Chris Dunlop wrote:
>> In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the
>> mons to lose touch with the osds?
> 
> Can you enable 'debug ms = 1' on the mons and leave them that way, in the 
> hopes that this happens again?  It will give us more information to go on.

Debug turned on.

>> Perhaps the mon lost osd.1 because it was too slow, but that hadn't happened 
>> in
>> any of the many previous "slow requests" intances, and the timing doesn't 
>> look
>> quite right: the mon complains it hasn't heard from osd.0 since 20:11:19, but
>> the osd.0 log shows nothing problems at all, then the mon complains about not
>> having heard from osd.1 since 20:11:21, whereas the first indication of 
>> trouble
>> on osd.1 was the request from 20:26:20 not being processed in a timely 
>> fashion.
> 
> My guess is the above was a side-effect of osd.0 being marked out.   On 
> 0.56.2 there is some strange peering workqueue laggyness that could 
> potentially contribute as well.  I recommend moving to 0.56.3.

Upgraded to 0.56.3.

>> Trying to manually set the osds in (e.g. ceph osd in 0) didn't help, nor did
>> restarting the osds ('service ceph restart osd' on each osd host).
>> 
>> The immediate issue was resolved by restarting ceph completely on one of the
>> mon/osd hosts (service ceph restart). Possibly a restart of just the mon 
>> would
>> have been sufficient.
> 
> Did you notice that the osds you restarted didn't immediately mark 
> themselves in?  Again, it could be explained by the peering wq issue, 
> especially if there are pools in your cluster that are not getting any IO.

Sorry, no. I was kicking myself later for losing the 'ceph -s' output 
when I killed that terminal session but in the heat of the moment...

I can't see anything about osd marking themselves in from the logs from the
time (with no debugging), but I'm on my ipad at the moment so I could easily
have missed it. Should that info be in the logs somewhere?

There's certainly unused pools: we're only using the rbd pool and so the
default data and metadata pools are unused.

Thanks for your attention!

Cheers,

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mon losing touch with OSDs

2013-02-18 Thread Chris Dunlop
On Sun, Feb 17, 2013 at 05:44:29PM -0800, Sage Weil wrote:
> On Mon, 18 Feb 2013, Chris Dunlop wrote:
>> On Sat, Feb 16, 2013 at 09:05:21AM +1100, Chris Dunlop wrote:
>>> On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote:
>>>> On Fri, 15 Feb 2013, Chris Dunlop wrote:
>>>>> In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause 
>>>>> the
>>>>> mons to lose touch with the osds?
>>>> 
>>>> Can you enable 'debug ms = 1' on the mons and leave them that way, in the 
>>>> hopes that this happens again?  It will give us more information to go on.
>>> 
>>> Debug turned on.
>> 
>> We haven't experienced the cluster losing touch with the osds completely
>> since upgrading from 0.56.2 to 0.56.3, but we did lose touch with osd.1
>> for a few seconds before it recovered. See below for logs (reminder: 3
>> boxes, b2 is mon-only, b4 is mon+osd.0, b5 is mon+osd.1).
> 
> Hrm, I don't see any obvious clues.  You could enable 'debug ms = 1' on 
> the osds as well.  That will give us more to go on if/when it happens 
> again, and should not affect performance significantly.

Done: ceph osd tell '*' injectargs '--debug-ms 1'

Now to wait for it to happen again.

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mon losing touch with OSDs

2013-02-19 Thread Chris Dunlop
On Tue, Feb 19, 2013 at 02:02:03PM +1100, Chris Dunlop wrote:
> On Sun, Feb 17, 2013 at 05:44:29PM -0800, Sage Weil wrote:
>> On Mon, 18 Feb 2013, Chris Dunlop wrote:
>>> On Sat, Feb 16, 2013 at 09:05:21AM +1100, Chris Dunlop wrote:
>>>> On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote:
>>>>> On Fri, 15 Feb 2013, Chris Dunlop wrote:
>>>>>> In an otherwise seemingly healthy cluster (ceph 0.56.2), what might 
>>>>>> cause the
>>>>>> mons to lose touch with the osds?
>>>>> 
>>>>> Can you enable 'debug ms = 1' on the mons and leave them that way, in the 
>>>>> hopes that this happens again?  It will give us more information to go on.
>>>> 
>>>> Debug turned on.
>>> 
>>> We haven't experienced the cluster losing touch with the osds completely
>>> since upgrading from 0.56.2 to 0.56.3, but we did lose touch with osd.1
>>> for a few seconds before it recovered. See below for logs (reminder: 3
>>> boxes, b2 is mon-only, b4 is mon+osd.0, b5 is mon+osd.1).
>> 
>> Hrm, I don't see any obvious clues.  You could enable 'debug ms = 1' on 
>> the osds as well.  That will give us more to go on if/when it happens 
>> again, and should not affect performance significantly.
> 
> Done: ceph osd tell '*' injectargs '--debug-ms 1'
> 
> Now to wait for it to happen again.

OK, we got it again. Full logs covering the incident available at:

https://www.dropbox.com/s/kguzwyjfglv3ypl/ceph-logs.zip

Archive:  /tmp/ceph-logs.zip
 Length   MethodSize  CmprDateTime   CRC-32   Name
  --  ---  -- -   
   11492  Defl:X 1186  90% 2013-02-20 12:04 c0cba4ae  ceph-mon.b2.log
 1270789  Defl:X89278  93% 2013-02-20 12:00 2208d035  ceph-mon.b4.log
 1375858  Defl:X   104025  92% 2013-02-20 12:05 c64c1dad  ceph-mon.b5.log
 2020042  Defl:X   215000  89% 2013-02-20 10:40 f74ae4ca  ceph-osd.0.log
 2075512  Defl:X   224098  89% 2013-02-20 12:05 b454d2ec  ceph-osd.1.log
  154938  Defl:X12989  92% 2013-02-20 12:04 d2729b05  ceph.log
  ---  ------
 6908631   646576  91%6 files

My naive analysis, based on the log extracts below (best viewed on a wide
screen!)...

Osd.0 starts hearing much-delayed ping_replies from osd.1 and tells the mon,
which marks osd.1 down.

However the whole time, the osd.1 log indicates that it's receiving and
responding to each ping from osd.0 in a timely fashion. In contrast, the osd.0
log indicates it isn't seeing the osd.1 replies for a while, then sees them all
arrive in a flurry, until they're "delayed" enough to cause osd.0 to tell the
mon.

During the time osd.0 is not seeing the osd.1 ping_replies, there's other 
traffic
(osd_op, osd_sub_op, osd_sub_op_reply etc.) between osd.0 and osd.1, indicating
that it's not a network problem.

The load on both osds during this period was >90% idle and <1% iow.

Is this pointing to osd.0 experiencing some kind of scheduling or priority
starvation on the ping thread (assuming the ping is in it's own thread)?

The next odd thing is that, although the osds are both back by 04:38:50 ("2
osds: 2 up, 2 in"), the system still wasn't working (see the disk stats for
both osd.0 and osd.1) and didn't recover until ceph (mon + osd) was restarted
on one of the boxes at around 05:50 (not shown in the logs, but full logs
available if needed).

Prior to the restart:

# ceph health
HEALTH_WARN 281 pgs peering; 281 pgs stuck inactive; 576 pgs stuck unclean

(Sorry, once again didn't get a 'ceph -s' prior to the restart.)

Chris.

--
ceph.log
--
2013-02-20 04:37:51.074128 mon.0 10.200.63.130:6789/0 120771 : [INF] pgmap 
v3000932: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 
GB used, 2881 GB / 3724 GB avail
2013-02-20 04:37:53.541471 mon.0 10.200.63.130:6789/0 120772 : [INF] pgmap 
v3000933: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 
GB used, 2881 GB / 3724 GB avail
2013-02-20 04:37:56.063059 mon.0 10.200.63.130:6789/0 120773 : [INF] pgmap 
v3000934: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 
GB used, 2881 GB / 3724 GB avail
2013-02-20 04:37:58.532763 mon.0 10.200.63.130:6789/0 120774 : [INF] pgmap 
v3000935: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 
GB used, 2881 GB / 3724 GB avail
<<<< osd.0 sees delayed ping_replies from here >>>>
2013-02-20 04:38:01.057939 mon.0 10.200.63.130:6789/0 120775 : [INF] 

Re: Mon losing touch with OSDs

2013-02-21 Thread Chris Dunlop
G'day,

It seems there might be two issues here: the first being the delayed
receipt of echo replies causing an seemingly otherwise healthy osd to be
marked down, the second being the lack of recovery once the downed osd is
recognised as up again.

Is it worth my opening tracker reports for this, just so it doesn't get
lost?

Cheers,

Chris

On Wed, Feb 20, 2013 at 01:07:03PM +1100, Chris Dunlop wrote:
> On Tue, Feb 19, 2013 at 02:02:03PM +1100, Chris Dunlop wrote:
>> On Sun, Feb 17, 2013 at 05:44:29PM -0800, Sage Weil wrote:
>>> On Mon, 18 Feb 2013, Chris Dunlop wrote:
>>>> On Sat, Feb 16, 2013 at 09:05:21AM +1100, Chris Dunlop wrote:
>>>>> On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote:
>>>>>> On Fri, 15 Feb 2013, Chris Dunlop wrote:
>>>>>>> In an otherwise seemingly healthy cluster (ceph 0.56.2), what might 
>>>>>>> cause the
>>>>>>> mons to lose touch with the osds?
>>>>>> 
>>>>>> Can you enable 'debug ms = 1' on the mons and leave them that way, in 
>>>>>> the 
>>>>>> hopes that this happens again?  It will give us more information to go 
>>>>>> on.
>>>>> 
>>>>> Debug turned on.
>>>> 
>>>> We haven't experienced the cluster losing touch with the osds completely
>>>> since upgrading from 0.56.2 to 0.56.3, but we did lose touch with osd.1
>>>> for a few seconds before it recovered. See below for logs (reminder: 3
>>>> boxes, b2 is mon-only, b4 is mon+osd.0, b5 is mon+osd.1).
>>> 
>>> Hrm, I don't see any obvious clues.  You could enable 'debug ms = 1' on 
>>> the osds as well.  That will give us more to go on if/when it happens 
>>> again, and should not affect performance significantly.
>> 
>> Done: ceph osd tell '*' injectargs '--debug-ms 1'
>> 
>> Now to wait for it to happen again.
> 
> OK, we got it again. Full logs covering the incident available at:
> 
> https://www.dropbox.com/s/kguzwyjfglv3ypl/ceph-logs.zip
> 
> Archive:  /tmp/ceph-logs.zip
>  Length   MethodSize  CmprDateTime   CRC-32   Name
>   --  ---  -- -   
>11492  Defl:X 1186  90% 2013-02-20 12:04 c0cba4ae  ceph-mon.b2.log
>  1270789  Defl:X89278  93% 2013-02-20 12:00 2208d035  ceph-mon.b4.log
>  1375858  Defl:X   104025  92% 2013-02-20 12:05 c64c1dad  ceph-mon.b5.log
>  2020042  Defl:X   215000  89% 2013-02-20 10:40 f74ae4ca  ceph-osd.0.log
>  2075512  Defl:X   224098  89% 2013-02-20 12:05 b454d2ec  ceph-osd.1.log
>   154938  Defl:X12989  92% 2013-02-20 12:04 d2729b05  ceph.log
>   ---  ------
>  6908631   646576  91%6 files
> 
> My naive analysis, based on the log extracts below (best viewed on a wide
> screen!)...
> 
> Osd.0 starts hearing much-delayed ping_replies from osd.1 and tells the mon,
> which marks osd.1 down.
> 
> However the whole time, the osd.1 log indicates that it's receiving and
> responding to each ping from osd.0 in a timely fashion. In contrast, the osd.0
> log indicates it isn't seeing the osd.1 replies for a while, then sees them 
> all
> arrive in a flurry, until they're "delayed" enough to cause osd.0 to tell the
> mon.
> 
> During the time osd.0 is not seeing the osd.1 ping_replies, there's other 
> traffic
> (osd_op, osd_sub_op, osd_sub_op_reply etc.) between osd.0 and osd.1, 
> indicating
> that it's not a network problem.
> 
> The load on both osds during this period was >90% idle and <1% iow.
> 
> Is this pointing to osd.0 experiencing some kind of scheduling or priority
> starvation on the ping thread (assuming the ping is in it's own thread)?
> 
> The next odd thing is that, although the osds are both back by 04:38:50 ("2
> osds: 2 up, 2 in"), the system still wasn't working (see the disk stats for
> both osd.0 and osd.1) and didn't recover until ceph (mon + osd) was restarted
> on one of the boxes at around 05:50 (not shown in the logs, but full logs
> available if needed).
> 
> Prior to the restart:
> 
> # ceph health
> HEALTH_WARN 281 pgs peering; 281 pgs stuck inactive; 576 pgs stuck unclean
> 
> (Sorry, once again didn't get a 'ceph -s' prior to the restart.)
> 
> Chris.
> 
> --
> ceph.log
> --
> 2013-02-20 04:37:51.0741

Re: Mon losing touch with OSDs

2013-02-22 Thread Chris Dunlop
On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
> On Fri, 22 Feb 2013, Chris Dunlop wrote:
>> G'day,
>> 
>> It seems there might be two issues here: the first being the delayed
>> receipt of echo replies causing an seemingly otherwise healthy osd to be
>> marked down, the second being the lack of recovery once the downed osd is
>> recognised as up again.
>> 
>> Is it worth my opening tracker reports for this, just so it doesn't get
>> lost?
> 
> I just looked at the logs.  I can't tell what happend to cause that 10 
> second delay.. strangely, messages were passing from 0 -> 1, but nothing 
> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
> 
> The strange bit is that after this, you get those indefinite hangs.  From 
> the logs it looks like the OSD rebound to an old port that was previously 
> open from osd.0.. probably from way back.  Do you have logs going further 
> back than what you posted?  Also, do you have osdmaps, say, 750 and 
> onward?  It looks like there is a bug in the connection handling code 
> (that is unrelated to the delay above).

Currently uploading logs starting midnight to dropbox, will send
links when when they're up.

How would I retrieve the interesting osdmaps?

Chris.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mon losing touch with OSDs

2013-02-22 Thread Chris Dunlop
On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
>>> On Fri, 22 Feb 2013, Chris Dunlop wrote:
>>>> G'day,
>>>> 
>>>> It seems there might be two issues here: the first being the delayed
>>>> receipt of echo replies causing an seemingly otherwise healthy osd to be
>>>> marked down, the second being the lack of recovery once the downed osd is
>>>> recognised as up again.
>>>> 
>>>> Is it worth my opening tracker reports for this, just so it doesn't get
>>>> lost?
>>> 
>>> I just looked at the logs.  I can't tell what happend to cause that 10 
>>> second delay.. strangely, messages were passing from 0 -> 1, but nothing 
>>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).

Is there any way of telling where they were delayed, i.e. in the 1's output
queue or 0's input queue?

>>> The strange bit is that after this, you get those indefinite hangs.  From 
>>> the logs it looks like the OSD rebound to an old port that was previously 
>>> open from osd.0.. probably from way back.  Do you have logs going further 
>>> back than what you posted?  Also, do you have osdmaps, say, 750 and 
>>> onward?  It looks like there is a bug in the connection handling code 
>>> (that is unrelated to the delay above).
>> 
>> Currently uploading logs starting midnight to dropbox, will send
>> links when when they're up.
>> 
>> How would I retrieve the interesting osdmaps?
> 
> They are in the monitor data directory, in the osdmap_full dir.

Logs from midnight onwards and osdmaps are in this folder:

https://www.dropbox.com/sh/7nq7gr2u2deorcu/Nvw3FFGiy2

  ceph-mon.b2.log.bz2
  ceph-mon.b4.log.bz2
  ceph-mon.b5.log.bz2
  ceph-osd.0.log.bz2
  ceph-osd.1.log.bz2 (still uploading as I type)
  osdmaps.zip

Cheers,

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mon losing touch with OSDs

2013-02-22 Thread Chris Dunlop
On Fri, Feb 22, 2013 at 04:25:39PM -0800, Sage Weil wrote:
> Hi Chris-
> 
> Can you confirm that both ceph-osd daemons are running v0.56.3 (i.e., 
> they were restarted after the upgrade)?

Not absolutely, but the indications are good: the osd.1 process
was started Feb 16 08:38:43 2013, 20 minutes before I sent my
email saying it had been upgraded. The osd.0 process was
restarted more recently to kick things along after the most
recent problem, however my command line history shows a "apt-get
upgrade" followed by a "service ceph restart".

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mon losing touch with OSDs

2013-02-22 Thread Chris Dunlop
On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
>>>>> On Fri, 22 Feb 2013, Chris Dunlop wrote:
>>>>>> G'day,
>>>>>> 
>>>>>> It seems there might be two issues here: the first being the delayed
>>>>>> receipt of echo replies causing an seemingly otherwise healthy osd to be
>>>>>> marked down, the second being the lack of recovery once the downed osd is
>>>>>> recognised as up again.
>>>>>> 
>>>>>> Is it worth my opening tracker reports for this, just so it doesn't get
>>>>>> lost?
>>>>> 
>>>>> I just looked at the logs.  I can't tell what happend to cause that 10 
>>>>> second delay.. strangely, messages were passing from 0 -> 1, but nothing 
>>>>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
>> 
>> Is there any way of telling where they were delayed, i.e. in the 1's output
>> queue or 0's input queue?
> 
> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
> generate a lot of logging, though.

I really don't want to load the system with too much logging, but I'm happy
modifying code...  Are there specific interesting debug outputs which I can
modify so they're output under "ms = 1"?

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mon losing touch with OSDs

2013-02-22 Thread Chris Dunlop
On Sat, Feb 23, 2013 at 11:50:26AM +1100, Chris Dunlop wrote:
> On Fri, Feb 22, 2013 at 04:25:39PM -0800, Sage Weil wrote:
>> Hi Chris-
>> 
>> Can you confirm that both ceph-osd daemons are running v0.56.3 (i.e., 
>> they were restarted after the upgrade)?
> 
> Not absolutely, but the indications are good: the osd.1 process
> was started Feb 16 08:38:43 2013, 20 minutes before I sent my
> email saying it had been upgraded. The osd.0 process was
> restarted more recently to kick things along after the most
> recent problem, however my command line history shows a "apt-get
> upgrade" followed by a "service ceph restart".


I can't see anything in the logs indicating the various daemon version
numbers.

Unless I'm blind and it's already there, perhaps a good idea for every
daemon to log it's version number when it starts?

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mon losing touch with OSDs

2013-02-22 Thread Chris Dunlop
On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote:
> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>> On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
>>>>>>> I just looked at the logs.  I can't tell what happend to cause that 10 
>>>>>>> second delay.. strangely, messages were passing from 0 -> 1, but 
>>>>>>> nothing 
>>>>>>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
>>>> 
>>>> Is there any way of telling where they were delayed, i.e. in the 1's output
>>>> queue or 0's input queue?
>>> 
>>> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
>>> generate a lot of logging, though.
>> 
>> I really don't want to load the system with too much logging, but I'm happy
>> modifying code...  Are there specific interesting debug outputs which I can
>> modify so they're output under "ms = 1"?
> 
> I'm basically interested in everything in writer() and write_message(), 
> and reader() and read_message()...

Like this?

--
diff --git a/src/msg/Pipe.cc b/src/msg/Pipe.cc
index 37b1eeb..db4774f 100644
--- a/src/msg/Pipe.cc
+++ b/src/msg/Pipe.cc
@@ -1263,7 +1263,7 @@ void Pipe::reader()
 
 // sleep if (re)connecting
 if (state == STATE_STANDBY) {
-  ldout(msgr->cct,20) << "reader sleeping during reconnect|standby" << 
dendl;
+  ldout(msgr->cct, 1) << "reader sleeping during reconnect|standby" << 
dendl;
   cond.Wait(pipe_lock);
   continue;
 }
@@ -1272,28 +1272,28 @@ void Pipe::reader()
 
 char buf[80];
 char tag = -1;
-ldout(msgr->cct,20) << "reader reading tag..." << dendl;
+ldout(msgr->cct, 1) << "reader reading tag..." << dendl;
 if (tcp_read((char*)&tag, 1) < 0) {
   pipe_lock.Lock();
-  ldout(msgr->cct,2) << "reader couldn't read tag, " << strerror_r(errno, 
buf, sizeof(buf)) << dendl;
+  ldout(msgr->cct, 1) << "reader couldn't read tag, " << strerror_r(errno, 
buf, sizeof(buf)) << dendl;
   fault(true);
   continue;
 }
 
 if (tag == CEPH_MSGR_TAG_KEEPALIVE) {
-  ldout(msgr->cct,20) << "reader got KEEPALIVE" << dendl;
+  ldout(msgr->cct, 1) << "reader got KEEPALIVE" << dendl;
   pipe_lock.Lock();
   continue;
 }
 
 // open ...
 if (tag == CEPH_MSGR_TAG_ACK) {
-  ldout(msgr->cct,20) << "reader got ACK" << dendl;
+  ldout(msgr->cct, 1) << "reader got ACK" << dendl;
   ceph_le64 seq;
   int rc = tcp_read((char*)&seq, sizeof(seq));
   pipe_lock.Lock();
   if (rc < 0) {
-   ldout(msgr->cct,2) << "reader couldn't read ack seq, " << 
strerror_r(errno, buf, sizeof(buf)) << dendl;
+   ldout(msgr->cct, 1) << "reader couldn't read ack seq, " << 
strerror_r(errno, buf, sizeof(buf)) << dendl;
fault(true);
   } else if (state != STATE_CLOSED) {
 handle_ack(seq);
@@ -1302,7 +1302,7 @@ void Pipe::reader()
 }
 
 else if (tag == CEPH_MSGR_TAG_MSG) {
-  ldout(msgr->cct,20) << "reader got MSG" << dendl;
+  ldout(msgr->cct, 1) << "reader got MSG" << dendl;
   Message *m = 0;
   int r = read_message(&m);
 
@@ -1342,7 +1342,7 @@ void Pipe::reader()
 
   cond.Signal();  // wake up writer, to ack this
   
-  ldout(msgr->cct,10) << "reader got message "
+  ldout(msgr->cct, 1) << "reader got message "
   << m->get_seq() << " " << m << " " << *m
   << dendl;
 
@@ -1360,7 +1360,7 @@ void Pipe::reader()
 } 
 
 else if (tag == CEPH_MSGR_TAG_CLOSE) {
-  ldout(msgr->cct,20) << "reader got CLOSE" << dendl;
+  ldout(msgr->cct, 1) << "reader got CLOSE" << dendl;
   pipe_lock.Lock();
   if (state == STATE_CLOSING) {
state = STATE_CLOSED;
@@ -1383,7 +1383,7 @@ void Pipe::reader()
   reader_running = false;
   reader_needs_join = true;
   unlock_maybe_reap();
-  ldout(msgr->cct,10) <<

Re: Mon losing touch with OSDs

2013-02-22 Thread Chris Dunlop
On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote:
> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>> On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote:
>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>> On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
>>>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>>>> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
>>>>>>>>> I just looked at the logs.  I can't tell what happend to cause that 
>>>>>>>>> 10 
>>>>>>>>> second delay.. strangely, messages were passing from 0 -> 1, but 
>>>>>>>>> nothing 
>>>>>>>>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
>>>>>> 
>>>>>> Is there any way of telling where they were delayed, i.e. in the 1's 
>>>>>> output
>>>>>> queue or 0's input queue?
>>>>> 
>>>>> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
>>>>> generate a lot of logging, though.
>>>> 
>>>> I really don't want to load the system with too much logging, but I'm happy
>>>> modifying code...  Are there specific interesting debug outputs which I can
>>>> modify so they're output under "ms = 1"?
>>> 
>>> I'm basically interested in everything in writer() and write_message(), 
>>> and reader() and read_message()...
>> 
>> Like this?
> 
> Yeah.  You could do 2 instead of 1 so you can turn it down.  I suspect 
> that this is the lions share of what debug 20 will spam to the log, but 
> hopefully the load is manageable!

Good idea on the '2'. I'll get that installed and wait for it to happen again.

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mon losing touch with OSDs

2013-02-28 Thread Chris Dunlop
On Sat, Feb 23, 2013 at 01:02:53PM +1100, Chris Dunlop wrote:
> On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote:
>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>> On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote:
>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>> On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
>>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>>> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
>>>>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>>>>> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
>>>>>>>>>> I just looked at the logs.  I can't tell what happend to cause that 
>>>>>>>>>> 10 
>>>>>>>>>> second delay.. strangely, messages were passing from 0 -> 1, but 
>>>>>>>>>> nothing 
>>>>>>>>>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
>>>>>>> 
>>>>>>> Is there any way of telling where they were delayed, i.e. in the 1's 
>>>>>>> output
>>>>>>> queue or 0's input queue?
>>>>>> 
>>>>>> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
>>>>>> generate a lot of logging, though.
>>>>> 
>>>>> I really don't want to load the system with too much logging, but I'm 
>>>>> happy
>>>>> modifying code...  Are there specific interesting debug outputs which I 
>>>>> can
>>>>> modify so they're output under "ms = 1"?
>>>> 
>>>> I'm basically interested in everything in writer() and write_message(), 
>>>> and reader() and read_message()...
>>> 
>>> Like this?
>> 
>> Yeah.  You could do 2 instead of 1 so you can turn it down.  I suspect 
>> that this is the lions share of what debug 20 will spam to the log, but 
>> hopefully the load is manageable!
> 
> Good idea on the '2'. I'll get that installed and wait for it to happen again.

FYI...

To avoid running out of disk space for the massive logs, I
started using logrotate on the ceph logs every two hours, which
does a 'service ceph reload' to re-open the log files.

In the week since doing that I haven't seen any 'slow requests'
at all (the load has stayed the same as before the change),
which means the issue with the osds dropping out, then the
system not recovering properly, also hasn't happened.

That's a bit suspicious, no?

I've now put the log dirs on each machine on their own 2TB
partition and reverted back to the default daily rotates.

And once more we're waiting... Godot, is that you?


Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mon losing touch with OSDs

2013-03-07 Thread Chris Dunlop
On Thu, Feb 28, 2013 at 09:00:24PM -0800, Sage Weil wrote:
> On Fri, 1 Mar 2013, Chris Dunlop wrote:
>> On Sat, Feb 23, 2013 at 01:02:53PM +1100, Chris Dunlop wrote:
>>> On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote:
>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>> On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote:
>>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>>> On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
>>>>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>>>>> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
>>>>>>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>>>>>>> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
>>>>>>>>>>>> I just looked at the logs.  I can't tell what happend to cause 
>>>>>>>>>>>> that 10 
>>>>>>>>>>>> second delay.. strangely, messages were passing from 0 -> 1, but 
>>>>>>>>>>>> nothing 
>>>>>>>>>>>> came back from 1 -> 0 (although 1 was queuing, if not sending, 
>>>>>>>>>>>> them).
>>>>>>>>> 
>>>>>>>>> Is there any way of telling where they were delayed, i.e. in the 1's 
>>>>>>>>> output
>>>>>>>>> queue or 0's input queue?
>>>>>>>> 
>>>>>>>> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
>>>>>>>> generate a lot of logging, though.
>>>>>>> 
>>>>>>> I really don't want to load the system with too much logging, but I'm 
>>>>>>> happy
>>>>>>> modifying code...  Are there specific interesting debug outputs which I 
>>>>>>> can
>>>>>>> modify so they're output under "ms = 1"?
>>>>>> 
>>>>>> I'm basically interested in everything in writer() and write_message(), 
>>>>>> and reader() and read_message()...
>>>>> 
>>>>> Like this?
>>>> 
>>>> Yeah.  You could do 2 instead of 1 so you can turn it down.  I suspect 
>>>> that this is the lions share of what debug 20 will spam to the log, but 
>>>> hopefully the load is manageable!
>>> 
>>> Good idea on the '2'. I'll get that installed and wait for it to happen 
>>> again.
>> 
>> FYI...
>> 
>> To avoid running out of disk space for the massive logs, I
>> started using logrotate on the ceph logs every two hours, which
>> does a 'service ceph reload' to re-open the log files.
>> 
>> In the week since doing that I haven't seen any 'slow requests'
>> at all (the load has stayed the same as before the change),
>> which means the issue with the osds dropping out, then the
>> system not recovering properly, also hasn't happened.
>> 
>> That's a bit suspicious, no?
> 
> I suspect the logging itself is changing the timing.  Let's wait and see 
> if we get lucky... 

We got "lucky"...

ceph-mon.0.log:
2013-03-08 03:46:44.786682 7fcc62172700  1 -- 192.168.254.132:0/20298 --> 
192.168.254.133:6801/23939 -- osd_ping(ping e815 stamp 2013-03-08 
03:46:44.786679) v2 -- ?+0 0x765b180 con 0x6ab6160
  [no ping_reply logged, then later...]
2013-03-08 03:46:56.211993 7fcc71190700 -1 osd.0 815 heartbeat_check: no reply 
from osd.1 since 2013-03-08 03:46:35.986327 (cutoff 2013-03-08 03:46:36.211992)

ceph-mon.1.log:
2013-03-08 03:46:44.786848 7fe6f47a4700  1 -- 192.168.254.133:6801/23939 <== 
osd.0 192.168.254.132:0/20298 178549  osd_ping(ping e815 stamp 2013-03-08 
03:46:44.786679) v2  47+0+0 (1298645350 0 0) 0x98256c0 con 0x7bd2160
2013-03-08 03:46:44.786880 7fe6f47a4700  1 -- 192.168.254.133:6801/23939 --> 
192.168.254.132:0/20298 -- osd_ping(ping_reply e815 stamp 2013-03-08 
03:46:44.786679) v2 -- ?+0 0x29876c0 con 0x7bd2160

Interestingly, the matching ping_reply from osd.1 never appears in the
osd.0 log, in contrast to the previous incident upthread where the
"missing" ping replies were all seen in a rush (but after osd.1 had been
marked down).

The missing ping_reply caused osd.1 to get marked down, then it marked
itself up again a bit later ("map e818 wrongly marked me down"). However
the system still hadn't recovered before 07:46:29 when a 'servi

Re: Mon losing touch with OSDs

2013-03-08 Thread Chris Dunlop
On Fri, Mar 08, 2013 at 02:12:40PM +1100, Chris Dunlop wrote:
> On Thu, Feb 28, 2013 at 09:00:24PM -0800, Sage Weil wrote:
>> On Fri, 1 Mar 2013, Chris Dunlop wrote:
>>> On Sat, Feb 23, 2013 at 01:02:53PM +1100, Chris Dunlop wrote:
>>>> On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote:
>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>> On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote:
>>>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>>>> On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
>>>>>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>>>>>> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
>>>>>>>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>>>>>>>> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
>>>>>>>>>>>>> I just looked at the logs.  I can't tell what happend to cause 
>>>>>>>>>>>>> that 10 
>>>>>>>>>>>>> second delay.. strangely, messages were passing from 0 -> 1, but 
>>>>>>>>>>>>> nothing 
>>>>>>>>>>>>> came back from 1 -> 0 (although 1 was queuing, if not sending, 
>>>>>>>>>>>>> them).
>>>>>>>>>> 
>>>>>>>>>> Is there any way of telling where they were delayed, i.e. in the 1's 
>>>>>>>>>> output
>>>>>>>>>> queue or 0's input queue?
>>>>>>>>> 
>>>>>>>>> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
>>>>>>>>> generate a lot of logging, though.
>>>>>>>> 
>>>>>>>> I really don't want to load the system with too much logging, but I'm 
>>>>>>>> happy
>>>>>>>> modifying code...  Are there specific interesting debug outputs which 
>>>>>>>> I can
>>>>>>>> modify so they're output under "ms = 1"?
>>>>>>> 
>>>>>>> I'm basically interested in everything in writer() and write_message(), 
>>>>>>> and reader() and read_message()...
>>>>>> 
>>>>>> Like this?
>>>>> 
>>>>> Yeah.  You could do 2 instead of 1 so you can turn it down.  I suspect 
>>>>> that this is the lions share of what debug 20 will spam to the log, but 
>>>>> hopefully the load is manageable!
>>>> 
>>>> Good idea on the '2'. I'll get that installed and wait for it to happen 
>>>> again.
>>> 
>>> FYI...
>>> 
>>> To avoid running out of disk space for the massive logs, I
>>> started using logrotate on the ceph logs every two hours, which
>>> does a 'service ceph reload' to re-open the log files.
>>> 
>>> In the week since doing that I haven't seen any 'slow requests'
>>> at all (the load has stayed the same as before the change),
>>> which means the issue with the osds dropping out, then the
>>> system not recovering properly, also hasn't happened.
>>> 
>>> That's a bit suspicious, no?
>> 
>> I suspect the logging itself is changing the timing.  Let's wait and see 
>> if we get lucky... 
> 
> We got "lucky"...
> 
> ceph-mon.0.log:
> 2013-03-08 03:46:44.786682 7fcc62172700  1 -- 192.168.254.132:0/20298 --> 
> 192.168.254.133:6801/23939 -- osd_ping(ping e815 stamp 2013-03-08 
> 03:46:44.786679) v2 -- ?+0 0x765b180 con 0x6ab6160
>   [no ping_reply logged, then later...]
> 2013-03-08 03:46:56.211993 7fcc71190700 -1 osd.0 815 heartbeat_check: no 
> reply from osd.1 since 2013-03-08 03:46:35.986327 (cutoff 2013-03-08 
> 03:46:36.211992)

Bugger. I just realised that the cluster had come up without
the "ms = 2" logging enabled, making these logs no more useful
than the previous ones for working out who dropped the missing
ping_reply.

Injected "ms = 2" into the osds and mons, and added it to the
config files.

Sigh. And we're waiting again...

Chris.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Speed up 'rbd rm'

2013-05-28 Thread Chris Dunlop
Hi,

I know 'rbd rm' is notoriously slow, even on never-written devices:

http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/6740
http://tracker.ceph.com/issues/2256

I fat fingered an 'rbd create' and accidentally created a 1.5 PB device,
and it's going to take some considerable time to remove this!

I see there's a new commit to speed up an 'rbd rm':

http://tracker.ceph.com/projects/ceph/repository/revisions/40956410169709c32a282d9b872cb5f618a48926

Is it safe to cherry-pick this commit on top of 0.56.6 (or, if not, v0.61.2) to 
speed up the remove?

Cheers,

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Speed up 'rbd rm'

2013-05-29 Thread Chris Dunlop
On Wed, May 29, 2013 at 12:21:07PM -0700, Josh Durgin wrote:
> On 05/28/2013 10:59 PM, Chris Dunlop wrote:
>> I see there's a new commit to speed up an 'rbd rm':
>>
>> http://tracker.ceph.com/projects/ceph/repository/revisions/40956410169709c32a282d9b872cb5f618a48926
>>
>> Is it safe to cherry-pick this commit on top of 0.56.6 (or, if not, v0.61.2) 
>> to speed up the remove?
> 
> You'll need 537386d906b8c0e395433461dcb03a82eb33f34f as well. It should
> apply cleanly to 0.61.2, and probably 0.56.6 too.

Thanks. I'll see how I go, I may just leave the 'rm' running all
weekend rather than futzing around recompiling ceph and getting
off the mainline track.

Perhaps as a low-churn high-impact (for those that need it) change,
this is a candidate to be back-patched into 0.61.3 and perhaps 0.56.7 ?

Cheers,

Chris.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Speed up 'rbd rm'

2013-05-30 Thread Chris Dunlop
On Thu, May 30, 2013 at 01:50:14PM -0700, Josh Durgin wrote:
> On 05/29/2013 07:23 PM, Chris Dunlop wrote:
>> On Wed, May 29, 2013 at 12:21:07PM -0700, Josh Durgin wrote:
>>> On 05/28/2013 10:59 PM, Chris Dunlop wrote:
>>>> I see there's a new commit to speed up an 'rbd rm':
>>>>
>>>> http://tracker.ceph.com/projects/ceph/repository/revisions/40956410169709c32a282d9b872cb5f618a48926
>>>>
>>>> Is it safe to cherry-pick this commit on top of 0.56.6 (or, if not, 
>>>> v0.61.2) to speed up the remove?
>>>
>>> You'll need 537386d906b8c0e395433461dcb03a82eb33f34f as well. It should
>>> apply cleanly to 0.61.2, and probably 0.56.6 too.
>>
>> Thanks. I'll see how I go, I may just leave the 'rm' running all
>> weekend rather than futzing around recompiling ceph and getting
>> off the mainline track.
> 
> If you're mainly interested in getting rid of the accidentally 1.5PB
> image, you can just delete the header (and id object if it's format 2)
> and then 'rbd rm' will just remove it from the rbd_directory index, and
> not try to delete all the non-existent data objects.

Yes, that's my main interest. Sorry, I haven't yet delved far
into the details of how the rbd stuff hangs together: can you
give me a hint or point me towards any docs regarding what
"delete the header (and id object" would look like?


Cheers,

Chris.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Speed up 'rbd rm'

2013-06-03 Thread Chris Dunlop
On Thu, May 30, 2013 at 07:04:28PM -0700, Josh Durgin wrote:
> On 05/30/2013 06:40 PM, Chris Dunlop wrote:
>> On Thu, May 30, 2013 at 01:50:14PM -0700, Josh Durgin wrote:
>>> On 05/29/2013 07:23 PM, Chris Dunlop wrote:
>>>> On Wed, May 29, 2013 at 12:21:07PM -0700, Josh Durgin wrote:
>>>>> On 05/28/2013 10:59 PM, Chris Dunlop wrote:
>>>>>> I see there's a new commit to speed up an 'rbd rm':
>>>>>>
>>>>>> http://tracker.ceph.com/projects/ceph/repository/revisions/40956410169709c32a282d9b872cb5f618a48926
>>>>>>
>>>>>> Is it safe to cherry-pick this commit on top of 0.56.6 (or, if not, 
>>>>>> v0.61.2) to speed up the remove?
>>>>>
>>>>> You'll need 537386d906b8c0e395433461dcb03a82eb33f34f as well. It should
>>>>> apply cleanly to 0.61.2, and probably 0.56.6 too.
>>>>
>>>> Thanks. I'll see how I go, I may just leave the 'rm' running all
>>>> weekend rather than futzing around recompiling ceph and getting
>>>> off the mainline track.

# time rbd rm rbd/large-image
Removing image: 36% complete...Terminated
real2819m37.117s

I.e. 47 hours and only 36% complete before I gave up (I wanted
to restart that server). At that rate it would take 5.5 days to
remove!

>>> If you're mainly interested in getting rid of the accidentally 1.5PB
>>> image, you can just delete the header (and id object if it's format 2)
>>> and then 'rbd rm' will just remove it from the rbd_directory index, and
>>> not try to delete all the non-existent data objects.
>>
>> Yes, that's my main interest. Sorry, I haven't yet delved far
>> into the details of how the rbd stuff hangs together: can you
>> give me a hint or point me towards any docs regarding what
>> "delete the header (and id object" would look like?
> 
> For a format 2 image, 'rbd info imagename' will show a block_prefix
> like 'rbd_data.101574b0dc51'.
> 
> The random suffix after the '.' is the id of the image.
> For format 2, the header is named after this id, so you'd do:
> 
> rados -p poolname rm rbd_header.101574b0dc51
> 
> For format 1 images, the header object is named after the image name,
> like 'imagename.rbd'.
>
> After removing the header object manually, rbd rm will clean up the
> rest.

The problematical image is format 2.

If it's tricky to manually remove, it's not doing any harm just
sitting there (it it??) so I guess I can just wait until the
parallelized delete is available in a stable release, i.e.
dumpling, or backported to bobtail or cuttlefish.

Cheers,

Chris.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


krbd + format=2 ?

2013-06-03 Thread Chris Dunlop
G'day,

Sage's recent pull message to Linus said:


Please pull the following Ceph patches from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

This is a big pull.  Most of it is culmination of Alex's work to implement
RBD image layering, which is now complete (yay!).


Am I correct in thinking "RBD image layering... is now complete"
implies there should be full(?) support for format=2?

I pulled the for-linus branch (@ 3abef3b) on top of 3.10.0-rc4, and it's
letting me map a format=2 image (created under bobtail), however reading
from the block device returns zeros rather than the data. The same image
correctly shows data (NTFS filesystem) when mounted into kvm using librbd.


# uname -r
3.10.0-rc4-00010-g0326739
# rbd ls -l
NAME   SIZE PARENT   FMT PROT LOCK 
xxx   1536G2
# rbd map rbd/xxx
# rbd showmapped
id pool image   snap device
1  rbd  xxx -/dev/rbd1 
# dd if=/dev/rbd1 of=/tmp/xxx count=20480
20480+0 records in
20480+0 records out
10485760 bytes (10 MB) copied, 0.757754 s, 13.8 MB/s
# od -c /tmp/xxx | less
000  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
5000



Cheers,

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: krbd + format=2 ?

2013-06-07 Thread Chris Dunlop
On Fri, Jun 07, 2013 at 11:54:20AM -0500, Alex Elder wrote:
> On 06/03/2013 04:24 AM, Chris Dunlop wrote:
>> I pulled the for-linus branch (@ 3abef3b) on top of 3.10.0-rc4, and it's
>> letting me map a format=2 image (created under bobtail), however reading
>> from the block device returns zeros rather than the data. The same image
>> correctly shows data (NTFS filesystem) when mounted into kvm using librbd.
> 
> Have you tried using a format 2 image that you created using
> the Linux rbd environment?  It would be good to know whether
> that works for you.

Sorry, how to you mean "created using the Linux rbd environment"?
The one I was trying was created using:

rbd create --format 2 xxx --size n

...then populated using qemu/librbd.

> Can you also send me the result of running each of these commands:
> 
> foo=$(rados get --pool=rbd rbd_id.xxx - | strings)
> rados ls --pool=rbd | grep rbd_data | grep "${foo}"
> unset foo

That results in a 3.2MB, 83973-line file, starting like:

rbd_data.e4d72ae8944a.00027066
rbd_data.e4d72ae8944a.0001bc77
rbd_data.e4d72ae8944a.0002112e
rbd_data.e4d72ae8944a.0002550c
rbd_data.e4d72ae8944a.0001b6b3
rbd_data.e4d72ae8944a.0002875b

Complete file is at: https://www.dropbox.com/s/5bw6ba7ifatf1mu/rbd_data.out

(Note: it's a long w'end here on Oz, I won't be back at this till Tue.)

Cheers,

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: krbd + format=2 ?

2013-06-11 Thread Chris Dunlop
On Sat, Jun 08, 2013 at 12:48:52PM +1000, Chris Dunlop wrote:
> On Fri, Jun 07, 2013 at 11:54:20AM -0500, Alex Elder wrote:
>> On 06/03/2013 04:24 AM, Chris Dunlop wrote:
>>> I pulled the for-linus branch (@ 3abef3b) on top of 3.10.0-rc4, and it's
>>> letting me map a format=2 image (created under bobtail), however reading
>>> from the block device returns zeros rather than the data. The same image
>>> correctly shows data (NTFS filesystem) when mounted into kvm using librbd.
>> 
>> Have you tried using a format 2 image that you created using
>> the Linux rbd environment?  It would be good to know whether
>> that works for you.
> 
> Sorry, how to you mean "created using the Linux rbd environment"?
> The one I was trying was created using:
> 
> rbd create --format 2 xxx --size n
> 
> ...then populated using qemu/librbd.

Looks like the kernel rbd and librbd aren't compatible, as at
3.10.0-rc4+ceph-client/for-linus@3abef3b vs librbd1 0.56.6-1~bpo70+1.

I can create a format=2 rbd image, map it, format and populate it using the
mapped device. I can then see that data after unmounting unmapping, then
re-mapping and re-mounting. However if I then attach that image via librbd to a
qemu-1.5.0~rc3 vm using the invocation below, the vm sees the whole device as
zeros.

Likewise, if I create a new format=2 rbd image, attach it via librbd to a
qemu-1.5.0~rc3 vm, format it and populate it, that data is thereafter visible
in the vm (e.g. after remounting etc.). But if I map that image using the
kernel client, again I get a device full of zeros.

# dpkg -l \*ceph\* \*rbd\* \*rados\* | grep ^i
ii  ceph   0.56.6-1~bpo70+1  amd64  distributed storage and file system
ii  ceph-common0.56.6-1~bpo70+1  amd64  common utilities to mount and 
interact with a ceph storage cluster
ii  librados2  0.56.6-1~bpo70+1  amd64  RADOS distributed object store 
client library
ii  librbd10.56.6-1~bpo70+1  amd64  RADOS block device client library

##
## Attach disk to vm
##
# virsh attach-device test <(cat <
  
  

   
  
  
  

END
)


Cheers,

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: krbd + format=2 ?

2013-06-12 Thread Chris Dunlop
On Wed, Jun 12, 2013 at 08:56:50PM -0700, Josh Durgin wrote:
> On 06/11/2013 09:59 PM, Chris Dunlop wrote:
>> Looks like the kernel rbd and librbd aren't compatible, as at
>> 3.10.0-rc4+ceph-client/for-linus@3abef3b vs librbd1 0.56.6-1~bpo70+1.
> 
> Thanks for the detailed report Chris. The kernel client was using the
> wrong object names for format 2 (zero-padding them with a different
> length than librbd). I just posted a patch fixing this.

Works for me!

(I had to modify the first of your posted patches to apply it to
current linux master (@26e04462c) which pulled in Sage's
for-linus branch: see below.)

Cheers,

Chris.

--
commit 932c369af21c9449649238f1b91d1062ab45384e
Author: Josh Durgin 
Date:   Wed Jun 12 20:10:45 2013 -0700

rbd: fetch object order before using it

rbd_dev_v2_header_onetime() fetches striping information, and
checks whether the image can be read by compariing the stripe unit
to the object size. It determines the object size by shifting
the object order, which is 0 at this point since it has not been
read yet. Move the call to get the image size and object order
before rbd_dev_v2_header_onetime() so it is set before use.

Signed-off-by: Josh Durgin 

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 3063452..2fecab2 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -4239,6 +4239,10 @@ static int rbd_dev_v2_header_info(struct rbd_device 
*rbd_dev)
 
down_write(&rbd_dev->header_rwsem);
 
+   ret = rbd_dev_v2_image_size(rbd_dev);
+   if (ret)
+   goto out;
+
if (first_time) {
ret = rbd_dev_v2_header_onetime(rbd_dev);
if (ret)
@@ -4272,10 +4276,6 @@ static int rbd_dev_v2_header_info(struct rbd_device 
*rbd_dev)
"is EXPERIMENTAL!");
}
 
-   ret = rbd_dev_v2_image_size(rbd_dev);
-   if (ret)
-   goto out;
-
if (rbd_dev->spec->snap_id == CEPH_NOSNAP)
if (rbd_dev->mapping.size != rbd_dev->header.image_size)
rbd_dev->mapping.size = rbd_dev->header.image_size;
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


libceph: error -2 building auth method x request

2012-10-14 Thread Chris Dunlop
G'day,

In case anyone else might be getting the $subject error, and to
make the issue visible to a google search...

When trying to map an rdb block device whilst using cephx
authentication, if you get kernel messages like:

[79683.055935] libceph: client0 fsid c6618970-0ce0-4cb2-bc9a-dd5f29b62e24
[79683.059286] libceph: error -2 building auth method x request

...ensure you have the appropriate encryption available in the
kernel (built in or modules). I'm a bit unclear as to precisely
what is required, but Josh Durgin's comment was:

  The only path I'm seeing that could lead to that error is if
  the kernel wasn't compiled with support for "cbc(aes)" with
  the CRYPTO_ALG_ASYNC flag. I'd guess any of the CRYTPO_AES*
  build options (with CRYPTO_BLKCIPHER2?) would be sufficient.

These modules were required to get my mapping working whilst
using cephx:

  aes_x86_64
  aes_generic
  cbc

See also: http://tracker.newdream.net/issues/3285

Cheers,

Chris.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kernel rbd format=2

2012-12-17 Thread Chris Dunlop
Hi,

Format 2 images (and attendant layering support) are not yet
supported by the kernel rbd client, according to:

http://ceph.com/docs/master/rbd/rbd-snapshot/#layering

When might this support be available?

Cheers,

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Simple test system not working

2011-01-31 Thread Chris Dunlop
G'day,

I'm just starting with ceph and am having trouble getting a test ceph
installation going on debian lenny with a self-compiled 2.6.37. Ceph
itself is latest unstable (73e76723) compiled using dpkg-buildpackage.
The mon, mds and osd are all on the same machine. The ceph base /data
directory is ext2 mounted with user_xattr.  The ceph filesystem is
created using the attached ceph.conf and:

  /sbin/mkcephfs -v -c /etc/ceph/ceph.conf --allhosts --mkbtrfs -k
/etc/ceph/keyring.bin

Once ceph has been started, trying a mount on a (separate) client
machine (2.6.37 kernel) hangs before eventually giving an error:

  # mount -t ceph b3:/ /mnt
  mount error 5 = Input/output error

On the server side, it seems that the cmon isn't accepting connections
from the mds or osd. E.g. repeated instances of this in the osd log,
and similar "mark_down"s in the mds log:

  2011-02-01 17:18:47.463897 4a254950 -- 0.0.0.0:6801/7854 mark_down
10.200.63.129:6789/0 -- 0xca8a00
  2011-02-01 17:18:47.463924 4a254950 -- 0.0.0.0:6801/7854 --> mon0
10.200.63.129:6789/0 -- auth(proto 0 26 bytes) v1 -- ?+0 0xcd9800

The mon log (amongst a whole bunch of other stuff) repeats:

  2011-02-01 17:18:51.445785 42c73950 mon.0@0(leader).mds e1 e1: 0/0/1 up
  2011-02-01 17:18:51.445790 42c73950 mon.0@0(leader).osd e1 e1: 0 osds: 0 up, 
0 in

The attached logs.zip has the full logs from one attempted start.

Is there something obvious I'm doing wrong?

Thanks for any assistance!

Chris


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Simple test system not working

2011-01-31 Thread Chris Dunlop
Chris Dunlop  onthe.net.au> writes:
> The ceph filesystem is created using the attached ceph.conf and:

Sorry, no attachments.

ceph.conf:
--
[global]
pid file = /var/run/ceph/$name.pid
debug ms = 20
debug mon = 20
[mon]
mon data = /data/mon$id
debug ms = 90
debug mon = 90
debug paxos = 90
debug auth = 90
[mon0]
host = b3
mon addr = 10.200.63.129:6789
[mds]
keyring = /data/keyring.$name
[mds.b3]
host = b3
[osd]
osd data = /data/osd$id
osd journal = /data/osd$id/journal
osd journal size = 1000 ; journal size, in megabytes
debug ms = 1
debug osd = 20
debug filestore = 20
debug journal = 20
[osd0]
host = b3
btrfs devs = /dev/sdb1
--

> The attached logs.zip has the full logs from one attempted start.

Available at: http://www.onthe.net.au/private/logs.zip


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Simple test system not working

2011-02-01 Thread Chris Dunlop
Samuel Just  hq.newdream.net> writes:
> From the logs, it looks like the osd and mds are failing to bind to
> an address.  Perhaps b3 is not a valid host name?

Thanks!

That wasn't it (b3 is valid), but you started me looking at the
networking side of things again (for the nth time) and this time
around I noticed there was no loopback interface on the box. That in
itself was odd, although it didn't immediately spring to mind that the
lack of a loopback would stop ceph working. But I added a loopback
anyway on the basis of "avoid oddness" and that got it all going. I
still don't know why a loopback is required for ceph, but apparently
it is!

FYI, with it working, the osd and mds are still bound to the wildcard
address rather than to a specific address (as they were before the
loopback was installed).  Without loopback:

# lsof -n -i | egrep 'COMMAND|cmon|cmds|cosd'
COMMANDPIDUSER   FD   TYPE DEVICE SIZE NODE NAME
cmon  3321root5u  IPv4   8451   TCP 10.200.63.129:6789 (LISTEN)
cmds  3357root4u  IPv4   8465   TCP *:6800 (LISTEN)
cmds  3357root5u  IPv4  10044   TCP
10.200.63.129:59480->10.200.63.129:6789 (SYN_SENT)
cosd  3466root3u  IPv4   8548   TCP *:6801 (LISTEN)
cosd  3466root5u  IPv4   8549   TCP *:6802 (LISTEN)
cosd  3466root6u  IPv4   8550   TCP *:6803 (LISTEN)
cosd  3466root   13u  IPv4  10046   TCP
10.200.63.129:59481->10.200.63.129:6789 (SYN_SENT)

Note that the DEVICE number for the SYN_SENT connections keeps
changing in this situation.

With the loopback:

# lsof -n -i | egrep 'COMMAND|cmon|cmds|cosd'
COMMANDPIDUSER   FD   TYPE DEVICE SIZE NODE NAME
cmon  6633root5u  IPv4  13277   TCP 10.200.63.129:6789 (LISTEN)
cmon  6633root6u  IPv4  14353   TCP
10.200.63.129:6789->10.200.63.129:32884 (ESTABLISHED)
cmon  6633root7u  IPv4  13367   TCP
10.200.63.129:6789->10.200.63.129:32883 (ESTABLISHED)
cmds  6669root4u  IPv4  13294   TCP *:6800 (LISTEN)
cmds  6669root5u  IPv4  13366   TCP
10.200.63.129:32883->10.200.63.129:6789 (ESTABLISHED)
cmds  6669root   10u  IPv4  13456   TCP
10.200.63.129:41828->10.200.63.129:6801 (ESTABLISHED)
cosd  6741root3u  IPv4  14350   TCP *:6801 (LISTEN)
cosd  6741root5u  IPv4  14351   TCP *:6802 (LISTEN)
cosd  6741root6u  IPv4  14352   TCP *:6803 (LISTEN)
cosd  6741root   13u  IPv4  13401   TCP
10.200.63.129:32884->10.200.63.129:6789 (ESTABLISHED)
cosd  6741root   14u  IPv4  14467   TCP
10.200.63.129:6801->10.200.63.129:41828 (ESTABLISHED)




Cheers,

Chris.


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: Simple test system not working

2011-02-01 Thread Chris Dunlop
G'day Nev!

On 2011-02-02, Neville Burnell  wrote:
> Just a guess, but maybe ceph is performing a dns lookup for b3 which
> is returning 127.0.0.1, and talking to 127.0.0.1 requires the loopback
> adapter?

That doesn't appear to be the case: the dns is definitely
configured to return the non-localhost address, and there's no
/etc/hosts entry that would do it. A brief skim and judicious
grepping of the source code doesn't reveal anything interesting
either. The symptom seems to be that the mds and osd daemons
*can* connect to the mon daemon, but the mon daemon tells 'em
to nick off.

Cheers,

Chris.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Speling fixes

2011-02-01 Thread Chris Dunlop
Trivial spelling fixes...

diff --git a/man/mount.ceph.8 b/man/mount.ceph.8
index d5fe62a..cfcf8f8 100644
--- a/man/mount.ceph.8
+++ b/man/mount.ceph.8
@@ -134,7 +134,7 @@ Report rbytes for st_size on directories.  Default: on
 
 norbytes
 .IP
-Do not report rbytse for st_size.
+Do not report rbytes for st_size.
 .PP
 
 nocrc
diff --git a/src/mds/MDSMap.h b/src/mds/MDSMap.h
index 7bcdd27..6e8a9fa 100644
--- a/src/mds/MDSMap.h
+++ b/src/mds/MDSMap.h
@@ -59,7 +59,7 @@ extern CompatSet mdsmap_compat_base; // pre v0.20
 #define MDS_FEATURE_INCOMPAT_BASE CompatSet::Feature(1, "base v0.20")
 #define MDS_FEATURE_INCOMPAT_CLIENTRANGES CompatSet::Feature(2, "client 
writeable ranges")
 #define MDS_FEATURE_INCOMPAT_FILELAYOUT CompatSet::Feature(3, "default file 
layouts on dirs")
-#define MDS_FEATURE_INCOMPAT_DIRINODE CompatSet::Feature(4, "dir inode in 
separate objec")
+#define MDS_FEATURE_INCOMPAT_DIRINODE CompatSet::Feature(4, "dir inode in 
separate object")
 
 class MDSMap {
 public:

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Improve mount.ceph.8

2011-02-01 Thread Chris Dunlop
Seeing as I was wondering what "rbytes" might be, I guess others might
too...

diff --git a/man/mount.ceph.8 b/man/mount.ceph.8
index d5fe62a..f2cdd86 100644
--- a/man/mount.ceph.8
+++ b/man/mount.ceph.8
@@ -129,12 +129,12 @@ no funky `cat dirname` for stats
 
 rbytes
 .IP
-Report rbytes for st_size on directories.  Default: on
+Report the recursive size of the directory contents for st_size on 
directories.  Default: on
 .PP
 
 norbytes
 .IP
-Do not report rbytse for st_size.
+Do not report the recursive size of the directory contents for st_size on 
directories.
 .PP
 
 nocrc

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


cmon: PGMonitor::encode_pending() assert failure

2011-02-02 Thread Chris Dunlop
G'day,

I received this assert failure after copying about 110 GB of data into
a previously-empty ceph 0.24.2:

ceph version 0.25~rc (commit:73e76723e35562c9391872e07cf314b4465f30af)
2011-02-03 08:05:26.779951 409b9950 mon.0@0(leader).pg v19635 
PGMonitor::update_from_paxos: error parsing incremental update: 
buffer::end_of_buffer
2011-02-03 08:05:28.651238 42b99950 mon.0@0(leader).pg v19635 
PGMonitor::update_from_paxos: error parsing incremental update: 
buffer::end_of_buffer
mon/PGMonitor.cc: In function 'virtual void 
PGMonitor::encode_pending(ceph::bufferlist&)', In thread 409b9950
mon/PGMonitor.cc:178: FAILED assert(paxos->get_version() + 1 == 
pending_inc.version)
 ceph version 0.25~rc (commit:73e76723e35562c9391872e07cf314b4465f30af)
 1: (PGMonitor::encode_pending(ceph::buffer::list&)+0x442) [0x4d4332]
 2: (PaxosService::propose_pending()+0x26d) [0x4995ad]
 3: (SafeTimer::timer_thread()+0x65f) [0x5602bf]
 4: (SafeTimerThread::entry()+0xd) [0x563a3d]
 5: (Thread::_entry_func(void*)+0xa) [0x46fe0a]
 6: /lib/libpthread.so.0 [0x7f282fd87fc7]
 7: (clone()+0x6d) [0x7f282ec6764d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.
*** Caught signal (Aborted) ***
in thread 409b9950
 ceph version 0.25~rc (commit:73e76723e35562c9391872e07cf314b4465f30af)
 1: /usr/bin/cmon [0x58054e]
 2: /lib/libpthread.so.0 [0x7f282fd8fa80]
 3: (gsignal()+0x35) [0x7f282ebc9ed5]
 4: (abort()+0x183) [0x7f282ebcb3f3]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x114) [0x7f282f44d294]
 6: /usr/lib/libstdc++.so.6 [0x7f282f44b696]
 7: /usr/lib/libstdc++.so.6 [0x7f282f44b6c3]
 8: /usr/lib/libstdc++.so.6 [0x7f282f44b7aa]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x3f4) [0x563f84]
 a: (PGMonitor::encode_pending(ceph::buffer::list&)+0x442) [0x4d4332]
 b: (PaxosService::propose_pending()+0x26d) [0x4995ad]
 c: (SafeTimer::timer_thread()+0x65f) [0x5602bf]
 d: (SafeTimerThread::entry()+0xd) [0x563a3d]
 e: (Thread::_entry_func(void*)+0xa) [0x46fe0a]
 f: /lib/libpthread.so.0 [0x7f282fd87fc7]
 10: (clone()+0x6d) [0x7f282ec6764d]

If needed, the cmon executable is available here:

http://www.onthe.net.au/private/cmon.gz

If you need any other info, just holler!

Cheers,

Chris

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cmon: PGMonitor::encode_pending() assert failure

2011-02-03 Thread Chris Dunlop
On Thu, Feb 03, 2011 at 01:03:17PM -0800, Sage Weil wrote:
> Hi Chris,
> 
> This is an interesting one.  Would it be possible for you to
> tar up your mondata directory on the failed node and post it
> somewhere I can get at it?  From the looks of things the pgmap
> incremental state file is truncated, but I'd like to confirm.
> 
> http://tracker.newdream.net/issues/762

Aw crap, sorry, I blew that fs away installing the latest master
to see what happened there ...whereupon overnight I've promptly
hit the "WARNING: at fs/btrfs/inode.c:2143" problem*.

I can revert back to my previous install and run the same
workload to see if it crops up again if that's useful (it took
about 12 hours of rsync'ing files into the fs to get there), or
I can try the workload using latest ceph with Josef Bacik's
btrfs-work** to see if either problem crops up again. Any
preference?

* http://article.gmane.org/gmane.comp.file-systems.ceph.devel/1726
** http://article.gmane.org/gmane.comp.file-systems.ceph.devel/1719

Cheers,

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Hard links (was: Fixing NFS)

2011-02-03 Thread Chris Dunlop
On 2011-02-03, Sage Weil  wrote:
> There are a couple of levels of difficulty.  The main problem is that the 
> only truly stable information in the NFS fh is the inode number, and 
> Ceph's architecture simply doesn't support lookup-by-ino.  (It uses an 
> extra table to support it for hard-linked files, under the assumption that 
> these are relatively rare in the real world.)

Sorry for the thread hijack, but just so this issue doesn't completely
fall through the cracks...

There are different "real worlds" where hard links are very, very
common. Although, admittedly, ceph may well not be targeted at those
parallel universes.

Backup servers are a classic example. It's very common to have
hard-links between the files for each snapshot. In this situation *most*
files have multiple hard links, and you can easily have almost all files
with 60 or more hard links (for 60 or more snapshots). Rsnapshot,
BackupPC and apparently OSX's Time Machine for instance work this way.

Of course, apps like this would probably be far better off if they
started using proper snapshots (and dedup, if/when that becomes
available, praise the day) provided by file systems such as ceph and
btrfs.

But there are other apps which use significant numbers of hard links,
some examples buried in this thread:

http://thread.gmane.org/gmane.comp.file-systems.btrfs/3427

Cheers,

Chris

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cmon: PGMonitor::encode_pending() assert failure

2011-02-03 Thread Chris Dunlop
On Thu, Feb 03, 2011 at 02:24:43PM -0800, Sage Weil wrote:
> On Fri, 4 Feb 2011, Chris Dunlop wrote:
>> On Thu, Feb 03, 2011 at 01:03:17PM -0800, Sage Weil wrote:
>>> Hi Chris,
>>> 
>>> This is an interesting one.  Would it be possible for you to
>>> tar up your mondata directory on the failed node and post it
>>> somewhere I can get at it?  From the looks of things the pgmap
>>> incremental state file is truncated, but I'd like to confirm.
>>> 
>>> http://tracker.newdream.net/issues/762
>> 
>> Aw crap, sorry, I blew that fs away installing the latest master
>> to see what happened there ...whereupon overnight I've promptly
>> hit the "WARNING: at fs/btrfs/inode.c:2143" problem*.
>> 
>> I can revert back to my previous install and run the same
>> workload to see if it crops up again if that's useful (it took
>> about 12 hours of rsync'ing files into the fs to get there), or
>> I can try the workload using latest ceph with Josef Bacik's
>> btrfs-work** to see if either problem crops up again. Any
>> preference?
> 
> Let's go with latest master and latest bits from Josef.  It's the future!

No worries, I'll get started on that as soon as I increase my
git-foo to the point where I know how to work with multiple
repositories...  hmmm, looks like git-remote is what I need...
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Bug #563 - osd: btrfs, warning at inode.c ( btrfs_orphan_commit_root )

2011-02-04 Thread Chris Dunlop
G'day,

http://tracker.newdream.net/issues/563

I see this bug was closed 5 days ago. Which commit[s] closed it? I'm
getting it on ceph-client 9aae8fa + btrfs-work bacae12 when I do a
mkcephfs on 3 x 2 GB OSDs, e.g.:

Feb  4 19:09:04 b3 kernel: [16780.260695] [ cut here ]
Feb  4 19:09:04 b3 kernel: [16780.260716] WARNING: at fs/btrfs/inode.c:2143 
btrfs_orphan_commit_root+0x7f/0x9b [btrfs]()
Feb  4 19:09:04 b3 kernel: [16780.260724] Hardware name: System Product Name
Feb  4 19:09:04 b3 kernel: [16780.260729] Modules linked in: kvm_amd kvm loop 
btrfs zlib_deflate crc32c libcrc32c dm_mod tpm_tis tpm psmouse tpm_bios 
i2c_piix4 i2c_core button shpchp pci_hotplug processor nfs lockd nfs_acl 
auth_rpcgss sunrpc sd_mod ata_generic ahci libahci libata ehci_hcd r8169 mii 
ohci_hcd scsi_mod thermal fan thermal_sys [last unloaded: scsi_wait_scan]
Feb  4 19:09:04 b3 kernel: [16780.260783] Pid: 5233, comm: kworker/0:2 Tainted: 
GW   2.6.37-otn1-33361-ga3f5274 #3
Feb  4 19:09:04 b3 kernel: [16780.260792] Call Trace:
Feb  4 19:09:04 b3 kernel: [16780.260805]  [] ? 
btrfs_orphan_commit_root+0x7f/0x9b [btrfs]
Feb  4 19:09:04 b3 kernel: [16780.260815]  [] ? 
warn_slowpath_common+0x78/0x8d
Feb  4 19:09:04 b3 kernel: [16780.260828]  [] ? 
btrfs_orphan_commit_root+0x7f/0x9b [btrfs]
Feb  4 19:09:04 b3 kernel: [16780.260842]  [] ? 
commit_fs_roots+0x87/0xef [btrfs]
Feb  4 19:09:04 b3 kernel: [16780.260855]  [] ? 
btrfs_commit_transaction+0x342/0x60f [btrfs]
Feb  4 19:09:04 b3 kernel: [16780.260869]  [] ? 
xprt_release_xprt+0x41/0x73 [sunrpc]
Feb  4 19:09:04 b3 kernel: [16780.260877]  [] ? 
autoremove_wake_function+0x0/0x2e
Feb  4 19:09:04 b3 kernel: [16780.260886]  [] ? 
call_transmit+0x214/0x24f [sunrpc]
Feb  4 19:09:04 b3 kernel: [16780.260899]  [] ? 
do_async_commit+0x12/0x1b [btrfs]
Feb  4 19:09:04 b3 kernel: [16780.260906]  [] ? 
process_one_work+0x22c/0x3a5
Feb  4 19:09:04 b3 kernel: [16780.260918]  [] ? 
do_async_commit+0x0/0x1b [btrfs]
Feb  4 19:09:04 b3 kernel: [16780.260924]  [] ? 
worker_thread+0x1d5/0x353
Feb  4 19:09:04 b3 kernel: [16780.260930]  [] ? 
worker_thread+0x0/0x353
Feb  4 19:09:04 b3 kernel: [16780.260936]  [] ? 
kthread+0x7e/0x86
Feb  4 19:09:04 b3 kernel: [16780.260942]  [] ? 
kernel_thread_helper+0x4/0x10
Feb  4 19:09:04 b3 kernel: [16780.260949]  [] ? 
kthread+0x0/0x86
Feb  4 19:09:04 b3 kernel: [16780.260954]  [] ? 
kernel_thread_helper+0x0/0x10
Feb  4 19:09:04 b3 kernel: [16780.260961] ---[ end trace e1f5276f9dfbd540 ]---
Feb  4 19:09:05 b3 kernel: [16782.127219] [ cut here ]
Feb  4 19:09:05 b3 kernel: [16782.127241] WARNING: at fs/btrfs/inode.c:2143 
btrfs_orphan_commit_root+0x7f/0x9b [btrfs]()
Feb  4 19:09:05 b3 kernel: [16782.127248] Hardware name: System Product Name
Feb  4 19:09:05 b3 kernel: [16782.127253] Modules linked in: kvm_amd kvm loop 
btrfs zlib_deflate crc32c libcrc32c dm_mod tpm_tis tpm psmouse tpm_bios 
i2c_piix4 i2c_core button shpchp pci_hotplug processor nfs lockd nfs_acl 
auth_rpcgss sunrpc sd_mod ata_generic ahci libahci libata ehci_hcd r8169 mii 
ohci_hcd scsi_mod thermal fan thermal_sys [last unloaded: scsi_wait_scan]
Feb  4 19:09:05 b3 kernel: [16782.127306] Pid: 5848, comm: kworker/1:2 Tainted: 
GW   2.6.37-otn1-33361-ga3f5274 #3
Feb  4 19:09:05 b3 kernel: [16782.127314] Call Trace:
Feb  4 19:09:05 b3 kernel: [16782.127326]  [] ? 
btrfs_orphan_commit_root+0x7f/0x9b [btrfs]
Feb  4 19:09:05 b3 kernel: [16782.128628]  [] ? 
warn_slowpath_common+0x78/0x8d
Feb  4 19:09:05 b3 kernel: [16782.128642]  [] ? 
btrfs_orphan_commit_root+0x7f/0x9b [btrfs]
Feb  4 19:09:05 b3 kernel: [16782.128658]  [] ? 
commit_fs_roots+0x87/0xef [btrfs]
Feb  4 19:09:05 b3 kernel: [16782.128672]  [] ? 
btrfs_commit_transaction+0x342/0x60f [btrfs]
Feb  4 19:09:05 b3 kernel: [16782.128682]  [] ? 
autoremove_wake_function+0x0/0x2e
Feb  4 19:09:05 b3 kernel: [16782.128695]  [] ? 
do_async_commit+0x12/0x1b [btrfs]
Feb  4 19:09:05 b3 kernel: [16782.128701]  [] ? 
process_one_work+0x22c/0x3a5
Feb  4 19:09:05 b3 kernel: [16782.128713]  [] ? 
do_async_commit+0x0/0x1b [btrfs]
Feb  4 19:09:05 b3 kernel: [16782.128719]  [] ? 
worker_thread+0x1d5/0x353
Feb  4 19:09:05 b3 kernel: [16782.128725]  [] ? 
worker_thread+0x0/0x353
Feb  4 19:09:05 b3 kernel: [16782.128732]  [] ? 
kthread+0x7e/0x86
Feb  4 19:09:05 b3 kernel: [16782.128738]  [] ? 
kernel_thread_helper+0x4/0x10
Feb  4 19:09:05 b3 kernel: [16782.128745]  [] ? 
kthread+0x0/0x86
Feb  4 19:09:05 b3 kernel: [16782.128750]  [] ? 
kernel_thread_helper+0x0/0x10
Feb  4 19:09:05 b3 kernel: [16782.128756] ---[ end trace e1f5276f9dfbd541 ]---


Cheers,

Chris.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cmon: PGMonitor::encode_pending() assert failure

2011-02-06 Thread Chris Dunlop
On Thu, Feb 03, 2011 at 02:24:43PM -0800, Sage Weil wrote:
> On Fri, 4 Feb 2011, Chris Dunlop wrote:
>> On Thu, Feb 03, 2011 at 01:03:17PM -0800, Sage Weil wrote:
>>> 
>>> http://tracker.newdream.net/issues/762
>> 
>> I can revert back to my previous install and run the same
>> workload to see if it crops up again if that's useful (it
>> took about 12 hours of rsync'ing files into the fs to get
>> there), or I can try the workload using latest ceph with
>> Josef Bacik's btrfs-work** to see if either problem crops up
>> again. Any preference?
> 
> Let's go with latest master and latest bits from Josef.  It's
> the future!

The PGMonitor::encode_pending() assert failure didn't show up in
copying 500 GB into a fresh ceph fs using:

ceph-client 9aae8faf
+ Josef's btrfs-work bacae123

So either it's fixed or it's now craftily hidden, waiting to
pounce again.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug #563 - osd: btrfs, warning at inode.c ( btrfs_orphan_commit_root )

2011-02-06 Thread Chris Dunlop
On Fri, Feb 04, 2011 at 03:38:12PM +0100, Wido den Hollander wrote:
> On Fri, 2011-02-04 at 19:48 +1100, Chris Dunlop wrote:
>> 
>> http://tracker.newdream.net/issues/563
>> 
>> I see this bug was closed 5 days ago. Which commit[s] closed it? I'm
>> getting it on ceph-client 9aae8fa + btrfs-work bacae12 when I do a
>> mkcephfs on 3 x 2 GB OSDs, e.g.:
> 
> From what I've heard they are working on it upstream (btrfs). Not sure
> what the state is though.

They're not waiting still for Sage?

>From your initial report at the top of the ticket:

#btrfs
20:00 < widodh> I've just upgraded to 2.6.37-rc1 to test Ceph, but I'm seeing 
this right now in my dmesg: 
http://pastebin.com/3HiGiNgy ( inode.c / 
btrfs_orphan_commit_root )
20:01 < josef> hrm looks like sage screwed something up with async commit
20:02 < widodh> aha, file it over there then?
20:03 < josef> well tell sage about it, he's the one who wrote the code so 
he'll probably be able to fix it quicker
20:03 < widodh> yes, sure. I thought it was a btrfs bug, that's why I came 
along here
20:03 < widodh> thanks!
20:05 < josef> well it is a bug in btrfs, but he will be able to fix it quicker 
than i will


I'll see what the linux-btrfs mailing list has to say about it...

Cheers,

Chris.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug #563 - osd: btrfs, warning at inode.c ( btrfs_orphan_commit_root )

2011-02-06 Thread Chris Dunlop
On Fri, Feb 04, 2011 at 09:22:53AM -0800, Sage Weil wrote:
> On Fri, 4 Feb 2011, Chris Dunlop wrote:
>> 
>> http://tracker.newdream.net/issues/563
>> 
>> I see this bug was closed 5 days ago. Which commit[s] closed it? I'm
>> getting it on ceph-client 9aae8fa + btrfs-work bacae12 when I do a
>> mkcephfs on 3 x 2 GB OSDs, e.g.:
> 
> I closed it because it seems to be a btrfs issue that's not specific to 
> Ceph.  Have you tried Josef's latest tree?  See
> 
> http://git.kernel.org/?p=linux/kernel/git/josef/btrfs-work.git;a=summary

Yup, this warning came up using:

ceph-client 9aae8faf
+ Josef's btrfs-work bacae123

Cheers,

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


WARNING: at fs/btrfs/inode.c:2143 btrfs_orphan_commit_root+0x7f/0x9b

2011-02-06 Thread Chris Dunlop
G'day,

Using Josef's btrfs-work bacae123 (+ ceph-client 9aae8faf), I
can consistently reproduce the following btrfs warning by simply
creating and starting a new btrfs-based ceph file system:

[  549.767234] [ cut here ]
[  549.767276] WARNING: at fs/btrfs/inode.c:2143 
btrfs_orphan_commit_root+0x7f/0x9b [btrfs]()
[  549.767305] Hardware name: System Product Name
[  549.767331] Modules linked in: loop btrfs zlib_deflate crc32c libcrc32c 
dm_mod tpm_tis psmouse tpm tpm_bios i2c_piix4 i2c_core shpchp pci_hotplug 
button processor nfs lockd nfs_acl auth_rpcgss sunrpc sd_mod ata_generic ahci 
libahci libata r8169 mii ohci_hcd ehci_hcd scsi_mod thermal fan thermal_sys 
[last unloaded: scsi_wait_scan]
[  549.768192] Pid: 10, comm: kworker/0:1 Tainted: GW   
2.6.37-otn1-33361-ga3f5274 #3
[  549.768221] Call Trace:
[  549.768255]  [] ? btrfs_orphan_commit_root+0x7f/0x9b 
[btrfs]
[  549.768286]  [] ? warn_slowpath_common+0x78/0x8d
[  549.768319]  [] ? btrfs_orphan_commit_root+0x7f/0x9b 
[btrfs]
[  549.768354]  [] ? commit_fs_roots+0x87/0xef [btrfs]
[  549.768388]  [] ? btrfs_commit_transaction+0x342/0x60f 
[btrfs]
[  549.768422]  [] ? xprt_release_xprt+0x41/0x73 [sunrpc]
[  549.768450]  [] ? autoremove_wake_function+0x0/0x2e
[  549.768480]  [] ? call_transmit+0x214/0x24f [sunrpc]
[  549.768513]  [] ? do_async_commit+0x12/0x1b [btrfs]
[  549.768541]  [] ? process_one_work+0x22c/0x3a5
[  549.768574]  [] ? do_async_commit+0x0/0x1b [btrfs]
[  549.768601]  [] ? worker_thread+0x1d5/0x353
[  549.768628]  [] ? worker_thread+0x0/0x353
[  549.768656]  [] ? kthread+0x7e/0x86
[  549.768683]  [] ? kernel_thread_helper+0x4/0x10
[  549.768710]  [] ? kthread+0x0/0x86
[  549.768737]  [] ? kernel_thread_helper+0x0/0x10
[  549.768765] ---[ end trace 38bb7d3c072c3ca5 ]---

This or similar warnings look to have previously been reported
by others, e.g.:

2010-11-19 
http://article.gmane.org/gmane.comp.file-systems.btrfs/7707/match=btrfs_orphan_commit_root
2011-01-06 
http://article.gmane.org/gmane.comp.file-systems.btrfs/8512/match=btrfs_orphan_commit_root
2011-01-26 
http://article.gmane.org/gmane.comp.file-systems.ceph.devel/1726/match=btrfs_orphan_commit_root

The process I use to reproduce is:

mkcephfs -v -c /etc/ceph/ceph.conf --allhosts --mkbtrfs -k /etc/ceph/keyring.bin
umount /data/osd0
/etc/init.d/ceph start 

This is using a minimal ceph.conf file:

[global]
[mon]
mon data = /data/mon$id
[mon0]
host = b3
mon addr = ip.ip.ip.ip:6789
[mds]
keyring = /data/keyring.$name
[mds.b3]
host = b3
[osd]
osd data = /data/osd$id
[osd0]
host = b3
btrfs devs = /dev/sda



Cheers,

Chris.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug #563 - osd: btrfs, warning at inode.c ( btrfs_orphan_commit_root )

2011-02-06 Thread Chris Dunlop
On Mon, Feb 07, 2011 at 05:24:55PM +1100, Chris Dunlop wrote:
> On Fri, Feb 04, 2011 at 09:22:53AM -0800, Sage Weil wrote:
>> On Fri, 4 Feb 2011, Chris Dunlop wrote:
>>> 
>>> http://tracker.newdream.net/issues/563
>>> 
>>> I see this bug was closed 5 days ago. Which commit[s] closed it? I'm
>>> getting it on ceph-client 9aae8fa + btrfs-work bacae12 when I do a
>>> mkcephfs on 3 x 2 GB OSDs, e.g.:
>> 
>> I closed it because it seems to be a btrfs issue that's not specific to 
>> Ceph.  Have you tried Josef's latest tree?  See
>> 
>> http://git.kernel.org/?p=linux/kernel/git/josef/btrfs-work.git;a=summary
> 
> Yup, this warning came up using:
> 
> ceph-client 9aae8faf
> + Josef's btrfs-work bacae123

Sorry, just found I've been booting wrong kernel. I've been
seeing the warning *without* Josef's btrfs-work.  Will re-test
tomorrow.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARNING: at fs/btrfs/inode.c:2143 btrfs_orphan_commit_root+0x7f/0x9b

2011-02-06 Thread Chris Dunlop
On Mon, Feb 07, 2011 at 05:31:02PM +1100, Chris Dunlop wrote:
> G'day,
> 
> Using Josef's btrfs-work bacae123 (+ ceph-client 9aae8faf), I
> can consistently reproduce the following btrfs warning by simply
> creating and starting a new btrfs-based ceph file system:
> 
> [  549.767234] [ cut here ]
> [  549.767276] WARNING: at fs/btrfs/inode.c:2143 
> btrfs_orphan_commit_root+0x7f/0x9b [btrfs]()

Sorry, just found I've been booting wrong kernel. I've been
seeing the warning *without* Josef's btrfs-work.  Will re-test
tomorrow.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARNING: at fs/btrfs/inode.c:2143 btrfs_orphan_commit_root+0x7f/0x9b

2011-02-07 Thread Chris Dunlop
On Mon, Feb 07, 2011 at 06:39:11PM +1100, Chris Dunlop wrote:
> On Mon, Feb 07, 2011 at 05:31:02PM +1100, Chris Dunlop wrote:
>> G'day,
>> 
>> Using Josef's btrfs-work bacae123 (+ ceph-client 9aae8faf), I
>> can consistently reproduce the following btrfs warning by simply
>> creating and starting a new btrfs-based ceph file system:
>> 
>> [  549.767234] [ cut here ]
>> [  549.767276] WARNING: at fs/btrfs/inode.c:2143 
>> btrfs_orphan_commit_root+0x7f/0x9b [btrfs]()
> 
> Sorry, just found I've been booting wrong kernel. I've been
> seeing the warning *without* Josef's btrfs-work.  Will re-test
> tomorrow.

OK, the line number has changed slightly, but still getting the
same thing with:

git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git

commit cb5520f02c010e3cb974b9ac06f30aafa2eebc38
Merge: eee4da2... 3a90983...
Author: Linus Torvalds 
Date:   Mon Feb 7 14:06:18 2011 -0800

Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable


[ 129.524029] [ cut here ]
[ 129.524080] WARNING: at fs/btrfs/inode.c:2161 
btrfs_orphan_commit_root+0x7f/0x9b [btrfs]()
[ 129.524111] Hardware name: System Product Name
[ 129.524140] Modules linked in: loop netconsole configfs dm_mod sd_mod 
i2c_piix4 k10temp i2c_core asus_atk0110 ata_generic ahci libahci libata 
scsi_mod shpchp button pci_hotplug processor nfs lockd nfs_acl auth_rpcgss 
sunrpc r8169 mii ohci_hcd ehci_hcd thermal fan thermal_sys btrfs zlib_deflate 
crc32c libcrc32c
[ 129.525057] Pid: 10, comm: kworker/0:1 Tainted: GW   
2.6.38-rc3-otn1-00323-gc88f808 #5
[ 129.525088] Call Trace:
[ 129.525125]  [] ? btrfs_orphan_commit_root+0x7f/0x9b [btrfs]
[ 129.525158]  [] ? warn_slowpath_common+0x78/0x8d
[ 129.525195]  [] ? btrfs_orphan_commit_root+0x7f/0x9b [btrfs]
[ 129.525232]  [] ? commit_fs_roots+0x87/0xef [btrfs]
[ 129.525268]  [] ? btrfs_commit_transaction+0x348/0x615 
[btrfs]
[ 129.525300]  [] ? autoremove_wake_function+0x0/0x2e
[ 129.525336]  [] ? call_transmit+0x214/0x24f [sunrpc]
[ 129.525372]  [] ? do_async_commit+0x12/0x1b [btrfs]
[ 129.525400]  [] ? process_one_work+0x22c/0x3a5
[ 129.525437]  [] ? do_async_commit+0x0/0x1b [btrfs]
[ 129.525465]  [] ? worker_thread+0x1d5/0x353
[ 129.525495]  [] ? worker_thread+0x0/0x353
[ 129.525526]  [] ? kthread+0x7e/0x86
[ 129.525557]  [] ? kernel_thread_helper+0x4/0x10
[ 129.525588]  [] ? kthread+0x0/0x86
[ 129.525617]  [] ? kernel_thread_helper+0x0/0x10
[ 129.525649] ---[ end trace deccd3c0dec6c2ba ]---
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARNING: at fs/btrfs/inode.c:2143 btrfs_orphan_commit_root+0x7f/0x9b

2011-02-08 Thread Chris Dunlop
On Tue, Feb 08, 2011 at 07:19:36AM -0800, Sage Weil wrote:
> On Tue, 8 Feb 2011, Chris Dunlop wrote:
>> On Mon, Feb 07, 2011 at 06:39:11PM +1100, Chris Dunlop wrote:
>>> On Mon, Feb 07, 2011 at 05:31:02PM +1100, Chris Dunlop wrote:
>>>> 
>>>> [  549.767234] [ cut here ]
>>>> [  549.767276] WARNING: at fs/btrfs/inode.c:2143 
>>>> btrfs_orphan_commit_root+0x7f/0x9b [btrfs]()
>> 
>> OK, the line number has changed slightly, but still getting the
>> same thing with:
>> 
>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
> 
> Can you try to reproduce this with the following patch applied to the 
> server side?  This will disable the async snapshot ioctl.  It doesn't look 
> like it's related to me, but let's be sure.
> 
> diff --git a/src/os/FileStore.cc b/src/os/FileStore.cc
> index dc78451..d5530c5 100644
> --- a/src/os/FileStore.cc
> +++ b/src/os/FileStore.cc
> @@ -865,7 +865,7 @@ int FileStore::_detect_fs()
>dout(0) << "mount btrfs SNAP_CREATE_V2 got " << r << " " << 
> strerror_r(-r, buf, sizeof(buf)) << dendl;
>if (r == 0 || errno == EEXIST) {
> dout(0) << "mount btrfs SNAP_CREATE_V2 is supported" << dendl;
> -   btrfs_snap_create_v2 = true;
> +   //btrfs_snap_create_v2 = true;
>
> // clean up
> r = ::ioctl(fd, BTRFS_IOC_SNAP_DESTROY, &vol_args);

The patch (applied to ceph.git 778902b, with changed line
numbers) didn't help, same warning message:

[73382.202774] [ cut here ]
[73382.202811] WARNING: at fs/btrfs/inode.c:2161 
btrfs_orphan_commit_root+0x7f/0x9b [btrfs]()
[73382.202820] Hardware name: System Product Name
[73382.202827] Modules linked in: loop netconsole configfs dm_mod sd_mod 
i2c_piix4 k10temp i2c_core asus_atk0110 ata_generic ahci libahci libata 
scsi_mod shpchp button pci_hotplug processor nfs lockd nfs_acl auth_rpcgss 
sunrpc r8169 mii ohci_hcd ehci_hcd thermal fan thermal_sys btrfs zlib_deflate 
crc32c libcrc32c
[73382.202936] Pid: 7092, comm: cosd Tainted: GW   
2.6.38-rc3-otn1-00323-gc88f808 #5
[73382.202945] Call Trace:
[73382.202960]  [] ? btrfs_orphan_commit_root+0x7f/0x9b 
[btrfs]
[73382.202971]  [] ? warn_slowpath_common+0x78/0x8d
[73382.202985]  [] ? btrfs_orphan_commit_root+0x7f/0x9b 
[btrfs]
[73382.203001]  [] ? commit_fs_roots+0x87/0xef [btrfs]
[73382.203015]  [] ? btrfs_commit_transaction+0x348/0x615 
[btrfs]
[73382.203026]  [] ? autoremove_wake_function+0x0/0x2e
[73382.203038]  [] ? block_rsv_add_bytes+0x15/0x3c [btrfs]
[73382.203052]  [] ? btrfs_mksubvol+0x1f4/0x308 [btrfs]
[73382.203065]  [] ? 
btrfs_ioctl_snap_create_transid+0xfe/0x11d [btrfs]
[73382.203080]  [] ? btrfs_ioctl_snap_create+0x49/0x5e [btrfs]
[73382.203094]  [] ? btrfs_ioctl+0x471/0x90a [btrfs]
[73382.203102]  [] ? fsnotify+0x255/0x279
[73382.203110]  [] ? invalidate_interrupt1+0xe/0x20
[73382.203119]  [] ? do_vfs_ioctl+0x40c/0x453
[73382.203126]  [] ? sys_ioctl+0x3d/0x5c
[73382.203134]  [] ? system_call_fastpath+0x16/0x1b
[73382.203142] ---[ end trace deccd3c0dec6c2bd ]---
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD crash during repair

2013-09-05 Thread Chris Dunlop
G'day,

I'm getting an OSD crash on 0.56.7-1~bpo70+1 whilst trying to repair an OSD:

http://tracker.ceph.com/issues/6233


ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33)
 1: /usr/bin/ceph-osd() [0x8530a2]
 2: (()+0xf030) [0x7f541ca39030]
 3: (gsignal()+0x35) [0x7f541b132475]
 4: (abort()+0x180) [0x7f541b1356f0]
 5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f541b98789d]
 6: (()+0x63996) [0x7f541b985996]
 7: (()+0x639c3) [0x7f541b9859c3]
 8: (()+0x63bee) [0x7f541b985bee]
 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x127) [0x8fa9a7]
 10: (object_info_t::decode(ceph::buffer::list::iterator&)+0x29) [0x95b579]
 11: (object_info_t::object_info_t(ceph::buffer::list&)+0x180) [0x695ec0]
 12: (PG::repair_object(hobject_t const&, ScrubMap::object*, int, int)+0xc7) 
[0x7646b7]
 13: (PG::scrub_process_inconsistent()+0x9bd) [0x76534d]
 14: (PG::scrub_finish()+0x4f) [0x76587f]
 15: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x10d6) [0x76cb96]
 16: (PG::scrub(ThreadPool::TPHandle&)+0x138) [0x76d7e8]
 17: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0xf) [0x70515f]
 18: (ThreadPool::worker(ThreadPool::WorkThread*)+0x992) [0x8f0542]
 19: (ThreadPool::WorkThread::entry()+0x10) [0x8f14d0]
 20: (()+0x6b50) [0x7f541ca30b50]
 21: (clone()+0x6d) [0x7f541b1daa7d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed 
to interpret this.


This occurs as a result of:

# ceph pg dump | grep inconsistent
2.1227230   0   0   11311299072 159189  159189  
active+clean+inconsistent   2013-09-06 09:35:47.512119  20117'690441
20120'7914185   [6,7]   [6,7]   20021'6759672013-09-03 15:58:12.459188  
19384'6654042013-08-28 12:42:07.490877
# ceph pg repair 2.12

Looking at PG::repair_object per line 12 of the backtrace, I can see a
dout(10) which should tell me the problem object:


src/osd/PG.cc:
void PG::repair_object(const hobject_t& soid, ScrubMap::object *po, int 
bad_peer, int ok_peer)
{
  dout(10) << "repair_object " << soid << " bad_peer osd." << bad_peer << " 
ok_peer osd." << ok_peer << dendl;
  ...
}


The 'ceph pg dump' output above tells me the primary osd is '6', so I
can increase the logging level to 10 on osd.6 to get the debug output,
and repair again:

# ceph osd tell 6 injectargs '--debug_osd 0/10'
# ceph pg repair 2.12

I get the same OSD crash, but this time it logs the dout from above,
which shows the problem object:

-1> 2013-09-06 09:34:45.142224 7f0ae94bd700 10 osd.6 pg_epoch: 20117 
pg[2.12( v 20117'690441 (20117'689440,20117'690441] local-les=20115 n=2722 ec=1 
les/c 20115/20115 20108/20112/20112) [6,7] r=0 lpr=20112 mlcod 20117'690440 
active+scrubbing+deep+repair] repair_object 
56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 bad_peer osd.7 ok_peer osd.6
 0> 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) 
**

So...

Firstly, is anyone interested in further investigating the problem to
fix the crash behaviour?

And, what's the best way to fix the pool?

Cheers,

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD crash during repair

2013-09-05 Thread Chris Dunlop
Hi Sage,

Does this answer your question?

2013-09-06 09:30:19.813811 7f0ae8cbc700  0 log [INF] : applying configuration 
change: internal_safe_to_start_threads = 'true'
2013-09-06 09:33:28.303658 7f0ae94bd700  0 log [ERR] : 2.12 osd.7: soid 
56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 extra attr _, extra attr 
snapset
2013-09-06 09:33:28.303685 7f0ae94bd700  0 log [ERR] : repair 2.12 
56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 no 'snapset' attr
2013-09-06 09:34:45.138468 7f0ae94bd700  0 log [ERR] : 2.12 repair stat 
mismatch, got 2722/2723 objects, 339/339 clones, 11307104768/11311299072 bytes.
2013-09-06 09:34:45.142215 7f0ae94bd700  0 log [ERR] : 2.12 repair 0 missing, 1 
inconsistent objects
2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) **

I've just attached the full 'debug_osd 0/10' log to the bug report.

Thanks,

Chris

On Thu, Sep 05, 2013 at 07:38:47PM -0700, Sage Weil wrote:
> Hi Chris,
> 
> What is the inconsistency that scrub reports in the log?  My guess is that 
> the simplest way to resolve this is to remove whichever copy you decide is 
> invalid, but it depends on what the inconstency it is trying/failing to 
> repair is.
> 
> Thanks!
> sage
> 
> 
> On Fri, 6 Sep 2013, Chris Dunlop wrote:
> 
> > G'day,
> > 
> > I'm getting an OSD crash on 0.56.7-1~bpo70+1 whilst trying to repair an OSD:
> > 
> > http://tracker.ceph.com/issues/6233
> > 
> > 
> > ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33)
> >  1: /usr/bin/ceph-osd() [0x8530a2]
> >  2: (()+0xf030) [0x7f541ca39030]
> >  3: (gsignal()+0x35) [0x7f541b132475]
> >  4: (abort()+0x180) [0x7f541b1356f0]
> >  5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f541b98789d]
> >  6: (()+0x63996) [0x7f541b985996]
> >  7: (()+0x639c3) [0x7f541b9859c3]
> >  8: (()+0x63bee) [0x7f541b985bee]
> >  9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x127) 
> > [0x8fa9a7]
> >  10: (object_info_t::decode(ceph::buffer::list::iterator&)+0x29) [0x95b579]
> >  11: (object_info_t::object_info_t(ceph::buffer::list&)+0x180) [0x695ec0]
> >  12: (PG::repair_object(hobject_t const&, ScrubMap::object*, int, 
> > int)+0xc7) [0x7646b7]
> >  13: (PG::scrub_process_inconsistent()+0x9bd) [0x76534d]
> >  14: (PG::scrub_finish()+0x4f) [0x76587f]
> >  15: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x10d6) [0x76cb96]
> >  16: (PG::scrub(ThreadPool::TPHandle&)+0x138) [0x76d7e8]
> >  17: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0xf) [0x70515f]
> >  18: (ThreadPool::worker(ThreadPool::WorkThread*)+0x992) [0x8f0542]
> >  19: (ThreadPool::WorkThread::entry()+0x10) [0x8f14d0]
> >  20: (()+0x6b50) [0x7f541ca30b50]
> >  21: (clone()+0x6d) [0x7f541b1daa7d]
> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
> > needed to interpret this.
> > 
> > 
> > This occurs as a result of:
> > 
> > # ceph pg dump | grep inconsistent
> > 2.1227230   0   0   11311299072 159189  159189  
> > active+clean+inconsistent   2013-09-06 09:35:47.512119  
> > 20117'69044120120'7914185   [6,7]   [6,7]   20021'6759672013-09-03 
> > 15:58:12.459188  19384'6654042013-08-28 12:42:07.490877
> > # ceph pg repair 2.12
> > 
> > Looking at PG::repair_object per line 12 of the backtrace, I can see a
> > dout(10) which should tell me the problem object:
> > 
> > 
> > src/osd/PG.cc:
> > void PG::repair_object(const hobject_t& soid, ScrubMap::object *po, int 
> > bad_peer, int ok_peer)
> > {
> >   dout(10) << "repair_object " << soid << " bad_peer osd." << bad_peer << " 
> > ok_peer osd." << ok_peer << dendl;
> >   ...
> > }
> > 
> > 
> > The 'ceph pg dump' output above tells me the primary osd is '6', so I
> > can increase the logging level to 10 on osd.6 to get the debug output,
> > and repair again:
> > 
> > # ceph osd tell 6 injectargs '--debug_osd 0/10'
> > # ceph pg repair 2.12
> > 
> > I get the same OSD crash, but this time it logs the dout from above,
> > which shows the problem object:
> > 
> > -1> 2013-09-06 09:34:45.142224 7f0ae94bd700 10 osd.6 pg_epoch: 20117 
> > pg[2.12( v 20117'690441 (20117'689440,20117'690441] local-les=20115 n=2722 
> > ec=1 les/c 20115/20115 20108/20112/20112) [6,7] r=0 lpr=20112 mlcod 
> > 20117'690440 active+scrubbing+deep+repair] repair_obj

Re: OSD crash during repair

2013-09-05 Thread Chris Dunlop
On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote:
> On Fri, 6 Sep 2013, Chris Dunlop wrote:
>> Hi Sage,
>> 
>> Does this answer your question?
>> 
>> 2013-09-06 09:30:19.813811 7f0ae8cbc700  0 log [INF] : applying 
>> configuration change: internal_safe_to_start_threads = 'true'
>> 2013-09-06 09:33:28.303658 7f0ae94bd700  0 log [ERR] : 2.12 osd.7: soid 
>> 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 extra attr _, extra attr 
>> snapset
>> 2013-09-06 09:33:28.303685 7f0ae94bd700  0 log [ERR] : repair 2.12 
>> 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 no 'snapset' attr
>> 2013-09-06 09:34:45.138468 7f0ae94bd700  0 log [ERR] : 2.12 repair stat 
>> mismatch, got 2722/2723 objects, 339/339 clones, 11307104768/11311299072 
>> bytes.
>> 2013-09-06 09:34:45.142215 7f0ae94bd700  0 log [ERR] : 2.12 repair 0 
>> missing, 1 inconsistent objects
>> 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) **
>> 
>> I've just attached the full 'debug_osd 0/10' log to the bug report.
> 
> This suggests to me that the object on osd.6 is missing those xattrs; can 
> you confirm with getfattr -d on the in osd.6's data directory?

I haven't yet wrapped my head around how to translate an oid
like those above into a underlying file system object. What 
directory should I be looking at?

> If that is indeed the case, you should be able to move the object out of 
> the way (don't delete it, just in case) and then do the repair.  The osd.6 
> should recover by copying the object from osd.7 (which has the needed 
> xattrs).  Bobtail is smart enough to recover missing objects but not to 
> recover just missing xattrs.

Do you want me to hold off on any repairs to allow tracking down
the crash, or is the current code sufficiently different that
there's little point?

> Also, you should upgrade to dumpling.  :)

I've been considering it. It was initially a little scary with
the various issues that were cropping up but that all seems to
have quietened down.

Of course I'd like my cluster to be clean before attempting an upgrade!

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD crash during repair

2013-09-05 Thread Chris Dunlop
On Fri, Sep 06, 2013 at 01:12:21PM +1000, Chris Dunlop wrote:
> On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote:
>> On Fri, 6 Sep 2013, Chris Dunlop wrote:
>>> Hi Sage,
>>> 
>>> Does this answer your question?
>>> 
>>> 2013-09-06 09:30:19.813811 7f0ae8cbc700  0 log [INF] : applying 
>>> configuration change: internal_safe_to_start_threads = 'true'
>>> 2013-09-06 09:33:28.303658 7f0ae94bd700  0 log [ERR] : 2.12 osd.7: soid 
>>> 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 extra attr _, extra attr 
>>> snapset
>>> 2013-09-06 09:33:28.303685 7f0ae94bd700  0 log [ERR] : repair 2.12 
>>> 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 no 'snapset' attr
>>> 2013-09-06 09:34:45.138468 7f0ae94bd700  0 log [ERR] : 2.12 repair stat 
>>> mismatch, got 2722/2723 objects, 339/339 clones, 11307104768/11311299072 
>>> bytes.
>>> 2013-09-06 09:34:45.142215 7f0ae94bd700  0 log [ERR] : 2.12 repair 0 
>>> missing, 1 inconsistent objects
>>> 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) **
>>> 
>>> I've just attached the full 'debug_osd 0/10' log to the bug report.
>> 
>> This suggests to me that the object on osd.6 is missing those xattrs; can 
>> you confirm with getfattr -d on the in osd.6's data directory?
> 
> I haven't yet wrapped my head around how to translate an oid
> like those above into a underlying file system object. What 
> directory should I be looking at?

Found it:

b5# cd /var/lib/ceph/osd/ceph-6/current
b5# find 2.12* | grep -i 17d9b.2ae8944a.1e11
2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2
b5# getfattr -d 
2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2
<<< ...crickets... >>>

vs.

b4# cd /var/lib/ceph/osd/ceph-7/current
b4# getfattr -d 
2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2
# file: 
2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2
user.ceph._=0sCgjhBANBACByYi4wLjE3ZDliLjJhZTg5NDRhLjAwMDAwMDAwMWUxMf7/EnqYVgAAAgAEAxACAP8AAEInCgAAuEsAAEEnCgAAuEsAAAICFQgTmwEAAHD1AgAAQAAAyY4dUpjCTSACAhUAAABCJwoAALhL
user.ceph.snapset=0sAgIZAAABAA==

>> If that is indeed the case, you should be able to move the object out of 
>> the way (don't delete it, just in case) and then do the repair.  The osd.6 
>> should recover by copying the object from osd.7 (which has the needed 
>> xattrs).  Bobtail is smart enough to recover missing objects but not to 
>> recover just missing xattrs.
> 
> Do you want me to hold off on any repairs to allow tracking down
> the crash, or is the current code sufficiently different that
> there's little point?

Repaired! ...but why does it take multiple rounds?

b5# mv 
2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2 
..

b5# ceph pg repair 2.12
b5# while ceph -s | grep -q scrubbing; do sleep 60; done
b5# tail /var/log/ceph/ceph-osd.6.log
2013-09-06 15:02:13.751160 7f6ccc5ae700  0 log [ERR] : 2.12 osd.6 missing 
56987a12/rb.0.17d9b.2ae8944a.1e11/head//2
2013-09-06 15:04:15.286711 7f6ccc5ae700  0 log [ERR] : 2.12 repair stat 
mismatch, got 2723/2724 objects, 339/339 clones, 11311299072/11315493376 bytes.
2013-09-06 15:04:15.286766 7f6ccc5ae700  0 log [ERR] : 2.12 repair 1 missing, 0 
inconsistent objects
2013-09-06 15:04:15.286823 7f6ccc5ae700  0 log [ERR] : 2.12 repair 2 errors, 2 
fixed
2013-09-06 15:04:20.778377 7f6ccc5ae700  0 log [ERR] : 2.12 scrub stat 
mismatch, got 2724/2723 objects, 339/339 clones, 11315493376/11311299072 bytes.
2013-09-06 15:04:20.778383 7f6ccc5ae700  0 log [ERR] : 2.12 scrub 1 errors

b5# ceph pg dump | grep inconsistent
2.1227230   0   0   11311299072 159103  159103  
active+clean+inconsistent   2013-09-06 15:04:20.778413  20121'690883
20128'7941893   [6,7]   [6,7]   20121'6908832013-09-06 15:04:20.778387  
20121'6908832013-09-06 15:04:15.286835

b5# ceph pg repair 2.12
b5# while ceph -s | grep -q scrubbing; do sleep 60; done
b5# tail /var/log/ceph/ceph-osd.6.log
2013-09-06 15:07:30.461959 7f6ccc5ae700  0 log [ERR] : 2.12 repair stat 
mismatch, got 2724/2723 objects, 339/339 clones, 11315493376/11311299072 bytes.
2013-09-06 15:07:30.461991 7f6ccc5ae700  0 log [ERR] : 2.12 repair 1 errors, 1 
fixed

b5# ceph pg dump | grep inconsistent
2.1227240   0   0   11315493376 159580  159580  
active+clean+inconsistent   2013-09-06 15:07:30.462039  20129'

OSD repair: on disk size does not match object info size

2013-09-09 Thread Chris Dunlop
G'day,

On 0.56.7-1~bpo70+1 I'm getting:

# ceph pg dump | grep inconsistent
013-09-10-08:39:59 2.bc27760   0   0   11521799680 
162063  162063  active+clean+inconsistent   2013-09-10 08:38:38.482302  
20512'69987720360'13461026  [6,0]   [6,0]   20512'6998772013-09-10 
08:38:38.482264  20512'699877 2013-09-10 08:38:38.482264

# ceph pg repair 2.bc
instructing pg 2.bc on osd.6 to repair

# tail /var/log/ceph/ceph-osd.6.log
2013-09-10 08:17:25.557926 7fef09c14700  0 log [ERR] : repair 2.bc 
89ebebc/rbd_data.13a0c74b0dc51.000107ec/head//2 on disk size (4194304) 
does not match object info size (4104192)
2013-09-10 08:17:27.316112 7fef09c14700  0 log [ERR] : 2.bc repair 1 errors, 0 
fixed

# ls -l 
'ceph-6/current/2.bc_head/DIR_C/DIR_B/DIR_E/rbd\udata.13a0c74b0dc51.000107ec__head_089EBEBC__2'
-rw-r--r-- 1 root root 4194304 Sep  8 21:01 
ceph-6/current/2.bc_head/DIR_C/DIR_B/DIR_E/rbd\udata.13a0c74b0dc51.000107ec__head_089EBEBC__2
# ls -l 
'ceph-0/current/2.bc_head/DIR_C/DIR_B/DIR_E/rbd\udata.13a0c74b0dc51.000107ec__head_089EBEBC__2'
-rw-r--r-- 1 root root 4194304 Sep  8 21:01 
ceph-0/current/2.bc_head/DIR_C/DIR_B/DIR_E/rbd\udata.13a0c74b0dc51.000107ec__head_089EBEBC__2

One possible solution would be to simply truncate the objects down to the
object info size, as recommended in this case:

http://www.spinics.net/lists/ceph-users/msg00793.html

However I'm a little concerned about that solution as the on-disk size is
exactly 4MB, which I think is the expected size of these objects, and matches
the size of all the other objects in the same directory, and the "extra" data
looks a little interesting, with "FILE0" blocks in there (what are those?):

# cd /var/lib/ceph/osd/ceph-6/current/2.bc_head/DIR_C/DIR_B/DIR_E/
# dd if='rbd\udata.13a0c74b0dc51.000107ec__head_089EBEBC__2' bs=1024 
skip=4008 | od -c
000   F   I   L   E   0  \0 003  \0 312   j   o   o  \0  \0  \0  \0
020 001  \0 001  \0   8  \0 001  \0   X 001  \0  \0  \0 004  \0  \0
040  \0  \0  \0  \0  \0  \0  \0  \0 006  \0  \0  \0 310   p 017  \0
060 002  \0  \0  \0  \0  \0  \0  \0 020  \0  \0  \0   `  \0  \0  \0
...
0002000   F   I   L   E   0  \0 003  \0 002   k   o   o  \0  \0  \0  \0
0002020 001  \0 001  \0   8  \0 001  \0   X 001  \0  \0  \0 004  \0  \0
0002040  \0  \0  \0  \0  \0  \0  \0  \0 006  \0  \0  \0 311   p 017  \0
0002060 002  \0  \0  \0  \0  \0  \0  \0 020  \0  \0  \0   `  \0  \0  \0
...
0004000   F   I   L   E   0  \0 003  \0 023   r   o   o  \0  \0  \0  \0
0004020 001  \0 001  \0   8  \0 001  \0   X 001  \0  \0  \0 004  \0  \0
0004040  \0  \0  \0  \0  \0  \0  \0  \0 006  \0  \0  \0 312   p 017  \0
0004060 002  \0  \0  \0  \0  \0  \0  \0 020  \0  \0  \0   `  \0  \0  \0

Is it safe to simply truncate this object, or what other solutions might
be applicable?

Cheers,

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD repair: on disk size does not match object info size

2013-09-09 Thread Chris Dunlop
On Mon, Sep 09, 2013 at 04:30:33PM -0700, Sage Weil wrote:
> On Tue, 10 Sep 2013, Chris Dunlop wrote:
>> G'day,
>> 
>> On 0.56.7-1~bpo70+1 I'm getting:
>> 
>> # ceph pg dump | grep inconsistent
>> 013-09-10-08:39:59 2.bc27760   0   0   11521799680   
>>   162063  162063  active+clean+inconsistent   2013-09-10 08:38:38.482302 
>>  20512'69987720360'13461026  [6,0]   [6,0]   20512'699877
>> 2013-09-10 08:38:38.482264  20512'699877 2013-09-10 08:38:38.482264
>> 
>> # ceph pg repair 2.bc
>> instructing pg 2.bc on osd.6 to repair
>> 
>> # tail /var/log/ceph/ceph-osd.6.log
>> 2013-09-10 08:17:25.557926 7fef09c14700  0 log [ERR] : repair 2.bc 
>> 89ebebc/rbd_data.13a0c74b0dc51.000107ec/head//2 on disk size 
>> (4194304) does not match object info size (4104192)
>> 2013-09-10 08:17:27.316112 7fef09c14700  0 log [ERR] : 2.bc repair 1 errors, 
>> 0 fixed
>> 
>> # ls -l 
>> 'ceph-6/current/2.bc_head/DIR_C/DIR_B/DIR_E/rbd\udata.13a0c74b0dc51.000107ec__head_089EBEBC__2'
>> -rw-r--r-- 1 root root 4194304 Sep  8 21:01 
>> ceph-6/current/2.bc_head/DIR_C/DIR_B/DIR_E/rbd\udata.13a0c74b0dc51.000107ec__head_089EBEBC__2
>> # ls -l 
>> 'ceph-0/current/2.bc_head/DIR_C/DIR_B/DIR_E/rbd\udata.13a0c74b0dc51.000107ec__head_089EBEBC__2'
>> -rw-r--r-- 1 root root 4194304 Sep  8 21:01 
>> ceph-0/current/2.bc_head/DIR_C/DIR_B/DIR_E/rbd\udata.13a0c74b0dc51.000107ec__head_089EBEBC__2
>> 
>> One possible solution would be to simply truncate the objects down to the
>> object info size, as recommended in this case:
>> 
>> http://www.spinics.net/lists/ceph-users/msg00793.html
>> 
>> However I'm a little concerned about that solution as the on-disk size is
>> exactly 4MB, which I think is the expected size of these objects, and matches
>> the size of all the other objects in the same directory, and the "extra" data
>> looks a little interesting, with "FILE0" blocks in there (what are those?):
>> 
>> # cd /var/lib/ceph/osd/ceph-6/current/2.bc_head/DIR_C/DIR_B/DIR_E/
>> # dd if='rbd\udata.13a0c74b0dc51.000107ec__head_089EBEBC__2' bs=1024 
>> skip=4008 | od -c
>> 000   F   I   L   E   0  \0 003  \0 312   j   o   o  \0  \0  \0  \0
>> 020 001  \0 001  \0   8  \0 001  \0   X 001  \0  \0  \0 004  \0  \0
>> 040  \0  \0  \0  \0  \0  \0  \0  \0 006  \0  \0  \0 310   p 017  \0
>> 060 002  \0  \0  \0  \0  \0  \0  \0 020  \0  \0  \0   `  \0  \0  \0
>> ...
>> 0002000   F   I   L   E   0  \0 003  \0 002   k   o   o  \0  \0  \0  \0
>> 0002020 001  \0 001  \0   8  \0 001  \0   X 001  \0  \0  \0 004  \0  \0
>> 0002040  \0  \0  \0  \0  \0  \0  \0  \0 006  \0  \0  \0 311   p 017  \0
>> 0002060 002  \0  \0  \0  \0  \0  \0  \0 020  \0  \0  \0   `  \0  \0  \0
>> ...
>> 0004000   F   I   L   E   0  \0 003  \0 023   r   o   o  \0  \0  \0  \0
>> 0004020 001  \0 001  \0   8  \0 001  \0   X 001  \0  \0  \0 004  \0  \0
>> 0004040  \0  \0  \0  \0  \0  \0  \0  \0 006  \0  \0  \0 312   p 017  \0
>> 0004060 002  \0  \0  \0  \0  \0  \0  \0 020  \0  \0  \0   `  \0  \0  \0
>> 
>> Is it safe to simply truncate this object, or what other solutions might
>> be applicable?
> 
> The alternative is to edit the xattr.  That's harder, but better.  You'll 
> want grab the user.ceph._ xattr, change the the one instance of 4104192 to 
> 4194304, and then reset it.  You can use
> 
>  ceph-dencoder type object_info_t import /tmp/xattrfile decode dump_json
>
> to verify that it decodes properly before and after you make the edit.  I 
> like the 'attr' tool for getting/setting xattrs.

Can ceph-dencoder import the (modified) json and write out the
encoded binary suitable for setting in the xattr?

If not, what encoding is the xattr, so I can work out what I
need to do to make the change?

# getfattr -n user.ceph._ 
'ceph-6/current/2.bc_head/DIR_C/DIR_B/DIR_E/rbd\udata.13a0c74b0dc51.000107ec__head_089EBEBC__2'
getfattr: Removing leading '/' from absolute path names
# file: 
ceph-6/current/2.bc_head/DIR_C/DIR_B/DIR_E/rbd\134udata.13a0c74b0dc51.000107ec__head_089EBEBC__2
user.ceph._=0sCgjoBANIACcAAAByYmRfZGF0YS4xM2EwYzc0YjBkYzUxLjAwMDAwMDAwMDAwMTA3ZWP+/7y+nggAAAIABAMQAgD/AACrqgoAADJPAAB6dwoAADBOAAACAhUIS2YBAADm+g0foD4AABlZLFIAh2IgAgIVq6oKAAAyTw==

> Is this still bobtail?  We haven't seen this sort of corruption since 
> then.

Yup. I'll upgrade once the cluster settles down cleanly!

Cheers,

Chris.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD repair: on disk size does not match object info size

2013-09-09 Thread Chris Dunlop
On Mon, Sep 09, 2013 at 05:14:14PM -0700, Sage Weil wrote:
> On Tue, 10 Sep 2013, Chris Dunlop wrote:
>> On Mon, Sep 09, 2013 at 04:30:33PM -0700, Sage Weil wrote:
>>> On Tue, 10 Sep 2013, Chris Dunlop wrote:
>>>> G'day,
>>>> 
>>>> On 0.56.7-1~bpo70+1 I'm getting:
>>>> 
>>>> # ceph pg dump | grep inconsistent
>>>> 013-09-10-08:39:59 2.bc27760   0   0   11521799680 
>>>> 162063  162063  active+clean+inconsistent   2013-09-10 
>>>> 08:38:38.482302  20512'69987720360'13461026  [6,0]   [6,0]   
>>>> 20512'6998772013-09-10 08:38:38.482264  20512'699877 
>>>> 2013-09-10 08:38:38.482264
>>>> 
>>>> # ceph pg repair 2.bc
>>>> instructing pg 2.bc on osd.6 to repair
>>>> 
>>>> # tail /var/log/ceph/ceph-osd.6.log
>>>> 2013-09-10 08:17:25.557926 7fef09c14700  0 log [ERR] : repair 2.bc 
>>>> 89ebebc/rbd_data.13a0c74b0dc51.000107ec/head//2 on disk size 
>>>> (4194304) does not match object info size (4104192)
>>>> 2013-09-10 08:17:27.316112 7fef09c14700  0 log [ERR] : 2.bc repair 1 
>>>> errors, 0 fixed
>>>> 
>>>> # ls -l 
>>>> 'ceph-6/current/2.bc_head/DIR_C/DIR_B/DIR_E/rbd\udata.13a0c74b0dc51.000107ec__head_089EBEBC__2'
>>>> -rw-r--r-- 1 root root 4194304 Sep  8 21:01 
>>>> ceph-6/current/2.bc_head/DIR_C/DIR_B/DIR_E/rbd\udata.13a0c74b0dc51.000107ec__head_089EBEBC__2
>>>> # ls -l 
>>>> 'ceph-0/current/2.bc_head/DIR_C/DIR_B/DIR_E/rbd\udata.13a0c74b0dc51.000107ec__head_089EBEBC__2'
>>>> -rw-r--r-- 1 root root 4194304 Sep  8 21:01 
>>>> ceph-0/current/2.bc_head/DIR_C/DIR_B/DIR_E/rbd\udata.13a0c74b0dc51.000107ec__head_089EBEBC__2
>>>> 
>>>> One possible solution would be to simply truncate the objects down to the
>>>> object info size, as recommended in this case:
>>>> 
>>>> http://www.spinics.net/lists/ceph-users/msg00793.html
>>>> 
>>>> However I'm a little concerned about that solution as the on-disk size is
>>>> exactly 4MB, which I think is the expected size of these objects, and 
>>>> matches
>>>> the size of all the other objects in the same directory, and the "extra" 
>>>> data
>>>> looks a little interesting, with "FILE0" blocks in there (what are those?):
>>>> 
>>>> # cd /var/lib/ceph/osd/ceph-6/current/2.bc_head/DIR_C/DIR_B/DIR_E/
>>>> # dd if='rbd\udata.13a0c74b0dc51.000107ec__head_089EBEBC__2' 
>>>> bs=1024 skip=4008 | od -c
>>>> 000   F   I   L   E   0  \0 003  \0 312   j   o   o  \0  \0  \0  \0
>>>> 020 001  \0 001  \0   8  \0 001  \0   X 001  \0  \0  \0 004  \0  \0
>>>> 040  \0  \0  \0  \0  \0  \0  \0  \0 006  \0  \0  \0 310   p 017  \0
>>>> 060 002  \0  \0  \0  \0  \0  \0  \0 020  \0  \0  \0   `  \0  \0  \0
>>>> ...
>>>> 0002000   F   I   L   E   0  \0 003  \0 002   k   o   o  \0  \0  \0  \0
>>>> 0002020 001  \0 001  \0   8  \0 001  \0   X 001  \0  \0  \0 004  \0  \0
>>>> 0002040  \0  \0  \0  \0  \0  \0  \0  \0 006  \0  \0  \0 311   p 017  \0
>>>> 0002060 002  \0  \0  \0  \0  \0  \0  \0 020  \0  \0  \0   `  \0  \0  \0
>>>> ...
>>>> 0004000   F   I   L   E   0  \0 003  \0 023   r   o   o  \0  \0  \0  \0
>>>> 0004020 001  \0 001  \0   8  \0 001  \0   X 001  \0  \0  \0 004  \0  \0
>>>> 0004040  \0  \0  \0  \0  \0  \0  \0  \0 006  \0  \0  \0 312   p 017  \0
>>>> 0004060 002  \0  \0  \0  \0  \0  \0  \0 020  \0  \0  \0   `  \0  \0  \0
>>>> 
>>>> Is it safe to simply truncate this object, or what other solutions might
>>>> be applicable?
>>> 
>>> The alternative is to edit the xattr.  That's harder, but better.  You'll 
>>> want grab the user.ceph._ xattr, change the the one instance of 4104192 to 
>>> 4194304, and then reset it.  You can use
>>> 
>>>  ceph-dencoder type object_info_t import /tmp/xattrfile decode dump_json
>>>
>>> to verify that it decodes properly before and after you make the edit.  I 
>>> like the 'attr' tool for getting/setting xattrs.
>> 
>> Can ceph-dencoder import the (modified) json and write out the
>> encoded binary suitable for setting in the xattr?
> 
> It can't, sadly.
>  
>> If not, what encoding is

Bobtail to dumpling (was: OSD crash during repair)

2013-09-10 Thread Chris Dunlop
On Fri, Sep 06, 2013 at 08:21:07AM -0700, Sage Weil wrote:
> On Fri, 6 Sep 2013, Chris Dunlop wrote:
>> On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote:
>>> Also, you should upgrade to dumpling.  :)
>> 
>> I've been considering it. It was initially a little scary with
>> the various issues that were cropping up but that all seems to
>> have quietened down.
>> 
>> Of course I'd like my cluster to be clean before attempting an upgrade!
> 
> Definitely.  Let us know how it goes! :)

Upgraded, directly from bobtail to dumpling.

Well, that was a mite more traumatic than I expected. I had two
issues, both my fault...

Firstly, I didn't realise I should have restarted the osds one
at a time rather than doing 'service ceph restart' on each host
quickly in succession. Restarting them all at once meant
everything was offline whilst "PGs are upgrading".

Secondly, whilst I saw the 'osd crush update on start' issue in
the release notes, and checked that my crush map hostnames match
the actual hostnames, I have two separate pools (for fast SAS vs
bulk SATA disks) and I stupidly only noticed the one which
matched, but not the other which didn't match. So on restart all
the osds moved into the one pool, and started rebalancing.

The two issues at the same time produced quite the adrenaline
rush! :-)

My current crush configuration is below (host b2 is recently
added and I haven't added it into the pools yet). Is there a
better/recommended way of using the crush map to support
separate pools to avoid setting 'osd crush update on start =
false'? It doesn't seem that I can use the same 'host' names
under the separate 'sas' and 'default' roots?

Cheers,

Chris

--
# ceph osd tree
# idweight  type name   up/down reweight
-8  2   root sas
-7  2   rack sas-rack-1
-5  1   host b4-sas
4   0.5 osd.4   up  1   
5   0.5 osd.5   up  1   
-6  1   host b5-sas
2   0.5 osd.2   up  1   
3   0.5 osd.3   up  1   
-1  12.66   root default
-3  8   rack unknownrack
-2  4   host b4
0   2   osd.0   up  1   
7   2   osd.7   up  1   
-4  4   host b5
1   2   osd.1   up  1   
6   2   osd.6   up  1   
-9  4.66host b2
10  1.82osd.10  up  1   
11  1.82osd.11  up  1   
8   0.51osd.8   up  1   
9   0.51osd.9   up  1   

--
# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 root

# buckets
host b4 {
id -2   # do not change unnecessarily
# weight 4.000
alg straw
hash 0  # rjenkins1
item osd.0 weight 2.000
item osd.7 weight 2.000
}
host b5 {
id -4   # do not change unnecessarily
# weight 4.000
alg straw
hash 0  # rjenkins1
item osd.1 weight 2.000
item osd.6 weight 2.000
}
rack unknownrack {
id -3   # do not change unnecessarily
# weight 8.000
alg straw
hash 0  # rjenkins1
item b4 weight 4.000
item b5 weight 4.000
}
host b2 {
id -9   # do not change unnecessarily
# weight 4.660
alg straw
hash 0  # rjenkins1
item osd.10 weight 1.820
item osd.11 weight 1.820
item osd.8 weight 0.510
item osd.9 weight 0.510
}
root default {
id -1   # do not change unnecessarily
# weight 12.660
alg straw
hash 0  # rjenkins1
item unknownrack weight 8.000
item b2 weight 4.660
}
host b4-sas {
id -5   # do not change unnecessarily
# weight 1.000
alg straw
hash 0  # rjenkins1
item osd.4 weight 0.500
item osd.5 weight 0.500
}
host b5-sas {
id -6   # do not change unnecessarily
# weight 1.000
alg straw
hash 0  # rjenkins1
item osd.2 weight 0.500
item osd.3 weight 0.500
}
rack sas-rack-1 {
id -7   # do not change

Re: [ceph-users] I have PGs that I can't deep-scrub

2014-07-10 Thread Chris Dunlop
Hi Craig,

On Thu, Jul 10, 2014 at 03:09:51PM -0700, Craig Lewis wrote:
> I fixed this issue by reformatting all of the OSDs.  I changed the mkfs
> options from
> 
> [osd]
>   osd mkfs type = xfs
>   osd mkfs options xfs = -l size=1024m -n size=64k -i size=2048 -s size=4096
> 
> to
> [osd]
>   osd mkfs type = xfs
>   osd mkfs options xfs = -s size=4096
> 
> (I have a mix of 512 and 4k sector drives, and I want to treat them all
> like 4k sector).
> 
> 
> Now deep scrub runs to completion, and CPU usage of the daemon never goes
> over 30%.  I did have to restart a few OSDs when I scrubbed known problem
> PGs, but they scrubbed the 2nd time successfully.  The cluster is still
> scrubbing, but it's completed half with no more issues.

I suspect it was the "-n size=64k" causing this behaviour, potentially using
too much CPU and starving the OSD processes:

http://xfs.org/index.php/XFS_FAQ#Q:_Performance:_mkfs.xfs_-n_size.3D64k_option

Cheers,

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] All pgs stuck peering

2015-12-14 Thread Chris Dunlop
On Mon, Dec 14, 2015 at 09:29:20PM +0800, Jaze Lee wrote:
> Should we add big packet test in heartbeat? Right now the heartbeat
> only test the little packet. If the MTU is mismatched, the heartbeat
> can not find that.

It would certainly have saved me a great deal of stress!

I imagine you wouldn't want it doing a big packet test every
heartbeat, perhaps every 10th or some configurable number.

Something for the developers to consider? (cc'ed)

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html