Re: Striped images and cluster misbehavior

2013-01-12 Thread Andrey Korolyov
After digging a lot, I have found that IB cards and switch may went to
``bad'' state after host` load spike, so I have limited all
potentially cpu-hungry processes via cg. That`s has no effect at all,
spikes happens almost at same time when osds on the corresponding host
went down as ``wrongly marked'' for a couple of seconds. By doing
manual observations, I have ensured that osds went crazy first, eating
all cores with 100% SY(mean. scheduler or fs issues), then card
lacking time for its interrupts start dropping the packets and so on.

This can be reproduced only on heavy workload on the fast cluster,
slow one with simular software versions will crawl but do not produce
such locks. Those locks may went away and may hang for a while, tens
of minutes, I do not sure of what it depends. Both nodes with logs
pointed above contains one monitor and one osd, but locks do happen on
two-osd nodes as well. Ceph instances does not share block devices in
my setup(except two-osd nodes using same SSD for a journal, but since
it is reproducible on mon-osd pair with completely separated storage
that`s seems not to be an exact cause). For meantime, I may suggest
for myself to move out from XFS and see if locks remain. The issue
started in the latest 3.6 series and 0.55+ and remains in the 3.7.1
and 0.56.1. Should I move to ext4 immediately or try 3.8-rc with
couple of XFS fixes first?

http://xdel.ru/downloads/ceph-log/osd-lockup-1-14-25-12.875107.log.gz
http://xdel.ru/downloads/ceph-log/osd-lockup-2-14-33-16.741603.log.gz

Timestamps in filenames added for easier lookup, osdmap have marked
osds as down after couple of beats after those marks.


On Mon, Dec 31, 2012 at 1:16 AM, Andrey Korolyov and...@xdel.ru wrote:
 On Sun, Dec 30, 2012 at 10:56 PM, Samuel Just sam.j...@inktank.com wrote:
 Sorry for the delay.  A quick look at the log doesn't show anything
 obvious... Can you elaborate on how you caused the hang?
 -Sam


 I am sorry for all this noise, the issue almost for sure has been
 triggered by some bug in the Infiniband switch firmware because
 per-port reset was able to solve ``wrong mark'' problem - at least, it
 haven`t showed up yet for a week. The problem took almost two days
 until resolution - all possible connectivity tests displayed no
 overtimes or drops which can cause wrong marks. Finally, I have
 started playing with TCP settings and found that ipv4.tcp_low_latency
 raising possibility of ``wrong mark'' event several times when enabled
 - so area of all possible causes quickly collapsed to the media-only
 problem and I fixed problem soon.

 On Wed, Dec 19, 2012 at 3:53 AM, Andrey Korolyov and...@xdel.ru wrote:
 Please take a look at the log below, this is slightly different bug -
 both osd processes on the node was stuck eating all available cpu
 until I killed them. This can be reproduced by doing parallel export
 of different from same client IP using both ``rbd export'' or API
 calls - after a couple of wrong ``downs'' osd.19 and osd.27 finally
 stuck. What is more interesting, 10.5.0.33 holds most hungry set of
 virtual machines, eating constantly four of twenty-four HT cores, and
 this node fails almost always, Underlying fs is an XFS, ceph version
 gf9d090e. With high possibility my previous reports are about side
 effects of this problem.

 http://xdel.ru/downloads/ceph-log/osd-19_and_27_stuck.log.gz

 and timings for the monmap, logs are from different hosts, so they may
 have a time shift of tens of milliseconds:

 http://xdel.ru/downloads/ceph-log/timings-crash-osd_19_and_27.txt

 Thanks!
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Seperate metadata disk for OSD

2013-01-12 Thread Yan, Zheng
On Sat, Jan 12, 2013 at 2:57 PM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote:

 Hi list,
 For a rbd write request, Ceph need to do 3 writes:
 2013-01-10 13:10:15.539967 7f52f516c700 10 filestore(/data/osd.21) 
 _do_transaction on 0x327d790
 2013-01-10 13:10:15.539979 7f52f516c700 15 filestore(/data/osd.21) write 
 meta/516b801c/pglog_2.1a/0//-1 36015~147
 2013-01-10 13:10:15.540016 7f52f516c700 15 filestore(/data/osd.21) path: 
 /data/osd.21/current/meta/DIR_C/pglog\u2.1a__0_516B801C__none
 2013-01-10 13:10:15.540164 7f52f516c700 15 filestore(/data/osd.21) write 
 meta/28d2f4a8/pginfo_2.1a/0//-1 0~496
 2013-01-10 13:10:15.540189 7f52f516c700 15 filestore(/data/osd.21) path: 
 /data/osd.21/current/meta/DIR_8/pginfo\u2.1a__0_28D2F4A8__none
 2013-01-10 13:10:15.540217 7f52f516c700 10 filestore(/data/osd.21) 
 _do_transaction on 0x327d708
 2013-01-10 13:10:15.540222 7f52f516c700 15 filestore(/data/osd.21) write 
 2.1a_head/8abf341a/rb.0.106e.6b8b4567.02d3/head//2 3227648~524288
 2013-01-10 13:10:15.540245 7f52f516c700 15 filestore(/data/osd.21) path: 
 /data/osd.21/current/2.1a_head/rb.0.106e.6b8b4567.02d3__head_8ABF341A__2

 If using XFS as backend file system and running xfs on top of 
 traditional sata disk, it will introduce a lot of seeks and therefore reduce 
 bandwidth, a blktrace is available here :( 
 http://ww3.sinaimg.cn/mw690/6e1aee47jw1e0qsbxbvddj.jpg) to demonstrate this 
 issue.( single client running dd on top of a new RBD volumes).
 Then I tried to move /osd.X/current/meta to a separate disk, the 
 bandwidth boosted.(look blktrace at 
 http://ww4.sinaimg.cn/mw690/6e1aee47jw1e0qsadz1bij.jpg).
 I haven't test other access pattern or something else, but it looks 
 to me that moving such meta to a separate disk (ssd or sata with btrfs) will 
 benefit ceph write performance, is it true? Will ceph introduce this feature 
 in the future?  Is there any potential problem for such hack?


Did you try putting XFS metadata log a separate and fast device
(mkfs.xfs -l logdev=/dev/sdbx,size=1b). I think it will boost
performance too.

Regards
Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Seperate metadata disk for OSD

2013-01-12 Thread Mark Nelson

Hi Xiaoxi and Zheng,

We've played with both of these some internally, but not for a 
production deployment. Mostly just for diagnosing performance problems. 
 It's been a while since I last played with this, but I hadn't seen a 
whole lot of performance improvements at the time.  That may have been 
due to the hardware in use, or perhaps other parts of Ceph have improved 
to the point where this matters now!


On a side note, Btrfs also had a google summer of code project to let 
you put metadata on an external device.  Originally I think that was 
supposed to make it into 3.7, but am not sure if that happened.


Mark

On 01/12/2013 06:21 AM, Yan, Zheng wrote:

On Sat, Jan 12, 2013 at 2:57 PM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote:


Hi list,
 For a rbd write request, Ceph need to do 3 writes:
2013-01-10 13:10:15.539967 7f52f516c700 10 filestore(/data/osd.21) 
_do_transaction on 0x327d790
2013-01-10 13:10:15.539979 7f52f516c700 15 filestore(/data/osd.21) write 
meta/516b801c/pglog_2.1a/0//-1 36015~147
2013-01-10 13:10:15.540016 7f52f516c700 15 filestore(/data/osd.21) path: 
/data/osd.21/current/meta/DIR_C/pglog\u2.1a__0_516B801C__none
2013-01-10 13:10:15.540164 7f52f516c700 15 filestore(/data/osd.21) write 
meta/28d2f4a8/pginfo_2.1a/0//-1 0~496
2013-01-10 13:10:15.540189 7f52f516c700 15 filestore(/data/osd.21) path: 
/data/osd.21/current/meta/DIR_8/pginfo\u2.1a__0_28D2F4A8__none
2013-01-10 13:10:15.540217 7f52f516c700 10 filestore(/data/osd.21) 
_do_transaction on 0x327d708
2013-01-10 13:10:15.540222 7f52f516c700 15 filestore(/data/osd.21) write 
2.1a_head/8abf341a/rb.0.106e.6b8b4567.02d3/head//2 3227648~524288
2013-01-10 13:10:15.540245 7f52f516c700 15 filestore(/data/osd.21) path: 
/data/osd.21/current/2.1a_head/rb.0.106e.6b8b4567.02d3__head_8ABF341A__2
l
 If using XFS as backend file system and running xfs on top of 
traditional sata disk, it will introduce a lot of seeks and therefore reduce 
bandwidth, a blktrace is available here :( 
http://ww3.sinaimg.cn/mw690/6e1aee47jw1e0qsbxbvddj.jpg) to demonstrate this 
issue.( single client running dd on top of a new RBD volumes).
 Then I tried to move /osd.X/current/meta to a separate disk, the 
bandwidth boosted.(look blktrace at 
http://ww4.sinaimg.cn/mw690/6e1aee47jw1e0qsadz1bij.jpg).
 I haven't test other access pattern or something else, but it looks to 
me that moving such meta to a separate disk (ssd or sata with btrfs) will 
benefit ceph write performance, is it true? Will ceph introduce this feature in 
the future?  Is there any potential problem for such hack?



Did you try putting XFS metadata log a separate and fast device
(mkfs.xfs -l logdev=/dev/sdbx,size=1b). I think it will boost
performance too.

Regards
Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Seperate metadata disk for OSD

2013-01-12 Thread Chen, Xiaoxi
Hi Zheng,
I have put XFS log to a separate disk, indeed it provide some 
performance gain but not that significant.
Ceph's metadata is somehow separate(it's some files reside in OSD's 
disk), therefore,it cannot be helped by neither XFS journal log nor OSD's 
journal.That's why I am trying to put ceph's metadata(/data/osd.x/meta folder ) 
to a separate SSD disk.
To Nelson,
I did the experiment with just 1 client, if using more clients, the 
gain will not be that much.
It looks to me that a single write from client side become 3 writes to 
disk is somehow a big overhead for in-place-update filesystem such like XFS 
since it introduce more seeks.Out-of-place-update filesystem will not suffer a 
lot for such pattern,I didn’t find this problem when I using BTRFS as backend 
filesystem. But forBTRFS, fragmentation is another performance killer, for a 
single RBD volume, if you did a lot of random write on it, the sequential read 
performance will drop to 30% of a new RBD volume. This make BTRFS unusable in 
production.
Separate Ceph meta seems quite easy to me ( I just mount a partition to 
/data/osd.X/meta), is it right  ? is there any potential problem in it? 




Xiaoxi
-Original Message-
From: Mark Nelson [mailto:mark.nel...@inktank.com] 
Sent: 2013年1月12日 21:36
To: Yan, Zheng 
Cc: Chen, Xiaoxi; ceph-devel@vger.kernel.org
Subject: Re: Seperate metadata disk for OSD

Hi Xiaoxi and Zheng,

We've played with both of these some internally, but not for a production 
deployment. Mostly just for diagnosing performance problems. 
  It's been a while since I last played with this, but I hadn't seen a whole 
lot of performance improvements at the time.  That may have been due to the 
hardware in use, or perhaps other parts of Ceph have improved to the point 
where this matters now!

On a side note, Btrfs also had a google summer of code project to let you put 
metadata on an external device.  Originally I think that was supposed to make 
it into 3.7, but am not sure if that happened.

Mark

On 01/12/2013 06:21 AM, Yan, Zheng wrote:
 On Sat, Jan 12, 2013 at 2:57 PM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote:

 Hi list,
  For a rbd write request, Ceph need to do 3 writes:
 2013-01-10 13:10:15.539967 7f52f516c700 10 filestore(/data/osd.21) 
_do_transaction on 0x327d790
 2013-01-10 13:10:15.539979 7f52f516c700 15 filestore(/data/osd.21) 
write meta/516b801c/pglog_2.1a/0//-1 36015~147
 2013-01-10 13:10:15.540016 7f52f516c700 15 filestore(/data/osd.21) 
path: /data/osd.21/current/meta/DIR_C/pglog\u2.1a__0_516B801C__none
 2013-01-10 13:10:15.540164 7f52f516c700 15 filestore(/data/osd.21) 
write meta/28d2f4a8/pginfo_2.1a/0//-1 0~496
 2013-01-10 13:10:15.540189 7f52f516c700 15 filestore(/data/osd.21) 
path: /data/osd.21/current/meta/DIR_8/pginfo\u2.1a__0_28D2F4A8__none
 2013-01-10 13:10:15.540217 7f52f516c700 10 filestore(/data/osd.21) 
_do_transaction on 0x327d708
 2013-01-10 13:10:15.540222 7f52f516c700 15 filestore(/data/osd.21) 
write 2.1a_head/8abf341a/rb.0.106e.6b8b4567.02d3/head//2 
3227648~524288
 2013-01-10 13:10:15.540245 7f52f516c700 15 filestore(/data/osd.21) 
path: 
/data/osd.21/current/2.1a_head/rb.0.106e.6b8b4567.02d3__head_8
ABF341A__2
l
  If using XFS as backend file system and running xfs on top of 
 traditional sata disk, it will introduce a lot of seeks and therefore reduce 
 bandwidth, a blktrace is available here :( 
 http://ww3.sinaimg.cn/mw690/6e1aee47jw1e0qsbxbvddj.jpg) to demonstrate this 
 issue.( single client running dd on top of a new RBD volumes).
  Then I tried to move /osd.X/current/meta to a separate disk, the 
 bandwidth boosted.(look blktrace at 
 http://ww4.sinaimg.cn/mw690/6e1aee47jw1e0qsadz1bij.jpg).
  I haven't test other access pattern or something else, but it looks 
 to me that moving such meta to a separate disk (ssd or sata with btrfs) will 
 benefit ceph write performance, is it true? Will ceph introduce this feature 
 in the future?  Is there any potential problem for such hack?


 Did you try putting XFS metadata log a separate and fast device 
 (mkfs.xfs -l logdev=/dev/sdbx,size=1b). I think it will boost 
 performance too.

 Regards
 Yan, Zheng
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel 
 in the body of a message to majord...@vger.kernel.org More majordomo 
 info at  http://vger.kernel.org/majordomo-info.html




Re: [PATCH] configure.ac: check for org.junit.rules.ExternalResource

2013-01-12 Thread Danny Al-Gaaf
Am 11.01.2013 06:13, schrieb Gary Lowell:
[...]
 Thanks Danny.  Installing sharutils solved that minor issue.  We now
 get though the build just fine on opensuse 12, but sles 11sp2 gives
 more warnings (pasted below).  Should we be using a newer version of
 autoconf  on sles?  I've tried moving AC_CANONICAL_TARGET earlier in
 the file, but that causes some other issues with the new java
 macros.
 
 Thanks, Gary

I'll take a look at it, I guess it's a problem in configure.ac.

I see the same warnings in our build system at SUSE for SLES (see: e.g.
logs at
https://build.opensuse.org/project/monitor?project=home%3Adalgaaf%3Abranches%3Afilesystems),
but the package builds just fine for openSUSE, SLES, Fedora, RHEL and
CentOS there.

Danny
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html