Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-12-02 Thread Ilya Dryomov
On Mon, Dec 1, 2014 at 1:39 PM, Andrei Mikhailovsky and...@arhont.com wrote:
 Ilya,

 I will try doing that once again tonight as this is a production cluster and
 when dds trigger that dmesg error the cluster's io becomes very bad and I
 have to reboot the server to get things on track. Most of my vms start
 having 70-90% iowait until that server is rebooted.

 I've actually checked what you've asked last time i've ran the test.

 When I do 4 dds concurrently nothing aprears in the dmesg output. No
 messages at all.

 The kern.log file that i've sent last time is what I got about a minute
 after i've started 8 dds. I've pasted the full output. The 8 dds did
 actually complete, but it took a rather long time. I was getting about 6MB/s
 per dd process compared to around 70MB/s per dd process when 4 dds were
 running. Do you still want me to run this or is the information i've
 provided enough?

How long did it take for all dds to complete?

Can you send the entire kern.log for that boot?  I want to look at how
things progressed during the entire time dds were chunking along.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-12-01 Thread Ilya Dryomov
On Mon, Dec 1, 2014 at 12:30 AM, Andrei Mikhailovsky and...@arhont.com wrote:

 Ilya, further to your email I have switched back to the 3.18 kernel that
 you've sent and I got similar looking dmesg output as I had on the 3.17
 kernel. Please find it attached for your reference. As before, this is the
 command I've ran on the client:


 time dd if=/dev/zero of=4G00 bs=4M count=5K oflag=direct  time dd
 if=/dev/zero of=4G11 bs=4M count=5K oflag=direct time dd if=/dev/zero
 of=4G22 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G33 bs=4M
 count=5K oflag=direct  time dd if=/dev/zero of=4G44 bs=4M count=5K
 oflag=direct  time dd if=/dev/zero of=4G55 bs=4M count=5K oflag=direct
 time dd if=/dev/zero of=4G66 bs=4M count=5K oflag=direct time dd
 if=/dev/zero of=4G77 bs=4M count=5K oflag=direct 

Can you run that command again - on 3.18 kernel, to completion - and
paste

- the entire dmesg
- time results for each dd

?

Compare those to your results with four dds (or any other number which
doesn't trigger page allocation failures).

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-12-01 Thread Andrei Mikhailovsky
Ilya, 

I will try doing that once again tonight as this is a production cluster and 
when dds trigger that dmesg error the cluster's io becomes very bad and I have 
to reboot the server to get things on track. Most of my vms start having 70-90% 
iowait until that server is rebooted. 

I've actually checked what you've asked last time i've ran the test. 

When I do 4 dds concurrently nothing aprears in the dmesg output. No messages 
at all. 

The kern.log file that i've sent last time is what I got about a minute after 
i've started 8 dds. I've pasted the full output. The 8 dds did actually 
complete, but it took a rather long time. I was getting about 6MB/s per dd 
process compared to around 70MB/s per dd process when 4 dds were running. Do 
you still want me to run this or is the information i've provided enough? 

Cheers 

Andrei 

- Original Message -

 From: Ilya Dryomov ilya.dryo...@inktank.com
 To: Andrei Mikhailovsky and...@arhont.com
 Cc: ceph-users ceph-users@lists.ceph.com, Gregory Farnum
 g...@gregs42.com
 Sent: Monday, 1 December, 2014 8:22:08 AM
 Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks

 On Mon, Dec 1, 2014 at 12:30 AM, Andrei Mikhailovsky
 and...@arhont.com wrote:
 
  Ilya, further to your email I have switched back to the 3.18 kernel
  that
  you've sent and I got similar looking dmesg output as I had on the
  3.17
  kernel. Please find it attached for your reference. As before, this
  is the
  command I've ran on the client:
 
 
  time dd if=/dev/zero of=4G00 bs=4M count=5K oflag=direct  time dd
  if=/dev/zero of=4G11 bs=4M count=5K oflag=direct time dd
  if=/dev/zero
  of=4G22 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G33
  bs=4M
  count=5K oflag=direct  time dd if=/dev/zero of=4G44 bs=4M count=5K
  oflag=direct  time dd if=/dev/zero of=4G55 bs=4M count=5K
  oflag=direct
  time dd if=/dev/zero of=4G66 bs=4M count=5K oflag=direct time dd
  if=/dev/zero of=4G77 bs=4M count=5K oflag=direct 

 Can you run that command again - on 3.18 kernel, to completion - and
 paste

 - the entire dmesg
 - time results for each dd

 ?

 Compare those to your results with four dds (or any other number
 which
 doesn't trigger page allocation failures).

 Thanks,

 Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-12-01 Thread Ilya Dryomov
On Mon, Dec 1, 2014 at 1:39 PM, Andrei Mikhailovsky and...@arhont.com wrote:
 Ilya,

 I will try doing that once again tonight as this is a production cluster and
 when dds trigger that dmesg error the cluster's io becomes very bad and I
 have to reboot the server to get things on track. Most of my vms start
 having 70-90% iowait until that server is rebooted.

That's easily explained - those splats in dmesg indicate a case of a
severe memory pressure.


 I've actually checked what you've asked last time i've ran the test.

 When I do 4 dds concurrently nothing aprears in the dmesg output. No
 messages at all.

 The kern.log file that i've sent last time is what I got about a minute
 after i've started 8 dds. I've pasted the full output. The 8 dds did
 actually complete, but it took a rather long time. I was getting about 6MB/s
 per dd process compared to around 70MB/s per dd process when 4 dds were
 running. Do you still want me to run this or is the information i've
 provided enough?

No, no need if it's a production cluster.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-12-01 Thread Andrei Mikhailovsky
Ilya, 

I see. My server is has 24GB of ram + 3GB of swap. While running the tests, 
I've noticed that the server had 14GB of ram shown as cached and only 2MB were 
used from the swap. Not sure if this is helpful to your debugging. 

Andrei 

-- 
Andrei Mikhailovsky 
Director 
Arhont Information Security 

Web: http://www.arhont.com 
http://www.wi-foo.com 
Tel: +44 (0)870 4431337 
Fax: +44 (0)208 429 3111 
PGP: Key ID - 0x2B3438DE 
PGP: Server - keyserver.pgp.com 

DISCLAIMER 

The information contained in this email is intended only for the use of the 
person(s) to whom it is addressed and may be confidential or contain legally 
privileged information. If you are not the intended recipient you are hereby 
notified that any perusal, use, distribution, copying or disclosure is strictly 
prohibited. If you have received this email in error please immediately advise 
us by return email at and...@arhont.com and delete and purge the email and any 
attachments without making a copy. 

- Original Message -

 From: Ilya Dryomov ilya.dryo...@inktank.com
 To: Andrei Mikhailovsky and...@arhont.com
 Cc: ceph-users ceph-users@lists.ceph.com, Gregory Farnum
 g...@gregs42.com
 Sent: Monday, 1 December, 2014 11:06:37 AM
 Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks

 On Mon, Dec 1, 2014 at 1:39 PM, Andrei Mikhailovsky
 and...@arhont.com wrote:
  Ilya,
 
  I will try doing that once again tonight as this is a production
  cluster and
  when dds trigger that dmesg error the cluster's io becomes very bad
  and I
  have to reboot the server to get things on track. Most of my vms
  start
  having 70-90% iowait until that server is rebooted.

 That's easily explained - those splats in dmesg indicate a case of a
 severe memory pressure.

 
  I've actually checked what you've asked last time i've ran the
  test.
 
  When I do 4 dds concurrently nothing aprears in the dmesg output.
  No
  messages at all.
 
  The kern.log file that i've sent last time is what I got about a
  minute
  after i've started 8 dds. I've pasted the full output. The 8 dds
  did
  actually complete, but it took a rather long time. I was getting
  about 6MB/s
  per dd process compared to around 70MB/s per dd process when 4 dds
  were
  running. Do you still want me to run this or is the information
  i've
  provided enough?

 No, no need if it's a production cluster.

 Thanks,

 Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-12-01 Thread Gregory Farnum
On Sun, Nov 30, 2014 at 1:15 PM, Andrei Mikhailovsky and...@arhont.com wrote:
 Greg, thanks for your comment. Could you please share what OS, kernel and
 any nfs/cephfs settings you've used to achieve the pretty well stability?
 Also, what kind of tests have you ran to check that?


We're just doing it on our testing cluster with the
teuthology/ceph-qa-suite stuff in
https://github.com/ceph/ceph-qa-suite/tree/master/suites/knfs/basic
So that'll be running our ceph-client kernel, which I believe is
usually a recent rc release with the new Ceph changes on top, with
knfs exporting a kcephfs mount, and then running each of the tasks
named in the tasks folder on top of a client to that knfs export.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-11-30 Thread Andrei Mikhailovsky
Greg, thanks for your comment. Could you please share what OS, kernel and any 
nfs/cephfs settings you've used to achieve the pretty well stability? Also, 
what kind of tests have you ran to check that? 

Thanks 

- Original Message -

 From: Gregory Farnum g...@gregs42.com
 To: Ilya Dryomov ilya.dryo...@inktank.com, Andrei Mikhailovsky
 and...@arhont.com
 Cc: ceph-users ceph-users@lists.ceph.com
 Sent: Saturday, 29 November, 2014 10:19:32 PM
 Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks

 Ilya, do you have a ticket reference for the bug?
 Andrei, we run NFS tests on CephFS in our nightlies and it does
 pretty well so in the general case we expect it to work. Obviously
 not at the moment with whatever bug Ilya is looking at, though. ;)
 -Greg

 On Sat, Nov 29, 2014 at 4:51 AM Ilya Dryomov 
 ilya.dryo...@inktank.com  wrote:

  On Sat, Nov 29, 2014 at 3:49 PM, Ilya Dryomov 
  ilya.dryo...@inktank.com  wrote:
 
   On Sat, Nov 29, 2014 at 3:22 PM, Andrei Mikhailovsky 
   and...@arhont.com  wrote:
 
   Ilya,
 
  
 
   I think i spoke too soon in my last message. I've not given it
   more load
 
   (running 8 concurrent dds with bs=4M) and about a minute or so
   after
 
   starting i've seen problems in dmesg output. I am attaching
   kern.log file
 
   for you reference.
 
  
 
   Please check starting with the following line: Nov 29 12:07:38
 
   arh-ibstorage1-ib kernel: [ 3831.906510]. This is when I've
   started the
 
   concurrent 8 dds.
 
  
 
   The command that caused this is:
 
  
 
   time dd if=/dev/zero of=4G00 bs=4M count=5K oflag=direct  time
   dd
 
   if=/dev/zero of=4G11 bs=4M count=5K oflag=direct time dd
   if=/dev/zero
 
   of=4G22 bs=4M count=5K oflag=direct time dd if=/dev/zero
   of=4G33
   bs=4M
 
   count=5K oflag=direct  time dd if=/dev/zero of=4G44 bs=4M
   count=5K
 
   oflag=direct  time dd if=/dev/zero of=4G55 bs=4M count=5K
   oflag=direct
 
   time dd if=/dev/zero of=4G66 bs=4M count=5K oflag=direct time
   dd
 
   if=/dev/zero of=4G77 bs=4M count=5K oflag=direct 
 
  
 
   I've ran the same test about 10 times but with only 4 concurrent
   dds and
 
   that didn't cause the issue.
 
  
 
   Should I try the 3.18 kernel again to see if 8dds produce
   similar
   output?
 
  
 
   Missing attachment.
 

  Definitely try the 3.18 testing kernel.
 

  Thanks,
 

  Ilya
 
  __ _
 
  ceph-users mailing list
 
  ceph-users@lists.ceph.com
 
  http://lists.ceph.com/ listinfo.cgi/ceph-users-ceph. com
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-11-29 Thread Ilya Dryomov
On Sat, Nov 29, 2014 at 2:13 AM, Andrei Mikhailovsky and...@arhont.com wrote:
 Ilya, here is what I got shortly after starting the dd test:



 [  288.307993]
 [  288.308004] =
 [  288.308008] [ INFO: possible irq lock inversion dependency detected ]
 [  288.308014] 3.18.0-rc6-ceph-00024-g72ca172 #1 Tainted: GE
 [  288.308019] -
 [  288.308023] kswapd1/87 just changed the state of lock:
 [  288.308027]  (xfs_dir_ilock_class){-+}, at: [a0682d44]
 xfs_ilock+0x134/0x160 [xfs]
 [  288.308072] but this lock took another, RECLAIM_FS-unsafe lock in the
 past:
 [  288.308076]  (mm-mmap_sem){++}
 [  288.308076]
 [  288.308076] and interrupts could create inverse lock ordering between
 them.
 [  288.308076]
 [  288.308084]
 [  288.308084] other info that might help us debug this:
 [  288.308089]  Possible interrupt unsafe locking scenario:
 [  288.308089]
 [  288.308094]CPU0CPU1
 [  288.308097]
 [  288.308100]   lock(mm-mmap_sem);
 [  288.308104]local_irq_disable();
 [  288.308109]lock(xfs_dir_ilock_class);
 [  288.308114]lock(mm-mmap_sem);
 [  288.308120]   Interrupt
 [  288.308122] lock(xfs_dir_ilock_class);
 [  288.308127]
 [  288.308127]  *** DEADLOCK ***
 [  288.308127]
 [  288.308133] 3 locks held by kswapd1/87:
 [  288.308136]  #0:  (shrinker_rwsem){..}, at: [8117551f]
 shrink_slab+0x3f/0x140
 [  288.308151]  #1:  (type-s_umount_key#27){.+}, at:
 [811d8c14] grab_super_passive+0x44/0x90
 [  288.308165]  #2:  (pag-pag_ici_reclaim_lock){+.+...}, at:
 [a067acd4] xfs_reclaim_inodes_ag+0xb4/0x400 [xfs]
 [  288.308192]
 [  288.308192] the shortest dependencies between 2nd lock and 1st lock:
 [  288.308206]  - (mm-mmap_sem){++} ops: 27039227 {
 [  288.308214] HARDIRQ-ON-W at:
 [  288.308218]   [810a7209]
 __lock_acquire+0x629/0x1c90
 [  288.308229]   [810a8e9e]
 lock_acquire+0x9e/0x140
 [  288.308236]   [8173ae99]
 down_write+0x49/0x80
 [  288.308244]   [811dcd03]
 do_execve_common.isra.25+0x283/0x6e0
 [  288.308253]   [811dd178]
 do_execve+0x18/0x20
 [  288.308259]   [8106ff4e]
 call_usermodehelper+0x11e/0x170
 [  288.308269]   [8173d66c]
 ret_from_fork+0x7c/0xb0
 [  288.308276] HARDIRQ-ON-R at:
 [  288.308280]   [810a6f23]
 __lock_acquire+0x343/0x1c90
 [  288.308287]   [810a8e9e]
 lock_acquire+0x9e/0x140
 [  288.308294]   [8118d833]
 might_fault+0x93/0xc0
 [  288.308304]   [813b7a80]
 __clear_user+0x20/0x70
 [  288.308314]   [813b7afe]
 clear_user+0x2e/0x40
 [  288.308320]   [8122a4cd] padzero+0x2d/0x40
 [  288.308329]   [8122b0bf]
 load_elf_binary+0x9cf/0x1880
 [  288.308336]   [811db9f0]
 search_binary_handler+0xa0/0x1e0
 [  288.308343]   [811dcfa2]
 do_execve_common.isra.25+0x522/0x6e0
 [  288.308351]   [811dd178]
 do_execve+0x18/0x20
 [  288.308358]   [8106ff4e]
 call_usermodehelper+0x11e/0x170
 [  288.308366]   [8173d66c]
 ret_from_fork+0x7c/0xb0
 [  288.308373] SOFTIRQ-ON-W at:
 [  288.308376]   [810a6f54]
 __lock_acquire+0x374/0x1c90
 [  288.308384]   [810a8e9e]
 lock_acquire+0x9e/0x140
 [  288.308391]   [8173ae99]
 down_write+0x49/0x80
 [  288.308398]   [811dcd03]
 do_execve_common.isra.25+0x283/0x6e0
 [  288.308406]   [811dd178]
 do_execve+0x18/0x20
 [  288.308412]   [8106ff4e]
 call_usermodehelper+0x11e/0x170
 [  288.308420]   [8173d66c]
 ret_from_fork+0x7c/0xb0
 [  288.308427] SOFTIRQ-ON-R at:
 [  288.308431]   [810a6f54]
 __lock_acquire+0x374/0x1c90
 [  288.308438]   [810a8e9e]
 lock_acquire+0x9e/0x140
 [  288.308445]   [8118d833]
 might_fault+0x93/0xc0
 [  288.308452]   [813b7a80]
 __clear_user+0x20/0x70
 [  288.308458]   [813b7afe]
 clear_user+0x2e/0x40
 [  288.308464]   [8122a4cd] padzero+0x2d/0x40
 [  288.308470]   [8122b0bf]
 load_elf_binary+0x9cf/0x1880
 [  288.308477]   [811db9f0]
 

Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-11-29 Thread Andrei Mikhailovsky
Ilya, so, what is the best action plan now? should I continue using the kernel 
that you've sent me? I am running production infrastructure and not sure if 
this is the right way forward. 

Do you have a patch by any chance against the LTS kernel that I can use to 
recompile the ceph module? 

Thanks 

- Original Message -

 From: Ilya Dryomov ilya.dryo...@inktank.com
 To: Andrei Mikhailovsky and...@arhont.com
 Cc: ceph-users ceph-users@lists.ceph.com
 Sent: Saturday, 29 November, 2014 8:45:54 AM
 Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks

 On Sat, Nov 29, 2014 at 2:13 AM, Andrei Mikhailovsky
 and...@arhont.com wrote:
  Ilya, here is what I got shortly after starting the dd test:
 
 
 
  [ 288.307993]
  [ 288.308004]
  =
  [ 288.308008] [ INFO: possible irq lock inversion dependency
  detected ]
  [ 288.308014] 3.18.0-rc6-ceph-00024-g72ca172 #1 Tainted: G E
  [ 288.308019]
  -
  [ 288.308023] kswapd1/87 just changed the state of lock:
  [ 288.308027] (xfs_dir_ilock_class){-+}, at:
  [a0682d44]
  xfs_ilock+0x134/0x160 [xfs]
  [ 288.308072] but this lock took another, RECLAIM_FS-unsafe lock in
  the
  past:
  [ 288.308076] (mm-mmap_sem){++}
  [ 288.308076]
  [ 288.308076] and interrupts could create inverse lock ordering
  between
  them.
  [ 288.308076]
  [ 288.308084]
  [ 288.308084] other info that might help us debug this:
  [ 288.308089] Possible interrupt unsafe locking scenario:
  [ 288.308089]
  [ 288.308094] CPU0 CPU1
  [ 288.308097]  
  [ 288.308100] lock(mm-mmap_sem);
  [ 288.308104] local_irq_disable();
  [ 288.308109] lock(xfs_dir_ilock_class);
  [ 288.308114] lock(mm-mmap_sem);
  [ 288.308120] Interrupt
  [ 288.308122] lock(xfs_dir_ilock_class);
  [ 288.308127]
  [ 288.308127] *** DEADLOCK ***
  [ 288.308127]
  [ 288.308133] 3 locks held by kswapd1/87:
  [ 288.308136] #0: (shrinker_rwsem){..}, at:
  [8117551f]
  shrink_slab+0x3f/0x140
  [ 288.308151] #1: (type-s_umount_key#27){.+}, at:
  [811d8c14] grab_super_passive+0x44/0x90
  [ 288.308165] #2: (pag-pag_ici_reclaim_lock){+.+...}, at:
  [a067acd4] xfs_reclaim_inodes_ag+0xb4/0x400 [xfs]
  [ 288.308192]
  [ 288.308192] the shortest dependencies between 2nd lock and 1st
  lock:
  [ 288.308206] - (mm-mmap_sem){++} ops: 27039227 {
  [ 288.308214] HARDIRQ-ON-W at:
  [ 288.308218] [810a7209]
  __lock_acquire+0x629/0x1c90
  [ 288.308229] [810a8e9e]
  lock_acquire+0x9e/0x140
  [ 288.308236] [8173ae99]
  down_write+0x49/0x80
  [ 288.308244] [811dcd03]
  do_execve_common.isra.25+0x283/0x6e0
  [ 288.308253] [811dd178]
  do_execve+0x18/0x20
  [ 288.308259] [8106ff4e]
  call_usermodehelper+0x11e/0x170
  [ 288.308269] [8173d66c]
  ret_from_fork+0x7c/0xb0
  [ 288.308276] HARDIRQ-ON-R at:
  [ 288.308280] [810a6f23]
  __lock_acquire+0x343/0x1c90
  [ 288.308287] [810a8e9e]
  lock_acquire+0x9e/0x140
  [ 288.308294] [8118d833]
  might_fault+0x93/0xc0
  [ 288.308304] [813b7a80]
  __clear_user+0x20/0x70
  [ 288.308314] [813b7afe]
  clear_user+0x2e/0x40
  [ 288.308320] [8122a4cd] padzero+0x2d/0x40
  [ 288.308329] [8122b0bf]
  load_elf_binary+0x9cf/0x1880
  [ 288.308336] [811db9f0]
  search_binary_handler+0xa0/0x1e0
  [ 288.308343] [811dcfa2]
  do_execve_common.isra.25+0x522/0x6e0
  [ 288.308351] [811dd178]
  do_execve+0x18/0x20
  [ 288.308358] [8106ff4e]
  call_usermodehelper+0x11e/0x170
  [ 288.308366] [8173d66c]
  ret_from_fork+0x7c/0xb0
  [ 288.308373] SOFTIRQ-ON-W at:
  [ 288.308376] [810a6f54]
  __lock_acquire+0x374/0x1c90
  [ 288.308384] [810a8e9e]
  lock_acquire+0x9e/0x140
  [ 288.308391] [8173ae99]
  down_write+0x49/0x80
  [ 288.308398] [811dcd03]
  do_execve_common.isra.25+0x283/0x6e0
  [ 288.308406] [811dd178]
  do_execve+0x18/0x20
  [ 288.308412] [8106ff4e]
  call_usermodehelper+0x11e/0x170
  [ 288.308420] [8173d66c]
  ret_from_fork+0x7c/0xb0
  [ 288.308427] SOFTIRQ-ON-R at:
  [ 288.308431] [810a6f54]
  __lock_acquire+0x374/0x1c90
  [ 288.308438] [810a8e9e]
  lock_acquire+0x9e/0x140
  [ 288.308445] [8118d833]
  might_fault+0x93/0xc0
  [ 288.308452] [813b7a80]
  __clear_user+0x20/0x70
  [ 288.308458] [813b7afe]
  clear_user+0x2e/0x40
  [ 288.308464] [8122a4cd] padzero+0x2d/0x40
  [ 288.308470] [8122b0bf]
  load_elf_binary+0x9cf/0x1880
  [ 288.308477] [811db9f0]
  search_binary_handler+0xa0/0x1e0
  [ 288.308485] [811dcfa2]
  do_execve_common.isra.25+0x522/0x6e0
  [ 288.308493] [811dd178]
  do_execve+0x18/0x20
  [ 288.308499] [8106ff4e]
  call_usermodehelper+0x11e/0x170
  [ 288.308507] [8173d66c]
  ret_from_fork+0x7c/0xb0
  [ 288.308514] RECLAIM_FS

Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-11-29 Thread Ilya Dryomov
On Sat, Nov 29, 2014 at 2:33 AM, Andrei Mikhailovsky and...@arhont.com wrote:
 Ilya,

 not sure if dmesg output in the previous is related to the cephfs, but from
 what I can see it looks good with your kernel. I would have seen hang tasks
 by now, but not anymore. I've ran a bunch of concurrent dd tests and also
 the file touch tests and there are no more delays.

 So, it looks like you have nailed the bug!

Great, good to have another data point.


 Do you plan to backport the fix to the 3.16 or 3.17 branches?

That's the tricky part.  Can you try

http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/for-andrei-1/linux-image-3.17.4-ceph-00638-g0f25ebb_3.17.4-ceph-00638-g0f25ebb-1_amd64.deb

?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-11-29 Thread Andrei Mikhailovsky
Ilya, I will give it a try and get back to you shortly, 

Andrei 

- Original Message -

 From: Ilya Dryomov ilya.dryo...@inktank.com
 To: Andrei Mikhailovsky and...@arhont.com
 Cc: ceph-users ceph-users@lists.ceph.com
 Sent: Saturday, 29 November, 2014 10:40:48 AM
 Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks

 On Sat, Nov 29, 2014 at 2:33 AM, Andrei Mikhailovsky
 and...@arhont.com wrote:
  Ilya,
 
  not sure if dmesg output in the previous is related to the cephfs,
  but from
  what I can see it looks good with your kernel. I would have seen
  hang tasks
  by now, but not anymore. I've ran a bunch of concurrent dd tests
  and also
  the file touch tests and there are no more delays.
 
  So, it looks like you have nailed the bug!

 Great, good to have another data point.

 
  Do you plan to backport the fix to the 3.16 or 3.17 branches?

 That's the tricky part. Can you try

 http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/for-andrei-1/linux-image-3.17.4-ceph-00638-g0f25ebb_3.17.4-ceph-00638-g0f25ebb-1_amd64.deb

 ?

 Thanks,

 Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-11-29 Thread Andrei Mikhailovsky
Ilya, 

I think i spoke too soon in my last message. I've not given it more load 
(running 8 concurrent dds with bs=4M) and about a minute or so after starting 
i've seen problems in dmesg output. I am attaching kern.log file for you 
reference. 

Please check starting with the following line: Nov 29 12:07:38 
arh-ibstorage1-ib kernel: [ 3831.906510] . This is when I've started the 
concurrent 8 dds. 

The command that caused this is: 

time dd if=/dev/zero of=4G00 bs=4M count=5K oflag=direct  time dd if=/dev/zero 
of=4G11 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G22 bs=4M 
count=5K oflag=direct time dd if=/dev/zero of=4G33 bs=4M count=5K oflag=direct 
 time dd if=/dev/zero of=4G44 bs=4M count=5K oflag=direct  time dd 
if=/dev/zero of=4G55 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G66 
bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G77 bs=4M count=5K 
oflag=direct  

I've ran the same test about 10 times but with only 4 concurrent dds and that 
didn't cause the issue. 

Should I try the 3.18 kernel again to see if 8dds produce similar output? 

Andrei 

- Original Message -

 From: Ilya Dryomov ilya.dryo...@inktank.com
 To: Andrei Mikhailovsky and...@arhont.com
 Cc: ceph-users ceph-users@lists.ceph.com
 Sent: Saturday, 29 November, 2014 10:40:48 AM
 Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks

 On Sat, Nov 29, 2014 at 2:33 AM, Andrei Mikhailovsky
 and...@arhont.com wrote:
  Ilya,
 
  not sure if dmesg output in the previous is related to the cephfs,
  but from
  what I can see it looks good with your kernel. I would have seen
  hang tasks
  by now, but not anymore. I've ran a bunch of concurrent dd tests
  and also
  the file touch tests and there are no more delays.
 
  So, it looks like you have nailed the bug!

 Great, good to have another data point.

 
  Do you plan to backport the fix to the 3.16 or 3.17 branches?

 That's the tricky part. Can you try

 http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/for-andrei-1/linux-image-3.17.4-ceph-00638-g0f25ebb_3.17.4-ceph-00638-g0f25ebb-1_amd64.deb

 ?

 Thanks,

 Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-11-29 Thread Ilya Dryomov
On Sat, Nov 29, 2014 at 3:10 PM, Andrei Mikhailovsky and...@arhont.com wrote:
 Ilya,

 The 3.17.4 kernel that you've given is also good so far. No hang tasks as
 seen before. However, I do have the same message in dmesg as with the 3.18
 kernel that you've sent. This message I've not seen in the past while using
 kernel version 3.2 onwards.

 Not really sure if this message should be treated as alarming.

If you are referring to the xfs lockdep splat, the reason you haven't
seen it in the past may be that lockdep just wasn't enabled on your
kernels - most distro kernels don't enable it.

I wouldn't treat it as alarming but I'd report it to xfs lists if it
hasn't been reported there yet.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-11-29 Thread Ilya Dryomov
On Sat, Nov 29, 2014 at 3:22 PM, Andrei Mikhailovsky and...@arhont.com wrote:
 Ilya,

 I think i spoke too soon in my last message. I've not given it more load
 (running 8 concurrent dds with bs=4M) and about a minute or so after
 starting i've seen problems in dmesg output. I am attaching kern.log file
 for you reference.

 Please check starting with the following line: Nov 29 12:07:38
 arh-ibstorage1-ib kernel: [ 3831.906510]. This is when I've started the
 concurrent 8 dds.

 The command that caused this is:

 time dd if=/dev/zero of=4G00 bs=4M count=5K oflag=direct  time dd
 if=/dev/zero of=4G11 bs=4M count=5K oflag=direct time dd if=/dev/zero
 of=4G22 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G33 bs=4M
 count=5K oflag=direct  time dd if=/dev/zero of=4G44 bs=4M count=5K
 oflag=direct  time dd if=/dev/zero of=4G55 bs=4M count=5K oflag=direct
 time dd if=/dev/zero of=4G66 bs=4M count=5K oflag=direct time dd
 if=/dev/zero of=4G77 bs=4M count=5K oflag=direct 

 I've ran the same test about 10 times but with only 4 concurrent dds and
 that didn't cause the issue.

 Should I try the 3.18 kernel again to see if 8dds produce similar output?

Missing attachment.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-11-29 Thread Ilya Dryomov
On Sat, Nov 29, 2014 at 3:49 PM, Ilya Dryomov ilya.dryo...@inktank.com wrote:
 On Sat, Nov 29, 2014 at 3:22 PM, Andrei Mikhailovsky and...@arhont.com 
 wrote:
 Ilya,

 I think i spoke too soon in my last message. I've not given it more load
 (running 8 concurrent dds with bs=4M) and about a minute or so after
 starting i've seen problems in dmesg output. I am attaching kern.log file
 for you reference.

 Please check starting with the following line: Nov 29 12:07:38
 arh-ibstorage1-ib kernel: [ 3831.906510]. This is when I've started the
 concurrent 8 dds.

 The command that caused this is:

 time dd if=/dev/zero of=4G00 bs=4M count=5K oflag=direct  time dd
 if=/dev/zero of=4G11 bs=4M count=5K oflag=direct time dd if=/dev/zero
 of=4G22 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G33 bs=4M
 count=5K oflag=direct  time dd if=/dev/zero of=4G44 bs=4M count=5K
 oflag=direct  time dd if=/dev/zero of=4G55 bs=4M count=5K oflag=direct
 time dd if=/dev/zero of=4G66 bs=4M count=5K oflag=direct time dd
 if=/dev/zero of=4G77 bs=4M count=5K oflag=direct 

 I've ran the same test about 10 times but with only 4 concurrent dds and
 that didn't cause the issue.

 Should I try the 3.18 kernel again to see if 8dds produce similar output?

 Missing attachment.

Definitely try the 3.18 testing kernel.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-11-29 Thread Gregory Farnum
Ilya, do you have a ticket reference for the bug?
Andrei, we run NFS tests on CephFS in our nightlies and it does pretty well
so in the general case we expect it to work. Obviously not at the moment
with whatever bug Ilya is looking at, though. ;)
-Greg
On Sat, Nov 29, 2014 at 4:51 AM Ilya Dryomov ilya.dryo...@inktank.com
wrote:

 On Sat, Nov 29, 2014 at 3:49 PM, Ilya Dryomov ilya.dryo...@inktank.com
 wrote:
  On Sat, Nov 29, 2014 at 3:22 PM, Andrei Mikhailovsky and...@arhont.com
 wrote:
  Ilya,
 
  I think i spoke too soon in my last message. I've not given it more load
  (running 8 concurrent dds with bs=4M) and about a minute or so after
  starting i've seen problems in dmesg output. I am attaching kern.log
 file
  for you reference.
 
  Please check starting with the following line: Nov 29 12:07:38
  arh-ibstorage1-ib kernel: [ 3831.906510]. This is when I've started the
  concurrent 8 dds.
 
  The command that caused this is:
 
  time dd if=/dev/zero of=4G00 bs=4M count=5K oflag=direct  time dd
  if=/dev/zero of=4G11 bs=4M count=5K oflag=direct time dd if=/dev/zero
  of=4G22 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G33 bs=4M
  count=5K oflag=direct  time dd if=/dev/zero of=4G44 bs=4M count=5K
  oflag=direct  time dd if=/dev/zero of=4G55 bs=4M count=5K oflag=direct
  time dd if=/dev/zero of=4G66 bs=4M count=5K oflag=direct time dd
  if=/dev/zero of=4G77 bs=4M count=5K oflag=direct 
 
  I've ran the same test about 10 times but with only 4 concurrent dds and
  that didn't cause the issue.
 
  Should I try the 3.18 kernel again to see if 8dds produce similar
 output?
 
  Missing attachment.

 Definitely try the 3.18 testing kernel.

 Thanks,

 Ilya
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-11-29 Thread Ilya Dryomov
On Sun, Nov 30, 2014 at 1:19 AM, Gregory Farnum g...@gregs42.com wrote:
 Ilya, do you have a ticket reference for the bug?

Opened a ticket, assigned to myself.

http://tracker.ceph.com/issues/10208

 Andrei, we run NFS tests on CephFS in our nightlies and it does pretty well
 so in the general case we expect it to work. Obviously not at the moment
 with whatever bug Ilya is looking at, though. ;)

This is most probably a libceph issue - both krbd and kcephfs are
affected.  I've been tracking it under the general io hang umbrella,
which spreads over a couple existing tickets.  Definitely not a
nfs-on-cephfs problem, Greg.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-11-28 Thread Andrei Mikhailovsky
I am also noticing some delays working with nfs over cephfs. Especially when 
making an initial connection. For instance, I run the following: 

# time for i in {0..10} ; do time touch /tmp/cephfs/test-$i ; done 

where /tmp/cephfs is the nfs mount point running over cephfs 

I am noticing that the first touch file is created after about 20-30 seconds. 
All the following files files are created with no delay. 

If I run the command once again, all files are created pretty quickly without 
any delay. However, if I wait 20-30 minutes and run the command again, the 
delay with the first file is back again. 

Has anyone experienced similar issues? 

Andrei 
- Original Message -

 From: Andrei Mikhailovsky and...@arhont.com
 To: ceph-users ceph-users@lists.ceph.com
 Sent: Friday, 28 November, 2014 9:08:17 AM
 Subject: [ceph-users] Giant + nfs over cephfs hang tasks

 Hello guys,

 I've got a bunch of hang tasks of the nfsd service running over the
 cephfs (kernel) mounted file system. Here is an example of one of
 them.

 [433079.991218] INFO: task nfsd:32625 blocked for more than 120
 seconds.
 [433080.029685] Not tainted 3.15.10-031510-generic #201408132333
 [433080.068036] echo 0  /proc/sys/kernel/hung_task_timeout_secs
 disables this message.
 [433080.144235] nfsd D 000a 0 32625 2 0x
 [433080.144241] 8801a94dba78 0002 8801a94dba38
 8801a94dbfd8
 [433080.144244] 00014540 00014540 880673d63260
 880491d264c0
 [433080.144247] 8801a94dba78 88067fd14e40 880491d264c0
 8115dff0
 [433080.144250] Call Trace:
 [433080.144260] [8115dff0] ? __lock_page+0x70/0x70
 [433080.144274] [81778449] schedule+0x29/0x70
 [433080.144279] [8177851f] io_schedule+0x8f/0xd0
 [433080.144282] [8115dffe] sleep_on_page+0xe/0x20
 [433080.144286] [81778be2] __wait_on_bit+0x62/0x90
 [433080.144288] [8115eacb] ? find_get_pages_tag+0xcb/0x170
 [433080.144291] [8115e160] wait_on_page_bit+0x80/0x90
 [433080.144296] [810b54a0] ?
 wake_atomic_t_function+0x40/0x40
 [433080.144299] [8115e334]
 filemap_fdatawait_range+0xf4/0x180
 [433080.144302] [8116027d]
 filemap_write_and_wait_range+0x4d/0x80
 [433080.144315] [a06bf1b8] ceph_fsync+0x58/0x200 [ceph]
 [433080.144330] [813308f5] ? ima_file_check+0x35/0x40
 [433080.144337] [812028c8] vfs_fsync_range+0x18/0x30
 [433080.144352] [a03ee491] nfsd_commit+0xb1/0xd0 [nfsd]
 [433080.144363] [a03fb787] nfsd4_commit+0x57/0x60 [nfsd]
 [433080.144370] [a03fcf9e] nfsd4_proc_compound+0x54e/0x740
 [nfsd]
 [433080.144377] [a03e8e05] nfsd_dispatch+0xe5/0x230 [nfsd]
 [433080.144401] [a03205a5] svc_process_common+0x345/0x680
 [sunrpc]
 [433080.144413] [a0320c33] svc_process+0x103/0x160 [sunrpc]
 [433080.144418] [a03e895f] nfsd+0xbf/0x130 [nfsd]
 [433080.144424] [a03e88a0] ? nfsd_destroy+0x80/0x80 [nfsd]
 [433080.144428] [81091439] kthread+0xc9/0xe0
 [433080.144431] [81091370] ? flush_kthread_worker+0xb0/0xb0
 [433080.144434] [8178567c] ret_from_fork+0x7c/0xb0
 [433080.144437] [81091370] ? flush_kthread_worker+0xb0/0xb0

 I am using Ubuntu 12.04 servers with 3.15.10 kernel and ceph Giant.

 Thanks

 Andrei

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-11-28 Thread Andrei Mikhailovsky
I've just tried the latest ubuntu-vivid kernel and also seeing hang tasks with 
dd tests : 

[ 3721.026421] INFO: task nfsd:16596 blocked for more than 120 seconds. 
[ 3721.065141] Not tainted 3.17.4-031704-generic #201411211317 
[ 3721.103721] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message. 
[ 3721.180409] nfsd D 0009 0 16596 2 0x 
[ 3721.180412] 88006f0cbc18 0046 88006f0cbbc8 
8109677f 
[ 3721.180414] 88006f0cbfd8 000145c0 88067089f700 
000145c0 
[ 3721.180417] 880673dac600 88045adfbc00 8801bdf8be40 
88000b841500 
[ 3721.180420] Call Trace: 
[ 3721.180423] [8109677f] ? set_groups+0x2f/0x60 
[ 3721.180427] [817a20c9] schedule+0x29/0x70 
[ 3721.180440] [817a23ee] schedule_preempt_disabled+0xe/0x10 
[ 3721.180443] [817a429d] __mutex_lock_slowpath+0xcd/0x1d0 
[ 3721.180447] [817a43c3] mutex_lock+0x23/0x37 
[ 3721.180454] [c071cadd] nfsd_setattr+0x15d/0x2a0 [nfsd] 
[ 3721.180460] [c0727d2e] nfsd4_setattr+0x14e/0x180 [nfsd] 
[ 3721.180467] [c0729eac] nfsd4_proc_compound+0x4cc/0x730 [nfsd] 
[ 3721.180478] [c0715e55] nfsd_dispatch+0xe5/0x230 [nfsd] 
[ 3721.180491] [c05b9882] ? svc_tcp_adjust_wspace+0x12/0x30 [sunrpc] 
[ 3721.180500] [c05b8694] svc_process_common+0x324/0x680 [sunrpc] 
[ 3721.180510] [c05b8d43] svc_process+0x103/0x160 [sunrpc] 
[ 3721.180516] [c07159c7] nfsd+0x117/0x190 [nfsd] 
[ 3721.180526] [c07158b0] ? nfsd_destroy+0x80/0x80 [nfsd] 
[ 3721.180528] [81093359] kthread+0xc9/0xe0 
[ 3721.180533] [81093290] ? flush_kthread_worker+0x90/0x90 
[ 3721.180536] [817a64bc] ret_from_fork+0x7c/0xb0 
[ 3721.180540] [81093290] ? flush_kthread_worker+0x90/0x90 
[ 3721.180577] INFO: task kworker/2:3:28061 blocked for more than 120 seconds. 
[ 3721.221450] Not tainted 3.17.4-031704-generic #201411211317 
[ 3721.261440] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message. 
[ 3721.341394] kworker/2:3 D 0002 0 28061 2 0x 
[ 3721.341408] Workqueue: ceph-trunc ceph_vmtruncate_work [ceph] 
[ 3721.341409] 8805a6507b08 0046 8805a6507b88 
ea00040e8c80 
[ 3721.341412] 8805a6507fd8 000145c0 88067089b480 
000145c0 
[ 3721.341414] 8801fffec600 880102535a00 88000b8415a8 
88046fc94ec0 
[ 3721.341417] Call Trace: 
[ 3721.341421] [817a2970] ? bit_wait+0x50/0x50 
[ 3721.341424] [817a20c9] schedule+0x29/0x70 
[ 3721.341427] [817a219f] io_schedule+0x8f/0xd0 
[ 3721.341430] [817a299b] bit_wait_io+0x2b/0x50 
[ 3721.341433] [817a2656] __wait_on_bit_lock+0x76/0xb0 
[ 3721.341438] [811756b5] ? find_get_entries+0xe5/0x160 
[ 3721.341440] [8117245e] __lock_page+0xae/0xb0 
[ 3721.341446] [810b3fd0] ? wake_atomic_t_function+0x40/0x40 
[ 3721.341451] [81183226] truncate_inode_pages_range+0x446/0x700 
[ 3721.341455] [81183565] truncate_inode_pages+0x15/0x20 
[ 3721.341457] [811835bc] truncate_pagecache+0x4c/0x70 
[ 3721.341464] [c09f815e] __ceph_do_pending_vmtruncate+0xde/0x230 
[ceph] 
[ 3721.341470] [c09f8c73] ceph_vmtruncate_work+0x23/0x50 [ceph] 
[ 3721.341476] [8108cece] process_one_work+0x14e/0x460 
[ 3721.341479] [8108d84b] worker_thread+0x11b/0x3f0 
[ 3721.341482] [8108d730] ? create_worker+0x1e0/0x1e0 
[ 3721.341485] [81093359] kthread+0xc9/0xe0 
[ 3721.341487] [81093290] ? flush_kthread_worker+0x90/0x90 
[ 3721.341490] [817a64bc] ret_from_fork+0x7c/0xb0 
[ 3721.341492] [81093290] ? flush_kthread_worker+0x90/0x90 

They do not happen with every dd test, but happen pretty often. Especially when 
I am running a few dd tests concurrently. 

An example test that generated hang tasks above after just 2 runs: 

# dd if=/dev/zero of=/tmp/cephfs/4G bs=1M count=4K oflag=direct  dd 
if=/dev/zero of=/tmp/cephfs/4G1 bs=1M count=4K oflag=direct  dd if=/dev/zero 
of=/tmp/cephfs/4G2 bs=1M count=4K oflag=direct  dd if=/dev/zero 
of=/tmp/cephfs/4G3 bs=1M count=4K oflag=direct  

Cheers 

- Original Message -

 From: Andrei Mikhailovsky and...@arhont.com
 To: ceph-users ceph-users@lists.ceph.com
 Sent: Friday, 28 November, 2014 11:22:07 AM
 Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks

 I am also noticing some delays working with nfs over cephfs.
 Especially when making an initial connection. For instance, I run
 the following:

 # time for i in {0..10} ; do time touch /tmp/cephfs/test-$i ; done

 where /tmp/cephfs is the nfs mount point running over cephfs

 I am noticing that the first touch file is created after about 20-30
 seconds. All the following files files are created with no delay.

 If I run the command once again, all files are created pretty quickly
 without any delay. However, if I wait 20-30 minutes

Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-11-28 Thread Andrei Mikhailovsky
I've done some tests using ceph-fuse and it looks far more stable. I've not 
experienced any issues so far with ceph-fuse mount point over nfs. Will do more 
stress testing over and update. 

Anyone experiencing issues with hang tasks using ceph kernel module mount 
method? 

Thanks 
- Original Message -

 From: Andrei Mikhailovsky and...@arhont.com
 To: ceph-users ceph-users@lists.ceph.com
 Sent: Friday, 28 November, 2014 12:02:57 PM
 Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks

 I've just tried the latest ubuntu-vivid kernel and also seeing hang
 tasks with dd tests:

 [ 3721.026421] INFO: task nfsd:16596 blocked for more than 120
 seconds.
 [ 3721.065141] Not tainted 3.17.4-031704-generic #201411211317
 [ 3721.103721] echo 0  /proc/sys/kernel/hung_task_timeout_secs
 disables this message.
 [ 3721.180409] nfsd D 0009 0 16596 2 0x
 [ 3721.180412] 88006f0cbc18 0046 88006f0cbbc8
 8109677f
 [ 3721.180414] 88006f0cbfd8 000145c0 88067089f700
 000145c0
 [ 3721.180417] 880673dac600 88045adfbc00 8801bdf8be40
 88000b841500
 [ 3721.180420] Call Trace:
 [ 3721.180423] [8109677f] ? set_groups+0x2f/0x60
 [ 3721.180427] [817a20c9] schedule+0x29/0x70
 [ 3721.180440] [817a23ee]
 schedule_preempt_disabled+0xe/0x10
 [ 3721.180443] [817a429d] __mutex_lock_slowpath+0xcd/0x1d0
 [ 3721.180447] [817a43c3] mutex_lock+0x23/0x37
 [ 3721.180454] [c071cadd] nfsd_setattr+0x15d/0x2a0 [nfsd]
 [ 3721.180460] [c0727d2e] nfsd4_setattr+0x14e/0x180 [nfsd]
 [ 3721.180467] [c0729eac] nfsd4_proc_compound+0x4cc/0x730
 [nfsd]
 [ 3721.180478] [c0715e55] nfsd_dispatch+0xe5/0x230 [nfsd]
 [ 3721.180491] [c05b9882] ? svc_tcp_adjust_wspace+0x12/0x30
 [sunrpc]
 [ 3721.180500] [c05b8694] svc_process_common+0x324/0x680
 [sunrpc]
 [ 3721.180510] [c05b8d43] svc_process+0x103/0x160 [sunrpc]
 [ 3721.180516] [c07159c7] nfsd+0x117/0x190 [nfsd]
 [ 3721.180526] [c07158b0] ? nfsd_destroy+0x80/0x80 [nfsd]
 [ 3721.180528] [81093359] kthread+0xc9/0xe0
 [ 3721.180533] [81093290] ? flush_kthread_worker+0x90/0x90
 [ 3721.180536] [817a64bc] ret_from_fork+0x7c/0xb0
 [ 3721.180540] [81093290] ? flush_kthread_worker+0x90/0x90
 [ 3721.180577] INFO: task kworker/2:3:28061 blocked for more than 120
 seconds.
 [ 3721.221450] Not tainted 3.17.4-031704-generic #201411211317
 [ 3721.261440] echo 0  /proc/sys/kernel/hung_task_timeout_secs
 disables this message.
 [ 3721.341394] kworker/2:3 D 0002 0 28061 2 0x
 [ 3721.341408] Workqueue: ceph-trunc ceph_vmtruncate_work [ceph]
 [ 3721.341409] 8805a6507b08 0046 8805a6507b88
 ea00040e8c80
 [ 3721.341412] 8805a6507fd8 000145c0 88067089b480
 000145c0
 [ 3721.341414] 8801fffec600 880102535a00 88000b8415a8
 88046fc94ec0
 [ 3721.341417] Call Trace:
 [ 3721.341421] [817a2970] ? bit_wait+0x50/0x50
 [ 3721.341424] [817a20c9] schedule+0x29/0x70
 [ 3721.341427] [817a219f] io_schedule+0x8f/0xd0
 [ 3721.341430] [817a299b] bit_wait_io+0x2b/0x50
 [ 3721.341433] [817a2656] __wait_on_bit_lock+0x76/0xb0
 [ 3721.341438] [811756b5] ? find_get_entries+0xe5/0x160
 [ 3721.341440] [8117245e] __lock_page+0xae/0xb0
 [ 3721.341446] [810b3fd0] ?
 wake_atomic_t_function+0x40/0x40
 [ 3721.341451] [81183226]
 truncate_inode_pages_range+0x446/0x700
 [ 3721.341455] [81183565] truncate_inode_pages+0x15/0x20
 [ 3721.341457] [811835bc] truncate_pagecache+0x4c/0x70
 [ 3721.341464] [c09f815e]
 __ceph_do_pending_vmtruncate+0xde/0x230 [ceph]
 [ 3721.341470] [c09f8c73] ceph_vmtruncate_work+0x23/0x50
 [ceph]
 [ 3721.341476] [8108cece] process_one_work+0x14e/0x460
 [ 3721.341479] [8108d84b] worker_thread+0x11b/0x3f0
 [ 3721.341482] [8108d730] ? create_worker+0x1e0/0x1e0
 [ 3721.341485] [81093359] kthread+0xc9/0xe0
 [ 3721.341487] [81093290] ? flush_kthread_worker+0x90/0x90
 [ 3721.341490] [817a64bc] ret_from_fork+0x7c/0xb0
 [ 3721.341492] [81093290] ? flush_kthread_worker+0x90/0x90

 They do not happen with every dd test, but happen pretty often.
 Especially when I am running a few dd tests concurrently.

 An example test that generated hang tasks above after just 2 runs:

 # dd if=/dev/zero of=/tmp/cephfs/4G bs=1M count=4K oflag=direct  dd
 if=/dev/zero of=/tmp/cephfs/4G1 bs=1M count=4K oflag=direct  dd
 if=/dev/zero of=/tmp/cephfs/4G2 bs=1M count=4K oflag=direct  dd
 if=/dev/zero of=/tmp/cephfs/4G3 bs=1M count=4K oflag=direct 

 Cheers

 - Original Message -

  From: Andrei Mikhailovsky and...@arhont.com
 
  To: ceph-users ceph-users@lists.ceph.com
 
  Sent: Friday, 28 November, 2014 11:22:07 AM
 
  Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks
 

  I am also

Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-11-28 Thread Ilya Dryomov
On Fri, Nov 28, 2014 at 3:02 PM, Andrei Mikhailovsky and...@arhont.com wrote:
 I've just tried the latest ubuntu-vivid kernel and also seeing hang tasks
 with dd tests:


 [ 3721.026421] INFO: task nfsd:16596 blocked for more than 120 seconds.
 [ 3721.065141]   Not tainted 3.17.4-031704-generic #201411211317
 [ 3721.103721] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables
 this message.
 [ 3721.180409] nfsdD 0009 0 16596  2
 0x
 [ 3721.180412]  88006f0cbc18 0046 88006f0cbbc8
 8109677f
 [ 3721.180414]  88006f0cbfd8 000145c0 88067089f700
 000145c0
 [ 3721.180417]  880673dac600 88045adfbc00 8801bdf8be40
 88000b841500
 [ 3721.180420] Call Trace:
 [ 3721.180423]  [8109677f] ? set_groups+0x2f/0x60
 [ 3721.180427]  [817a20c9] schedule+0x29/0x70
 [ 3721.180440]  [817a23ee] schedule_preempt_disabled+0xe/0x10
 [ 3721.180443]  [817a429d] __mutex_lock_slowpath+0xcd/0x1d0
 [ 3721.180447]  [817a43c3] mutex_lock+0x23/0x37
 [ 3721.180454]  [c071cadd] nfsd_setattr+0x15d/0x2a0 [nfsd]
 [ 3721.180460]  [c0727d2e] nfsd4_setattr+0x14e/0x180 [nfsd]
 [ 3721.180467]  [c0729eac] nfsd4_proc_compound+0x4cc/0x730 [nfsd]
 [ 3721.180478]  [c0715e55] nfsd_dispatch+0xe5/0x230 [nfsd]
 [ 3721.180491]  [c05b9882] ? svc_tcp_adjust_wspace+0x12/0x30
 [sunrpc]
 [ 3721.180500]  [c05b8694] svc_process_common+0x324/0x680 [sunrpc]
 [ 3721.180510]  [c05b8d43] svc_process+0x103/0x160 [sunrpc]
 [ 3721.180516]  [c07159c7] nfsd+0x117/0x190 [nfsd]
 [ 3721.180526]  [c07158b0] ? nfsd_destroy+0x80/0x80 [nfsd]
 [ 3721.180528]  [81093359] kthread+0xc9/0xe0
 [ 3721.180533]  [81093290] ? flush_kthread_worker+0x90/0x90
 [ 3721.180536]  [817a64bc] ret_from_fork+0x7c/0xb0
 [ 3721.180540]  [81093290] ? flush_kthread_worker+0x90/0x90
 [ 3721.180577] INFO: task kworker/2:3:28061 blocked for more than 120
 seconds.
 [ 3721.221450]   Not tainted 3.17.4-031704-generic #201411211317
 [ 3721.261440] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables
 this message.
 [ 3721.341394] kworker/2:3 D 0002 0 28061  2
 0x
 [ 3721.341408] Workqueue: ceph-trunc ceph_vmtruncate_work [ceph]
 [ 3721.341409]  8805a6507b08 0046 8805a6507b88
 ea00040e8c80
 [ 3721.341412]  8805a6507fd8 000145c0 88067089b480
 000145c0
 [ 3721.341414]  8801fffec600 880102535a00 88000b8415a8
 88046fc94ec0
 [ 3721.341417] Call Trace:
 [ 3721.341421]  [817a2970] ? bit_wait+0x50/0x50
 [ 3721.341424]  [817a20c9] schedule+0x29/0x70
 [ 3721.341427]  [817a219f] io_schedule+0x8f/0xd0
 [ 3721.341430]  [817a299b] bit_wait_io+0x2b/0x50
 [ 3721.341433]  [817a2656] __wait_on_bit_lock+0x76/0xb0
 [ 3721.341438]  [811756b5] ? find_get_entries+0xe5/0x160
 [ 3721.341440]  [8117245e] __lock_page+0xae/0xb0
 [ 3721.341446]  [810b3fd0] ? wake_atomic_t_function+0x40/0x40
 [ 3721.341451]  [81183226] truncate_inode_pages_range+0x446/0x700
 [ 3721.341455]  [81183565] truncate_inode_pages+0x15/0x20
 [ 3721.341457]  [811835bc] truncate_pagecache+0x4c/0x70
 [ 3721.341464]  [c09f815e] __ceph_do_pending_vmtruncate+0xde/0x230
 [ceph]
 [ 3721.341470]  [c09f8c73] ceph_vmtruncate_work+0x23/0x50 [ceph]
 [ 3721.341476]  [8108cece] process_one_work+0x14e/0x460
 [ 3721.341479]  [8108d84b] worker_thread+0x11b/0x3f0
 [ 3721.341482]  [8108d730] ? create_worker+0x1e0/0x1e0
 [ 3721.341485]  [81093359] kthread+0xc9/0xe0
 [ 3721.341487]  [81093290] ? flush_kthread_worker+0x90/0x90
 [ 3721.341490]  [817a64bc] ret_from_fork+0x7c/0xb0
 [ 3721.341492]  [81093290] ? flush_kthread_worker+0x90/0x90


 They do not happen with every dd test, but happen pretty often. Especially
 when I am running a few dd tests concurrently.

 An example test that generated hang tasks above after just 2 runs:

 # dd if=/dev/zero of=/tmp/cephfs/4G bs=1M count=4K oflag=direct  dd
 if=/dev/zero of=/tmp/cephfs/4G1 bs=1M count=4K oflag=direct  dd
 if=/dev/zero of=/tmp/cephfs/4G2 bs=1M count=4K oflag=direct  dd
 if=/dev/zero of=/tmp/cephfs/4G3 bs=1M count=4K oflag=direct 

 Cheers

 

 From: Andrei Mikhailovsky and...@arhont.com
 To: ceph-users ceph-users@lists.ceph.com
 Sent: Friday, 28 November, 2014 11:22:07 AM
 Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks


 I am also noticing some delays working with nfs over cephfs. Especially when
 making an initial connection. For instance, I run the following:

 # time for i in {0..10} ; do time touch /tmp/cephfs/test-$i ; done

 where /tmp/cephfs is the nfs mount point running over cephfs

 I am noticing that the first touch file is created after about 20-30
 seconds. All

Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-11-28 Thread Andrei Mikhailovsky
Ilya, yes I do! LIke these from different osds: 

[ 4422.212204] libceph: osd13 192.168.168.201:6819 socket closed (con state 
OPEN) 

Andrei 

- Original Message -

 From: Ilya Dryomov ilya.dryo...@inktank.com
 To: Andrei Mikhailovsky and...@arhont.com
 Cc: ceph-users ceph-users@lists.ceph.com
 Sent: Friday, 28 November, 2014 4:58:41 PM
 Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks

 On Fri, Nov 28, 2014 at 3:02 PM, Andrei Mikhailovsky
 and...@arhont.com wrote:
  I've just tried the latest ubuntu-vivid kernel and also seeing hang
  tasks
  with dd tests:
 
 
  [ 3721.026421] INFO: task nfsd:16596 blocked for more than 120
  seconds.
  [ 3721.065141] Not tainted 3.17.4-031704-generic #201411211317
  [ 3721.103721] echo 0  /proc/sys/kernel/hung_task_timeout_secs
  disables
  this message.
  [ 3721.180409] nfsd D 0009 0 16596 2
  0x
  [ 3721.180412] 88006f0cbc18 0046 88006f0cbbc8
  8109677f
  [ 3721.180414] 88006f0cbfd8 000145c0 88067089f700
  000145c0
  [ 3721.180417] 880673dac600 88045adfbc00 8801bdf8be40
  88000b841500
  [ 3721.180420] Call Trace:
  [ 3721.180423] [8109677f] ? set_groups+0x2f/0x60
  [ 3721.180427] [817a20c9] schedule+0x29/0x70
  [ 3721.180440] [817a23ee]
  schedule_preempt_disabled+0xe/0x10
  [ 3721.180443] [817a429d]
  __mutex_lock_slowpath+0xcd/0x1d0
  [ 3721.180447] [817a43c3] mutex_lock+0x23/0x37
  [ 3721.180454] [c071cadd] nfsd_setattr+0x15d/0x2a0 [nfsd]
  [ 3721.180460] [c0727d2e] nfsd4_setattr+0x14e/0x180
  [nfsd]
  [ 3721.180467] [c0729eac] nfsd4_proc_compound+0x4cc/0x730
  [nfsd]
  [ 3721.180478] [c0715e55] nfsd_dispatch+0xe5/0x230 [nfsd]
  [ 3721.180491] [c05b9882] ?
  svc_tcp_adjust_wspace+0x12/0x30
  [sunrpc]
  [ 3721.180500] [c05b8694] svc_process_common+0x324/0x680
  [sunrpc]
  [ 3721.180510] [c05b8d43] svc_process+0x103/0x160
  [sunrpc]
  [ 3721.180516] [c07159c7] nfsd+0x117/0x190 [nfsd]
  [ 3721.180526] [c07158b0] ? nfsd_destroy+0x80/0x80 [nfsd]
  [ 3721.180528] [81093359] kthread+0xc9/0xe0
  [ 3721.180533] [81093290] ?
  flush_kthread_worker+0x90/0x90
  [ 3721.180536] [817a64bc] ret_from_fork+0x7c/0xb0
  [ 3721.180540] [81093290] ?
  flush_kthread_worker+0x90/0x90
  [ 3721.180577] INFO: task kworker/2:3:28061 blocked for more than
  120
  seconds.
  [ 3721.221450] Not tainted 3.17.4-031704-generic #201411211317
  [ 3721.261440] echo 0  /proc/sys/kernel/hung_task_timeout_secs
  disables
  this message.
  [ 3721.341394] kworker/2:3 D 0002 0 28061 2
  0x
  [ 3721.341408] Workqueue: ceph-trunc ceph_vmtruncate_work [ceph]
  [ 3721.341409] 8805a6507b08 0046 8805a6507b88
  ea00040e8c80
  [ 3721.341412] 8805a6507fd8 000145c0 88067089b480
  000145c0
  [ 3721.341414] 8801fffec600 880102535a00 88000b8415a8
  88046fc94ec0
  [ 3721.341417] Call Trace:
  [ 3721.341421] [817a2970] ? bit_wait+0x50/0x50
  [ 3721.341424] [817a20c9] schedule+0x29/0x70
  [ 3721.341427] [817a219f] io_schedule+0x8f/0xd0
  [ 3721.341430] [817a299b] bit_wait_io+0x2b/0x50
  [ 3721.341433] [817a2656] __wait_on_bit_lock+0x76/0xb0
  [ 3721.341438] [811756b5] ? find_get_entries+0xe5/0x160
  [ 3721.341440] [8117245e] __lock_page+0xae/0xb0
  [ 3721.341446] [810b3fd0] ?
  wake_atomic_t_function+0x40/0x40
  [ 3721.341451] [81183226]
  truncate_inode_pages_range+0x446/0x700
  [ 3721.341455] [81183565] truncate_inode_pages+0x15/0x20
  [ 3721.341457] [811835bc] truncate_pagecache+0x4c/0x70
  [ 3721.341464] [c09f815e]
  __ceph_do_pending_vmtruncate+0xde/0x230
  [ceph]
  [ 3721.341470] [c09f8c73] ceph_vmtruncate_work+0x23/0x50
  [ceph]
  [ 3721.341476] [8108cece] process_one_work+0x14e/0x460
  [ 3721.341479] [8108d84b] worker_thread+0x11b/0x3f0
  [ 3721.341482] [8108d730] ? create_worker+0x1e0/0x1e0
  [ 3721.341485] [81093359] kthread+0xc9/0xe0
  [ 3721.341487] [81093290] ?
  flush_kthread_worker+0x90/0x90
  [ 3721.341490] [817a64bc] ret_from_fork+0x7c/0xb0
  [ 3721.341492] [81093290] ?
  flush_kthread_worker+0x90/0x90
 
 
  They do not happen with every dd test, but happen pretty often.
  Especially
  when I am running a few dd tests concurrently.
 
  An example test that generated hang tasks above after just 2 runs:
 
  # dd if=/dev/zero of=/tmp/cephfs/4G bs=1M count=4K oflag=direct 
  dd
  if=/dev/zero of=/tmp/cephfs/4G1 bs=1M count=4K oflag=direct  dd
  if=/dev/zero of=/tmp/cephfs/4G2 bs=1M count=4K oflag=direct  dd
  if=/dev/zero of=/tmp/cephfs/4G3 bs=1M count=4K oflag=direct 
 
  Cheers
 
  
 
  From: Andrei Mikhailovsky and...@arhont.com
  To: ceph-users ceph-users@lists.ceph.com
  Sent: Friday

Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-11-28 Thread Ilya Dryomov
On Fri, Nov 28, 2014 at 8:13 PM, Andrei Mikhailovsky and...@arhont.com wrote:
 Ilya, yes I do! LIke these from different osds:

 [ 4422.212204] libceph: osd13 192.168.168.201:6819 socket closed (con state
 OPEN)

Can you by any chance try a kernel from [1] ?  It's based on Ubuntu
config and unless you are doing something fancy should boot your box.
You have to install it only on the client box of course.

This may be related to the bug I'm currently trying to nail down and
I'd like to know if the latest bits make any difference.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-11-28 Thread Ilya Dryomov
On Fri, Nov 28, 2014 at 8:19 PM, Ilya Dryomov ilya.dryo...@inktank.com wrote:
 On Fri, Nov 28, 2014 at 8:13 PM, Andrei Mikhailovsky and...@arhont.com 
 wrote:
 Ilya, yes I do! LIke these from different osds:

 [ 4422.212204] libceph: osd13 192.168.168.201:6819 socket closed (con state
 OPEN)

 Can you by any chance try a kernel from [1] ?  It's based on Ubuntu
 config and unless you are doing something fancy should boot your box.
 You have to install it only on the client box of course.

 This may be related to the bug I'm currently trying to nail down and
 I'd like to know if the latest bits make any difference.

[1] 
http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/testing/linux-image-3.18.0-rc6-ceph-00024-g72ca172_3.18.0-rc6-ceph-00024-g72ca172-1_amd64.deb

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-11-28 Thread Ilya Dryomov
On Fri, Nov 28, 2014 at 8:20 PM, Ilya Dryomov ilya.dryo...@inktank.com wrote:
 On Fri, Nov 28, 2014 at 8:19 PM, Ilya Dryomov ilya.dryo...@inktank.com 
 wrote:
 On Fri, Nov 28, 2014 at 8:13 PM, Andrei Mikhailovsky and...@arhont.com 
 wrote:
 Ilya, yes I do! LIke these from different osds:

 [ 4422.212204] libceph: osd13 192.168.168.201:6819 socket closed (con state
 OPEN)

 Can you by any chance try a kernel from [1] ?  It's based on Ubuntu
 config and unless you are doing something fancy should boot your box.
 You have to install it only on the client box of course.

 This may be related to the bug I'm currently trying to nail down and
 I'd like to know if the latest bits make any difference.

 [1] 
 http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/testing/linux-image-3.18.0-rc6-ceph-00024-g72ca172_3.18.0-rc6-ceph-00024-g72ca172-1_amd64.deb

It's currently rebuilding because of an unrelated patch and will be
overwritten once gitbuilder is done.  If it's not there by the time you
try use this link:

http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/sha1/72ca172a582d656930f413c3733401b8a5c120db/linux-image-3.18.0-rc6-ceph-00024-g72ca172_3.18.0-rc6-ceph-00024-g72ca172-1_amd64.deb

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-11-28 Thread Andrei Mikhailovsky
I will give it a go and let you know. 

Cheers 

- Original Message -

 From: Ilya Dryomov ilya.dryo...@inktank.com
 To: Andrei Mikhailovsky and...@arhont.com
 Cc: ceph-users ceph-users@lists.ceph.com
 Sent: Friday, 28 November, 2014 5:28:28 PM
 Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks

 On Fri, Nov 28, 2014 at 8:20 PM, Ilya Dryomov
 ilya.dryo...@inktank.com wrote:
  On Fri, Nov 28, 2014 at 8:19 PM, Ilya Dryomov
  ilya.dryo...@inktank.com wrote:
  On Fri, Nov 28, 2014 at 8:13 PM, Andrei Mikhailovsky
  and...@arhont.com wrote:
  Ilya, yes I do! LIke these from different osds:
 
  [ 4422.212204] libceph: osd13 192.168.168.201:6819 socket closed
  (con state
  OPEN)
 
  Can you by any chance try a kernel from [1] ? It's based on Ubuntu
  config and unless you are doing something fancy should boot your
  box.
  You have to install it only on the client box of course.
 
  This may be related to the bug I'm currently trying to nail down
  and
  I'd like to know if the latest bits make any difference.
 
  [1]
  http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/testing/linux-image-3.18.0-rc6-ceph-00024-g72ca172_3.18.0-rc6-ceph-00024-g72ca172-1_amd64.deb

 It's currently rebuilding because of an unrelated patch and will be
 overwritten once gitbuilder is done. If it's not there by the time
 you
 try use this link:

 http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/sha1/72ca172a582d656930f413c3733401b8a5c120db/linux-image-3.18.0-rc6-ceph-00024-g72ca172_3.18.0-rc6-ceph-00024-g72ca172-1_amd64.deb

 Thanks,

 Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-11-28 Thread Andrei Mikhailovsky
: 
[ 288.310156] CPU: 8 PID: 87 Comm: kswapd1 Tainted: G E 
3.18.0-rc6-ceph-00024-g72ca172 #1 
[ 288.310162] Hardware name: Supermicro 
X9DRH-7TF/7F/iTF/iF/X9DRH-7TF/7F/iTF/iF, BIOS 3.0b 04/28/2014 
[ 288.310169] 821208e0 8804676ab608 81733b38 
0007 
[ 288.310177] 8804676ab670 8804676ab658 810a5f68 
821208e0 
[ 288.310184] 81a7cbe0 8804676ab674  
88046763cc50 
[ 288.310192] Call Trace: 
[ 288.310200] [81733b38] dump_stack+0x4e/0x68 
[ 288.310206] [810a5f68] print_irq_inversion_bug.part.41+0x1e8/0x1f0 
[ 288.310213] [810a607b] check_usage_forwards+0x10b/0x150 
[ 288.310220] [810a6a8b] mark_lock+0x18b/0x2e0 
[ 288.310226] [810a5f70] ? 
print_irq_inversion_bug.part.41+0x1f0/0x1f0 
[ 288.310234] [811c9185] ? __mem_cgroup_threshold+0x5/0x1d0 
[ 288.310241] [810a6fb0] __lock_acquire+0x3d0/0x1c90 
[ 288.310247] [810a6ff1] ? __lock_acquire+0x411/0x1c90 
[ 288.310266] [a0682d44] ? xfs_ilock+0x134/0x160 [xfs] 
[ 288.310272] [810a8e9e] lock_acquire+0x9e/0x140 
[ 288.310289] [a0682d44] ? xfs_ilock+0x134/0x160 [xfs] 
[ 288.310295] [810a33ef] down_write_nested+0x4f/0x80 
[ 288.310312] [a0682d44] ? xfs_ilock+0x134/0x160 [xfs] 
[ 288.310329] [a0682d44] xfs_ilock+0x134/0x160 [xfs] 
[ 288.310347] [a067aa0c] ? xfs_reclaim_inode+0x12c/0x340 [xfs] 
[ 288.310364] [a067aa0c] xfs_reclaim_inode+0x12c/0x340 [xfs] 
[ 288.310382] [a067aea7] xfs_reclaim_inodes_ag+0x287/0x400 [xfs] 
[ 288.310400] [a067ad00] ? xfs_reclaim_inodes_ag+0xe0/0x400 [xfs] 
[ 288.310418] [a067bda3] xfs_reclaim_inodes_nr+0x33/0x40 [xfs] 
[ 288.310438] [a068b855] xfs_fs_free_cached_objects+0x15/0x20 [xfs] 
[ 288.310445] [811d8dd8] super_cache_scan+0x178/0x180 
[ 288.310451] [8117393e] shrink_slab_node+0x15e/0x310 
[ 288.310457] [811755e0] shrink_slab+0x100/0x140 
[ 288.310463] [81178306] kswapd_shrink_zone+0x116/0x1a0 
[ 288.310469] [8117925b] kswapd+0x4bb/0x9a0 
[ 288.310475] [81178da0] ? mem_cgroup_shrink_node_zone+0x1c0/0x1c0 
[ 288.310481] [8107a664] kthread+0xe4/0x100 
[ 288.310488] [8107a580] ? flush_kthread_worker+0xf0/0xf0 
[ 288.310494] [8173d66c] ret_from_fork+0x7c/0xb0 
[ 288.310500] [8107a580] ? flush_kthread_worker+0xf0/0xf0 

I've not seen any hang tasks just yet. The server seems to continue working. I 
will do more testing and get back to you with more info. 

Andrei 

- Original Message -

 From: Ilya Dryomov ilya.dryo...@inktank.com
 To: Andrei Mikhailovsky and...@arhont.com
 Cc: ceph-users ceph-users@lists.ceph.com
 Sent: Friday, 28 November, 2014 5:28:28 PM
 Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks

 On Fri, Nov 28, 2014 at 8:20 PM, Ilya Dryomov
 ilya.dryo...@inktank.com wrote:
  On Fri, Nov 28, 2014 at 8:19 PM, Ilya Dryomov
  ilya.dryo...@inktank.com wrote:
  On Fri, Nov 28, 2014 at 8:13 PM, Andrei Mikhailovsky
  and...@arhont.com wrote:
  Ilya, yes I do! LIke these from different osds:
 
  [ 4422.212204] libceph: osd13 192.168.168.201:6819 socket closed
  (con state
  OPEN)
 
  Can you by any chance try a kernel from [1] ? It's based on Ubuntu
  config and unless you are doing something fancy should boot your
  box.
  You have to install it only on the client box of course.
 
  This may be related to the bug I'm currently trying to nail down
  and
  I'd like to know if the latest bits make any difference.
 
  [1]
  http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/testing/linux-image-3.18.0-rc6-ceph-00024-g72ca172_3.18.0-rc6-ceph-00024-g72ca172-1_amd64.deb

 It's currently rebuilding because of an unrelated patch and will be
 overwritten once gitbuilder is done. If it's not there by the time
 you
 try use this link:

 http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/sha1/72ca172a582d656930f413c3733401b8a5c120db/linux-image-3.18.0-rc6-ceph-00024-g72ca172_3.18.0-rc6-ceph-00024-g72ca172-1_amd64.deb

 Thanks,

 Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-11-28 Thread Andrei Mikhailovsky
Ilya, 

not sure if dmesg output in the previous is related to the cephfs, but from 
what I can see it looks good with your kernel. I would have seen hang tasks by 
now, but not anymore. I've ran a bunch of concurrent dd tests and also the file 
touch tests and there are no more delays. 

So, it looks like you have nailed the bug! 

Do you plan to backport the fix to the 3.16 or 3.17 branches? 

Cheers 

Andrei 

- Original Message -

 From: Ilya Dryomov ilya.dryo...@inktank.com
 To: Andrei Mikhailovsky and...@arhont.com
 Cc: ceph-users ceph-users@lists.ceph.com
 Sent: Friday, 28 November, 2014 5:28:28 PM
 Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks

 On Fri, Nov 28, 2014 at 8:20 PM, Ilya Dryomov
 ilya.dryo...@inktank.com wrote:
  On Fri, Nov 28, 2014 at 8:19 PM, Ilya Dryomov
  ilya.dryo...@inktank.com wrote:
  On Fri, Nov 28, 2014 at 8:13 PM, Andrei Mikhailovsky
  and...@arhont.com wrote:
  Ilya, yes I do! LIke these from different osds:
 
  [ 4422.212204] libceph: osd13 192.168.168.201:6819 socket closed
  (con state
  OPEN)
 
  Can you by any chance try a kernel from [1] ? It's based on Ubuntu
  config and unless you are doing something fancy should boot your
  box.
  You have to install it only on the client box of course.
 
  This may be related to the bug I'm currently trying to nail down
  and
  I'd like to know if the latest bits make any difference.
 
  [1]
  http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/testing/linux-image-3.18.0-rc6-ceph-00024-g72ca172_3.18.0-rc6-ceph-00024-g72ca172-1_amd64.deb

 It's currently rebuilding because of an unrelated patch and will be
 overwritten once gitbuilder is done. If it's not there by the time
 you
 try use this link:

 http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/sha1/72ca172a582d656930f413c3733401b8a5c120db/linux-image-3.18.0-rc6-ceph-00024-g72ca172_3.18.0-rc6-ceph-00024-g72ca172-1_amd64.deb

 Thanks,

 Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com