Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
- Original Message - > From: "Raghavendra Gowdappa" > To: "Joe Julian" > Cc: "Raghavendra G" , "Pranith Kumar Karampuri" > , "Soumya Koduri" > , rta...@redhat.com, gluster-us...@gluster.org, "Gluster > Devel" , > "Oleksandr Natalenko" > Sent: Friday, October 14, 2016 8:29:18 AM > Subject: Re: [Gluster-users] [Gluster-devel] GlusterFS FUSE client hangs on > rsyncing lots of file > > I've a patch [1], which I think fixes the problem in write-behind. If > possible can anyone of you please let me know whether it fixes the issue > (with client.io-threads turned on)? > > [1] http://review.gluster.org/15579 I see some regression failures again in latest patchset, which I am working on to fix. However, you can use patchset 3 [2] for testing as it has passed all regressions. [2] http://review.gluster.org/#/c/15579/3 > > - Original Message - > > From: "Joe Julian" > > To: "Raghavendra G" , "Pranith Kumar Karampuri" > > , "Soumya Koduri" > > , rta...@redhat.com > > Cc: gluster-us...@gluster.org, "Gluster Devel" > > Sent: Friday, February 5, 2016 12:16:01 PM > > Subject: Re: [Gluster-users] [Gluster-devel] GlusterFS FUSE client hangs on > > rsyncing lots of file > > > > Interesting. I just encountered a hanging flush problem, too. Probably > > unrelated but if you want to give this a try a temporary workaround I found > > was to drop caches, "echo 3 > /proc/vm/drop_caches", on all the servers > > prior to the flush operation. > > > > On February 4, 2016 10:06:45 PM PST, Raghavendra G > > > > wrote: > > > > > > +soumyak, +rtalur. > > > > On Fri, Jan 29, 2016 at 2:34 PM, Pranith Kumar Karampuri < > > pkara...@redhat.com > wrote: > > > > > > > > > > On 01/28/2016 05:05 PM, Pranith Kumar Karampuri wrote: > > > > > > With baul jianguo's help I am able to see that FLUSH fops are hanging for > > some reason. > > > > pk1@localhost - ~/Downloads > > 17:02:13 :) ⚡ grep "unique=" client-dump1.txt > > unique=3160758373 > > unique= 2073075682 > > unique=1455047665 > > unique=0 > > > > pk1@localhost - ~/Downloads > > 17:02:21 :) ⚡ grep "unique=" client-dump-0.txt > > unique=3160758373 > > unique= 2073075682 > > unique=1455047665 > > unique=0 > > > > I will be debugging a bit more and post my findings. > > +Raghavendra G > > > > All the stubs are hung in write-behind. I checked that the statedumps > > doesn't > > have any writes in progress. May be because of some race, flush fop is not > > resumed after write calls are complete? It seems this issue happens only > > when io-threads is enabled on the client. > > > > Pranith > > > > > > > > > > Pranith > > On 01/28/2016 03:18 PM, baul jianguo wrote: > > > > > > the client glusterfs gdb info, main thread id is 70800。 > > In the top output,70800 thread time 1263:30,70810 thread time > > 1321:10,other thread time too small。 > > (gdb) thread apply all bt > > > > > > > > Thread 9 (Thread 0x7fc21acaf700 (LWP 70801)): > > > > #0 0x7fc21cc0c535 in sigwait () from /lib64/libpthread.so.0 > > > > #1 0x0040539b in glusterfs_sigwaiter (arg= > out>) at glusterfsd.c:1653 > > > > #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 > > > > #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 > > > > > > > > Thread 8 (Thread 0x7fc21a2ae700 (LWP 70802)): > > > > #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from > > /lib64/libpthread.so.0 > > > > #1 0x7fc21ded02bf in syncenv_task (proc=0x121ee60) at syncop.c:493 > > > > #2 0x7fc21ded6300 in syncenv_processor (thdata=0x121ee60) at > > syncop.c:571 > > > > #3 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 > > > > #4 0x7fc21c56e93d in clone () from /lib64/libc.so.6 > > > > > > > > Thread 7 (Thread 0x7fc2198ad700 (LWP 70803)): > > > > #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from > > /lib64/libpthread.so.0 > > > > #1 0x7fc21ded02bf in syncenv_task (proc=0x121f220) at syncop.c:493 > > > > #2 0x7fc21ded6300 in syncenv_processor (thdata=0x121f220) at > > syncop.c:571 > > > > #3 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 > > > > #4 0x7fc21c56e93d in clone () from /lib64/libc.so.6 > > > > > > > > Thread 6 (Thread 0x7fc21767d700 (LWP 70805)): > > > > #0 0x7fc21cc0bfbd in nanosleep () from /lib64/libpthread.so.0 > > > > #1 0x7fc21deb16bc in gf_timer_proc (ctx=0x11f2010) at timer.c:170 > > > > #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 > > > > #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 > > > > > > > > Thread 5 (Thread 0x7fc20fb1e700 (LWP 70810)): > > > > #0 0x7fc21c566987 in readv () from /lib64/libc.so.6 > > > > #1 0x7fc21accbc55 in fuse_thread_proc (data=0x120f450) at > > fuse-bridge.c:4752 > > > > #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 > > > > #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 时间最多 > > > > > > > > Thread 4 (Thread 0x7fc
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
I've a patch [1], which I think fixes the problem in write-behind. If possible can anyone of you please let me know whether it fixes the issue (with client.io-threads turned on)? [1] http://review.gluster.org/15579 - Original Message - > From: "Joe Julian" > To: "Raghavendra G" , "Pranith Kumar Karampuri" > , "Soumya Koduri" > , rta...@redhat.com > Cc: gluster-us...@gluster.org, "Gluster Devel" > Sent: Friday, February 5, 2016 12:16:01 PM > Subject: Re: [Gluster-users] [Gluster-devel] GlusterFS FUSE client hangs on > rsyncing lots of file > > Interesting. I just encountered a hanging flush problem, too. Probably > unrelated but if you want to give this a try a temporary workaround I found > was to drop caches, "echo 3 > /proc/vm/drop_caches", on all the servers > prior to the flush operation. > > On February 4, 2016 10:06:45 PM PST, Raghavendra G > wrote: > > > +soumyak, +rtalur. > > On Fri, Jan 29, 2016 at 2:34 PM, Pranith Kumar Karampuri < > pkara...@redhat.com > wrote: > > > > > On 01/28/2016 05:05 PM, Pranith Kumar Karampuri wrote: > > > With baul jianguo's help I am able to see that FLUSH fops are hanging for > some reason. > > pk1@localhost - ~/Downloads > 17:02:13 :) ⚡ grep "unique=" client-dump1.txt > unique=3160758373 > unique= 2073075682 > unique=1455047665 > unique=0 > > pk1@localhost - ~/Downloads > 17:02:21 :) ⚡ grep "unique=" client-dump-0.txt > unique=3160758373 > unique= 2073075682 > unique=1455047665 > unique=0 > > I will be debugging a bit more and post my findings. > +Raghavendra G > > All the stubs are hung in write-behind. I checked that the statedumps doesn't > have any writes in progress. May be because of some race, flush fop is not > resumed after write calls are complete? It seems this issue happens only > when io-threads is enabled on the client. > > Pranith > > > > > Pranith > On 01/28/2016 03:18 PM, baul jianguo wrote: > > > the client glusterfs gdb info, main thread id is 70800。 > In the top output,70800 thread time 1263:30,70810 thread time > 1321:10,other thread time too small。 > (gdb) thread apply all bt > > > > Thread 9 (Thread 0x7fc21acaf700 (LWP 70801)): > > #0 0x7fc21cc0c535 in sigwait () from /lib64/libpthread.so.0 > > #1 0x0040539b in glusterfs_sigwaiter (arg= out>) at glusterfsd.c:1653 > > #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 > > #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 > > > > Thread 8 (Thread 0x7fc21a2ae700 (LWP 70802)): > > #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > > #1 0x7fc21ded02bf in syncenv_task (proc=0x121ee60) at syncop.c:493 > > #2 0x7fc21ded6300 in syncenv_processor (thdata=0x121ee60) at syncop.c:571 > > #3 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 > > #4 0x7fc21c56e93d in clone () from /lib64/libc.so.6 > > > > Thread 7 (Thread 0x7fc2198ad700 (LWP 70803)): > > #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > > #1 0x7fc21ded02bf in syncenv_task (proc=0x121f220) at syncop.c:493 > > #2 0x7fc21ded6300 in syncenv_processor (thdata=0x121f220) at syncop.c:571 > > #3 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 > > #4 0x7fc21c56e93d in clone () from /lib64/libc.so.6 > > > > Thread 6 (Thread 0x7fc21767d700 (LWP 70805)): > > #0 0x7fc21cc0bfbd in nanosleep () from /lib64/libpthread.so.0 > > #1 0x7fc21deb16bc in gf_timer_proc (ctx=0x11f2010) at timer.c:170 > > #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 > > #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 > > > > Thread 5 (Thread 0x7fc20fb1e700 (LWP 70810)): > > #0 0x7fc21c566987 in readv () from /lib64/libc.so.6 > > #1 0x7fc21accbc55 in fuse_thread_proc (data=0x120f450) at > fuse-bridge.c:4752 > > #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 > > #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 时间最多 > > > > Thread 4 (Thread 0x7fc20f11d700 (LWP 70811)): 少点 > > #0 0x7fc21cc0b7dd in read () from /lib64/libpthread.so.0 > > #1 0x7fc21acc0e73 in read (data=) at > /usr/include/bits/unistd.h:45 > > #2 notify_kernel_loop (data=) at fuse-bridge.c:3786 > > #3 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 > > #4 0x7fc21c56e93d in clone () from /lib64/libc.so.6 > > > > Thread 3 (Thread 0x7fc1b16fe700 (LWP 206224)): > > ---Type to continue, or q to quit--- > > #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > > #1 0x7fc20e515e60 in iot_worker (data=0x19eeda0) at io-threads.c:157 > > #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 > > #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 > > > > Thread 2 (Thread 0x7fc1b0bfb700 (LWP 214361)): > > #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from > /lib64/libpth
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
Interesting. I just encountered a hanging flush problem, too. Probably unrelated but if you want to give this a try a temporary workaround I found was to drop caches, "echo 3 > /proc/vm/drop_caches", on all the servers prior to the flush operation. On February 4, 2016 10:06:45 PM PST, Raghavendra G wrote: >+soumyak, +rtalur. > >On Fri, Jan 29, 2016 at 2:34 PM, Pranith Kumar Karampuri < >pkara...@redhat.com> wrote: > >> >> >> On 01/28/2016 05:05 PM, Pranith Kumar Karampuri wrote: >> >>> With baul jianguo's help I am able to see that FLUSH fops are >hanging for >>> some reason. >>> >>> pk1@localhost - ~/Downloads >>> 17:02:13 :) ⚡ grep "unique=" client-dump1.txt >>> unique=3160758373 >>> unique=2073075682 >>> unique=1455047665 >>> unique=0 >>> >>> pk1@localhost - ~/Downloads >>> 17:02:21 :) ⚡ grep "unique=" client-dump-0.txt >>> unique=3160758373 >>> unique=2073075682 >>> unique=1455047665 >>> unique=0 >>> >>> I will be debugging a bit more and post my findings. >>> >> +Raghavendra G >> >> All the stubs are hung in write-behind. I checked that the statedumps >> doesn't have any writes in progress. May be because of some race, >flush fop >> is not resumed after write calls are complete? It seems this issue >happens >> only when io-threads is enabled on the client. >> >> Pranith >> >> >>> Pranith >>> On 01/28/2016 03:18 PM, baul jianguo wrote: >>> the client glusterfs gdb info, main thread id is 70800。 In the top output,70800 thread time 1263:30,70810 thread time 1321:10,other thread time too small。 (gdb) thread apply all bt Thread 9 (Thread 0x7fc21acaf700 (LWP 70801)): #0 0x7fc21cc0c535 in sigwait () from /lib64/libpthread.so.0 #1 0x0040539b in glusterfs_sigwaiter (arg=>>> out>) at glusterfsd.c:1653 #2 0x7fc21cc04a51 in start_thread () from >/lib64/libpthread.so.0 #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 8 (Thread 0x7fc21a2ae700 (LWP 70802)): #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () >from /lib64/libpthread.so.0 #1 0x7fc21ded02bf in syncenv_task (proc=0x121ee60) at >syncop.c:493 #2 0x7fc21ded6300 in syncenv_processor (thdata=0x121ee60) at syncop.c:571 #3 0x7fc21cc04a51 in start_thread () from >/lib64/libpthread.so.0 #4 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 7 (Thread 0x7fc2198ad700 (LWP 70803)): #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () >from /lib64/libpthread.so.0 #1 0x7fc21ded02bf in syncenv_task (proc=0x121f220) at >syncop.c:493 #2 0x7fc21ded6300 in syncenv_processor (thdata=0x121f220) at syncop.c:571 #3 0x7fc21cc04a51 in start_thread () from >/lib64/libpthread.so.0 #4 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 6 (Thread 0x7fc21767d700 (LWP 70805)): #0 0x7fc21cc0bfbd in nanosleep () from /lib64/libpthread.so.0 #1 0x7fc21deb16bc in gf_timer_proc (ctx=0x11f2010) at >timer.c:170 #2 0x7fc21cc04a51 in start_thread () from >/lib64/libpthread.so.0 #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x7fc20fb1e700 (LWP 70810)): #0 0x7fc21c566987 in readv () from /lib64/libc.so.6 #1 0x7fc21accbc55 in fuse_thread_proc (data=0x120f450) at fuse-bridge.c:4752 #2 0x7fc21cc04a51 in start_thread () from >/lib64/libpthread.so.0 #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 时间最多 Thread 4 (Thread 0x7fc20f11d700 (LWP 70811)): 少点 #0 0x7fc21cc0b7dd in read () from /lib64/libpthread.so.0 #1 0x7fc21acc0e73 in read (data=) at /usr/include/bits/unistd.h:45 #2 notify_kernel_loop (data=) at >fuse-bridge.c:3786 #3 0x7fc21cc04a51 in start_thread () from >/lib64/libpthread.so.0 #4 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7fc1b16fe700 (LWP 206224)): ---Type to continue, or q to quit--- #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () >from /lib64/libpthread.so.0 #1 0x7fc20e515e60 in iot_worker (data=0x19eeda0) at >io-threads.c:157 #2 0x7fc21cc04a51 in start_thread () from >/lib64/libpthread.so.0 #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7fc1b0bfb700 (LWP 214361)): #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () >from /lib64/libpthread.so.0 #1 0x7fc20e515e60 in iot_worker (data=0x19eeda0) at >io-threads.c:157 #2 0x7fc21cc04a51 in start_thread () from >/lib64/libpthread.so.0 #3 0x7fc21c56e93d in clone () fr
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
+soumyak, +rtalur. On Fri, Jan 29, 2016 at 2:34 PM, Pranith Kumar Karampuri < pkara...@redhat.com> wrote: > > > On 01/28/2016 05:05 PM, Pranith Kumar Karampuri wrote: > >> With baul jianguo's help I am able to see that FLUSH fops are hanging for >> some reason. >> >> pk1@localhost - ~/Downloads >> 17:02:13 :) ⚡ grep "unique=" client-dump1.txt >> unique=3160758373 >> unique=2073075682 >> unique=1455047665 >> unique=0 >> >> pk1@localhost - ~/Downloads >> 17:02:21 :) ⚡ grep "unique=" client-dump-0.txt >> unique=3160758373 >> unique=2073075682 >> unique=1455047665 >> unique=0 >> >> I will be debugging a bit more and post my findings. >> > +Raghavendra G > > All the stubs are hung in write-behind. I checked that the statedumps > doesn't have any writes in progress. May be because of some race, flush fop > is not resumed after write calls are complete? It seems this issue happens > only when io-threads is enabled on the client. > > Pranith > > >> Pranith >> On 01/28/2016 03:18 PM, baul jianguo wrote: >> >>> the client glusterfs gdb info, main thread id is 70800。 >>> In the top output,70800 thread time 1263:30,70810 thread time >>> 1321:10,other thread time too small。 >>> (gdb) thread apply all bt >>> >>> >>> >>> Thread 9 (Thread 0x7fc21acaf700 (LWP 70801)): >>> >>> #0 0x7fc21cc0c535 in sigwait () from /lib64/libpthread.so.0 >>> >>> #1 0x0040539b in glusterfs_sigwaiter (arg=>> out>) at glusterfsd.c:1653 >>> >>> #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 >>> >>> #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 >>> >>> >>> >>> Thread 8 (Thread 0x7fc21a2ae700 (LWP 70802)): >>> >>> #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from >>> /lib64/libpthread.so.0 >>> >>> #1 0x7fc21ded02bf in syncenv_task (proc=0x121ee60) at syncop.c:493 >>> >>> #2 0x7fc21ded6300 in syncenv_processor (thdata=0x121ee60) at >>> syncop.c:571 >>> >>> #3 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 >>> >>> #4 0x7fc21c56e93d in clone () from /lib64/libc.so.6 >>> >>> >>> >>> Thread 7 (Thread 0x7fc2198ad700 (LWP 70803)): >>> >>> #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from >>> /lib64/libpthread.so.0 >>> >>> #1 0x7fc21ded02bf in syncenv_task (proc=0x121f220) at syncop.c:493 >>> >>> #2 0x7fc21ded6300 in syncenv_processor (thdata=0x121f220) at >>> syncop.c:571 >>> >>> #3 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 >>> >>> #4 0x7fc21c56e93d in clone () from /lib64/libc.so.6 >>> >>> >>> >>> Thread 6 (Thread 0x7fc21767d700 (LWP 70805)): >>> >>> #0 0x7fc21cc0bfbd in nanosleep () from /lib64/libpthread.so.0 >>> >>> #1 0x7fc21deb16bc in gf_timer_proc (ctx=0x11f2010) at timer.c:170 >>> >>> #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 >>> >>> #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 >>> >>> >>> >>> Thread 5 (Thread 0x7fc20fb1e700 (LWP 70810)): >>> >>> #0 0x7fc21c566987 in readv () from /lib64/libc.so.6 >>> >>> #1 0x7fc21accbc55 in fuse_thread_proc (data=0x120f450) at >>> fuse-bridge.c:4752 >>> >>> #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 >>> >>> #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 时间最多 >>> >>> >>> >>> Thread 4 (Thread 0x7fc20f11d700 (LWP 70811)): 少点 >>> >>> #0 0x7fc21cc0b7dd in read () from /lib64/libpthread.so.0 >>> >>> #1 0x7fc21acc0e73 in read (data=) at >>> /usr/include/bits/unistd.h:45 >>> >>> #2 notify_kernel_loop (data=) at fuse-bridge.c:3786 >>> >>> #3 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 >>> >>> #4 0x7fc21c56e93d in clone () from /lib64/libc.so.6 >>> >>> >>> >>> Thread 3 (Thread 0x7fc1b16fe700 (LWP 206224)): >>> >>> ---Type to continue, or q to quit--- >>> >>> #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from >>> /lib64/libpthread.so.0 >>> >>> #1 0x7fc20e515e60 in iot_worker (data=0x19eeda0) at io-threads.c:157 >>> >>> #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 >>> >>> #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 >>> >>> >>> >>> Thread 2 (Thread 0x7fc1b0bfb700 (LWP 214361)): >>> >>> #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from >>> /lib64/libpthread.so.0 >>> >>> #1 0x7fc20e515e60 in iot_worker (data=0x19eeda0) at io-threads.c:157 >>> >>> #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 >>> >>> #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 >>> >>> >>> >>> Thread 1 (Thread 0x7fc21e31e700 (LWP 70800)): >>> >>> #0 0x7fc21c56ef33 in epoll_wait () from /lib64/libc.so.6 >>> >>> #1 0x7fc21deea3e7 in event_dispatch_epoll (event_pool=0x120dec0) >>> at event-epoll.c:428 >>> >>> #2 0x004075e4 in main (argc=4, argv=0x7fff3dc93698) at >>> glusterfsd.c:1983 >>> >>> On Thu, Jan 28, 2016 at 5:29 PM, baul jianguo >>> wrote: >>> http://pastebin.centos.org/38941/ client statedump,only the pid
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
On 01/28/2016 05:05 PM, Pranith Kumar Karampuri wrote: With baul jianguo's help I am able to see that FLUSH fops are hanging for some reason. pk1@localhost - ~/Downloads 17:02:13 :) ⚡ grep "unique=" client-dump1.txt unique=3160758373 unique=2073075682 unique=1455047665 unique=0 pk1@localhost - ~/Downloads 17:02:21 :) ⚡ grep "unique=" client-dump-0.txt unique=3160758373 unique=2073075682 unique=1455047665 unique=0 I will be debugging a bit more and post my findings. +Raghavendra G All the stubs are hung in write-behind. I checked that the statedumps doesn't have any writes in progress. May be because of some race, flush fop is not resumed after write calls are complete? It seems this issue happens only when io-threads is enabled on the client. Pranith Pranith On 01/28/2016 03:18 PM, baul jianguo wrote: the client glusterfs gdb info, main thread id is 70800。 In the top output,70800 thread time 1263:30,70810 thread time 1321:10,other thread time too small。 (gdb) thread apply all bt Thread 9 (Thread 0x7fc21acaf700 (LWP 70801)): #0 0x7fc21cc0c535 in sigwait () from /lib64/libpthread.so.0 #1 0x0040539b in glusterfs_sigwaiter (arg=) at glusterfsd.c:1653 #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 8 (Thread 0x7fc21a2ae700 (LWP 70802)): #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7fc21ded02bf in syncenv_task (proc=0x121ee60) at syncop.c:493 #2 0x7fc21ded6300 in syncenv_processor (thdata=0x121ee60) at syncop.c:571 #3 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #4 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 7 (Thread 0x7fc2198ad700 (LWP 70803)): #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7fc21ded02bf in syncenv_task (proc=0x121f220) at syncop.c:493 #2 0x7fc21ded6300 in syncenv_processor (thdata=0x121f220) at syncop.c:571 #3 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #4 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 6 (Thread 0x7fc21767d700 (LWP 70805)): #0 0x7fc21cc0bfbd in nanosleep () from /lib64/libpthread.so.0 #1 0x7fc21deb16bc in gf_timer_proc (ctx=0x11f2010) at timer.c:170 #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x7fc20fb1e700 (LWP 70810)): #0 0x7fc21c566987 in readv () from /lib64/libc.so.6 #1 0x7fc21accbc55 in fuse_thread_proc (data=0x120f450) at fuse-bridge.c:4752 #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 时间最多 Thread 4 (Thread 0x7fc20f11d700 (LWP 70811)): 少点 #0 0x7fc21cc0b7dd in read () from /lib64/libpthread.so.0 #1 0x7fc21acc0e73 in read (data=) at /usr/include/bits/unistd.h:45 #2 notify_kernel_loop (data=) at fuse-bridge.c:3786 #3 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #4 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7fc1b16fe700 (LWP 206224)): ---Type to continue, or q to quit--- #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7fc20e515e60 in iot_worker (data=0x19eeda0) at io-threads.c:157 #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7fc1b0bfb700 (LWP 214361)): #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7fc20e515e60 in iot_worker (data=0x19eeda0) at io-threads.c:157 #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7fc21e31e700 (LWP 70800)): #0 0x7fc21c56ef33 in epoll_wait () from /lib64/libc.so.6 #1 0x7fc21deea3e7 in event_dispatch_epoll (event_pool=0x120dec0) at event-epoll.c:428 #2 0x004075e4 in main (argc=4, argv=0x7fff3dc93698) at glusterfsd.c:1983 On Thu, Jan 28, 2016 at 5:29 PM, baul jianguo wrote: http://pastebin.centos.org/38941/ client statedump,only the pid 27419,168030,208655 hang,you can search this pid in the statedump file。 On Wed, Jan 27, 2016 at 4:35 PM, Pranith Kumar Karampuri wrote: Hi, If the hang appears on enabling client side io-threads then it could be because of some race that is seen when io-threads is enabled on the client side. 2 things will help us debug this issue: 1) thread apply all bt inside gdb (with debuginfo rpms/debs installed ) 2) Complete statedump of the mount at two intervals preferably 10 seconds apart. It becomes difficult to find out which ones are stuck vs the ones that are on-going when we have just one statedump. If we have two, we can find w
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
With baul jianguo's help I am able to see that FLUSH fops are hanging for some reason. pk1@localhost - ~/Downloads 17:02:13 :) ⚡ grep "unique=" client-dump1.txt unique=3160758373 unique=2073075682 unique=1455047665 unique=0 pk1@localhost - ~/Downloads 17:02:21 :) ⚡ grep "unique=" client-dump-0.txt unique=3160758373 unique=2073075682 unique=1455047665 unique=0 I will be debugging a bit more and post my findings. Pranith On 01/28/2016 03:18 PM, baul jianguo wrote: the client glusterfs gdb info, main thread id is 70800。 In the top output,70800 thread time 1263:30,70810 thread time 1321:10,other thread time too small。 (gdb) thread apply all bt Thread 9 (Thread 0x7fc21acaf700 (LWP 70801)): #0 0x7fc21cc0c535 in sigwait () from /lib64/libpthread.so.0 #1 0x0040539b in glusterfs_sigwaiter (arg=) at glusterfsd.c:1653 #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 8 (Thread 0x7fc21a2ae700 (LWP 70802)): #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7fc21ded02bf in syncenv_task (proc=0x121ee60) at syncop.c:493 #2 0x7fc21ded6300 in syncenv_processor (thdata=0x121ee60) at syncop.c:571 #3 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #4 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 7 (Thread 0x7fc2198ad700 (LWP 70803)): #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7fc21ded02bf in syncenv_task (proc=0x121f220) at syncop.c:493 #2 0x7fc21ded6300 in syncenv_processor (thdata=0x121f220) at syncop.c:571 #3 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #4 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 6 (Thread 0x7fc21767d700 (LWP 70805)): #0 0x7fc21cc0bfbd in nanosleep () from /lib64/libpthread.so.0 #1 0x7fc21deb16bc in gf_timer_proc (ctx=0x11f2010) at timer.c:170 #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x7fc20fb1e700 (LWP 70810)): #0 0x7fc21c566987 in readv () from /lib64/libc.so.6 #1 0x7fc21accbc55 in fuse_thread_proc (data=0x120f450) at fuse-bridge.c:4752 #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 时间最多 Thread 4 (Thread 0x7fc20f11d700 (LWP 70811)): 少点 #0 0x7fc21cc0b7dd in read () from /lib64/libpthread.so.0 #1 0x7fc21acc0e73 in read (data=) at /usr/include/bits/unistd.h:45 #2 notify_kernel_loop (data=) at fuse-bridge.c:3786 #3 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #4 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7fc1b16fe700 (LWP 206224)): ---Type to continue, or q to quit--- #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7fc20e515e60 in iot_worker (data=0x19eeda0) at io-threads.c:157 #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7fc1b0bfb700 (LWP 214361)): #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7fc20e515e60 in iot_worker (data=0x19eeda0) at io-threads.c:157 #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7fc21e31e700 (LWP 70800)): #0 0x7fc21c56ef33 in epoll_wait () from /lib64/libc.so.6 #1 0x7fc21deea3e7 in event_dispatch_epoll (event_pool=0x120dec0) at event-epoll.c:428 #2 0x004075e4 in main (argc=4, argv=0x7fff3dc93698) at glusterfsd.c:1983 On Thu, Jan 28, 2016 at 5:29 PM, baul jianguo wrote: http://pastebin.centos.org/38941/ client statedump,only the pid 27419,168030,208655 hang,you can search this pid in the statedump file。 On Wed, Jan 27, 2016 at 4:35 PM, Pranith Kumar Karampuri wrote: Hi, If the hang appears on enabling client side io-threads then it could be because of some race that is seen when io-threads is enabled on the client side. 2 things will help us debug this issue: 1) thread apply all bt inside gdb (with debuginfo rpms/debs installed ) 2) Complete statedump of the mount at two intervals preferably 10 seconds apart. It becomes difficult to find out which ones are stuck vs the ones that are on-going when we have just one statedump. If we have two, we can find which frames are common in both of the statedumps and then take a closer look there. Feel free to ping me on #gluster-dev, nick: pranithk if you have the process hung in that state and you guys don't mind me do a live debugging with you guys. This option is the best of the lot! Thanks a lot baul, Oleksandr for the debugging so far! Pranith On 01/25/2016 01:03 PM, baul
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
the client glusterfs gdb info, main thread id is 70800。 In the top output,70800 thread time 1263:30,70810 thread time 1321:10,other thread time too small。 (gdb) thread apply all bt Thread 9 (Thread 0x7fc21acaf700 (LWP 70801)): #0 0x7fc21cc0c535 in sigwait () from /lib64/libpthread.so.0 #1 0x0040539b in glusterfs_sigwaiter (arg=) at glusterfsd.c:1653 #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 8 (Thread 0x7fc21a2ae700 (LWP 70802)): #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7fc21ded02bf in syncenv_task (proc=0x121ee60) at syncop.c:493 #2 0x7fc21ded6300 in syncenv_processor (thdata=0x121ee60) at syncop.c:571 #3 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #4 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 7 (Thread 0x7fc2198ad700 (LWP 70803)): #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7fc21ded02bf in syncenv_task (proc=0x121f220) at syncop.c:493 #2 0x7fc21ded6300 in syncenv_processor (thdata=0x121f220) at syncop.c:571 #3 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #4 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 6 (Thread 0x7fc21767d700 (LWP 70805)): #0 0x7fc21cc0bfbd in nanosleep () from /lib64/libpthread.so.0 #1 0x7fc21deb16bc in gf_timer_proc (ctx=0x11f2010) at timer.c:170 #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x7fc20fb1e700 (LWP 70810)): #0 0x7fc21c566987 in readv () from /lib64/libc.so.6 #1 0x7fc21accbc55 in fuse_thread_proc (data=0x120f450) at fuse-bridge.c:4752 #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 时间最多 Thread 4 (Thread 0x7fc20f11d700 (LWP 70811)): 少点 #0 0x7fc21cc0b7dd in read () from /lib64/libpthread.so.0 #1 0x7fc21acc0e73 in read (data=) at /usr/include/bits/unistd.h:45 #2 notify_kernel_loop (data=) at fuse-bridge.c:3786 #3 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #4 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7fc1b16fe700 (LWP 206224)): ---Type to continue, or q to quit--- #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7fc20e515e60 in iot_worker (data=0x19eeda0) at io-threads.c:157 #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7fc1b0bfb700 (LWP 214361)): #0 0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7fc20e515e60 in iot_worker (data=0x19eeda0) at io-threads.c:157 #2 0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0 #3 0x7fc21c56e93d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7fc21e31e700 (LWP 70800)): #0 0x7fc21c56ef33 in epoll_wait () from /lib64/libc.so.6 #1 0x7fc21deea3e7 in event_dispatch_epoll (event_pool=0x120dec0) at event-epoll.c:428 #2 0x004075e4 in main (argc=4, argv=0x7fff3dc93698) at glusterfsd.c:1983 On Thu, Jan 28, 2016 at 5:29 PM, baul jianguo wrote: > http://pastebin.centos.org/38941/ > client statedump,only the pid 27419,168030,208655 hang,you can search > this pid in the statedump file。 > > On Wed, Jan 27, 2016 at 4:35 PM, Pranith Kumar Karampuri > wrote: >> Hi, >> If the hang appears on enabling client side io-threads then it could >> be because of some race that is seen when io-threads is enabled on the >> client side. 2 things will help us debug this issue: >> 1) thread apply all bt inside gdb (with debuginfo rpms/debs installed ) >> 2) Complete statedump of the mount at two intervals preferably 10 seconds >> apart. It becomes difficult to find out which ones are stuck vs the ones >> that are on-going when we have just one statedump. If we have two, we can >> find which frames are common in both of the statedumps and then take a >> closer look there. >> >> Feel free to ping me on #gluster-dev, nick: pranithk if you have the process >> hung in that state and you guys don't mind me do a live debugging with you >> guys. This option is the best of the lot! >> >> Thanks a lot baul, Oleksandr for the debugging so far! >> >> Pranith >> >> >> On 01/25/2016 01:03 PM, baul jianguo wrote: >>> >>> 3.5.7 also hangs.only the flush op hung. Yes,off the >>> performance.client-io-threads ,no hang. >>> >>> The hang does not relate the client kernel version. >>> >>> One client statdump about flush op,any abnormal? >>> >>> [global.callpool.stack.12] >>> >>> uid=0 >>> >>> gid=0 >>> >>> pid=14432 >>> >>> unique=16336007098 >>> >>> lk-owner=77cb199aa36f3641 >>> >>> op=FLUSH >>> >>> type=
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
On 01/28/2016 02:59 PM, baul jianguo wrote: http://pastebin.centos.org/38941/ client statedump,only the pid 27419,168030,208655 hang,you can search this pid in the statedump file。 Could you take one more statedump please? Pranith On Wed, Jan 27, 2016 at 4:35 PM, Pranith Kumar Karampuri wrote: Hi, If the hang appears on enabling client side io-threads then it could be because of some race that is seen when io-threads is enabled on the client side. 2 things will help us debug this issue: 1) thread apply all bt inside gdb (with debuginfo rpms/debs installed ) 2) Complete statedump of the mount at two intervals preferably 10 seconds apart. It becomes difficult to find out which ones are stuck vs the ones that are on-going when we have just one statedump. If we have two, we can find which frames are common in both of the statedumps and then take a closer look there. Feel free to ping me on #gluster-dev, nick: pranithk if you have the process hung in that state and you guys don't mind me do a live debugging with you guys. This option is the best of the lot! Thanks a lot baul, Oleksandr for the debugging so far! Pranith On 01/25/2016 01:03 PM, baul jianguo wrote: 3.5.7 also hangs.only the flush op hung. Yes,off the performance.client-io-threads ,no hang. The hang does not relate the client kernel version. One client statdump about flush op,any abnormal? [global.callpool.stack.12] uid=0 gid=0 pid=14432 unique=16336007098 lk-owner=77cb199aa36f3641 op=FLUSH type=1 cnt=6 [global.callpool.stack.12.frame.1] ref_count=1 translator=fuse complete=0 [global.callpool.stack.12.frame.2] ref_count=0 translator=datavolume-write-behind complete=0 parent=datavolume-read-ahead wind_from=ra_flush wind_to=FIRST_CHILD (this)->fops->flush unwind_to=ra_flush_cbk [global.callpool.stack.12.frame.3] ref_count=1 translator=datavolume-read-ahead complete=0 parent=datavolume-open-behind wind_from=default_flush_resume wind_to=FIRST_CHILD(this)->fops->flush unwind_to=default_flush_cbk [global.callpool.stack.12.frame.4] ref_count=1 translator=datavolume-open-behind complete=0 parent=datavolume-io-threads wind_from=iot_flush_wrapper wind_to=FIRST_CHILD(this)->fops->flush unwind_to=iot_flush_cbk [global.callpool.stack.12.frame.5] ref_count=1 translator=datavolume-io-threads complete=0 parent=datavolume wind_from=io_stats_flush wind_to=FIRST_CHILD(this)->fops->flush unwind_to=io_stats_flush_cbk [global.callpool.stack.12.frame.6] ref_count=1 translator=datavolume complete=0 parent=fuse wind_from=fuse_flush_resume wind_to=xl->fops->flush unwind_to=fuse_err_cbk On Sun, Jan 24, 2016 at 5:35 AM, Oleksandr Natalenko wrote: With "performance.client-io-threads" set to "off" no hangs occurred in 3 rsync/rm rounds. Could that be some fuse-bridge lock race? Will bring that option to "on" back again and try to get full statedump. On четвер, 21 січня 2016 р. 14:54:47 EET Raghavendra G wrote: On Thu, Jan 21, 2016 at 10:49 AM, Pranith Kumar Karampuri < pkara...@redhat.com> wrote: On 01/18/2016 02:28 PM, Oleksandr Natalenko wrote: XFS. Server side works OK, I'm able to mount volume again. Brick is 30% full. Oleksandr, Will it be possible to get the statedump of the client, bricks output next time it happens? https://github.com/gluster/glusterfs/blob/master/doc/debugging/statedump.m d#how-to-generate-statedump We also need to dump inode information. To do that you've to add "all=yes" to /var/run/gluster/glusterdump.options before you issue commands to get statedump. Pranith On понеділок, 18 січня 2016 р. 15:07:18 EET baul jianguo wrote: What is your brick file system? and the glusterfsd process and all thread status? I met same issue when client app such as rsync stay in D status,and the brick process and relate thread also be in the D status. And the brick dev disk util is 100% . On Sun, Jan 17, 2016 at 6:13 AM, Oleksandr Natalenko wrote: Wrong assumption, rsync hung again. On субота, 16 січня 2016 р. 22:53:04 EET Oleksandr Natalenko wrote: One possible reason: cluster.lookup-optimize: on cluster.readdir-optimize: on I've disabled both optimizations, and at least as of now rsync still does its job with no issues. I would like to find out what option causes such a behavior and why. Will test more. On пʼятниця, 15 січня 2016 р. 16:09:51 EET Oleksandr Natalenko wrote: Another observation: if rsyncing is resumed after hang, rsync itself hangs a lot faster because it does stat of already copied files. So, the reason may be not writing itself, but massive stat on GlusterFS volume as well. 15.01.2016 09:40, Oleksandr Natalenko написав: While doing rsync over millions of files from ordinary partition to GlusterFS volume, just after approx. first 2 million rsync hang happens, and the following info appears in dmesg: === [17075038.924481] INFO: task rsync:10310 blocked for more than 120 seconds. [17075038.931948] "echo 0 > /proc/sys/kernel/hun
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
http://pastebin.centos.org/38941/ client statedump,only the pid 27419,168030,208655 hang,you can search this pid in the statedump file。 On Wed, Jan 27, 2016 at 4:35 PM, Pranith Kumar Karampuri wrote: > Hi, > If the hang appears on enabling client side io-threads then it could > be because of some race that is seen when io-threads is enabled on the > client side. 2 things will help us debug this issue: > 1) thread apply all bt inside gdb (with debuginfo rpms/debs installed ) > 2) Complete statedump of the mount at two intervals preferably 10 seconds > apart. It becomes difficult to find out which ones are stuck vs the ones > that are on-going when we have just one statedump. If we have two, we can > find which frames are common in both of the statedumps and then take a > closer look there. > > Feel free to ping me on #gluster-dev, nick: pranithk if you have the process > hung in that state and you guys don't mind me do a live debugging with you > guys. This option is the best of the lot! > > Thanks a lot baul, Oleksandr for the debugging so far! > > Pranith > > > On 01/25/2016 01:03 PM, baul jianguo wrote: >> >> 3.5.7 also hangs.only the flush op hung. Yes,off the >> performance.client-io-threads ,no hang. >> >> The hang does not relate the client kernel version. >> >> One client statdump about flush op,any abnormal? >> >> [global.callpool.stack.12] >> >> uid=0 >> >> gid=0 >> >> pid=14432 >> >> unique=16336007098 >> >> lk-owner=77cb199aa36f3641 >> >> op=FLUSH >> >> type=1 >> >> cnt=6 >> >> >> >> [global.callpool.stack.12.frame.1] >> >> ref_count=1 >> >> translator=fuse >> >> complete=0 >> >> >> >> [global.callpool.stack.12.frame.2] >> >> ref_count=0 >> >> translator=datavolume-write-behind >> >> complete=0 >> >> parent=datavolume-read-ahead >> >> wind_from=ra_flush >> >> wind_to=FIRST_CHILD (this)->fops->flush >> >> unwind_to=ra_flush_cbk >> >> >> >> [global.callpool.stack.12.frame.3] >> >> ref_count=1 >> >> translator=datavolume-read-ahead >> >> complete=0 >> >> parent=datavolume-open-behind >> >> wind_from=default_flush_resume >> >> wind_to=FIRST_CHILD(this)->fops->flush >> >> unwind_to=default_flush_cbk >> >> >> >> [global.callpool.stack.12.frame.4] >> >> ref_count=1 >> >> translator=datavolume-open-behind >> >> complete=0 >> >> parent=datavolume-io-threads >> >> wind_from=iot_flush_wrapper >> >> wind_to=FIRST_CHILD(this)->fops->flush >> >> unwind_to=iot_flush_cbk >> >> >> >> [global.callpool.stack.12.frame.5] >> >> ref_count=1 >> >> translator=datavolume-io-threads >> >> complete=0 >> >> parent=datavolume >> >> wind_from=io_stats_flush >> >> wind_to=FIRST_CHILD(this)->fops->flush >> >> unwind_to=io_stats_flush_cbk >> >> >> >> [global.callpool.stack.12.frame.6] >> >> ref_count=1 >> >> translator=datavolume >> >> complete=0 >> >> parent=fuse >> >> wind_from=fuse_flush_resume >> >> wind_to=xl->fops->flush >> >> unwind_to=fuse_err_cbk >> >> >> >> On Sun, Jan 24, 2016 at 5:35 AM, Oleksandr Natalenko >> wrote: >>> >>> With "performance.client-io-threads" set to "off" no hangs occurred in 3 >>> rsync/rm rounds. Could that be some fuse-bridge lock race? Will bring >>> that >>> option to "on" back again and try to get full statedump. >>> >>> On четвер, 21 січня 2016 р. 14:54:47 EET Raghavendra G wrote: On Thu, Jan 21, 2016 at 10:49 AM, Pranith Kumar Karampuri < pkara...@redhat.com> wrote: > > On 01/18/2016 02:28 PM, Oleksandr Natalenko wrote: >> >> XFS. Server side works OK, I'm able to mount volume again. Brick is >> 30% >> full. > > Oleksandr, > >Will it be possible to get the statedump of the client, bricks > > output next time it happens? > > > https://github.com/gluster/glusterfs/blob/master/doc/debugging/statedump.m > d#how-to-generate-statedump We also need to dump inode information. To do that you've to add "all=yes" to /var/run/gluster/glusterdump.options before you issue commands to get statedump. > Pranith > >> On понеділок, 18 січня 2016 р. 15:07:18 EET baul jianguo wrote: >>> >>> What is your brick file system? and the glusterfsd process and all >>> thread status? >>> I met same issue when client app such as rsync stay in D status,and >>> the brick process and relate thread also be in the D status. >>> And the brick dev disk util is 100% . >>> >>> On Sun, Jan 17, 2016 at 6:13 AM, Oleksandr Natalenko >>> >>> wrote: Wrong assumption, rsync hung again. On субота, 16 січня 2016 р. 22:53:04 EET Oleksandr Natalenko wrote: > > One possible reason: > > cluster.lookup-optimize: on > cluster.readdir-optimize: on > > I've disabled both optimizations, and at least as of now rsync > still > does > its job with no issues. I would like to find out what option causes > such > a > behavior and why.
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
Hi, If the hang appears on enabling client side io-threads then it could be because of some race that is seen when io-threads is enabled on the client side. 2 things will help us debug this issue: 1) thread apply all bt inside gdb (with debuginfo rpms/debs installed ) 2) Complete statedump of the mount at two intervals preferably 10 seconds apart. It becomes difficult to find out which ones are stuck vs the ones that are on-going when we have just one statedump. If we have two, we can find which frames are common in both of the statedumps and then take a closer look there. Feel free to ping me on #gluster-dev, nick: pranithk if you have the process hung in that state and you guys don't mind me do a live debugging with you guys. This option is the best of the lot! Thanks a lot baul, Oleksandr for the debugging so far! Pranith On 01/25/2016 01:03 PM, baul jianguo wrote: 3.5.7 also hangs.only the flush op hung. Yes,off the performance.client-io-threads ,no hang. The hang does not relate the client kernel version. One client statdump about flush op,any abnormal? [global.callpool.stack.12] uid=0 gid=0 pid=14432 unique=16336007098 lk-owner=77cb199aa36f3641 op=FLUSH type=1 cnt=6 [global.callpool.stack.12.frame.1] ref_count=1 translator=fuse complete=0 [global.callpool.stack.12.frame.2] ref_count=0 translator=datavolume-write-behind complete=0 parent=datavolume-read-ahead wind_from=ra_flush wind_to=FIRST_CHILD (this)->fops->flush unwind_to=ra_flush_cbk [global.callpool.stack.12.frame.3] ref_count=1 translator=datavolume-read-ahead complete=0 parent=datavolume-open-behind wind_from=default_flush_resume wind_to=FIRST_CHILD(this)->fops->flush unwind_to=default_flush_cbk [global.callpool.stack.12.frame.4] ref_count=1 translator=datavolume-open-behind complete=0 parent=datavolume-io-threads wind_from=iot_flush_wrapper wind_to=FIRST_CHILD(this)->fops->flush unwind_to=iot_flush_cbk [global.callpool.stack.12.frame.5] ref_count=1 translator=datavolume-io-threads complete=0 parent=datavolume wind_from=io_stats_flush wind_to=FIRST_CHILD(this)->fops->flush unwind_to=io_stats_flush_cbk [global.callpool.stack.12.frame.6] ref_count=1 translator=datavolume complete=0 parent=fuse wind_from=fuse_flush_resume wind_to=xl->fops->flush unwind_to=fuse_err_cbk On Sun, Jan 24, 2016 at 5:35 AM, Oleksandr Natalenko wrote: With "performance.client-io-threads" set to "off" no hangs occurred in 3 rsync/rm rounds. Could that be some fuse-bridge lock race? Will bring that option to "on" back again and try to get full statedump. On четвер, 21 січня 2016 р. 14:54:47 EET Raghavendra G wrote: On Thu, Jan 21, 2016 at 10:49 AM, Pranith Kumar Karampuri < pkara...@redhat.com> wrote: On 01/18/2016 02:28 PM, Oleksandr Natalenko wrote: XFS. Server side works OK, I'm able to mount volume again. Brick is 30% full. Oleksandr, Will it be possible to get the statedump of the client, bricks output next time it happens? https://github.com/gluster/glusterfs/blob/master/doc/debugging/statedump.m d#how-to-generate-statedump We also need to dump inode information. To do that you've to add "all=yes" to /var/run/gluster/glusterdump.options before you issue commands to get statedump. Pranith On понеділок, 18 січня 2016 р. 15:07:18 EET baul jianguo wrote: What is your brick file system? and the glusterfsd process and all thread status? I met same issue when client app such as rsync stay in D status,and the brick process and relate thread also be in the D status. And the brick dev disk util is 100% . On Sun, Jan 17, 2016 at 6:13 AM, Oleksandr Natalenko wrote: Wrong assumption, rsync hung again. On субота, 16 січня 2016 р. 22:53:04 EET Oleksandr Natalenko wrote: One possible reason: cluster.lookup-optimize: on cluster.readdir-optimize: on I've disabled both optimizations, and at least as of now rsync still does its job with no issues. I would like to find out what option causes such a behavior and why. Will test more. On пʼятниця, 15 січня 2016 р. 16:09:51 EET Oleksandr Natalenko wrote: Another observation: if rsyncing is resumed after hang, rsync itself hangs a lot faster because it does stat of already copied files. So, the reason may be not writing itself, but massive stat on GlusterFS volume as well. 15.01.2016 09:40, Oleksandr Natalenko написав: While doing rsync over millions of files from ordinary partition to GlusterFS volume, just after approx. first 2 million rsync hang happens, and the following info appears in dmesg: === [17075038.924481] INFO: task rsync:10310 blocked for more than 120 seconds. [17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [17075038.940748] rsync D 88207fc13680 0 10310 10309 0x0080 [17075038.940752] 8809c578be18 0086 8809c578bfd8 00013680 [17075038.940756] 8809c578bfd8 00013680 880310cbe660 881159d16
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
the client statedump is at http://pastebin.centos.org/38671/ On Mon, Jan 25, 2016 at 3:33 PM, baul jianguo wrote: > 3.5.7 also hangs.only the flush op hung. Yes,off the > performance.client-io-threads ,no hang. > > The hang does not relate the client kernel version. > > One client statdump about flush op,any abnormal? > > [global.callpool.stack.12] > > uid=0 > > gid=0 > > pid=14432 > > unique=16336007098 > > lk-owner=77cb199aa36f3641 > > op=FLUSH > > type=1 > > cnt=6 > > > > [global.callpool.stack.12.frame.1] > > ref_count=1 > > translator=fuse > > complete=0 > > > > [global.callpool.stack.12.frame.2] > > ref_count=0 > > translator=datavolume-write-behind > > complete=0 > > parent=datavolume-read-ahead > > wind_from=ra_flush > > wind_to=FIRST_CHILD (this)->fops->flush > > unwind_to=ra_flush_cbk > > > > [global.callpool.stack.12.frame.3] > > ref_count=1 > > translator=datavolume-read-ahead > > complete=0 > > parent=datavolume-open-behind > > wind_from=default_flush_resume > > wind_to=FIRST_CHILD(this)->fops->flush > > unwind_to=default_flush_cbk > > > > [global.callpool.stack.12.frame.4] > > ref_count=1 > > translator=datavolume-open-behind > > complete=0 > > parent=datavolume-io-threads > > wind_from=iot_flush_wrapper > > wind_to=FIRST_CHILD(this)->fops->flush > > unwind_to=iot_flush_cbk > > > > [global.callpool.stack.12.frame.5] > > ref_count=1 > > translator=datavolume-io-threads > > complete=0 > > parent=datavolume > > wind_from=io_stats_flush > > wind_to=FIRST_CHILD(this)->fops->flush > > unwind_to=io_stats_flush_cbk > > > > [global.callpool.stack.12.frame.6] > > ref_count=1 > > translator=datavolume > > complete=0 > > parent=fuse > > wind_from=fuse_flush_resume > > wind_to=xl->fops->flush > > unwind_to=fuse_err_cbk > > > > On Sun, Jan 24, 2016 at 5:35 AM, Oleksandr Natalenko > wrote: >> With "performance.client-io-threads" set to "off" no hangs occurred in 3 >> rsync/rm rounds. Could that be some fuse-bridge lock race? Will bring that >> option to "on" back again and try to get full statedump. >> >> On четвер, 21 січня 2016 р. 14:54:47 EET Raghavendra G wrote: >>> On Thu, Jan 21, 2016 at 10:49 AM, Pranith Kumar Karampuri < >>> >>> pkara...@redhat.com> wrote: >>> > On 01/18/2016 02:28 PM, Oleksandr Natalenko wrote: >>> >> XFS. Server side works OK, I'm able to mount volume again. Brick is 30% >>> >> full. >>> > >>> > Oleksandr, >>> > >>> > Will it be possible to get the statedump of the client, bricks >>> > >>> > output next time it happens? >>> > >>> > https://github.com/gluster/glusterfs/blob/master/doc/debugging/statedump.m >>> > d#how-to-generate-statedump >>> We also need to dump inode information. To do that you've to add "all=yes" >>> to /var/run/gluster/glusterdump.options before you issue commands to get >>> statedump. >>> >>> > Pranith >>> > >>> >> On понеділок, 18 січня 2016 р. 15:07:18 EET baul jianguo wrote: >>> >>> What is your brick file system? and the glusterfsd process and all >>> >>> thread status? >>> >>> I met same issue when client app such as rsync stay in D status,and >>> >>> the brick process and relate thread also be in the D status. >>> >>> And the brick dev disk util is 100% . >>> >>> >>> >>> On Sun, Jan 17, 2016 at 6:13 AM, Oleksandr Natalenko >>> >>> >>> >>> wrote: >>> Wrong assumption, rsync hung again. >>> >>> On субота, 16 січня 2016 р. 22:53:04 EET Oleksandr Natalenko wrote: >>> > One possible reason: >>> > >>> > cluster.lookup-optimize: on >>> > cluster.readdir-optimize: on >>> > >>> > I've disabled both optimizations, and at least as of now rsync still >>> > does >>> > its job with no issues. I would like to find out what option causes >>> > such >>> > a >>> > behavior and why. Will test more. >>> > >>> > On пʼятниця, 15 січня 2016 р. 16:09:51 EET Oleksandr Natalenko wrote: >>> >> Another observation: if rsyncing is resumed after hang, rsync itself >>> >> hangs a lot faster because it does stat of already copied files. So, >>> >> the >>> >> reason may be not writing itself, but massive stat on GlusterFS >>> >> volume >>> >> as well. >>> >> >>> >> 15.01.2016 09:40, Oleksandr Natalenko написав: >>> >>> While doing rsync over millions of files from ordinary partition to >>> >>> GlusterFS volume, just after approx. first 2 million rsync hang >>> >>> happens, and the following info appears in dmesg: >>> >>> >>> >>> === >>> >>> [17075038.924481] INFO: task rsync:10310 blocked for more than 120 >>> >>> seconds. >>> >>> [17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>> >>> disables this message. >>> >>> [17075038.940748] rsync D 88207fc13680 0 10310 >>> >>> 10309 0x0080 >>> >>> [17075038.940752] 8809c578be18 0086 >>> >>> 8809c578bfd8 >>> >>> 00013680 >>> >>> [17075038.940756] 8809c578bfd8 00013680 >>> >>> 880310cbe6
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
3.5.7 also hangs.only the flush op hung. Yes,off the performance.client-io-threads ,no hang. The hang does not relate the client kernel version. One client statdump about flush op,any abnormal? [global.callpool.stack.12] uid=0 gid=0 pid=14432 unique=16336007098 lk-owner=77cb199aa36f3641 op=FLUSH type=1 cnt=6 [global.callpool.stack.12.frame.1] ref_count=1 translator=fuse complete=0 [global.callpool.stack.12.frame.2] ref_count=0 translator=datavolume-write-behind complete=0 parent=datavolume-read-ahead wind_from=ra_flush wind_to=FIRST_CHILD (this)->fops->flush unwind_to=ra_flush_cbk [global.callpool.stack.12.frame.3] ref_count=1 translator=datavolume-read-ahead complete=0 parent=datavolume-open-behind wind_from=default_flush_resume wind_to=FIRST_CHILD(this)->fops->flush unwind_to=default_flush_cbk [global.callpool.stack.12.frame.4] ref_count=1 translator=datavolume-open-behind complete=0 parent=datavolume-io-threads wind_from=iot_flush_wrapper wind_to=FIRST_CHILD(this)->fops->flush unwind_to=iot_flush_cbk [global.callpool.stack.12.frame.5] ref_count=1 translator=datavolume-io-threads complete=0 parent=datavolume wind_from=io_stats_flush wind_to=FIRST_CHILD(this)->fops->flush unwind_to=io_stats_flush_cbk [global.callpool.stack.12.frame.6] ref_count=1 translator=datavolume complete=0 parent=fuse wind_from=fuse_flush_resume wind_to=xl->fops->flush unwind_to=fuse_err_cbk On Sun, Jan 24, 2016 at 5:35 AM, Oleksandr Natalenko wrote: > With "performance.client-io-threads" set to "off" no hangs occurred in 3 > rsync/rm rounds. Could that be some fuse-bridge lock race? Will bring that > option to "on" back again and try to get full statedump. > > On четвер, 21 січня 2016 р. 14:54:47 EET Raghavendra G wrote: >> On Thu, Jan 21, 2016 at 10:49 AM, Pranith Kumar Karampuri < >> >> pkara...@redhat.com> wrote: >> > On 01/18/2016 02:28 PM, Oleksandr Natalenko wrote: >> >> XFS. Server side works OK, I'm able to mount volume again. Brick is 30% >> >> full. >> > >> > Oleksandr, >> > >> > Will it be possible to get the statedump of the client, bricks >> > >> > output next time it happens? >> > >> > https://github.com/gluster/glusterfs/blob/master/doc/debugging/statedump.m >> > d#how-to-generate-statedump >> We also need to dump inode information. To do that you've to add "all=yes" >> to /var/run/gluster/glusterdump.options before you issue commands to get >> statedump. >> >> > Pranith >> > >> >> On понеділок, 18 січня 2016 р. 15:07:18 EET baul jianguo wrote: >> >>> What is your brick file system? and the glusterfsd process and all >> >>> thread status? >> >>> I met same issue when client app such as rsync stay in D status,and >> >>> the brick process and relate thread also be in the D status. >> >>> And the brick dev disk util is 100% . >> >>> >> >>> On Sun, Jan 17, 2016 at 6:13 AM, Oleksandr Natalenko >> >>> >> >>> wrote: >> Wrong assumption, rsync hung again. >> >> On субота, 16 січня 2016 р. 22:53:04 EET Oleksandr Natalenko wrote: >> > One possible reason: >> > >> > cluster.lookup-optimize: on >> > cluster.readdir-optimize: on >> > >> > I've disabled both optimizations, and at least as of now rsync still >> > does >> > its job with no issues. I would like to find out what option causes >> > such >> > a >> > behavior and why. Will test more. >> > >> > On пʼятниця, 15 січня 2016 р. 16:09:51 EET Oleksandr Natalenko wrote: >> >> Another observation: if rsyncing is resumed after hang, rsync itself >> >> hangs a lot faster because it does stat of already copied files. So, >> >> the >> >> reason may be not writing itself, but massive stat on GlusterFS >> >> volume >> >> as well. >> >> >> >> 15.01.2016 09:40, Oleksandr Natalenko написав: >> >>> While doing rsync over millions of files from ordinary partition to >> >>> GlusterFS volume, just after approx. first 2 million rsync hang >> >>> happens, and the following info appears in dmesg: >> >>> >> >>> === >> >>> [17075038.924481] INFO: task rsync:10310 blocked for more than 120 >> >>> seconds. >> >>> [17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >> >>> disables this message. >> >>> [17075038.940748] rsync D 88207fc13680 0 10310 >> >>> 10309 0x0080 >> >>> [17075038.940752] 8809c578be18 0086 >> >>> 8809c578bfd8 >> >>> 00013680 >> >>> [17075038.940756] 8809c578bfd8 00013680 >> >>> 880310cbe660 >> >>> 881159d16a30 >> >>> [17075038.940759] 881e3aa25800 8809c578be48 >> >>> 881159d16b10 >> >>> 88087d553980 >> >>> [17075038.940762] Call Trace: >> >>> [17075038.940770] [] schedule+0x29/0x70 >> >>> [17075038.940797] [] >> >>> __fuse_request_send+0x13d/0x2c0 >> >>> [fuse] >> >>> [17075038.940801] [] ? >>
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
With "performance.client-io-threads" set to "off" no hangs occurred in 3 rsync/rm rounds. Could that be some fuse-bridge lock race? Will bring that option to "on" back again and try to get full statedump. On четвер, 21 січня 2016 р. 14:54:47 EET Raghavendra G wrote: > On Thu, Jan 21, 2016 at 10:49 AM, Pranith Kumar Karampuri < > > pkara...@redhat.com> wrote: > > On 01/18/2016 02:28 PM, Oleksandr Natalenko wrote: > >> XFS. Server side works OK, I'm able to mount volume again. Brick is 30% > >> full. > > > > Oleksandr, > > > > Will it be possible to get the statedump of the client, bricks > > > > output next time it happens? > > > > https://github.com/gluster/glusterfs/blob/master/doc/debugging/statedump.m > > d#how-to-generate-statedump > We also need to dump inode information. To do that you've to add "all=yes" > to /var/run/gluster/glusterdump.options before you issue commands to get > statedump. > > > Pranith > > > >> On понеділок, 18 січня 2016 р. 15:07:18 EET baul jianguo wrote: > >>> What is your brick file system? and the glusterfsd process and all > >>> thread status? > >>> I met same issue when client app such as rsync stay in D status,and > >>> the brick process and relate thread also be in the D status. > >>> And the brick dev disk util is 100% . > >>> > >>> On Sun, Jan 17, 2016 at 6:13 AM, Oleksandr Natalenko > >>> > >>> wrote: > Wrong assumption, rsync hung again. > > On субота, 16 січня 2016 р. 22:53:04 EET Oleksandr Natalenko wrote: > > One possible reason: > > > > cluster.lookup-optimize: on > > cluster.readdir-optimize: on > > > > I've disabled both optimizations, and at least as of now rsync still > > does > > its job with no issues. I would like to find out what option causes > > such > > a > > behavior and why. Will test more. > > > > On пʼятниця, 15 січня 2016 р. 16:09:51 EET Oleksandr Natalenko wrote: > >> Another observation: if rsyncing is resumed after hang, rsync itself > >> hangs a lot faster because it does stat of already copied files. So, > >> the > >> reason may be not writing itself, but massive stat on GlusterFS > >> volume > >> as well. > >> > >> 15.01.2016 09:40, Oleksandr Natalenko написав: > >>> While doing rsync over millions of files from ordinary partition to > >>> GlusterFS volume, just after approx. first 2 million rsync hang > >>> happens, and the following info appears in dmesg: > >>> > >>> === > >>> [17075038.924481] INFO: task rsync:10310 blocked for more than 120 > >>> seconds. > >>> [17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > >>> disables this message. > >>> [17075038.940748] rsync D 88207fc13680 0 10310 > >>> 10309 0x0080 > >>> [17075038.940752] 8809c578be18 0086 > >>> 8809c578bfd8 > >>> 00013680 > >>> [17075038.940756] 8809c578bfd8 00013680 > >>> 880310cbe660 > >>> 881159d16a30 > >>> [17075038.940759] 881e3aa25800 8809c578be48 > >>> 881159d16b10 > >>> 88087d553980 > >>> [17075038.940762] Call Trace: > >>> [17075038.940770] [] schedule+0x29/0x70 > >>> [17075038.940797] [] > >>> __fuse_request_send+0x13d/0x2c0 > >>> [fuse] > >>> [17075038.940801] [] ? > >>> fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] > >>> [17075038.940805] [] ? wake_up_bit+0x30/0x30 > >>> [17075038.940809] [] fuse_request_send+0x12/0x20 > >>> [fuse] > >>> [17075038.940813] [] fuse_flush+0xff/0x150 [fuse] > >>> [17075038.940817] [] filp_close+0x34/0x80 > >>> [17075038.940821] [] __close_fd+0x78/0xa0 > >>> [17075038.940824] [] SyS_close+0x23/0x50 > >>> [17075038.940828] [] > >>> system_call_fastpath+0x16/0x1b > >>> === > >>> > >>> rsync blocks in D state, and to kill it, I have to do umount --lazy > >>> on > >>> GlusterFS mountpoint, and then kill corresponding client glusterfs > >>> process. Then rsync exits. > >>> > >>> Here is GlusterFS volume info: > >>> > >>> === > >>> Volume Name: asterisk_records > >>> Type: Distributed-Replicate > >>> Volume ID: dc1fe561-fa3a-4f2e-8330-ec7e52c75ba4 > >>> Status: Started > >>> Number of Bricks: 3 x 2 = 6 > >>> Transport-type: tcp > >>> Bricks: > >>> Brick1: > >>> server1:/bricks/10_megaraid_0_3_9_x_0_4_3_hdd_r1_nolvm_hdd_storage_0 > >>> 1 > >>> /as > >>> te > >>> risk/records Brick2: > >>> server2:/bricks/10_megaraid_8_5_14_x_8_6_16_hdd_r1_nolvm_hdd_storage > >>> _ > >>> 01/ > >>> as > >>> terisk/records Brick3: > >>> server1:/bricks/11_megaraid_0_5_4_x_0_6_5_hdd_r1_nolvm_hdd_storage_0 > >>> 2 > >>> /as > >>> te > >>> risk/records Brick4: > >>> server2:/bricks/11_megaraid_8_7_15_x_8_8_20_hdd_r1_nolvm_hdd_storage > >>> _ > >>>
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
On Thu, Jan 21, 2016 at 10:49 AM, Pranith Kumar Karampuri < pkara...@redhat.com> wrote: > > > On 01/18/2016 02:28 PM, Oleksandr Natalenko wrote: > >> XFS. Server side works OK, I'm able to mount volume again. Brick is 30% >> full. >> > > Oleksandr, > Will it be possible to get the statedump of the client, bricks > output next time it happens? > > https://github.com/gluster/glusterfs/blob/master/doc/debugging/statedump.md#how-to-generate-statedump We also need to dump inode information. To do that you've to add "all=yes" to /var/run/gluster/glusterdump.options before you issue commands to get statedump. > > > Pranith > > > >> On понеділок, 18 січня 2016 р. 15:07:18 EET baul jianguo wrote: >> >>> What is your brick file system? and the glusterfsd process and all >>> thread status? >>> I met same issue when client app such as rsync stay in D status,and >>> the brick process and relate thread also be in the D status. >>> And the brick dev disk util is 100% . >>> >>> On Sun, Jan 17, 2016 at 6:13 AM, Oleksandr Natalenko >>> >>> wrote: >>> Wrong assumption, rsync hung again. On субота, 16 січня 2016 р. 22:53:04 EET Oleksandr Natalenko wrote: > One possible reason: > > cluster.lookup-optimize: on > cluster.readdir-optimize: on > > I've disabled both optimizations, and at least as of now rsync still > does > its job with no issues. I would like to find out what option causes > such > a > behavior and why. Will test more. > > On пʼятниця, 15 січня 2016 р. 16:09:51 EET Oleksandr Natalenko wrote: > >> Another observation: if rsyncing is resumed after hang, rsync itself >> hangs a lot faster because it does stat of already copied files. So, >> the >> reason may be not writing itself, but massive stat on GlusterFS volume >> as well. >> >> 15.01.2016 09:40, Oleksandr Natalenko написав: >> >>> While doing rsync over millions of files from ordinary partition to >>> GlusterFS volume, just after approx. first 2 million rsync hang >>> happens, and the following info appears in dmesg: >>> >>> === >>> [17075038.924481] INFO: task rsync:10310 blocked for more than 120 >>> seconds. >>> [17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>> disables this message. >>> [17075038.940748] rsync D 88207fc13680 0 10310 >>> 10309 0x0080 >>> [17075038.940752] 8809c578be18 0086 8809c578bfd8 >>> 00013680 >>> [17075038.940756] 8809c578bfd8 00013680 880310cbe660 >>> 881159d16a30 >>> [17075038.940759] 881e3aa25800 8809c578be48 881159d16b10 >>> 88087d553980 >>> [17075038.940762] Call Trace: >>> [17075038.940770] [] schedule+0x29/0x70 >>> [17075038.940797] [] >>> __fuse_request_send+0x13d/0x2c0 >>> [fuse] >>> [17075038.940801] [] ? >>> fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] >>> [17075038.940805] [] ? wake_up_bit+0x30/0x30 >>> [17075038.940809] [] fuse_request_send+0x12/0x20 >>> [fuse] >>> [17075038.940813] [] fuse_flush+0xff/0x150 [fuse] >>> [17075038.940817] [] filp_close+0x34/0x80 >>> [17075038.940821] [] __close_fd+0x78/0xa0 >>> [17075038.940824] [] SyS_close+0x23/0x50 >>> [17075038.940828] [] >>> system_call_fastpath+0x16/0x1b >>> === >>> >>> rsync blocks in D state, and to kill it, I have to do umount --lazy >>> on >>> GlusterFS mountpoint, and then kill corresponding client glusterfs >>> process. Then rsync exits. >>> >>> Here is GlusterFS volume info: >>> >>> === >>> Volume Name: asterisk_records >>> Type: Distributed-Replicate >>> Volume ID: dc1fe561-fa3a-4f2e-8330-ec7e52c75ba4 >>> Status: Started >>> Number of Bricks: 3 x 2 = 6 >>> Transport-type: tcp >>> Bricks: >>> Brick1: >>> server1:/bricks/10_megaraid_0_3_9_x_0_4_3_hdd_r1_nolvm_hdd_storage_01 >>> /as >>> te >>> risk/records Brick2: >>> server2:/bricks/10_megaraid_8_5_14_x_8_6_16_hdd_r1_nolvm_hdd_storage_ >>> 01/ >>> as >>> terisk/records Brick3: >>> server1:/bricks/11_megaraid_0_5_4_x_0_6_5_hdd_r1_nolvm_hdd_storage_02 >>> /as >>> te >>> risk/records Brick4: >>> server2:/bricks/11_megaraid_8_7_15_x_8_8_20_hdd_r1_nolvm_hdd_storage_ >>> 02/ >>> as >>> terisk/records Brick5: >>> server1:/bricks/12_megaraid_0_7_6_x_0_13_14_hdd_r1_nolvm_hdd_storage_ >>> 03/ >>> as >>> terisk/records Brick6: >>> server2:/bricks/12_megaraid_8_9_19_x_8_13_24_hdd_r1_nolvm_hdd_storage >>> _03 >>> /a >>> sterisk/records Options Reconfigured: >>> cluster.lookup-optimize: on >>> cluster.readdir-optimize: on >>> client.event-threads: 2 >>> network.inode-lru-limit: 4096 >>> server.event-threads: 4 >>> performance.client-io-threads: on >>> storage.linux-
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
On 01/18/2016 02:28 PM, Oleksandr Natalenko wrote: XFS. Server side works OK, I'm able to mount volume again. Brick is 30% full. Oleksandr, Will it be possible to get the statedump of the client, bricks output next time it happens? https://github.com/gluster/glusterfs/blob/master/doc/debugging/statedump.md#how-to-generate-statedump Pranith On понеділок, 18 січня 2016 р. 15:07:18 EET baul jianguo wrote: What is your brick file system? and the glusterfsd process and all thread status? I met same issue when client app such as rsync stay in D status,and the brick process and relate thread also be in the D status. And the brick dev disk util is 100% . On Sun, Jan 17, 2016 at 6:13 AM, Oleksandr Natalenko wrote: Wrong assumption, rsync hung again. On субота, 16 січня 2016 р. 22:53:04 EET Oleksandr Natalenko wrote: One possible reason: cluster.lookup-optimize: on cluster.readdir-optimize: on I've disabled both optimizations, and at least as of now rsync still does its job with no issues. I would like to find out what option causes such a behavior and why. Will test more. On пʼятниця, 15 січня 2016 р. 16:09:51 EET Oleksandr Natalenko wrote: Another observation: if rsyncing is resumed after hang, rsync itself hangs a lot faster because it does stat of already copied files. So, the reason may be not writing itself, but massive stat on GlusterFS volume as well. 15.01.2016 09:40, Oleksandr Natalenko написав: While doing rsync over millions of files from ordinary partition to GlusterFS volume, just after approx. first 2 million rsync hang happens, and the following info appears in dmesg: === [17075038.924481] INFO: task rsync:10310 blocked for more than 120 seconds. [17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [17075038.940748] rsync D 88207fc13680 0 10310 10309 0x0080 [17075038.940752] 8809c578be18 0086 8809c578bfd8 00013680 [17075038.940756] 8809c578bfd8 00013680 880310cbe660 881159d16a30 [17075038.940759] 881e3aa25800 8809c578be48 881159d16b10 88087d553980 [17075038.940762] Call Trace: [17075038.940770] [] schedule+0x29/0x70 [17075038.940797] [] __fuse_request_send+0x13d/0x2c0 [fuse] [17075038.940801] [] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [17075038.940805] [] ? wake_up_bit+0x30/0x30 [17075038.940809] [] fuse_request_send+0x12/0x20 [fuse] [17075038.940813] [] fuse_flush+0xff/0x150 [fuse] [17075038.940817] [] filp_close+0x34/0x80 [17075038.940821] [] __close_fd+0x78/0xa0 [17075038.940824] [] SyS_close+0x23/0x50 [17075038.940828] [] system_call_fastpath+0x16/0x1b === rsync blocks in D state, and to kill it, I have to do umount --lazy on GlusterFS mountpoint, and then kill corresponding client glusterfs process. Then rsync exits. Here is GlusterFS volume info: === Volume Name: asterisk_records Type: Distributed-Replicate Volume ID: dc1fe561-fa3a-4f2e-8330-ec7e52c75ba4 Status: Started Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: server1:/bricks/10_megaraid_0_3_9_x_0_4_3_hdd_r1_nolvm_hdd_storage_01 /as te risk/records Brick2: server2:/bricks/10_megaraid_8_5_14_x_8_6_16_hdd_r1_nolvm_hdd_storage_ 01/ as terisk/records Brick3: server1:/bricks/11_megaraid_0_5_4_x_0_6_5_hdd_r1_nolvm_hdd_storage_02 /as te risk/records Brick4: server2:/bricks/11_megaraid_8_7_15_x_8_8_20_hdd_r1_nolvm_hdd_storage_ 02/ as terisk/records Brick5: server1:/bricks/12_megaraid_0_7_6_x_0_13_14_hdd_r1_nolvm_hdd_storage_ 03/ as terisk/records Brick6: server2:/bricks/12_megaraid_8_9_19_x_8_13_24_hdd_r1_nolvm_hdd_storage _03 /a sterisk/records Options Reconfigured: cluster.lookup-optimize: on cluster.readdir-optimize: on client.event-threads: 2 network.inode-lru-limit: 4096 server.event-threads: 4 performance.client-io-threads: on storage.linux-aio: on performance.write-behind-window-size: 4194304 performance.stat-prefetch: on performance.quick-read: on performance.read-ahead: on performance.flush-behind: on performance.write-behind: on performance.io-thread-count: 2 performance.cache-max-file-size: 1048576 performance.cache-size: 33554432 features.cache-invalidation: on performance.readdir-ahead: on === The issue reproduces each time I rsync such an amount of files. How could I debug this issue better? ___ Gluster-users mailing list gluster-us...@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
XFS. Server side works OK, I'm able to mount volume again. Brick is 30% full. On понеділок, 18 січня 2016 р. 15:07:18 EET baul jianguo wrote: > What is your brick file system? and the glusterfsd process and all > thread status? > I met same issue when client app such as rsync stay in D status,and > the brick process and relate thread also be in the D status. > And the brick dev disk util is 100% . > > On Sun, Jan 17, 2016 at 6:13 AM, Oleksandr Natalenko > > wrote: > > Wrong assumption, rsync hung again. > > > > On субота, 16 січня 2016 р. 22:53:04 EET Oleksandr Natalenko wrote: > >> One possible reason: > >> > >> cluster.lookup-optimize: on > >> cluster.readdir-optimize: on > >> > >> I've disabled both optimizations, and at least as of now rsync still does > >> its job with no issues. I would like to find out what option causes such > >> a > >> behavior and why. Will test more. > >> > >> On пʼятниця, 15 січня 2016 р. 16:09:51 EET Oleksandr Natalenko wrote: > >> > Another observation: if rsyncing is resumed after hang, rsync itself > >> > hangs a lot faster because it does stat of already copied files. So, > >> > the > >> > reason may be not writing itself, but massive stat on GlusterFS volume > >> > as well. > >> > > >> > 15.01.2016 09:40, Oleksandr Natalenko написав: > >> > > While doing rsync over millions of files from ordinary partition to > >> > > GlusterFS volume, just after approx. first 2 million rsync hang > >> > > happens, and the following info appears in dmesg: > >> > > > >> > > === > >> > > [17075038.924481] INFO: task rsync:10310 blocked for more than 120 > >> > > seconds. > >> > > [17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > >> > > disables this message. > >> > > [17075038.940748] rsync D 88207fc13680 0 10310 > >> > > 10309 0x0080 > >> > > [17075038.940752] 8809c578be18 0086 8809c578bfd8 > >> > > 00013680 > >> > > [17075038.940756] 8809c578bfd8 00013680 880310cbe660 > >> > > 881159d16a30 > >> > > [17075038.940759] 881e3aa25800 8809c578be48 881159d16b10 > >> > > 88087d553980 > >> > > [17075038.940762] Call Trace: > >> > > [17075038.940770] [] schedule+0x29/0x70 > >> > > [17075038.940797] [] > >> > > __fuse_request_send+0x13d/0x2c0 > >> > > [fuse] > >> > > [17075038.940801] [] ? > >> > > fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] > >> > > [17075038.940805] [] ? wake_up_bit+0x30/0x30 > >> > > [17075038.940809] [] fuse_request_send+0x12/0x20 > >> > > [fuse] > >> > > [17075038.940813] [] fuse_flush+0xff/0x150 [fuse] > >> > > [17075038.940817] [] filp_close+0x34/0x80 > >> > > [17075038.940821] [] __close_fd+0x78/0xa0 > >> > > [17075038.940824] [] SyS_close+0x23/0x50 > >> > > [17075038.940828] [] > >> > > system_call_fastpath+0x16/0x1b > >> > > === > >> > > > >> > > rsync blocks in D state, and to kill it, I have to do umount --lazy > >> > > on > >> > > GlusterFS mountpoint, and then kill corresponding client glusterfs > >> > > process. Then rsync exits. > >> > > > >> > > Here is GlusterFS volume info: > >> > > > >> > > === > >> > > Volume Name: asterisk_records > >> > > Type: Distributed-Replicate > >> > > Volume ID: dc1fe561-fa3a-4f2e-8330-ec7e52c75ba4 > >> > > Status: Started > >> > > Number of Bricks: 3 x 2 = 6 > >> > > Transport-type: tcp > >> > > Bricks: > >> > > Brick1: > >> > > server1:/bricks/10_megaraid_0_3_9_x_0_4_3_hdd_r1_nolvm_hdd_storage_01 > >> > > /as > >> > > te > >> > > risk/records Brick2: > >> > > server2:/bricks/10_megaraid_8_5_14_x_8_6_16_hdd_r1_nolvm_hdd_storage_ > >> > > 01/ > >> > > as > >> > > terisk/records Brick3: > >> > > server1:/bricks/11_megaraid_0_5_4_x_0_6_5_hdd_r1_nolvm_hdd_storage_02 > >> > > /as > >> > > te > >> > > risk/records Brick4: > >> > > server2:/bricks/11_megaraid_8_7_15_x_8_8_20_hdd_r1_nolvm_hdd_storage_ > >> > > 02/ > >> > > as > >> > > terisk/records Brick5: > >> > > server1:/bricks/12_megaraid_0_7_6_x_0_13_14_hdd_r1_nolvm_hdd_storage_ > >> > > 03/ > >> > > as > >> > > terisk/records Brick6: > >> > > server2:/bricks/12_megaraid_8_9_19_x_8_13_24_hdd_r1_nolvm_hdd_storage > >> > > _03 > >> > > /a > >> > > sterisk/records Options Reconfigured: > >> > > cluster.lookup-optimize: on > >> > > cluster.readdir-optimize: on > >> > > client.event-threads: 2 > >> > > network.inode-lru-limit: 4096 > >> > > server.event-threads: 4 > >> > > performance.client-io-threads: on > >> > > storage.linux-aio: on > >> > > performance.write-behind-window-size: 4194304 > >> > > performance.stat-prefetch: on > >> > > performance.quick-read: on > >> > > performance.read-ahead: on > >> > > performance.flush-behind: on > >> > > performance.write-behind: on > >> > > performance.io-thread-count: 2 > >> > > performance.cache-max-file-size: 1048576 > >> > > performance.cache-size: 33554432 > >> > > features.cache-invalidation: on > >> > > performance.readdir-ahead: on > >> > > === > >> > > > >> > > The issue reproduces each time I rsync such an
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
What is your brick file system? and the glusterfsd process and all thread status? I met same issue when client app such as rsync stay in D status,and the brick process and relate thread also be in the D status. And the brick dev disk util is 100% . On Sun, Jan 17, 2016 at 6:13 AM, Oleksandr Natalenko wrote: > Wrong assumption, rsync hung again. > > On субота, 16 січня 2016 р. 22:53:04 EET Oleksandr Natalenko wrote: >> One possible reason: >> >> cluster.lookup-optimize: on >> cluster.readdir-optimize: on >> >> I've disabled both optimizations, and at least as of now rsync still does >> its job with no issues. I would like to find out what option causes such a >> behavior and why. Will test more. >> >> On пʼятниця, 15 січня 2016 р. 16:09:51 EET Oleksandr Natalenko wrote: >> > Another observation: if rsyncing is resumed after hang, rsync itself >> > hangs a lot faster because it does stat of already copied files. So, the >> > reason may be not writing itself, but massive stat on GlusterFS volume >> > as well. >> > >> > 15.01.2016 09:40, Oleksandr Natalenko написав: >> > > While doing rsync over millions of files from ordinary partition to >> > > GlusterFS volume, just after approx. first 2 million rsync hang >> > > happens, and the following info appears in dmesg: >> > > >> > > === >> > > [17075038.924481] INFO: task rsync:10310 blocked for more than 120 >> > > seconds. >> > > [17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >> > > disables this message. >> > > [17075038.940748] rsync D 88207fc13680 0 10310 >> > > 10309 0x0080 >> > > [17075038.940752] 8809c578be18 0086 8809c578bfd8 >> > > 00013680 >> > > [17075038.940756] 8809c578bfd8 00013680 880310cbe660 >> > > 881159d16a30 >> > > [17075038.940759] 881e3aa25800 8809c578be48 881159d16b10 >> > > 88087d553980 >> > > [17075038.940762] Call Trace: >> > > [17075038.940770] [] schedule+0x29/0x70 >> > > [17075038.940797] [] __fuse_request_send+0x13d/0x2c0 >> > > [fuse] >> > > [17075038.940801] [] ? >> > > fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] >> > > [17075038.940805] [] ? wake_up_bit+0x30/0x30 >> > > [17075038.940809] [] fuse_request_send+0x12/0x20 >> > > [fuse] >> > > [17075038.940813] [] fuse_flush+0xff/0x150 [fuse] >> > > [17075038.940817] [] filp_close+0x34/0x80 >> > > [17075038.940821] [] __close_fd+0x78/0xa0 >> > > [17075038.940824] [] SyS_close+0x23/0x50 >> > > [17075038.940828] [] system_call_fastpath+0x16/0x1b >> > > === >> > > >> > > rsync blocks in D state, and to kill it, I have to do umount --lazy on >> > > GlusterFS mountpoint, and then kill corresponding client glusterfs >> > > process. Then rsync exits. >> > > >> > > Here is GlusterFS volume info: >> > > >> > > === >> > > Volume Name: asterisk_records >> > > Type: Distributed-Replicate >> > > Volume ID: dc1fe561-fa3a-4f2e-8330-ec7e52c75ba4 >> > > Status: Started >> > > Number of Bricks: 3 x 2 = 6 >> > > Transport-type: tcp >> > > Bricks: >> > > Brick1: >> > > server1:/bricks/10_megaraid_0_3_9_x_0_4_3_hdd_r1_nolvm_hdd_storage_01/as >> > > te >> > > risk/records Brick2: >> > > server2:/bricks/10_megaraid_8_5_14_x_8_6_16_hdd_r1_nolvm_hdd_storage_01/ >> > > as >> > > terisk/records Brick3: >> > > server1:/bricks/11_megaraid_0_5_4_x_0_6_5_hdd_r1_nolvm_hdd_storage_02/as >> > > te >> > > risk/records Brick4: >> > > server2:/bricks/11_megaraid_8_7_15_x_8_8_20_hdd_r1_nolvm_hdd_storage_02/ >> > > as >> > > terisk/records Brick5: >> > > server1:/bricks/12_megaraid_0_7_6_x_0_13_14_hdd_r1_nolvm_hdd_storage_03/ >> > > as >> > > terisk/records Brick6: >> > > server2:/bricks/12_megaraid_8_9_19_x_8_13_24_hdd_r1_nolvm_hdd_storage_03 >> > > /a >> > > sterisk/records Options Reconfigured: >> > > cluster.lookup-optimize: on >> > > cluster.readdir-optimize: on >> > > client.event-threads: 2 >> > > network.inode-lru-limit: 4096 >> > > server.event-threads: 4 >> > > performance.client-io-threads: on >> > > storage.linux-aio: on >> > > performance.write-behind-window-size: 4194304 >> > > performance.stat-prefetch: on >> > > performance.quick-read: on >> > > performance.read-ahead: on >> > > performance.flush-behind: on >> > > performance.write-behind: on >> > > performance.io-thread-count: 2 >> > > performance.cache-max-file-size: 1048576 >> > > performance.cache-size: 33554432 >> > > features.cache-invalidation: on >> > > performance.readdir-ahead: on >> > > === >> > > >> > > The issue reproduces each time I rsync such an amount of files. >> > > >> > > How could I debug this issue better? >> > > ___ >> > > Gluster-users mailing list >> > > gluster-us...@gluster.org >> > > http://www.gluster.org/mailman/listinfo/gluster-users >> > >> > ___ >> > Gluster-devel mailing list >> > Gluster-devel@gluster.org >> > http://www.gluster.org/mailman/listinfo/gluster-devel >> >> __
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
Wrong assumption, rsync hung again. On субота, 16 січня 2016 р. 22:53:04 EET Oleksandr Natalenko wrote: > One possible reason: > > cluster.lookup-optimize: on > cluster.readdir-optimize: on > > I've disabled both optimizations, and at least as of now rsync still does > its job with no issues. I would like to find out what option causes such a > behavior and why. Will test more. > > On пʼятниця, 15 січня 2016 р. 16:09:51 EET Oleksandr Natalenko wrote: > > Another observation: if rsyncing is resumed after hang, rsync itself > > hangs a lot faster because it does stat of already copied files. So, the > > reason may be not writing itself, but massive stat on GlusterFS volume > > as well. > > > > 15.01.2016 09:40, Oleksandr Natalenko написав: > > > While doing rsync over millions of files from ordinary partition to > > > GlusterFS volume, just after approx. first 2 million rsync hang > > > happens, and the following info appears in dmesg: > > > > > > === > > > [17075038.924481] INFO: task rsync:10310 blocked for more than 120 > > > seconds. > > > [17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > > disables this message. > > > [17075038.940748] rsync D 88207fc13680 0 10310 > > > 10309 0x0080 > > > [17075038.940752] 8809c578be18 0086 8809c578bfd8 > > > 00013680 > > > [17075038.940756] 8809c578bfd8 00013680 880310cbe660 > > > 881159d16a30 > > > [17075038.940759] 881e3aa25800 8809c578be48 881159d16b10 > > > 88087d553980 > > > [17075038.940762] Call Trace: > > > [17075038.940770] [] schedule+0x29/0x70 > > > [17075038.940797] [] __fuse_request_send+0x13d/0x2c0 > > > [fuse] > > > [17075038.940801] [] ? > > > fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] > > > [17075038.940805] [] ? wake_up_bit+0x30/0x30 > > > [17075038.940809] [] fuse_request_send+0x12/0x20 > > > [fuse] > > > [17075038.940813] [] fuse_flush+0xff/0x150 [fuse] > > > [17075038.940817] [] filp_close+0x34/0x80 > > > [17075038.940821] [] __close_fd+0x78/0xa0 > > > [17075038.940824] [] SyS_close+0x23/0x50 > > > [17075038.940828] [] system_call_fastpath+0x16/0x1b > > > === > > > > > > rsync blocks in D state, and to kill it, I have to do umount --lazy on > > > GlusterFS mountpoint, and then kill corresponding client glusterfs > > > process. Then rsync exits. > > > > > > Here is GlusterFS volume info: > > > > > > === > > > Volume Name: asterisk_records > > > Type: Distributed-Replicate > > > Volume ID: dc1fe561-fa3a-4f2e-8330-ec7e52c75ba4 > > > Status: Started > > > Number of Bricks: 3 x 2 = 6 > > > Transport-type: tcp > > > Bricks: > > > Brick1: > > > server1:/bricks/10_megaraid_0_3_9_x_0_4_3_hdd_r1_nolvm_hdd_storage_01/as > > > te > > > risk/records Brick2: > > > server2:/bricks/10_megaraid_8_5_14_x_8_6_16_hdd_r1_nolvm_hdd_storage_01/ > > > as > > > terisk/records Brick3: > > > server1:/bricks/11_megaraid_0_5_4_x_0_6_5_hdd_r1_nolvm_hdd_storage_02/as > > > te > > > risk/records Brick4: > > > server2:/bricks/11_megaraid_8_7_15_x_8_8_20_hdd_r1_nolvm_hdd_storage_02/ > > > as > > > terisk/records Brick5: > > > server1:/bricks/12_megaraid_0_7_6_x_0_13_14_hdd_r1_nolvm_hdd_storage_03/ > > > as > > > terisk/records Brick6: > > > server2:/bricks/12_megaraid_8_9_19_x_8_13_24_hdd_r1_nolvm_hdd_storage_03 > > > /a > > > sterisk/records Options Reconfigured: > > > cluster.lookup-optimize: on > > > cluster.readdir-optimize: on > > > client.event-threads: 2 > > > network.inode-lru-limit: 4096 > > > server.event-threads: 4 > > > performance.client-io-threads: on > > > storage.linux-aio: on > > > performance.write-behind-window-size: 4194304 > > > performance.stat-prefetch: on > > > performance.quick-read: on > > > performance.read-ahead: on > > > performance.flush-behind: on > > > performance.write-behind: on > > > performance.io-thread-count: 2 > > > performance.cache-max-file-size: 1048576 > > > performance.cache-size: 33554432 > > > features.cache-invalidation: on > > > performance.readdir-ahead: on > > > === > > > > > > The issue reproduces each time I rsync such an amount of files. > > > > > > How could I debug this issue better? > > > ___ > > > Gluster-users mailing list > > > gluster-us...@gluster.org > > > http://www.gluster.org/mailman/listinfo/gluster-users > > > > ___ > > Gluster-devel mailing list > > Gluster-devel@gluster.org > > http://www.gluster.org/mailman/listinfo/gluster-devel > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
One possible reason: cluster.lookup-optimize: on cluster.readdir-optimize: on I've disabled both optimizations, and at least as of now rsync still does its job with no issues. I would like to find out what option causes such a behavior and why. Will test more. On пʼятниця, 15 січня 2016 р. 16:09:51 EET Oleksandr Natalenko wrote: > Another observation: if rsyncing is resumed after hang, rsync itself > hangs a lot faster because it does stat of already copied files. So, the > reason may be not writing itself, but massive stat on GlusterFS volume > as well. > > 15.01.2016 09:40, Oleksandr Natalenko написав: > > While doing rsync over millions of files from ordinary partition to > > GlusterFS volume, just after approx. first 2 million rsync hang > > happens, and the following info appears in dmesg: > > > > === > > [17075038.924481] INFO: task rsync:10310 blocked for more than 120 > > seconds. > > [17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > disables this message. > > [17075038.940748] rsync D 88207fc13680 0 10310 > > 10309 0x0080 > > [17075038.940752] 8809c578be18 0086 8809c578bfd8 > > 00013680 > > [17075038.940756] 8809c578bfd8 00013680 880310cbe660 > > 881159d16a30 > > [17075038.940759] 881e3aa25800 8809c578be48 881159d16b10 > > 88087d553980 > > [17075038.940762] Call Trace: > > [17075038.940770] [] schedule+0x29/0x70 > > [17075038.940797] [] __fuse_request_send+0x13d/0x2c0 > > [fuse] > > [17075038.940801] [] ? > > fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] > > [17075038.940805] [] ? wake_up_bit+0x30/0x30 > > [17075038.940809] [] fuse_request_send+0x12/0x20 > > [fuse] > > [17075038.940813] [] fuse_flush+0xff/0x150 [fuse] > > [17075038.940817] [] filp_close+0x34/0x80 > > [17075038.940821] [] __close_fd+0x78/0xa0 > > [17075038.940824] [] SyS_close+0x23/0x50 > > [17075038.940828] [] system_call_fastpath+0x16/0x1b > > === > > > > rsync blocks in D state, and to kill it, I have to do umount --lazy on > > GlusterFS mountpoint, and then kill corresponding client glusterfs > > process. Then rsync exits. > > > > Here is GlusterFS volume info: > > > > === > > Volume Name: asterisk_records > > Type: Distributed-Replicate > > Volume ID: dc1fe561-fa3a-4f2e-8330-ec7e52c75ba4 > > Status: Started > > Number of Bricks: 3 x 2 = 6 > > Transport-type: tcp > > Bricks: > > Brick1: > > server1:/bricks/10_megaraid_0_3_9_x_0_4_3_hdd_r1_nolvm_hdd_storage_01/aste > > risk/records Brick2: > > server2:/bricks/10_megaraid_8_5_14_x_8_6_16_hdd_r1_nolvm_hdd_storage_01/as > > terisk/records Brick3: > > server1:/bricks/11_megaraid_0_5_4_x_0_6_5_hdd_r1_nolvm_hdd_storage_02/aste > > risk/records Brick4: > > server2:/bricks/11_megaraid_8_7_15_x_8_8_20_hdd_r1_nolvm_hdd_storage_02/as > > terisk/records Brick5: > > server1:/bricks/12_megaraid_0_7_6_x_0_13_14_hdd_r1_nolvm_hdd_storage_03/as > > terisk/records Brick6: > > server2:/bricks/12_megaraid_8_9_19_x_8_13_24_hdd_r1_nolvm_hdd_storage_03/a > > sterisk/records Options Reconfigured: > > cluster.lookup-optimize: on > > cluster.readdir-optimize: on > > client.event-threads: 2 > > network.inode-lru-limit: 4096 > > server.event-threads: 4 > > performance.client-io-threads: on > > storage.linux-aio: on > > performance.write-behind-window-size: 4194304 > > performance.stat-prefetch: on > > performance.quick-read: on > > performance.read-ahead: on > > performance.flush-behind: on > > performance.write-behind: on > > performance.io-thread-count: 2 > > performance.cache-max-file-size: 1048576 > > performance.cache-size: 33554432 > > features.cache-invalidation: on > > performance.readdir-ahead: on > > === > > > > The issue reproduces each time I rsync such an amount of files. > > > > How could I debug this issue better? > > ___ > > Gluster-users mailing list > > gluster-us...@gluster.org > > http://www.gluster.org/mailman/listinfo/gluster-users > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
Another observation: if rsyncing is resumed after hang, rsync itself hangs a lot faster because it does stat of already copied files. So, the reason may be not writing itself, but massive stat on GlusterFS volume as well. 15.01.2016 09:40, Oleksandr Natalenko написав: While doing rsync over millions of files from ordinary partition to GlusterFS volume, just after approx. first 2 million rsync hang happens, and the following info appears in dmesg: === [17075038.924481] INFO: task rsync:10310 blocked for more than 120 seconds. [17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [17075038.940748] rsync D 88207fc13680 0 10310 10309 0x0080 [17075038.940752] 8809c578be18 0086 8809c578bfd8 00013680 [17075038.940756] 8809c578bfd8 00013680 880310cbe660 881159d16a30 [17075038.940759] 881e3aa25800 8809c578be48 881159d16b10 88087d553980 [17075038.940762] Call Trace: [17075038.940770] [] schedule+0x29/0x70 [17075038.940797] [] __fuse_request_send+0x13d/0x2c0 [fuse] [17075038.940801] [] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [17075038.940805] [] ? wake_up_bit+0x30/0x30 [17075038.940809] [] fuse_request_send+0x12/0x20 [fuse] [17075038.940813] [] fuse_flush+0xff/0x150 [fuse] [17075038.940817] [] filp_close+0x34/0x80 [17075038.940821] [] __close_fd+0x78/0xa0 [17075038.940824] [] SyS_close+0x23/0x50 [17075038.940828] [] system_call_fastpath+0x16/0x1b === rsync blocks in D state, and to kill it, I have to do umount --lazy on GlusterFS mountpoint, and then kill corresponding client glusterfs process. Then rsync exits. Here is GlusterFS volume info: === Volume Name: asterisk_records Type: Distributed-Replicate Volume ID: dc1fe561-fa3a-4f2e-8330-ec7e52c75ba4 Status: Started Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: server1:/bricks/10_megaraid_0_3_9_x_0_4_3_hdd_r1_nolvm_hdd_storage_01/asterisk/records Brick2: server2:/bricks/10_megaraid_8_5_14_x_8_6_16_hdd_r1_nolvm_hdd_storage_01/asterisk/records Brick3: server1:/bricks/11_megaraid_0_5_4_x_0_6_5_hdd_r1_nolvm_hdd_storage_02/asterisk/records Brick4: server2:/bricks/11_megaraid_8_7_15_x_8_8_20_hdd_r1_nolvm_hdd_storage_02/asterisk/records Brick5: server1:/bricks/12_megaraid_0_7_6_x_0_13_14_hdd_r1_nolvm_hdd_storage_03/asterisk/records Brick6: server2:/bricks/12_megaraid_8_9_19_x_8_13_24_hdd_r1_nolvm_hdd_storage_03/asterisk/records Options Reconfigured: cluster.lookup-optimize: on cluster.readdir-optimize: on client.event-threads: 2 network.inode-lru-limit: 4096 server.event-threads: 4 performance.client-io-threads: on storage.linux-aio: on performance.write-behind-window-size: 4194304 performance.stat-prefetch: on performance.quick-read: on performance.read-ahead: on performance.flush-behind: on performance.write-behind: on performance.io-thread-count: 2 performance.cache-max-file-size: 1048576 performance.cache-size: 33554432 features.cache-invalidation: on performance.readdir-ahead: on === The issue reproduces each time I rsync such an amount of files. How could I debug this issue better? ___ Gluster-users mailing list gluster-us...@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
Here is similar issue described on serverfault.com: https://serverfault.com/questions/716410/rsync-crashes-machine-while-performing-sync-on-glusterfs-mounted-share I've checked GlusterFS logs with no luck — as if nothing happened. P.S. GlusterFS v3.7.6. 15.01.2016 09:40, Oleksandr Natalenko написав: While doing rsync over millions of files from ordinary partition to GlusterFS volume, just after approx. first 2 million rsync hang happens, and the following info appears in dmesg: === [17075038.924481] INFO: task rsync:10310 blocked for more than 120 seconds. [17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [17075038.940748] rsync D 88207fc13680 0 10310 10309 0x0080 [17075038.940752] 8809c578be18 0086 8809c578bfd8 00013680 [17075038.940756] 8809c578bfd8 00013680 880310cbe660 881159d16a30 [17075038.940759] 881e3aa25800 8809c578be48 881159d16b10 88087d553980 [17075038.940762] Call Trace: [17075038.940770] [] schedule+0x29/0x70 [17075038.940797] [] __fuse_request_send+0x13d/0x2c0 [fuse] [17075038.940801] [] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [17075038.940805] [] ? wake_up_bit+0x30/0x30 [17075038.940809] [] fuse_request_send+0x12/0x20 [fuse] [17075038.940813] [] fuse_flush+0xff/0x150 [fuse] [17075038.940817] [] filp_close+0x34/0x80 [17075038.940821] [] __close_fd+0x78/0xa0 [17075038.940824] [] SyS_close+0x23/0x50 [17075038.940828] [] system_call_fastpath+0x16/0x1b === rsync blocks in D state, and to kill it, I have to do umount --lazy on GlusterFS mountpoint, and then kill corresponding client glusterfs process. Then rsync exits. Here is GlusterFS volume info: === Volume Name: asterisk_records Type: Distributed-Replicate Volume ID: dc1fe561-fa3a-4f2e-8330-ec7e52c75ba4 Status: Started Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: server1:/bricks/10_megaraid_0_3_9_x_0_4_3_hdd_r1_nolvm_hdd_storage_01/asterisk/records Brick2: server2:/bricks/10_megaraid_8_5_14_x_8_6_16_hdd_r1_nolvm_hdd_storage_01/asterisk/records Brick3: server1:/bricks/11_megaraid_0_5_4_x_0_6_5_hdd_r1_nolvm_hdd_storage_02/asterisk/records Brick4: server2:/bricks/11_megaraid_8_7_15_x_8_8_20_hdd_r1_nolvm_hdd_storage_02/asterisk/records Brick5: server1:/bricks/12_megaraid_0_7_6_x_0_13_14_hdd_r1_nolvm_hdd_storage_03/asterisk/records Brick6: server2:/bricks/12_megaraid_8_9_19_x_8_13_24_hdd_r1_nolvm_hdd_storage_03/asterisk/records Options Reconfigured: cluster.lookup-optimize: on cluster.readdir-optimize: on client.event-threads: 2 network.inode-lru-limit: 4096 server.event-threads: 4 performance.client-io-threads: on storage.linux-aio: on performance.write-behind-window-size: 4194304 performance.stat-prefetch: on performance.quick-read: on performance.read-ahead: on performance.flush-behind: on performance.write-behind: on performance.io-thread-count: 2 performance.cache-max-file-size: 1048576 performance.cache-size: 33554432 features.cache-invalidation: on performance.readdir-ahead: on === The issue reproduces each time I rsync such an amount of files. How could I debug this issue better? ___ Gluster-users mailing list gluster-us...@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel