[Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump

2019-01-29 Thread Li, Deqian (NSB - CN/Hangzhou)
Hi,

Could you help to check this coredump?
We are using glusterfs 3.12-3(3 replicated bricks solution ) to do stability 
testing under high CPU load like 80% by stress and doing I/O.
After several hours, coredump happened in glusterfs side .

[Current thread is 1 (Thread 0x737d2700 (LWP 3696))]
Missing separate debuginfos, use: dnf debuginfo-install 
rcp-pack-glusterfs-1.8.1_11_g99e9ca6-RCP2.wf28.x86_64
(gdb) bt
#0  0x70d5c845 in wb_fulfill (wb_inode=0x7fffd406b3b0, 
liabilities=0x7fffdc234b50) at write-behind.c:1148
#1  0x70d5e4d5 in wb_process_queue (wb_inode=0x7fffd406b3b0) at 
write-behind.c:1718
#2  0x70d5eda7 in wb_writev (frame=0x7fffe0086290, this=0x7fffec014b00, 
fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, offset=67108863, 
flags=32770, iobref=0x7fffdc00d550, xdata=0x0)
at write-behind.c:1825
#3  0x70b51fcb in du_writev_resume (ret=0, frame=0x7fffdc0305a0, 
opaque=0x7fffdc0305a0) at disk-usage.c:490
#4  0x77b3510d in synctask_wrap () at syncop.c:377
#5  0x760d0660 in ?? () from /lib64/libc.so.6
#6  0x in ?? ()
(gdb) p wb_inode
$1 = (wb_inode_t *) 0x7fffd406b3b0
(gdb) frame 2
#2  0x70d5eda7 in wb_writev (frame=0x7fffe0086290, this=0x7fffec014b00, 
fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, offset=67108863, 
flags=32770, iobref=0x7fffdc00d550, xdata=0x0)
at write-behind.c:1825
1825 in write-behind.c
(gdb) p *fd
$2 = {pid = 18154, flags = 32962, refcount = 0, inode_list = {next = 
0x7fffe4034080, prev = 0x7fffe4034080}, inode = 0x0, lock = {spinlock = 0, 
mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
__nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list = {__prev 
= 0x0, __next = 0x0}}, __size = '\000' , "\377\377\377\377", 
'\000' , __align = 0}},
  _ctx = 0x7fffe4022930, xl_count = 17, lk_ctx = 0x7fffe40350e0, anonymous = 
_gf_false}
(gdb) p fd
$3 = (fd_t *) 0x7fffe4034070

(gdb) p wb_inode->this
$1 = (xlator_t *) 0xff00

After adding test log I found the FOP  sequence in write-behind xlator side was 
mass as bellow showing.  In the FUSE side the FLUSH is after write2, but in the 
WB side, FLUSH is between write2 'wb_do_unwinds' and 'wb_fulfill'.
So I think this should has problem. I think it's possible that the FLUSH and 
later RELEASE operation will destroy the fd , it will cause 
'wb_in->this(0xff00)'. Do you think so?
And I think our new adding disk-usage xlator's synctask_new will dealy the 
write operation, but the FLUSH operation without this delay(because not invoked 
the disk-usage xlator).

Do you agree with my speculation ? and how to fix?(we don't want to move the 
disk-usage xlator)


Problematic FOP sequence :

FUSE side: WB side:

Write 1write1
Write2 do unwind
Write 2FLUSH
  Release(destroy fd)
FLUSHwrite2  (wb_fulfill)  then coredump.
Release


int
wb_fulfill (wb_inode_t *wb_inode, list_head_t *liabilities)
{
 wb_request_t  *req = NULL;
 wb_request_t  *head= NULL;
 wb_request_t  *tmp = NULL;
 wb_conf_t *conf= NULL;
 off_t  expected_offset = 0;
 size_t curr_aggregate = 0;
 size_t vector_count = 0;
intret  = 0;

 conf = wb_inode->this->private;   --> this line coredump

 list_for_each_entry_safe (req, tmp, liabilities, winds) {
  list_del_init (&req->winds);




volume ccs-write-behind
68: type performance/write-behind
69: subvolumes ccs-dht
70: end-volume
71:
 72: volume ccs-disk-usage  --> we add a new xlator here 
for write op ,just for checking if disk if full.  And synctask_new for write.
73: type performance/disk-usage
74: subvolumes ccs-write-behind
75: end-volume
76:
 77: volume ccs-read-ahead
78: type performance/read-ahead
79: subvolumes ccs-disk-usage
80: end-volume



Ps. Part of  Our new translator code

int
du_writev (call_frame_t *frame, xlator_t *this, fd_t *fd,
struct iovec *vector, int count, off_t off, uint32_t flags,
struct iobref *iobref, dict_t *xdata)
{
int   op_errno = -1;
int   ret = -1;
du_local_t  *local = NULL;
loc_t  tmp_loc  = {0,};

VALIDATE_OR_GOTO (frame, err);
VALIDATE_OR_GOTO (this, err);
VALIDATE_OR_GOTO (fd, err);

tmp_loc.gfid[15] = 1;
tmp_loc.inode = fd->inode;
tmp_loc.parent = fd->inode;
local = du_local_init (frame, &tmp_loc, fd, GF_FOP_WRITE);
if (!local) {

op_errno = ENOMEM;
goto err;
}
local->vector = iov_dup (vector, count);
local->offset = off;
local->count = count;
local->flags = flags;
local->iobref = iobref_ref (iobref);

ret = synctask_new(this->ctx->env, 
du_get_du_info,du_writev_resume,frame,frame);
if(ret)
{
op_errno = -1;

Re: [Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump

2019-01-29 Thread Raghavendra Gowdappa
On Wed, Jan 30, 2019 at 7:35 AM Li, Deqian (NSB - CN/Hangzhou) <
deqian...@nokia-sbell.com> wrote:

> Hi,
>
>
>
> Could you help to check this coredump?
>
> We are using glusterfs 3.12-3(3 replicated bricks solution ) to do
> stability testing under high CPU load like 80% by stress and doing I/O.
>
> After several hours, coredump happened in glusterfs side .
>
>
>
> [Current thread is 1 (Thread 0x737d2700 (LWP 3696))]
>
> Missing separate debuginfos, use: dnf debuginfo-install
> rcp-pack-glusterfs-1.8.1_11_g99e9ca6-RCP2.wf28.x86_64
>
> (gdb) bt
>
> #0  0x70d5c845 in wb_fulfill (wb_inode=0x7fffd406b3b0,
> liabilities=0x7fffdc234b50) at write-behind.c:1148
>
> #1  0x70d5e4d5 in wb_process_queue (wb_inode=0x7fffd406b3b0) at
> write-behind.c:1718
>
> #2  0x70d5eda7 in wb_writev (frame=0x7fffe0086290,
> this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1,
> offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0)
>
> at write-behind.c:1825
>
> #3  0x70b51fcb in du_writev_resume (ret=0, frame=0x7fffdc0305a0,
> opaque=0x7fffdc0305a0) at disk-usage.c:490
>
> #4  0x77b3510d in synctask_wrap () at syncop.c:377
>
> #5  0x760d0660 in ?? () from /lib64/libc.so.6
>
> #6  0x in ?? ()
>
> (gdb) p wb_inode
>
> $1 = (wb_inode_t *) 0x7fffd406b3b0
>
> (gdb) frame 2
>
> #2  0x70d5eda7 in wb_writev (frame=0x7fffe0086290,
> this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1,
> offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0)
>
> at write-behind.c:1825
>
> 1825 in write-behind.c
>
> (gdb) p *fd
>
> $2 = {pid = 18154, flags = 32962, refcount = 0, inode_list = {next =
> 0x7fffe4034080, prev = 0x7fffe4034080}, inode = 0x0, lock = {spinlock = 0,
> mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
>
> __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list =
> {__prev = 0x0, __next = 0x0}}, __size = '\000' ,
> "\377\377\377\377", '\000' , __align = 0}},
>
>   _ctx = 0x7fffe4022930, xl_count = 17, lk_ctx = 0x7fffe40350e0, anonymous
> = _gf_false}
>
> (gdb) p fd
>
> $3 = (fd_t *) 0x7fffe4034070
>
>
>
> (gdb) p wb_inode->this
>
> $1 = (xlator_t *) 0xff00
>
>
>
> After adding test log I found the FOP  sequence in write-behind xlator
> side was mass as bellow showing.  In the FUSE side the FLUSH is after
> write2, but in the WB side, FLUSH is between write2 ‘wb_do_unwinds’ and
> ‘wb_fulfill’.
>
> So I think this should has problem. I think it’s possible that the FLUSH
> and later RELEASE operation will destroy the fd , it will cause
> ‘wb_in->this(0xff00)’. Do you think so?
>
> And I think our new adding disk-usage xlator’s synctask_new will dealy the
> write operation, but the FLUSH operation without this delay(because not
> invoked the disk-usage xlator).
>
>
>
> Do you agree with my speculation ? and how to fix?(we don’t want to move
> the disk-usage xlator)
>
>
>
>
>
> Problematic FOP sequence :
>
>
>
> FUSE side: WB side:
>
>
>
> Write 1write1
>
> Write2 do unwind
>
> Write 2FLUSH
>
>   Release(destroy fd)
>
> FLUSHwrite2  (wb_fulfill)  then coredump.
>
> Release
>
>
>
>
>
> int
>
> wb_fulfill (wb_inode_t *wb_inode, list_head_t *liabilities)
>
> {
>
>  wb_request_t  *req = NULL;
>
>  wb_request_t  *head= NULL;
>
>  wb_request_t  *tmp = NULL;
>
>  wb_conf_t *conf= NULL;
>
>  off_t  expected_offset = 0;
>
>  size_t curr_aggregate = 0;
>
>  size_t vector_count = 0;
>
> intret  = 0;
>
>
>
>  conf = wb_inode->this->private;   à this line coredump
>
>
>
>  list_for_each_entry_safe (req, tmp, liabilities, winds) {
>
>   list_del_init (&req->winds);
>
>
>
> ….
>
>
>
>
>
> volume ccs-write-behind
>
> 68: type performance/write-behind
>
> 69: subvolumes ccs-dht
>
> 70: end-volume
>
> 71:
>
> * 72: volume ccs-disk-usage  **à we add a new xlator
> here for write op ,just for checking if disk if full.  And synctask_new for
> write.*
>
> *73: type performance/disk-usage*
>
> *74: subvolumes ccs-write-behind*
>
> *75: end-volume*
>
> 76:
>
>  77: volume ccs-read-ahead
>
> 78: type performance/read-ahead
>
> 79: subvolumes ccs-disk-usage
>
> 80: end-volume
>
>
>
>
>
>
>
> Ps. Part of  Our new translator code
>
>
>
> int
>
> du_writev (call_frame_t *frame, xlator_t *this, fd_t *fd,
>
> struct iovec *vector, int count, off_t off, uint32_t flags,
>
> struct iobref *iobref, dict_t *xdata)
>
> {
>
> int   op_errno = -1;
>
> int   ret = -1;
>
> du_local_t  *local = NULL;
>
> loc_t  tmp_loc  = {0,};
>
>
>
> VALIDATE_OR_GOTO (frame, err);
>
> VALIDATE_OR_GOTO (this, err);
>
> VALIDATE_OR_GOTO (fd, err);
>
>

Re: [Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump

2019-01-29 Thread Li, Deqian (NSB - CN/Hangzhou)
Hi,

Yes, thanks very much for your quick response.


I attach the whole file, not very big.


Br,
Li Deqian

From: Raghavendra Gowdappa 
Sent: Wednesday, January 30, 2019 11:00 AM
To: Li, Deqian (NSB - CN/Hangzhou) 
Cc: gluster-users 
Subject: Re: query about glusterfs 3.12-3 write-behind.c coredump



On Wed, Jan 30, 2019 at 7:35 AM Li, Deqian (NSB - CN/Hangzhou) 
mailto:deqian...@nokia-sbell.com>> wrote:
Hi,

Could you help to check this coredump?
We are using glusterfs 3.12-3(3 replicated bricks solution ) to do stability 
testing under high CPU load like 80% by stress and doing I/O.
After several hours, coredump happened in glusterfs side .

[Current thread is 1 (Thread 0x737d2700 (LWP 3696))]
Missing separate debuginfos, use: dnf debuginfo-install 
rcp-pack-glusterfs-1.8.1_11_g99e9ca6-RCP2.wf28.x86_64
(gdb) bt
#0  0x70d5c845 in wb_fulfill (wb_inode=0x7fffd406b3b0, 
liabilities=0x7fffdc234b50) at write-behind.c:1148
#1  0x70d5e4d5 in wb_process_queue (wb_inode=0x7fffd406b3b0) at 
write-behind.c:1718
#2  0x70d5eda7 in wb_writev (frame=0x7fffe0086290, this=0x7fffec014b00, 
fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, offset=67108863, 
flags=32770, iobref=0x7fffdc00d550, xdata=0x0)
at write-behind.c:1825
#3  0x70b51fcb in du_writev_resume (ret=0, frame=0x7fffdc0305a0, 
opaque=0x7fffdc0305a0) at disk-usage.c:490
#4  0x77b3510d in synctask_wrap () at syncop.c:377
#5  0x760d0660 in ?? () from /lib64/libc.so.6
#6  0x in ?? ()
(gdb) p wb_inode
$1 = (wb_inode_t *) 0x7fffd406b3b0
(gdb) frame 2
#2  0x70d5eda7 in wb_writev (frame=0x7fffe0086290, this=0x7fffec014b00, 
fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, offset=67108863, 
flags=32770, iobref=0x7fffdc00d550, xdata=0x0)
at write-behind.c:1825
1825 in write-behind.c
(gdb) p *fd
$2 = {pid = 18154, flags = 32962, refcount = 0, inode_list = {next = 
0x7fffe4034080, prev = 0x7fffe4034080}, inode = 0x0, lock = {spinlock = 0, 
mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
__nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list = {__prev 
= 0x0, __next = 0x0}}, __size = '\000' , "\377\377\377\377", 
'\000' , __align = 0}},
  _ctx = 0x7fffe4022930, xl_count = 17, lk_ctx = 0x7fffe40350e0, anonymous = 
_gf_false}
(gdb) p fd
$3 = (fd_t *) 0x7fffe4034070

(gdb) p wb_inode->this
$1 = (xlator_t *) 0xff00

After adding test log I found the FOP  sequence in write-behind xlator side was 
mass as bellow showing.  In the FUSE side the FLUSH is after write2, but in the 
WB side, FLUSH is between write2 ‘wb_do_unwinds’ and ‘wb_fulfill’.
So I think this should has problem. I think it’s possible that the FLUSH and 
later RELEASE operation will destroy the fd , it will cause 
‘wb_in->this(0xff00)’. Do you think so?
And I think our new adding disk-usage xlator’s synctask_new will dealy the 
write operation, but the FLUSH operation without this delay(because not invoked 
the disk-usage xlator).

Do you agree with my speculation ? and how to fix?(we don’t want to move the 
disk-usage xlator)


Problematic FOP sequence :

FUSE side: WB side:

Write 1write1
Write2 do unwind
Write 2FLUSH
  Release(destroy fd)
FLUSHwrite2  (wb_fulfill)  then coredump.
Release


int
wb_fulfill (wb_inode_t *wb_inode, list_head_t *liabilities)
{
 wb_request_t  *req = NULL;
 wb_request_t  *head= NULL;
 wb_request_t  *tmp = NULL;
 wb_conf_t *conf= NULL;
 off_t  expected_offset = 0;
 size_t curr_aggregate = 0;
 size_t vector_count = 0;
intret  = 0;

 conf = wb_inode->this->private;   --> this line coredump

 list_for_each_entry_safe (req, tmp, liabilities, winds) {
  list_del_init (&req->winds);

….


volume ccs-write-behind
68: type performance/write-behind
69: subvolumes ccs-dht
70: end-volume
71:
 72: volume ccs-disk-usage  --> we add a new xlator here 
for write op ,just for checking if disk if full.  And synctask_new for write.
73: type performance/disk-usage
74: subvolumes ccs-write-behind
75: end-volume
76:
 77: volume ccs-read-ahead
78: type performance/read-ahead
79: subvolumes ccs-disk-usage
80: end-volume



Ps. Part of  Our new translator code

int
du_writev (call_frame_t *frame, xlator_t *this, fd_t *fd,
struct iovec *vector, int count, off_t off, uint32_t flags,
struct iobref *iobref, dict_t *xdata)
{
int   op_errno = -1;
int   ret = -1;
du_local_t  *local = NULL;
loc_t  tmp_loc  = {0,};

VALIDATE_OR_GOTO (frame, err);
VALIDATE_OR_GOTO (this, err);
VALIDATE_OR_GOTO (fd, err);

tmp_loc.gfid[15] = 1;
tmp_loc.inode = fd->inode;
tmp_loc.parent = fd->inode;
local = du_local_init (

Re: [Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump

2019-01-29 Thread Raghavendra Gowdappa
On Wed, Jan 30, 2019 at 7:35 AM Li, Deqian (NSB - CN/Hangzhou) <
deqian...@nokia-sbell.com> wrote:

> Hi,
>
>
>
> Could you help to check this coredump?
>
> We are using glusterfs 3.12-3(3 replicated bricks solution ) to do
> stability testing under high CPU load like 80% by stress and doing I/O.
>
> After several hours, coredump happened in glusterfs side .
>
>
>
> [Current thread is 1 (Thread 0x737d2700 (LWP 3696))]
>
> Missing separate debuginfos, use: dnf debuginfo-install
> rcp-pack-glusterfs-1.8.1_11_g99e9ca6-RCP2.wf28.x86_64
>
> (gdb) bt
>
> #0  0x70d5c845 in wb_fulfill (wb_inode=0x7fffd406b3b0,
> liabilities=0x7fffdc234b50) at write-behind.c:1148
>
> #1  0x70d5e4d5 in wb_process_queue (wb_inode=0x7fffd406b3b0) at
> write-behind.c:1718
>
> #2  0x70d5eda7 in wb_writev (frame=0x7fffe0086290,
> this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1,
> offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0)
>
> at write-behind.c:1825
>
> #3  0x70b51fcb in du_writev_resume (ret=0, frame=0x7fffdc0305a0,
> opaque=0x7fffdc0305a0) at disk-usage.c:490
>
> #4  0x77b3510d in synctask_wrap () at syncop.c:377
>
> #5  0x760d0660 in ?? () from /lib64/libc.so.6
>
> #6  0x in ?? ()
>
> (gdb) p wb_inode
>
> $1 = (wb_inode_t *) 0x7fffd406b3b0
>
> (gdb) frame 2
>
> #2  0x70d5eda7 in wb_writev (frame=0x7fffe0086290,
> this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1,
> offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0)
>
> at write-behind.c:1825
>
> 1825 in write-behind.c
>
> (gdb) p *fd
>
> $2 = {pid = 18154, flags = 32962, refcount = 0, inode_list = {next =
> 0x7fffe4034080, prev = 0x7fffe4034080}, inode = 0x0, lock = {spinlock = 0,
> mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
>
> __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list =
> {__prev = 0x0, __next = 0x0}}, __size = '\000' ,
> "\377\377\377\377", '\000' , __align = 0}},
>
>   _ctx = 0x7fffe4022930, xl_count = 17, lk_ctx = 0x7fffe40350e0, anonymous
> = _gf_false}
>
> (gdb) p fd
>
> $3 = (fd_t *) 0x7fffe4034070
>
>
>
> (gdb) p wb_inode->this
>
> $1 = (xlator_t *) 0xff00
>
>
>
> After adding test log I found the FOP  sequence in write-behind xlator
> side was mass as bellow showing.  In the FUSE side the FLUSH is after
> write2, but in the WB side, FLUSH is between write2 ‘wb_do_unwinds’ and
> ‘wb_fulfill’.
>
> So I think this should has problem.
>

wb_do_unwinds the write after caching it. Once the write response is
unwound, kernel issues a FLUSH. So, this is a valid sequence of operations
and nothing wrong.

I think it’s possible that the FLUSH and later RELEASE operation will
> destroy the fd , it will cause ‘wb_in->this(0xff00)’. Do you
> think so?
>
> And I think our new adding disk-usage xlator’s synctask_new will dealy the
> write operation, but the FLUSH operation without this delay(because not
> invoked the disk-usage xlator).
>

Flush and release wait for completion of any on-going writes. So, unless
write-behind has unwound the write, it won't see a flush. By the time write
is unwound, write-behind makes sure to take references on objects it uses
(like fds, iobref etc). So, I don't see a problem there.


>
> Do you agree with my speculation ? and how to fix?(we don’t want to move
> the disk-usage xlator)
>

I've still not found the RCA. We can discuss about the fix once RCA is
found.


>
>
>
> Problematic FOP sequence :
>
>
>
> FUSE side: WB side:
>
>
>
> Write 1write1
>
> Write2 do unwind
>
> Write 2FLUSH
>
>   Release(destroy fd)
>
> FLUSHwrite2  (wb_fulfill)  then coredump.
>
> Release
>
>
>
>
>
> int
>
> wb_fulfill (wb_inode_t *wb_inode, list_head_t *liabilities)
>
> {
>
>  wb_request_t  *req = NULL;
>
>  wb_request_t  *head= NULL;
>
>  wb_request_t  *tmp = NULL;
>
>  wb_conf_t *conf= NULL;
>
>  off_t  expected_offset = 0;
>
>  size_t curr_aggregate = 0;
>
>  size_t vector_count = 0;
>
> intret  = 0;
>
>
>
>  conf = wb_inode->this->private;   à this line coredump
>
>
>
>  list_for_each_entry_safe (req, tmp, liabilities, winds) {
>
>   list_del_init (&req->winds);
>
>
>
> ….
>
>
>
>
>
> volume ccs-write-behind
>
> 68: type performance/write-behind
>
> 69: subvolumes ccs-dht
>
> 70: end-volume
>
> 71:
>
> * 72: volume ccs-disk-usage  **à we add a new xlator
> here for write op ,just for checking if disk if full.  And synctask_new for
> write.*
>
> *73: type performance/disk-usage*
>
> *74: subvolumes ccs-write-behind*
>
> *75: end-volume*
>
> 76:
>
>  77: volume ccs-read-ahead
>
> 78: type performance/read-ahead
>
> 79: subvolumes ccs-disk-usage
>
> 80: end-volume
>
>

Re: [Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump

2019-01-29 Thread Raghavendra Gowdappa
On Wed, Jan 30, 2019 at 9:44 AM Raghavendra Gowdappa 
wrote:

>
>
> On Wed, Jan 30, 2019 at 7:35 AM Li, Deqian (NSB - CN/Hangzhou) <
> deqian...@nokia-sbell.com> wrote:
>
>> Hi,
>>
>>
>>
>> Could you help to check this coredump?
>>
>> We are using glusterfs 3.12-3(3 replicated bricks solution ) to do
>> stability testing under high CPU load like 80% by stress and doing I/O.
>>
>> After several hours, coredump happened in glusterfs side .
>>
>>
>>
>> [Current thread is 1 (Thread 0x737d2700 (LWP 3696))]
>>
>> Missing separate debuginfos, use: dnf debuginfo-install
>> rcp-pack-glusterfs-1.8.1_11_g99e9ca6-RCP2.wf28.x86_64
>>
>> (gdb) bt
>>
>> #0  0x70d5c845 in wb_fulfill (wb_inode=0x7fffd406b3b0,
>> liabilities=0x7fffdc234b50) at write-behind.c:1148
>>
>> #1  0x70d5e4d5 in wb_process_queue (wb_inode=0x7fffd406b3b0) at
>> write-behind.c:1718
>>
>> #2  0x70d5eda7 in wb_writev (frame=0x7fffe0086290,
>> this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1,
>> offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0)
>>
>> at write-behind.c:1825
>>
>> #3  0x70b51fcb in du_writev_resume (ret=0, frame=0x7fffdc0305a0,
>> opaque=0x7fffdc0305a0) at disk-usage.c:490
>>
>> #4  0x77b3510d in synctask_wrap () at syncop.c:377
>>
>> #5  0x760d0660 in ?? () from /lib64/libc.so.6
>>
>> #6  0x in ?? ()
>>
>> (gdb) p wb_inode
>>
>> $1 = (wb_inode_t *) 0x7fffd406b3b0
>>
>> (gdb) frame 2
>>
>> #2  0x70d5eda7 in wb_writev (frame=0x7fffe0086290,
>> this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1,
>> offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0)
>>
>> at write-behind.c:1825
>>
>> 1825 in write-behind.c
>>
>> (gdb) p *fd
>>
>> $2 = {pid = 18154, flags = 32962, refcount = 0, inode_list = {next =
>> 0x7fffe4034080, prev = 0x7fffe4034080}, inode = 0x0, lock = {spinlock = 0,
>> mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
>>
>> __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list =
>> {__prev = 0x0, __next = 0x0}}, __size = '\000' ,
>> "\377\377\377\377", '\000' , __align = 0}},
>>
>>   _ctx = 0x7fffe4022930, xl_count = 17, lk_ctx = 0x7fffe40350e0,
>> anonymous = _gf_false}
>>
>> (gdb) p fd
>>
>> $3 = (fd_t *) 0x7fffe4034070
>>
>>
>>
>> (gdb) p wb_inode->this
>>
>> $1 = (xlator_t *) 0xff00
>>
>>
>>
>> After adding test log I found the FOP  sequence in write-behind xlator
>> side was mass as bellow showing.  In the FUSE side the FLUSH is after
>> write2, but in the WB side, FLUSH is between write2 ‘wb_do_unwinds’ and
>> ‘wb_fulfill’.
>>
>> So I think this should has problem.
>>
>
> wb_do_unwinds the write after caching it.
>

* wb_do_unwinds unwind the write after caching it.

Once the write response is unwound, kernel issues a FLUSH. So, this is a
> valid sequence of operations and nothing wrong.
>
> I think it’s possible that the FLUSH and later RELEASE operation will
>> destroy the fd , it will cause ‘wb_in->this(0xff00)’. Do you
>> think so?
>>
>> And I think our new adding disk-usage xlator’s synctask_new will dealy
>> the write operation, but the FLUSH operation without this delay(because not
>> invoked the disk-usage xlator).
>>
>
> Flush and release wait for completion of any on-going writes. So, unless
> write-behind has unwound the write, it won't see a flush. By the time write
> is unwound, write-behind makes sure to take references on objects it uses
> (like fds, iobref etc). So, I don't see a problem there.
>
>
>>
>> Do you agree with my speculation ? and how to fix?(we don’t want to move
>> the disk-usage xlator)
>>
>
> I've still not found the RCA. We can discuss about the fix once RCA is
> found.
>
>
>>
>>
>>
>> Problematic FOP sequence :
>>
>>
>>
>> FUSE side: WB side:
>>
>>
>>
>> Write 1write1
>>
>> Write2 do unwind
>>
>> Write 2FLUSH
>>
>>   Release(destroy fd)
>>
>> FLUSHwrite2  (wb_fulfill)  then coredump.
>>
>> Release
>>
>>
>>
>>
>>
>> int
>>
>> wb_fulfill (wb_inode_t *wb_inode, list_head_t *liabilities)
>>
>> {
>>
>>  wb_request_t  *req = NULL;
>>
>>  wb_request_t  *head= NULL;
>>
>>  wb_request_t  *tmp = NULL;
>>
>>  wb_conf_t *conf= NULL;
>>
>>  off_t  expected_offset = 0;
>>
>>  size_t curr_aggregate = 0;
>>
>>  size_t vector_count = 0;
>>
>> intret  = 0;
>>
>>
>>
>>  conf = wb_inode->this->private;   à this line coredump
>>
>>
>>
>>  list_for_each_entry_safe (req, tmp, liabilities, winds) {
>>
>>   list_del_init (&req->winds);
>>
>>
>>
>> ….
>>
>>
>>
>>
>>
>> volume ccs-write-behind
>>
>> 68: type performance/write-behind
>>
>> 69: subvolumes ccs-dht
>>
>> 70: end-volume
>>
>> 71:
>>
>> * 72: volume ccs-disk-usage  **à we add a new xlator
>> he

Re: [Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump

2019-01-29 Thread Li, Deqian (NSB - CN/Hangzhou)
Hi,
>>Once the write response is unwound, kernel issues a FLUSH. So, this is a 
>>valid sequence of operations and nothing wrong.

I think I need to explain more of my meaning.
I added test log into wb_process_queue() etc, found the 4 logs as bellow.
I mean normally the log order should be 1,4,2,3, because 1,4 are for one 
‘write’, but now it is 1,2,3,4, I think it’s abnormal (maybe caused by high CPU 
load or disk-usage xlator).
And I think after FLUSH(2,3) the ‘fd’ will be destroyed then  caused‘4’ 
incorrect pointer.  Do you agree with this speculation?
Here ‘refcount’ means ‘fd-> refcount’.

1.[2019-01-28 04:23:26.751555] E [MSGID: 131007] 
[write-behind.c:1237:wb_do_unwinds] 0-ldq before do un wind : 
wb_in->this(0x7fc544014b10),wb_in(0x7fc5140a9b40),fop=WRITE, 
fd=0x7fc53c054280,refcount:8, threadid:3320
2.[2019-01-28 04:23:26.778756] E [fuse-bridge.c:2562:fuse_flush] 
0-glusterfs-fuse: 1674240:ldq FLUSH fd 0x7fc53c054280,threadid:5649
 3.[2019-01-28 04:23:26.786131] E [MSGID: 131007] 
[write-behind.c:1715:wb_do_winds] 0-ldq before do wind : 
wb_in->this(0x7fc544014b10),wb_in(0x7fc5140a9b40),fop=FLUSH, 
fd=0x7fc53c054280,refcount:14, threadid:5646
 4. [2019-01-28 04:23:26.798515] E [MSGID: 131007] 
[write-behind.c:1755:wb_process_queue] 0-ldq before fulfill : 
wb_in->this(0xff00),wb_in(0x7fc5140a9b40), threadid:3320

>>Flush and release wait for completion of any on-going writes. So, unless 
>>write-behind has unwound the write, it won't see a flush. By the time write 
>>is unwound, write-behind makes sure to take references on objects it uses 
>>(like fds, iobref etc).

In the wb_process_queue(), we can see after wb_do_unwinds() still need to do 
wb_do_winds(),wb_fulfill().
Do you think after one ‘write’ operation executed ‘wb_do_unwinds()’, it will 
wake up the OP in FUSE queue like ‘FLUSH’ , and ‘FLUSH’ can finish it’s job and 
next release the ‘fd’. Then the thread of previous ‘write’ continue to do 
‘wb_full()’ then coredump?
And do you have more detail document about the glusterfs write-behind, 
especially how the the ‘fd->refcount’ add and descrease.

Thanks.

Ps the debug code in red.

void
wb_process_queue (wb_inode_t *wb_inode)
{
list_head_t tasks= {0, };
 list_head_t lies = {0, };
 list_head_t liabilities  = {0, };
int wind_failure = 0;

INIT_LIST_HEAD (&tasks);
INIT_LIST_HEAD (&lies);
INIT_LIST_HEAD (&liabilities);

do {
gf_log_callingfn (wb_inode->this->name, GF_LOG_DEBUG,
  "processing queues");

LOCK (&wb_inode->lock);
{
__wb_preprocess_winds (wb_inode);

__wb_pick_winds (wb_inode, &tasks, &liabilities);

__wb_pick_unwinds (wb_inode, &lies);

}
UNLOCK (&wb_inode->lock);

wb_do_unwinds (wb_inode, &lies);

wb_do_winds (wb_inode, &tasks);

gf_msg ("ldq before fulfill ", GF_LOG_ERROR, 0,
WRITE_BEHIND_MSG_RES_UNAVAILABLE,
"wb_in->this(%p),wb_in(%p), threadid:%ld",
 wb_inode->this,wb_inode, 
syscall(SYS_gettid));
/* If there is an error in wb_fulfill before winding write
 * requests, we would miss invocation of wb_process_queue
 * from wb_fulfill_cbk. So, retry processing again.
 */

wind_failure = wb_fulfill (wb_inode, &liabilities);

 gf_msg ("ldq after fulfill ", GF_LOG_ERROR, 0,
WRITE_BEHIND_MSG_RES_UNAVAILABLE,
"wb_in->this(%p),wb_in(%p), thid:%ld",
 
wb_inode->this,wb_inode,syscall(SYS_gettid));

} while (wind_failure);

return;
}

void
wb_do_unwinds (wb_inode_t *wb_inode, list_head_t *lies)
{
 wb_request_t *req = NULL;
 wb_request_t *tmp = NULL;
 call_frame_t *frame = NULL;
 struct iatt   buf = {0, };

list_for_each_entry_safe (req, tmp, lies, unwinds) {
frame = req->stub->frame;

STACK_UNWIND_STRICT (writev, frame, req->op_ret, req->op_errno,
&buf, &buf, NULL); /* :O */
  req->stub->frame = NULL;

  list_del_init (&req->unwinds);

   if(req->fd)
 gf_msg ("ldq before do un wind ", GF_LOG_ERROR, 0,
WRITE_BEHIND_MSG_RES_UNAVAILABLE,
"wb_in->this(%p),wb_in(%p),fop=%s, fd=%p,refcount:%d, 
threadid:%ld",
 
wb_inode->this,wb_inode,gf_fop_list[req->fop], 
req->fd,req->fd->refcount,syscall(SYS_gettid));

wb_request_unref (req);
}