Re: [Gluster-devel] How does GD_SYNCOP work?

2014-09-15 Thread Krishnan Parthasarathi
Emmanuel,

Pranith works on glustershd, CC'ing him.

~KP

- Original Message -
> Emmanuel Dreyfus  wrote:
> 
> > Here is the problem: once readdir() has reached the end of the
> > directory, on Linux, telldir() will report the last entry's offset,
> > while on NetBSD, it will report an invalid offset (it is in fact the
> > offset of the next entry beyond the last one, which does not exist).
> 
> But that difference did not explain why NetBSD was looping. I discovered
> why.
> 
> Between each index_fill_readdir() invocation, we have a closedir()/opendir()
> invocation. Then index_fill_readdir()  calls seekdir() with a pointer
> obtained from telldir() on the previously open/closed DIR *. Offsets
> returned by telldir() are only valid for a DIR * lifetime [1]. Such rule
> makes sense: If the directory content changed, we are likely to return
> garbage.
> 
> Now if the directory content did not change and we have readen everything,
> here is what happens:
> 
> On Linux, seekdir() works with the offset obtained from previous DIR * (it
> does not have to according to the standards), and goes to the last entry. It
> exits gracefuly returning EOF.
> 
> On NetBSD, seekdir() is given the offset from previous DIR * beyond the last
> entry. It fails and is nilpotent. Subsequent readdir_r() will operate from
> the beginning of the directory, and we never get EOF. Here is our infinite
> loop.
> 
> The correct fix is:
> 
> 1) either to keep the directory open between index_fill_readdir()
> invocations, but since that means preserving an open directory accross
> different syncop, I am not sure it is a good idea.
> 
> 2) do not reuse the offset from last attempt. That means if the buffer get
> filled, resize it as bigger and retry, until the data fits. This is bad
> performance wise, but it seems the only safe way to me.
> 
> Opinions?
> 
> 
> [1] http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html
> 
> --
> Emmanuel Dreyfus
> http://hcpnet.free.fr/pubz
> m...@netbsd.org
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] How does GD_SYNCOP work?

2014-09-13 Thread Emmanuel Dreyfus
Emmanuel Dreyfus  wrote:

> Here is the problem: once readdir() has reached the end of the
> directory, on Linux, telldir() will report the last entry's offset,
> while on NetBSD, it will report an invalid offset (it is in fact the
> offset of the next entry beyond the last one, which does not exist).

But that difference did not explain why NetBSD was looping. I discovered
why.

Between each index_fill_readdir() invocation, we have a closedir()/opendir()
invocation. Then index_fill_readdir()  calls seekdir() with a pointer
obtained from telldir() on the previously open/closed DIR *. Offsets
returned by telldir() are only valid for a DIR * lifetime [1]. Such rule
makes sense: If the directory content changed, we are likely to return
garbage.

Now if the directory content did not change and we have readen everything,
here is what happens:

On Linux, seekdir() works with the offset obtained from previous DIR * (it
does not have to according to the standards), and goes to the last entry. It
exits gracefuly returning EOF.

On NetBSD, seekdir() is given the offset from previous DIR * beyond the last
entry. It fails and is nilpotent. Subsequent readdir_r() will operate from
the beginning of the directory, and we never get EOF. Here is our infinite
loop.

The correct fix is:

1) either to keep the directory open between index_fill_readdir()
invocations, but since that means preserving an open directory accross
different syncop, I am not sure it is a good idea.

2) do not reuse the offset from last attempt. That means if the buffer get
filled, resize it as bigger and retry, until the data fits. This is bad
performance wise, but it seems the only safe way to me.

Opinions?


[1] http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] How does GD_SYNCOP work?

2014-09-12 Thread Emmanuel Dreyfus
Emmanuel Dreyfus  wrote:

> I tracked down most of the problem. The request to glustershd times out
> before the reply comes, because glustershd gets stuck in an infinite loop.
> 
> In afr_shd_gather_index_entries(), the obtained offset is corrupted
> (huge negative value), and the loop never ens.

Here is the problem: once readdir() has reached the end of the
directory, on Linux, telldir() will report the last entry's offset,
while on NetBSD, it will report an invalid offset (it is in fact the
offset of the next entry beyond the last one, which does not exist).

The patch below breaks the infinite loop and lets NetBSD pass
tests/basic/self-heald.t, but I am not sure it is correct in the general
case. I suspect it breaks if  index_fill_readdir() is called multiple
time, which may happen for a large directory. I think the exit condition
should be handled better but I have to find how. Input appreciated.

diff --git a/xlators/features/index/src/index.c
b/xlators/features/index/src/index.c
index 2b80e71..1150380 100644
--- a/xlators/features/index/src/index.c
+++ b/xlators/features/index/src/index.c
@@ -284,6 +284,13 @@ index_fill_readdir (fd_t *fd, DIR *dir, off_t off,
 if (!off) {
 rewinddir (dir);
 } else {
+#ifdef __NetBSD__
+   if (off > telldir(dir)) {
+   errno = ENOENT;
+   count = 0;
+   goto out;
+   }
+#endif
 seekdir (dir, off);
 }

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] How does GD_SYNCOP work?

2014-09-12 Thread Emmanuel Dreyfus
Emmanuel Dreyfus  wrote:

> I will not look for why this offset is corrupted.

s/not/now/ of course...

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] How does GD_SYNCOP work?

2014-09-12 Thread Emmanuel Dreyfus
On Fri, Sep 12, 2014 at 06:31:55AM +0200, Emmanuel Dreyfus wrote:
> It is fine for me that glusterd_bricks_select_heal_volume() finds 3
> bricks, they are the 3 remaining alive bricks. However I am surprised to
> see the first in the list having rpc->conn.name = "management". It
> should be a brick name here, right? Or is this glustershd?

Reading the code, it has to be glustershd.

I tracked down most of the problem. The request to glustershd times out
before the reply comes, because glustershd gets stuck in an infinite loop.

In afr_shd_gather_index_entries(), the obtained offset is corrupted
(huge negative value), and the loop never ens.

I will not look for why this offset is corrupted.

-- 
Emmanuel Dreyfus
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] How does GD_SYNCOP work?

2014-09-11 Thread Emmanuel Dreyfus
Krishnan Parthasarathi  wrote:

> If you left the hung setup for over ten minutes from the time the bricks
> went down, you should see logs corresponding to one of the above two
> mechanisms in action. Let me know if you don't. Then we need to
> investigate further.

Yes, it is able to miserably die after 10 minutes :-)

I added some debug printf to see what was hanging. Here is the path of
glusterd when receiving the gluster volume heal info

gd_brick_op_phase
  glusterd_volinfo_find 
glusterd_bricks_select_heal_volume -> rxlator_count = 3
  glusterd_syncop_aggr_rsp_dict
  list_for_each_entry (pending_node, &selected, list) {
First in list is rpc->conn.name = "management"
gd_syncop_mgmt_brick_op
   glusterd_brick_op_build_payload
   GD_SYNCOP -> never resume
  } 

It is fine for me that glusterd_bricks_select_heal_volume() finds 3
bricks, they are the 3 remaining alive bricks. However I am surprised to
see the first in the list having rpc->conn.name = "management". It
should be a brick name here, right? Or is this glustershd?

The logs give a hint about GD_SYNCOP not returning:

[2014-09-12 04:19:35.266126] I [socket.c:3277:socket_submit_reply]
0-socket.management: not connected (priv->connected = -1)
[2014-09-12 04:19:35.266139] E [rpcsvc.c:1249:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0x1, Program: GlusterD svc
cli, ProgVers: 2, Proc: 31) to rpc-transport (socket.management)

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] How does GD_SYNCOP work?

2014-09-11 Thread Krishnan Parthasarathi

- Original Message -
> Krishnan Parthasarathi  wrote:
> 
> > The scheduling of a paused task happens when the epoll thread receives a
> > POLLIN event along with the response from the remote endpoint. This is
> > contingent on the fact that the call back must issue a synctask_wake,
> > which will trigger the resumption of the task (in one of the threads from
> > the syncenv). In summary, the call back code triggers the scheduling back
> > of the paused task.
> 
> Right, this seems to work. I found the __wake() call at the end of
> _gd_syncop_brick_op_cbk() and it is executed.  The problem is therefore
> not there.
> 
> I tried running the test setps one by one. The offending command is
> "gluster volume heal $V0 info", hence I run it between each step.
> 
> It works at the beginning, it works if I kill 3 out ouf 6 bricks, and it
> hangs after I created files in the volume (with 3 out of 6 bricks down).
> 
> And at that time, the bricks that are still up show this in the logs:
> 
> [2014-09-11 17:47:31.452067] I [server.c:518:server_rpc_notify]
> 0-patchy-server: disconnecting connection from
> netbsd0.cloud.gluster.org-24431-2014/09/11-17:40:47:719843-patchy-client
> -1-0-0
> [2014-09-11 17:47:31.452142] I [server-helpers.c:290:do_fd_cleanup]
> 0-patchy-server: fd cleanup on /a/a/a/a/a/a/a/a/a/a
> [2014-09-11 17:47:31.452689] I [client_t.c:417:gf_client_unref]
> 0-patchy-server: Shutting down connection
> netbsd0.cloud.gluster.org-24431-2014/09/11-17:40:47:719843-patchy-client
> -1-0-0
> [2014-09-11 17:47:31.455145] I [server.c:518:server_rpc_notify]
> 0-patchy-server: disconnecting connection from
> netbsd0.cloud.gluster.org-3612-2014/09/11-17:40:28:979958-patchy-client-
> 1-0-0
> [2014-09-11 17:47:31.455172] I [client_t.c:417:gf_client_unref]
> 0-patchy-server: Shutting down connection
> netbsd0.cloud.gluster.org-3612-2014/09/11-17:40:28:979958-patchy-client-
> 1-0-0
> [2014-09-11 17:47:31.455208] I [server.c:518:server_rpc_notify]
> 0-patchy-server: disconnecting connection from
> netbsd0.cloud.gluster.org-26218-2014/09/11-17:40:28:900316-patchy-client
> -1-0-0
> [2014-09-11 17:47:31.455230] I [client_t.c:417:gf_client_unref]
> 0-patchy-server: Shutting down connection
> netbsd0.cloud.gluster.org-26218-2014/09/11-17:40:28:900316-patchy-client
> -1-0-0
> 
> If I understood correctly, gluster volume heal info causes glusterd to
> send requests to bricks that are alive. If they go offline at that time
> it may explain why the command hangs. What is the correct behavior here?

How were the bricks brought down? 

If glusterd doesn't get a POLLERR (for every brick that went down) after the 
bricks went down,
then the ping timer mechanism in glusterd should kick in and 'abort' the RPC.
If that didn't fire for some reason (could be a bug), the frame-timeout
for glusterd RPCs should kick in and the frame corresponding to the RPC should 
'bail'.
If you left the hung setup for over ten minutes from the time the bricks went 
down,
you should see logs corresponding to one of the above two mechanisms in action. 
Let me know if
you don't. Then we need to investigate further.

~KP
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] How does GD_SYNCOP work?

2014-09-11 Thread Emmanuel Dreyfus
Krishnan Parthasarathi  wrote:

> The scheduling of a paused task happens when the epoll thread receives a
> POLLIN event along with the response from the remote endpoint. This is
> contingent on the fact that the call back must issue a synctask_wake,
> which will trigger the resumption of the task (in one of the threads from
> the syncenv). In summary, the call back code triggers the scheduling back
> of the paused task.

Right, this seems to work. I found the __wake() call at the end of
_gd_syncop_brick_op_cbk() and it is executed.  The problem is therefore
not there.

I tried running the test setps one by one. The offending command is
"gluster volume heal $V0 info", hence I run it between each step.

It works at the beginning, it works if I kill 3 out ouf 6 bricks, and it
hangs after I created files in the volume (with 3 out of 6 bricks down).

And at that time, the bricks that are still up show this in the logs:

[2014-09-11 17:47:31.452067] I [server.c:518:server_rpc_notify]
0-patchy-server: disconnecting connection from
netbsd0.cloud.gluster.org-24431-2014/09/11-17:40:47:719843-patchy-client
-1-0-0
[2014-09-11 17:47:31.452142] I [server-helpers.c:290:do_fd_cleanup]
0-patchy-server: fd cleanup on /a/a/a/a/a/a/a/a/a/a
[2014-09-11 17:47:31.452689] I [client_t.c:417:gf_client_unref]
0-patchy-server: Shutting down connection
netbsd0.cloud.gluster.org-24431-2014/09/11-17:40:47:719843-patchy-client
-1-0-0
[2014-09-11 17:47:31.455145] I [server.c:518:server_rpc_notify]
0-patchy-server: disconnecting connection from
netbsd0.cloud.gluster.org-3612-2014/09/11-17:40:28:979958-patchy-client-
1-0-0
[2014-09-11 17:47:31.455172] I [client_t.c:417:gf_client_unref]
0-patchy-server: Shutting down connection
netbsd0.cloud.gluster.org-3612-2014/09/11-17:40:28:979958-patchy-client-
1-0-0
[2014-09-11 17:47:31.455208] I [server.c:518:server_rpc_notify]
0-patchy-server: disconnecting connection from
netbsd0.cloud.gluster.org-26218-2014/09/11-17:40:28:900316-patchy-client
-1-0-0
[2014-09-11 17:47:31.455230] I [client_t.c:417:gf_client_unref]
0-patchy-server: Shutting down connection
netbsd0.cloud.gluster.org-26218-2014/09/11-17:40:28:900316-patchy-client
-1-0-0

If I understood correctly, gluster volume heal info causes glusterd to
send requests to bricks that are alive. If they go offline at that time
it may explain why the command hangs. What is the correct behavior here?

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] How does GD_SYNCOP work?

2014-09-11 Thread Emmanuel Dreyfus
On Thu, Sep 11, 2014 at 02:00:08PM +0530, Santosh Pradhan wrote:
> Does this whole framework work with poll() interface (not epoll()
> interface)? If yes, it should work in BSD flavours. If it needs epoll() then
> it may not work in BSD flavoured ones which does not have epoll() which is
> pointed out by Emmanuel. Though have a robust kqueue() which may need some
> work.

If you have a Linux build at hand it should be easy to test: rebuild
with configure --disable-epoll and try to run self_heald.t

-- 
Emmanuel Dreyfus
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] How does GD_SYNCOP work?

2014-09-11 Thread Santosh Pradhan


On 09/11/2014 10:11 AM, Krishnan Parthasarathi wrote:

Emmanuel,

The scheduling of a paused task happens when the epoll thread receives a POLLIN 
event along with
the response from the remote endpoint. This is contingent on the fact that the 
call back must issue
a synctask_wake, which will trigger the resumption of the task (in one of the 
threads from the syncenv).
In summary, the call back code triggers the scheduling back of the paused task.


Does this whole framework work with poll() interface (not epoll() 
interface)? If yes, it should work in BSD flavours. If it needs epoll() 
then it may not work in BSD flavoured ones which does not have epoll() 
which is pointed out by Emmanuel. Though have a robust kqueue() which 
may need some work.


BR,
Santosh



HTH,
KP

- Original Message -

On Wed, Sep 10, 2014 at 05:32:41AM -0400, Krishnan Parthasarathi wrote:

Let me try to explain how GD_SYNCOP works. Internally, GD_SYNCOP yields the
thread that was
executing the (sync)task once the RPC request is submitted (asynchronously)
to the remote endpoint.
It's equivalent to pausing the task until the response is received. The
call back function, which generally
executes in the epoll thread, wakes the corresponding task into execution
(ie. resumes task execution).

I suspect this is the problem: the task is not scheduled. NetBSD uses poll
and not epoll,
which may explain the problem. Where does the task scheduling happens in
epoll code?
--
Emmanuel Dreyfus
m...@netbsd.org


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] How does GD_SYNCOP work?

2014-09-10 Thread Krishnan Parthasarathi
Emmanuel,

The scheduling of a paused task happens when the epoll thread receives a POLLIN 
event along with
the response from the remote endpoint. This is contingent on the fact that the 
call back must issue
a synctask_wake, which will trigger the resumption of the task (in one of the 
threads from the syncenv).
In summary, the call back code triggers the scheduling back of the paused task.

HTH,
KP

- Original Message -
> On Wed, Sep 10, 2014 at 05:32:41AM -0400, Krishnan Parthasarathi wrote:
> > Let me try to explain how GD_SYNCOP works. Internally, GD_SYNCOP yields the
> > thread that was
> > executing the (sync)task once the RPC request is submitted (asynchronously)
> > to the remote endpoint.
> > It's equivalent to pausing the task until the response is received. The
> > call back function, which generally
> > executes in the epoll thread, wakes the corresponding task into execution
> > (ie. resumes task execution).
> 
> I suspect this is the problem: the task is not scheduled. NetBSD uses poll
> and not epoll,
> which may explain the problem. Where does the task scheduling happens in
> epoll code?
> --
> Emmanuel Dreyfus
> m...@netbsd.org
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] How does GD_SYNCOP work?

2014-09-10 Thread Emmanuel Dreyfus
On Wed, Sep 10, 2014 at 05:32:41AM -0400, Krishnan Parthasarathi wrote:
> Let me try to explain how GD_SYNCOP works. Internally, GD_SYNCOP yields the 
> thread that was
> executing the (sync)task once the RPC request is submitted (asynchronously) 
> to the remote endpoint.
> It's equivalent to pausing the task until the response is received. The call 
> back function, which generally
> executes in the epoll thread, wakes the corresponding task into execution 
> (ie. resumes task execution).

I suspect this is the problem: the task is not scheduled. NetBSD uses poll and 
not epoll, 
which may explain the problem. Where does the task scheduling happens in epoll 
code?
-- 
Emmanuel Dreyfus
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] How does GD_SYNCOP work?

2014-09-10 Thread Krishnan Parthasarathi
Emmanuel,

I am not sure why glustershd process is not replying to the 'brick op' RPC sent 
from glusterd.
That is something that we need to identify.

Let me try to explain how GD_SYNCOP works. Internally, GD_SYNCOP yields the 
thread that was
executing the (sync)task once the RPC request is submitted (asynchronously) to 
the remote endpoint.
It's equivalent to pausing the task until the response is received. The call 
back function, which generally
executes in the epoll thread, wakes the corresponding task into execution (ie. 
resumes task execution).
If the remote endpoint doesn't reply for longer than frame-timeout, which is 
default 10mins in glusterd,
the call back is invoked (in the timer thread), which would call the wake and 
resume the task to completion,
albeit with failure.

Hope that helps.

~KP

- Original Message -
> Hi
> 
> I am tracking a bug that appear when running self_heald.t on NetBSD.
> The test will hang on:
> EXPECT "$HEAL_FILES" afr_get_pending_heal_count $V0
> 
> The problem inside afr_get_pending_heal_count is when  calling
>gluster volume heal $vol info
> 
> The command will never return. By adding a lot of printf, I
> tracked down the problem to GD_SYNCOP() when called throigh
> gd_syncop_mgmt_brick_op()
> 
> In GD_SYNCOP(), once gd_syncop_submit_request() is called
> with success, we call synctask_yield() to wait for the
> reply. It will never come: _gd_syncop_brick_op_cbk() is not called.
> 
> I suspect this is a synctask_wake() problem somewhere. If I
> add synctask_wake() before synctask_yiel() in GD_SYNCOP(),
> the currrent task is scheduled immediatly, gd_syncop_mgmt_brick_op()
> exits, then later _gd_syncop_brick_op_cbk() is invoked. Of course
> it will crash, because the context (args) was allocated on the
> stack in gd_syncop_mgmt_brick_op(),
> 
> Anyone has an idea of what is going on?
> 
> 
> --
> Emmanuel Dreyfus
> m...@netbsd.org
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] How does GD_SYNCOP work?

2014-09-10 Thread Emmanuel Dreyfus
Hi

I am tracking a bug that appear when running self_heald.t on NetBSD.
The test will hang on:
EXPECT "$HEAL_FILES" afr_get_pending_heal_count $V0

The problem inside afr_get_pending_heal_count is when  calling
   gluster volume heal $vol info

The command will never return. By adding a lot of printf, I 
tracked down the problem to GD_SYNCOP() when called throigh
gd_syncop_mgmt_brick_op()

In GD_SYNCOP(), once gd_syncop_submit_request() is called 
with success, we call synctask_yield() to wait for the
reply. It will never come: _gd_syncop_brick_op_cbk() is not called.

I suspect this is a synctask_wake() problem somewhere. If I 
add synctask_wake() before synctask_yiel() in GD_SYNCOP(), 
the currrent task is scheduled immediatly, gd_syncop_mgmt_brick_op()
exits, then later _gd_syncop_brick_op_cbk() is invoked. Of course
it will crash, because the context (args) was allocated on the
stack in gd_syncop_mgmt_brick_op(),

Anyone has an idea of what is going on?


-- 
Emmanuel Dreyfus
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel