Re: [Gluster-devel] How to fix wrong telldir/seekdir usage

2014-09-13 Thread Emmanuel Dreyfus
Pranith Kumar Karampuri  wrote:

> Just to make sure I understand the problem, the issue is happening 
> because self-heal-daemon uses anonymous fds to perform readdirs? i.e.
> there is no explicit opendir on the directory. Everytime there is a 
> readdir it may lead to opendir/seekdir/readdir/closedir. Did I get that
> right?

Yes, on the brick, it happens in xlator/features/index. 

> I believe posix xlator doesn't have this problem for non-anonymous fds
> where the DIR* stream is open till the final unref on the fd.

Then perhaps the solution is to change xlator/features/index behavior to
match xlator/storage/posix? There is also
xlator/features/snapview-server that may be affected.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] How to fix wrong telldir/seekdir usage

2014-09-13 Thread Pranith Kumar Karampuri


On 09/14/2014 12:32 AM, Emmanuel Dreyfus wrote:

In <1lrx1si.n8tms1igmi5pm%m...@netbsd.org> I explained why NetBSD
currently fails self-heald.t, but since the subjet is burried deep in a
thread, it might be worth starting a new one to talk about how to fix.

In 3 places within glusterfs code (features/index,
features/snapview-server and storage/posix), a server component answers
readdir requests on a directory which may be split in mulitple calls.

To answer one call, we have the following library calls:
- opendir()
- seekdir() to resume where the previous request was
- readdir()
- telldir() to record where we are for the next request
- closedir()

This relies on unspecified behavior, as POSIX says: "The value of loc
should have been returned from an earlier call to telldir() using the
same directory stream."
http://pubs.opengroup.org/onlinepubs/9699919799/functions/seekdir.html

Since we do opendir() and closedir() at each time, we do not use the
same directory stream. It causes an infinite loop on NetBSD because it
badly resume from previous request, and in the general case it will
break badly if an entry is added in the directory between two requests.

How can we fix that?

1) we can keep the directory stream open. The change is intrusive since
we will need a chained list of open contexts, and we need to clean them
if they timeout.

2) in order to keep state between requests, we can use the entry index
(first encoutered is 1, and so on) instead of values returned by
telldir(). That works around the unspecified behavior, but it still
breaks if directory content is changed between two requests

3) make sure the readdir is done in a single request. That means trying
with bigger buffers until it works. For instance  in
xlator/cluster/afr/src/afr-self-heald.c we have:
while ((ret = syncop_readdir (subvol, fd, 131072, offset, &entries)))

We would use -1 instead of 131072 to tell that we want everything
without a size limit, and the server component (here features/index)
would either return everyting or fail, whithout playing with
telldir/seekdir.

Opinions? The third solution seems the best to me since it is not very
intrusive and it makes things simplier. Indeed we allow unbound data
size to come back from the brick to glustershd, but we trust the brick,
right?
I saw cases where the number of entries in the index directory was close 
to 10. If brick process tries to read everything it will be OOM 
killed by the kernel.


Just to make sure I understand the problem, the issue is happening 
because self-heal-daemon uses anonymous fds to perform readdirs? i.e. 
there is no explicit opendir on the directory. Everytime there is a 
readdir it may lead to opendir/seekdir/readdir/closedir. Did I get that 
right? I believe posix xlator doesn't have this problem for 
non-anonymous fds where the DIR* stream is open till the final unref on 
the fd.


Pranith




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] How to fix wrong telldir/seekdir usage

2014-09-13 Thread Anand Avati
How does the NetBSD nfs server provide stable directory offsets, for the
NFS client to resume reading from at a later point in time? Very similar
problems are present in that scenario and it might be helpful to see what
approaches are taken there (which are probably more tried and tested)

Thanks

On Sat, Sep 13, 2014 at 12:02 PM, Emmanuel Dreyfus  wrote:

> In <1lrx1si.n8tms1igmi5pm%m...@netbsd.org> I explained why NetBSD
> currently fails self-heald.t, but since the subjet is burried deep in a
> thread, it might be worth starting a new one to talk about how to fix.
>
> In 3 places within glusterfs code (features/index,
> features/snapview-server and storage/posix), a server component answers
> readdir requests on a directory which may be split in mulitple calls.
>
> To answer one call, we have the following library calls:
> - opendir()
> - seekdir() to resume where the previous request was
> - readdir()
> - telldir() to record where we are for the next request
> - closedir()
>
> This relies on unspecified behavior, as POSIX says: "The value of loc
> should have been returned from an earlier call to telldir() using the
> same directory stream."
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/seekdir.html
>
> Since we do opendir() and closedir() at each time, we do not use the
> same directory stream. It causes an infinite loop on NetBSD because it
> badly resume from previous request, and in the general case it will
> break badly if an entry is added in the directory between two requests.
>
> How can we fix that?
>
> 1) we can keep the directory stream open. The change is intrusive since
> we will need a chained list of open contexts, and we need to clean them
> if they timeout.
>
> 2) in order to keep state between requests, we can use the entry index
> (first encoutered is 1, and so on) instead of values returned by
> telldir(). That works around the unspecified behavior, but it still
> breaks if directory content is changed between two requests
>
> 3) make sure the readdir is done in a single request. That means trying
> with bigger buffers until it works. For instance  in
> xlator/cluster/afr/src/afr-self-heald.c we have:
>while ((ret = syncop_readdir (subvol, fd, 131072, offset, &entries)))
>
> We would use -1 instead of 131072 to tell that we want everything
> without a size limit, and the server component (here features/index)
> would either return everyting or fail, whithout playing with
> telldir/seekdir.
>
> Opinions? The third solution seems the best to me since it is not very
> intrusive and it makes things simplier. Indeed we allow unbound data
> size to come back from the brick to glustershd, but we trust the brick,
> right?
>
> --
> Emmanuel Dreyfus
> http://hcpnet.free.fr/pubz
> m...@netbsd.org
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] How to fix wrong telldir/seekdir usage

2014-09-13 Thread Joe Julian
Personally, I like the third option provided that doesn't cause memory issues.

In fact, read the whole thing, transfer it to the client and let the client 
handle the posix syntax.

Optionally add a path cache timeout client side that stores the directory 
listing for a period of time to mitigate the "php" dilemma for those types of 
use cases. 
 

On September 13, 2014 12:02:55 PM PDT, m...@netbsd.org wrote:
>In <1lrx1si.n8tms1igmi5pm%m...@netbsd.org> I explained why NetBSD
>currently fails self-heald.t, but since the subjet is burried deep in a
>thread, it might be worth starting a new one to talk about how to fix.
>
>In 3 places within glusterfs code (features/index,
>features/snapview-server and storage/posix), a server component answers
>readdir requests on a directory which may be split in mulitple calls.
>
>To answer one call, we have the following library calls:
>- opendir()
>- seekdir() to resume where the previous request was
>- readdir()
>- telldir() to record where we are for the next request
>- closedir()
>
>This relies on unspecified behavior, as POSIX says: "The value of loc
>should have been returned from an earlier call to telldir() using the
>same directory stream."
>http://pubs.opengroup.org/onlinepubs/9699919799/functions/seekdir.html
>
>Since we do opendir() and closedir() at each time, we do not use the
>same directory stream. It causes an infinite loop on NetBSD because it
>badly resume from previous request, and in the general case it will
>break badly if an entry is added in the directory between two requests.
>
>How can we fix that?
>
>1) we can keep the directory stream open. The change is intrusive since
>we will need a chained list of open contexts, and we need to clean them
>if they timeout.
>
>2) in order to keep state between requests, we can use the entry index
>(first encoutered is 1, and so on) instead of values returned by
>telldir(). That works around the unspecified behavior, but it still
>breaks if directory content is changed between two requests
>
>3) make sure the readdir is done in a single request. That means trying
>with bigger buffers until it works. For instance  in
>xlator/cluster/afr/src/afr-self-heald.c we have:
>  while ((ret = syncop_readdir (subvol, fd, 131072, offset, &entries)))
>
>We would use -1 instead of 131072 to tell that we want everything
>without a size limit, and the server component (here features/index)
>would either return everyting or fail, whithout playing with
>telldir/seekdir.
>
>Opinions? The third solution seems the best to me since it is not very
>intrusive and it makes things simplier. Indeed we allow unbound data
>size to come back from the brick to glustershd, but we trust the brick,
>right?
>
>-- 
>Emmanuel Dreyfus
>http://hcpnet.free.fr/pubz
>m...@netbsd.org
>___
>Gluster-devel mailing list
>Gluster-devel@gluster.org
>http://supercolony.gluster.org/mailman/listinfo/gluster-devel

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] How to fix wrong telldir/seekdir usage

2014-09-13 Thread Emmanuel Dreyfus
In <1lrx1si.n8tms1igmi5pm%m...@netbsd.org> I explained why NetBSD
currently fails self-heald.t, but since the subjet is burried deep in a
thread, it might be worth starting a new one to talk about how to fix.

In 3 places within glusterfs code (features/index,
features/snapview-server and storage/posix), a server component answers
readdir requests on a directory which may be split in mulitple calls.

To answer one call, we have the following library calls:
- opendir()
- seekdir() to resume where the previous request was
- readdir()
- telldir() to record where we are for the next request
- closedir()

This relies on unspecified behavior, as POSIX says: "The value of loc
should have been returned from an earlier call to telldir() using the
same directory stream."
http://pubs.opengroup.org/onlinepubs/9699919799/functions/seekdir.html

Since we do opendir() and closedir() at each time, we do not use the
same directory stream. It causes an infinite loop on NetBSD because it
badly resume from previous request, and in the general case it will
break badly if an entry is added in the directory between two requests.

How can we fix that?

1) we can keep the directory stream open. The change is intrusive since
we will need a chained list of open contexts, and we need to clean them
if they timeout.

2) in order to keep state between requests, we can use the entry index
(first encoutered is 1, and so on) instead of values returned by
telldir(). That works around the unspecified behavior, but it still
breaks if directory content is changed between two requests

3) make sure the readdir is done in a single request. That means trying
with bigger buffers until it works. For instance  in
xlator/cluster/afr/src/afr-self-heald.c we have:
   while ((ret = syncop_readdir (subvol, fd, 131072, offset, &entries)))

We would use -1 instead of 131072 to tell that we want everything
without a size limit, and the server component (here features/index)
would either return everyting or fail, whithout playing with
telldir/seekdir.

Opinions? The third solution seems the best to me since it is not very
intrusive and it makes things simplier. Indeed we allow unbound data
size to come back from the brick to glustershd, but we trust the brick,
right?

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] Proposal for GlusterD-2.0

2014-09-13 Thread Prasad, Nirmal
"it also has Zookeeper support etc." - just to correct this and remove the 
perception that LogCabin somehow requires Zookeeper or works with it.

LogCabin as I understand is the C++ implementation of a small store based on 
the Raft consensus protocol - to provide a consistent and a small distributed 
store.

The Zookeeper part is a RAMCloud thing for storing co-ordinator information to 
an external cluster and it does not come with LogCabin.

-Original Message-
From: gluster-users-boun...@gluster.org 
[mailto:gluster-users-boun...@gluster.org] On Behalf Of Prasad, Nirmal
Sent: Friday, September 12, 2014 5:58 PM
To: James; Krishnan Parthasarathi
Cc: Balamurugan Arumugam; gluster-us...@gluster.org; Gluster Devel
Subject: Re: [Gluster-users] [Gluster-devel] Proposal for GlusterD-2.0

Has anyone looked into whether LogCabin can provide the consistent small 
storage based on RAFT for Gluster?

https://github.com/logcabin/logcabin

I have no experience with using it so I cannot say if it is good or suitable.

I do know the following project uses it and it's just not as easy to setup as 
Gluster is - it also has Zookeeper support etc. 

https://ramcloud.atlassian.net/wiki/display/RAM/RAMCloud

-Original Message-
From: gluster-users-boun...@gluster.org 
[mailto:gluster-users-boun...@gluster.org] On Behalf Of James
Sent: Friday, September 12, 2014 4:17 AM
To: Krishnan Parthasarathi
Cc: Gluster Devel; gluster-us...@gluster.org; Balamurugan Arumugam
Subject: Re: [Gluster-users] [Gluster-devel] Proposal for GlusterD-2.0

On Fri, Sep 12, 2014 at 12:02 AM, Krishnan Parthasarathi  
wrote:
> - Original Message -
>> On Thu, Sep 11, 2014 at 4:55 AM, Krishnan Parthasarathi 
>>  wrote:
>> >
>> > I think using Salt as the orchestration framework is a good idea.
>> > We would still need to have a consistent distributed store. I hope 
>> > Salt has the provision to use one of our choice. It could be consul 
>> > or something that satisfies the criteria for choosing alternate technology.
>> > I would wait for a couple of days for the community to chew on this 
>> > and share their thoughts. If we have a consensus on this, we could 'port'
>> > the 'basic'[1] volume management commands to a system built using 
>> > Salt and see for real how it fits our use case. Thoughts?
>>
>>
>> I disagree. I think puppet + puppet-gluster would be a good idea :) 
>> One advantage is that the technology is already proven, and there's a 
>> working POC.
>> Feel free to prove me wrong, or to request any features that it's 
>> missing. ;)
>>
>
> I am glad you joined this discussion. I was expecting you to join 
> earlier :)
Assuming non-sarcasm, then thank you :)
I didn't join earlier, because 1) I'm not a hardcore algorithmist like most of 
you are and, 2) I'm busy a lot :P

>
> IIUC, puppet-gluster uses glusterd to perform glusterfs deployments. I 
> think it's important to consider puppet given its acceptance.What are 
> your thoughts on building 'glusterd' using puppet?
I think I can describe my proposal simply, and then give the reason why...

Proposal:
glusterd shouldn't go away or aim to greatly automate / do much more than it 
does today already (with a few exceptions).
puppet-gluster should be used as a higher layer abstraction to do the complex 
management. More features would still need to be added to address every use 
case and corner case, but I think we're a good deal of the way there. My work 
on automatically growing gluster volumes was demo-ed as a POC but never 
finished and pushed to git master.

I have no comment on language choices or rewrites of glusterd itself, since 
functionality wise it mostly "works for me".

Why?
The reasons this makes a lot of sense:
1) Higher level declarative languages can guarantee a lot of "safety"
in terms of avoiding incorrect operations. It's easy to get the config 
management graph to error out, which typically means there is a bug in the code 
to be fixed. In this scenario, no code actually runs! This means your data 
won't get accidentally hurt, or put into a partial state.
2) Lines of code to accomplish certain things in puppet might be an order of 
magnitude less than in a typical imperative language.
Statistically speaking, by keeping LOC down, the logic can be more concise, and 
have fewer bugs. This also lets us reason about things from a higher POV.
3) Understanding the logic in puppet can be easier than reading a pile of c or 
go code. This is why you can look at a page of python and understand, but 
staring at three pages of assembly is useless.

In any case, I don't think it's likely that Gluster will end up using puppet, 
although I do hope people will think about this a bit more and at least 
consider it seriously. Since many people are not very familiar with 
configuration management, please don't be shy if you'd like to have a quick 
chat about it, and maybe a little demo to show you what's truly possible.

HTH,
James


>
> The proposal mail describes the f

Re: [Gluster-devel] [Gluster-users] Proposal for GlusterD-2.0

2014-09-13 Thread Prasad, Nirmal
Has anyone looked into whether LogCabin can provide the consistent small 
storage based on RAFT for Gluster?

https://github.com/logcabin/logcabin

I have no experience with using it so I cannot say if it is good or suitable.

I do know the following project uses it and it's just not as easy to setup as 
Gluster is - it also has Zookeeper support etc. 

https://ramcloud.atlassian.net/wiki/display/RAM/RAMCloud

-Original Message-
From: gluster-users-boun...@gluster.org 
[mailto:gluster-users-boun...@gluster.org] On Behalf Of James
Sent: Friday, September 12, 2014 4:17 AM
To: Krishnan Parthasarathi
Cc: Gluster Devel; gluster-us...@gluster.org; Balamurugan Arumugam
Subject: Re: [Gluster-users] [Gluster-devel] Proposal for GlusterD-2.0

On Fri, Sep 12, 2014 at 12:02 AM, Krishnan Parthasarathi  
wrote:
> - Original Message -
>> On Thu, Sep 11, 2014 at 4:55 AM, Krishnan Parthasarathi 
>>  wrote:
>> >
>> > I think using Salt as the orchestration framework is a good idea.
>> > We would still need to have a consistent distributed store. I hope 
>> > Salt has the provision to use one of our choice. It could be consul 
>> > or something that satisfies the criteria for choosing alternate technology.
>> > I would wait for a couple of days for the community to chew on this 
>> > and share their thoughts. If we have a consensus on this, we could 'port'
>> > the 'basic'[1] volume management commands to a system built using 
>> > Salt and see for real how it fits our use case. Thoughts?
>>
>>
>> I disagree. I think puppet + puppet-gluster would be a good idea :) 
>> One advantage is that the technology is already proven, and there's a 
>> working POC.
>> Feel free to prove me wrong, or to request any features that it's 
>> missing. ;)
>>
>
> I am glad you joined this discussion. I was expecting you to join 
> earlier :)
Assuming non-sarcasm, then thank you :)
I didn't join earlier, because 1) I'm not a hardcore algorithmist like most of 
you are and, 2) I'm busy a lot :P

>
> IIUC, puppet-gluster uses glusterd to perform glusterfs deployments. I 
> think it's important to consider puppet given its acceptance.What are 
> your thoughts on building 'glusterd' using puppet?
I think I can describe my proposal simply, and then give the reason why...

Proposal:
glusterd shouldn't go away or aim to greatly automate / do much more than it 
does today already (with a few exceptions).
puppet-gluster should be used as a higher layer abstraction to do the complex 
management. More features would still need to be added to address every use 
case and corner case, but I think we're a good deal of the way there. My work 
on automatically growing gluster volumes was demo-ed as a POC but never 
finished and pushed to git master.

I have no comment on language choices or rewrites of glusterd itself, since 
functionality wise it mostly "works for me".

Why?
The reasons this makes a lot of sense:
1) Higher level declarative languages can guarantee a lot of "safety"
in terms of avoiding incorrect operations. It's easy to get the config 
management graph to error out, which typically means there is a bug in the code 
to be fixed. In this scenario, no code actually runs! This means your data 
won't get accidentally hurt, or put into a partial state.
2) Lines of code to accomplish certain things in puppet might be an order of 
magnitude less than in a typical imperative language.
Statistically speaking, by keeping LOC down, the logic can be more concise, and 
have fewer bugs. This also lets us reason about things from a higher POV.
3) Understanding the logic in puppet can be easier than reading a pile of c or 
go code. This is why you can look at a page of python and understand, but 
staring at three pages of assembly is useless.

In any case, I don't think it's likely that Gluster will end up using puppet, 
although I do hope people will think about this a bit more and at least 
consider it seriously. Since many people are not very familiar with 
configuration management, please don't be shy if you'd like to have a quick 
chat about it, and maybe a little demo to show you what's truly possible.

HTH,
James


>
> The proposal mail describes the functions glusterd performs today. 
> With that as a reference could you elaborate on how we could use 
> puppet to perform some (or all) the functions of glusterd?
>
> ~KP
___
Gluster-users mailing list
gluster-us...@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] How does GD_SYNCOP work?

2014-09-13 Thread Emmanuel Dreyfus
Emmanuel Dreyfus  wrote:

> Here is the problem: once readdir() has reached the end of the
> directory, on Linux, telldir() will report the last entry's offset,
> while on NetBSD, it will report an invalid offset (it is in fact the
> offset of the next entry beyond the last one, which does not exist).

But that difference did not explain why NetBSD was looping. I discovered
why.

Between each index_fill_readdir() invocation, we have a closedir()/opendir()
invocation. Then index_fill_readdir()  calls seekdir() with a pointer
obtained from telldir() on the previously open/closed DIR *. Offsets
returned by telldir() are only valid for a DIR * lifetime [1]. Such rule
makes sense: If the directory content changed, we are likely to return
garbage.

Now if the directory content did not change and we have readen everything,
here is what happens:

On Linux, seekdir() works with the offset obtained from previous DIR * (it
does not have to according to the standards), and goes to the last entry. It
exits gracefuly returning EOF.

On NetBSD, seekdir() is given the offset from previous DIR * beyond the last
entry. It fails and is nilpotent. Subsequent readdir_r() will operate from
the beginning of the directory, and we never get EOF. Here is our infinite
loop.

The correct fix is:

1) either to keep the directory open between index_fill_readdir()
invocations, but since that means preserving an open directory accross
different syncop, I am not sure it is a good idea.

2) do not reuse the offset from last attempt. That means if the buffer get
filled, resize it as bigger and retry, until the data fits. This is bad
performance wise, but it seems the only safe way to me.

Opinions?


[1] http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel