[Gluster-devel] glusterfs replica volume self heal lots of small file very very slow!how to improve? why slow?
Hi all: I do the following test: I create a glusterfs replica volume (replica count is 2 ) with two server node(server A and server B),use XFS as the underlying filesystem, then mount the volume in client node, then, I shut down the network of server A node, in client node, I copy a dir(which has a lot of small files), the dir size is 2.9GByte, when copy finish, I unmount the volume from the client, then I start the network of server A node, now, glusterfs self-heal-daemon start heal dir from server B to server A, in the end, I find the self-heal-daemon heal the dir use 40 minutes, It's too slow! why? I find out related options with self-heal, as follow: cluster.self-heal-window-size cluster.self-heal-readdir-size cluster.background-self-heal-count then I config : cluster.self-heal-window-size is 1024(max value) cluster.self-heal-readdir-size is 131072(max value) and then do the same test case, find this times heal the dir use 35 minutes, The effective is not obvious, I want to ask, If there are better ways to improve replica volume self heal lots of small file performance?? thanks! justgluste...@gmail.com ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] how do you debug ref leaks?
On 09/18/2014 10:08 AM, Raghavendra Gowdappa wrote: For eg., if a dictionary is not freed because of non-zero refcount, if there is an information on who has held these references would help to narrow down the code path or component. Yes that is the aim. The implementation I suggested tries to get that information per xlator. Are you saying it is better to store in the dict/inode/fd itself? I actually wrote one patch sometime back to do something similar (Not tested yet :-) ). diff --git a/libglusterfs/src/dict.c b/libglusterfs/src/dict.c index 37a6b2c..bd4c438 100644 --- a/libglusterfs/src/dict.c +++ b/libglusterfs/src/dict.c @@ -100,8 +100,10 @@ dict_new (void) dict = get_new_dict_full(1); -if (dict) +if (dict) { +dict->refs = get_new_dict_full(1); dict_ref (dict); +} return dict; } @@ -446,6 +448,7 @@ dict_destroy (dict_t *this) data_pair_t *pair = this->members_list; data_pair_t *prev = this->members_list; +dict_destroy (this->refs); LOCK_DESTROY (&this->lock); while (prev) { @@ -495,15 +498,21 @@ dict_unref (dict_t *this) dict_t * dict_ref (dict_t *this) { +int32_t ref = 0; +xlat_t *x = NULL; if (!this) { gf_log_callingfn ("dict", GF_LOG_WARNING, "dict is NULL"); return NULL; } +x = THIS; LOCK (&this->lock); this->refcount++; +dict_get_in32 (this->refs, x->name, &ref); +dict_set_int32 (this->refs, x->name, ref+1); + UNLOCK (&this->lock); return this; @@ -513,15 +522,20 @@ void data_unref (data_t *this) { int32_t ref; +xlator_t *x = NULL; if (!this) { gf_log_callingfn ("dict", GF_LOG_WARNING, "dict is NULL"); return; } +x = THIS; LOCK (&this->lock); this->refcount--; +dict_get_in32 (this->refs, x->name, &ref); +dict_set_int32 (this->refs, x->name, ref-1); + ref = this->refcount; UNLOCK (&this->lock); diff --git a/libglusterfs/src/dict.h b/libglusterfs/src/dict.h index 682c152..33ed7bd 100644 --- a/libglusterfs/src/dict.h +++ b/libglusterfs/src/dict.h @@ -93,6 +93,7 @@ struct _dict { data_pair_t*members_internal; data_pair_t free_pair; gf_boolean_tfree_pair_in_use; +struct _dict *refs; }; I was not happy with this implementation either. similar implementation for inode here: http://review.gluster.com/8302 But I am not happy with any of these implementations. Probably because it is still not granular enough, i.e. fop info is missing. It is better than no info but still bad. We can't have it on statedump either. So when User reports high memory usage we still have to spend lot of time looking over all the places where the dicts are allocated which is bad :-(. Pranith - Original Message - From: "Pranith Kumar Karampuri" To: "Raghavendra Gowdappa" Cc: "Gluster Devel" Sent: Thursday, September 18, 2014 10:05:18 AM Subject: Re: [Gluster-devel] how do you debug ref leaks? On 09/18/2014 09:59 AM, Raghavendra Gowdappa wrote: One thing that would be helpful is "allocator" info for generic objects like dict, inode, fd etc. That way we wouldn't have to sift through large amount of code. Could you elaborate the idea please. Pranith - Original Message - From: "Pranith Kumar Karampuri" To: "Gluster Devel" Sent: Thursday, September 18, 2014 7:43:00 AM Subject: [Gluster-devel] how do you debug ref leaks? hi, Till now the only method I used to find ref leaks effectively is to find what operation is causing ref leaks and read the code to find if there is a ref-leak somewhere. Valgrind doesn't solve this problem because it is reachable memory from inode-table etc. I am just wondering if there is an effective way anyone else knows of. Do you guys think we need a better mechanism of finding refleaks? At least which decreases the search space significantly i.e. xlator y, fop f etc? It would be better if we can come up with ways to integrate statedump and this infra just like we did for mem-accounting. One way I thought was to introduce new apis called xl_fop_dict/inode/fd_ref/unref (). Each xl keeps an array of num_fops per inode/dict/fd and increments/decrements accordingly. Dump this info on statedump. I myself am not completely sure about this idea. It requires all xlators to change. Any ideas? Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] how do you debug ref leaks?
- Original Message - > From: "Raghavendra Gowdappa" > To: "Pranith Kumar Karampuri" > Cc: "Gluster Devel" > Sent: Thursday, September 18, 2014 10:08:15 AM > Subject: Re: [Gluster-devel] how do you debug ref leaks? > > For eg., if a dictionary is not freed because of non-zero refcount, if there > is an information on who has held these references would help to narrow down > the code path or component. This solution might be rudimentary. However, someone who has worked on things like garbage collection can give better answers I think. This discussion also reminds me of Greenspun's tenth rule [1] [1] http://en.wikipedia.org/wiki/Greenspun%27s_tenth_rule > > - Original Message - > > From: "Pranith Kumar Karampuri" > > To: "Raghavendra Gowdappa" > > Cc: "Gluster Devel" > > Sent: Thursday, September 18, 2014 10:05:18 AM > > Subject: Re: [Gluster-devel] how do you debug ref leaks? > > > > > > On 09/18/2014 09:59 AM, Raghavendra Gowdappa wrote: > > > One thing that would be helpful is "allocator" info for generic objects > > > like dict, inode, fd etc. That way we wouldn't have to sift through large > > > amount of code. > > Could you elaborate the idea please. > > > > Pranith > > > - Original Message - > > >> From: "Pranith Kumar Karampuri" > > >> To: "Gluster Devel" > > >> Sent: Thursday, September 18, 2014 7:43:00 AM > > >> Subject: [Gluster-devel] how do you debug ref leaks? > > >> > > >> hi, > > >> Till now the only method I used to find ref leaks effectively is > > >> to > > >> find what operation is causing ref leaks and read the code to find if > > >> there is a ref-leak somewhere. Valgrind doesn't solve this problem > > >> because it is reachable memory from inode-table etc. I am just wondering > > >> if there is an effective way anyone else knows of. Do you guys think we > > >> need a better mechanism of finding refleaks? At least which decreases > > >> the search space significantly i.e. xlator y, fop f etc? It would be > > >> better if we can come up with ways to integrate statedump and this infra > > >> just like we did for mem-accounting. > > >> > > >> One way I thought was to introduce new apis called > > >> xl_fop_dict/inode/fd_ref/unref (). Each xl keeps an array of num_fops > > >> per inode/dict/fd and increments/decrements accordingly. Dump this info > > >> on statedump. > > >> > > >> I myself am not completely sure about this idea. It requires all xlators > > >> to change. > > >> > > >> Any ideas? > > >> > > >> Pranith > > >> ___ > > >> Gluster-devel mailing list > > >> Gluster-devel@gluster.org > > >> http://supercolony.gluster.org/mailman/listinfo/gluster-devel > > >> > > > > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://supercolony.gluster.org/mailman/listinfo/gluster-devel > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] how do you debug ref leaks?
For eg., if a dictionary is not freed because of non-zero refcount, if there is an information on who has held these references would help to narrow down the code path or component. - Original Message - > From: "Pranith Kumar Karampuri" > To: "Raghavendra Gowdappa" > Cc: "Gluster Devel" > Sent: Thursday, September 18, 2014 10:05:18 AM > Subject: Re: [Gluster-devel] how do you debug ref leaks? > > > On 09/18/2014 09:59 AM, Raghavendra Gowdappa wrote: > > One thing that would be helpful is "allocator" info for generic objects > > like dict, inode, fd etc. That way we wouldn't have to sift through large > > amount of code. > Could you elaborate the idea please. > > Pranith > > - Original Message - > >> From: "Pranith Kumar Karampuri" > >> To: "Gluster Devel" > >> Sent: Thursday, September 18, 2014 7:43:00 AM > >> Subject: [Gluster-devel] how do you debug ref leaks? > >> > >> hi, > >> Till now the only method I used to find ref leaks effectively is to > >> find what operation is causing ref leaks and read the code to find if > >> there is a ref-leak somewhere. Valgrind doesn't solve this problem > >> because it is reachable memory from inode-table etc. I am just wondering > >> if there is an effective way anyone else knows of. Do you guys think we > >> need a better mechanism of finding refleaks? At least which decreases > >> the search space significantly i.e. xlator y, fop f etc? It would be > >> better if we can come up with ways to integrate statedump and this infra > >> just like we did for mem-accounting. > >> > >> One way I thought was to introduce new apis called > >> xl_fop_dict/inode/fd_ref/unref (). Each xl keeps an array of num_fops > >> per inode/dict/fd and increments/decrements accordingly. Dump this info > >> on statedump. > >> > >> I myself am not completely sure about this idea. It requires all xlators > >> to change. > >> > >> Any ideas? > >> > >> Pranith > >> ___ > >> Gluster-devel mailing list > >> Gluster-devel@gluster.org > >> http://supercolony.gluster.org/mailman/listinfo/gluster-devel > >> > > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] how do you debug ref leaks?
On 09/18/2014 09:59 AM, Raghavendra Gowdappa wrote: One thing that would be helpful is "allocator" info for generic objects like dict, inode, fd etc. That way we wouldn't have to sift through large amount of code. Could you elaborate the idea please. Pranith - Original Message - From: "Pranith Kumar Karampuri" To: "Gluster Devel" Sent: Thursday, September 18, 2014 7:43:00 AM Subject: [Gluster-devel] how do you debug ref leaks? hi, Till now the only method I used to find ref leaks effectively is to find what operation is causing ref leaks and read the code to find if there is a ref-leak somewhere. Valgrind doesn't solve this problem because it is reachable memory from inode-table etc. I am just wondering if there is an effective way anyone else knows of. Do you guys think we need a better mechanism of finding refleaks? At least which decreases the search space significantly i.e. xlator y, fop f etc? It would be better if we can come up with ways to integrate statedump and this infra just like we did for mem-accounting. One way I thought was to introduce new apis called xl_fop_dict/inode/fd_ref/unref (). Each xl keeps an array of num_fops per inode/dict/fd and increments/decrements accordingly. Dump this info on statedump. I myself am not completely sure about this idea. It requires all xlators to change. Any ideas? Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] how do you debug ref leaks?
One thing that would be helpful is "allocator" info for generic objects like dict, inode, fd etc. That way we wouldn't have to sift through large amount of code. - Original Message - > From: "Pranith Kumar Karampuri" > To: "Gluster Devel" > Sent: Thursday, September 18, 2014 7:43:00 AM > Subject: [Gluster-devel] how do you debug ref leaks? > > hi, > Till now the only method I used to find ref leaks effectively is to > find what operation is causing ref leaks and read the code to find if > there is a ref-leak somewhere. Valgrind doesn't solve this problem > because it is reachable memory from inode-table etc. I am just wondering > if there is an effective way anyone else knows of. Do you guys think we > need a better mechanism of finding refleaks? At least which decreases > the search space significantly i.e. xlator y, fop f etc? It would be > better if we can come up with ways to integrate statedump and this infra > just like we did for mem-accounting. > > One way I thought was to introduce new apis called > xl_fop_dict/inode/fd_ref/unref (). Each xl keeps an array of num_fops > per inode/dict/fd and increments/decrements accordingly. Dump this info > on statedump. > > I myself am not completely sure about this idea. It requires all xlators > to change. > > Any ideas? > > Pranith > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://supercolony.gluster.org/mailman/listinfo/gluster-devel > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] how do you debug ref leaks?
hi, Till now the only method I used to find ref leaks effectively is to find what operation is causing ref leaks and read the code to find if there is a ref-leak somewhere. Valgrind doesn't solve this problem because it is reachable memory from inode-table etc. I am just wondering if there is an effective way anyone else knows of. Do you guys think we need a better mechanism of finding refleaks? At least which decreases the search space significantly i.e. xlator y, fop f etc? It would be better if we can come up with ways to integrate statedump and this infra just like we did for mem-accounting. One way I thought was to introduce new apis called xl_fop_dict/inode/fd_ref/unref (). Each xl keeps an array of num_fops per inode/dict/fd and increments/decrements accordingly. Dump this info on statedump. I myself am not completely sure about this idea. It requires all xlators to change. Any ideas? Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Transparent encryption in GlusterFS: Implications on manageability
Hi all, Unfortunately it is impossible to validate non-trusted volfiles using existing glusterfs options. Semantic and format of values passed by the --xlator-option don't allow to "deliver" trusted values without compromises with security. So I have added a new --secure-xlator-option, Please, review: review.gluster.org/8657 Thanks, Edward. On Wed, 13 Aug 2014 12:26:29 -0700 Anand Avati wrote: > +1 for all the points. > > > On Wed, Aug 13, 2014 at 11:22 AM, Jeff Darcy > wrote: > > > > I.1 Generating the master volume key > > > > > > > > > Master volume key should be generated by user on the trusted > > > machine. Recommendations on master key generation provided at > > > section 6.2 of the manpages [1]. Generating of master volume key > > > is in user's competence. > > > > That was fine for an initial implementation, but it's still the > > single largest obstacle to adoption of this feature. Looking > > forward, we need to provide full CLI support for generating keys in > > the necessary format, specifying their location, etc. > > > > >I.2 Location of the master volume key when mounting a > > >volume > > > > > > > > > At mount time the crypt translator searches for a master volume > > > key on the client machine at the location specified by the > > > respective translator option. If there is no any key at the > > > specified location, or the key at specified location is in > > > improper format, then mount will fail. Otherwise, the crypt > > > translator loads the key to its private memory data structures. > > > > > > Location of the master volume key can be specified at volume > > > creation time (see option "master-key", section 6.7 of the man > > > pages [1]). However, this option can be overridden by user at > > > mount time to specify another location, see section 7 of manpages > > > [1], steps 6, 7, 8. > > > > Again, we need to improve on this. We should support this as a > > volume or mount option in its own right, not rely on the generic > > --xlator-option mechanism. Adding options to mount.glusterfs isn't > > hard. Alternatively, we could make this look like a volume option > > settable once through the CLI, even though the path is stored > > locally on the client. Or we could provide a separate > > special-purpose command/script, which again only needs to be run > > once. It would even be acceptable to treat the path to the key > > file (not its contents!) as a true volume option, stored on the > > servers. Any of these would be better than requiring the user to > > understand our volfile format and construction so that they can add > > the necessary option by hand. > > > > >II. Check graph of translators on your client > > > machine after mount! > > > > > > > > > During mount your client machine receives configuration info from > > > the non-trusted server. In particular, this info contains the > > > graph of translators, which can be subjected to tampering, so > > > that encryption won't be invoked for your volume at all. So it is > > > highly important to verify this graph. After successful mount > > > make sure that the graph of translators contains the crypt > > > translator with proper options (see FAQ#1, section 11 of the > > > manpages [1]). > > > > It is important to verify the graph, but not by poking through log > > files and not without more information about what to look for. So > > we got a volfile that includes the crypt translator, with some > > options. The *code* should ensure that the master-key option has > > the value from the command line or local config, and not some > > other. If we have to add special support for this in > > otherwise-generic graph initialization code, that's fine. > > ___ > > Gluster-devel mailing list > > Gluster-devel@gluster.org > > http://supercolony.gluster.org/mailman/listinfo/gluster-devel > > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Reminder: Weekly GlusterFS Community Meeting in 45 minutes
On 17/09/2014, at 12:14 PM, Justin Clift wrote: > Reminder!!! > > The weekly Gluster Community meeting is in 45 minutes, in > #gluster-meeting on IRC. > > This is a completely public meeting, everyone is encouraged > to attend and be a part of it. :) Short meeting today. ;) Meeting Minutes: http://meetbot.fedoraproject.org/gluster-meeting/2014-09-17/gluster-meeting.2014-09-17-12.02.html Full logs: http://meetbot.fedoraproject.org/gluster-meeting/2014-09-17/gluster-meeting.2014-09-17-12.02.log.html Regards and best wishes, Justin Clift -- GlusterFS - http://www.gluster.org An open source, distributed file system scaling to several petabytes, and handling thousands of clients. My personal twitter: twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Reminder: Weekly GlusterFS Community Meeting in 45 minutes
Reminder!!! The weekly Gluster Community meeting is in 45 minutes, in #gluster-meeting on IRC. This is a completely public meeting, everyone is encouraged to attend and be a part of it. :) To add Agenda items *** Add new items under the "Other items to discuss" point on the Etherpad: https://public.pad.fsfe.org/p/gluster-community-meetings And be at the meeting to explain what they're about. :) Regards and best wishes, Justin Clift -- GlusterFS - http://www.gluster.org An open source, distributed file system scaling to several petabytes, and handling thousands of clients. My personal twitter: twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Any review is appreciated. Reason about gluster server_connection_cleanup uncleanly, file flocks leaks in frequently network disconnection
Hi all, By several days tracking, we finally pinpointed the reason of glusterfs uncleanly detach file flocks in frequently network disconnection. We are now working on a patch to submit. And here is this issue details. Any suggestions will be appreciated! First of all, as I mentioned in http://supercolony.gluster.org/pipermail/gluster-devel/2014-September/042233.html This issue happens in a frequently network disconnection. According to the sources, the server cleanup jobs is in server_connection_cleanup. When the RPCSVC_EVENT_DISCONNECT happens, it will come here: int server_rpc_notify () { .. case RPCSVC_EVENT_DISCONNECT: .. if (!conf->lk_heal) { server_conn_ref (conn); server_connection_put (this, conn, &detached); if (detached) server_connection_cleanup (this, conn, INTERNAL_LOCKS | POSIX_LOCKS); server_conn_unref (conn); .. } The server_connection_cleanup() will be called while variable 'detached' is true. And the 'detached' is set by server_connection_put(): server_connection_t* server_connection_put (xlator_t *this, server_connection_t *conn, gf_boolean_t *detached) { server_conf_t *conf = NULL; gf_boolean_tunref = _gf_false; if (detached) *detached = _gf_false; conf = this->private; pthread_mutex_lock (&conf->mutex); { conn->bind_ref--; if (!conn->bind_ref) { list_del_init (&conn->list); unref = _gf_true; } } pthread_mutex_unlock (&conf->mutex); if (unref) { gf_log (this->name, GF_LOG_INFO, "Shutting down connection %s", conn->id); if (detached) *detached = _gf_true; server_conn_unref (conn); conn = NULL; } return conn; } The 'detached' is only set _gf_true when 'conn->bind_ref' decrease to 0. This 'conn->bind_ref' is set in server_connection_get(), increase or set to 1. server_connection_t * server_connection_get (xlator_t *this, const char *id) { .. list_for_each_entry (trav, &conf->conns, list) { if (!strcmp (trav->id, id)) { conn = trav; conn->bind_ref++; goto unlock; } } .. } When the connection id is same, then the 'conn->bind_ref' will be increased. Therefore, the problem should be a reference mismatch increase or decrease. Then we add some logs to verify our guess. // 1st connection comes in. and there is no id 'host-000c29e93d20-8661-2014/09/13-11:02:26:995090-vs_vol_rep2-client-2-0' in the connection table. The 'conn->bind_ref' is set to 1. [2014-09-17 04:42:28.950693] D [server-helpers.c:712:server_connection_get] 0-vs_vol_rep2-server: server connection id: host-000c29e93d20-8661-2014/09/13-11:02:26:995090-vs_vol_rep2-client-2-0, conn->bind_ref:1, found:0 [2014-09-17 04:42:28.950717] D [server-handshake.c:430:server_setvolume] 0-vs_vol_rep2-server: Connected to host-000c29e93d20-8661-2014/09/13-11:02:26:995090-vs_vol_rep2-client-2-0 [2014-09-17 04:42:28.950758] I [server-handshake.c:567:server_setvolume] 0-vs_vol_rep2-server: accepted client from host-000c29e93d20-8661-2014/09/13-11:02:26:995090-vs_vol_rep2-client-2-0 (version: 3.4.5) (peer: host-000c29e93d20:1015) .. // Keep running several minutes... .. // Network disconnected here. The TCP socket of client side is disconnected by time-out, by the server-side socket still keep connected. AT THIS MOMENT, network restore. Client side reconnect a new TCP connection JUST BEFORE the last socket on server-side is reset. Note that at this point, there is 2 valid sockets on server side. The later new connection use the same conn id 'host-000 c29e93d20-8661-2014/09/13-11:02:26:995090-vs_vol_rep2-client-2-0' look up in the connection table and increase the 'conn->bind_ref' to 2. [2014-09-17 04:46:16.135066] D [server-helpers.c:712:server_connection_get] 0-vs_vol_rep2-server: server connection id: host-000c29e93d20-8661-2014/09/13-11:02:26:995090-vs_vol_rep2-client-2-0, conn->bind_ref:2, found:1 // HERE IT IS, ref increase to 2!!! [2014-09-17 04:46:16.135113] D [server-handshake.c:430:server_setvolume] 0-vs_vol_rep2-server: Connected to host-000c29e93d20-8661-2014/09/13-11:02:26:995090-vs_vol_rep2-client-2-0 [2014-09-17 04:46:16.135157] I [server-handshake.c:567:server_setvolume] 0-vs_vol_rep2-server: accepted client from host-000c29e93d20-8661-2014/09/13-11:02:26:995090-vs_vol_rep2-client-2-0 (version: 3.4.5) (peer: host-00