Re: [Gluster-devel] Feature: FOP Statistics JSON Dumps

2015-09-22 Thread Ben England
Richard, what's great about your patch (besides lockless counters) is:

- JSON easier to parse (particularly in python).  Compare to parsing "gluster 
volume profile" output, which is much more difficult.  This will enable tools 
to display profiling data in a user-friendly way.  Would be nice if you 
attached a sample output to the bz 1261700.  

- client side capture - io-stats translator is at the top of the translator 
stack so we would see latencies just like the application sees them.  "gluster 
volume profile" provides server-side latencies but this can be deceptive and 
fails to report "user experience" latencies.

I'm not that clear on the UI for it, would be nice if "gluster volume " command 
could be set up to automatically poll this data at a fixed rate like many other 
perf utilities (example: iostat), so that user could capture a Gluster profile 
over time with a single command; at present the support team has to give them a 
script to do it.  This would make it trivial for a user to share what their 
application is doing from a Gluster perspective, as well as how Gluster is 
performing from the client's perspective./usr/sbin/gluster utility can run 
on the client now since it is in gluster-cli RPM right?  

So in other words it would be great to replace this:

gluster volume profile $volume_name start
gluster volume profile $volume_name info > /tmp/past
for min in `seq 1 $sample_count` ; do
  sleep $sample_interval
  gluster volume profile $volume_name info
done > gvp.log
gluster volume profile $volume_name stop

With this:

gluster volume profile $volume_name $sample_interval $sample_count > gvp.log

And be able to run this command on the client to use your patch there.

thx

-ben

- Original Message -
> From: "Richard Wareing" 
> To: gluster-devel@gluster.org
> Sent: Wednesday, September 9, 2015 10:24:54 PM
> Subject: [Gluster-devel] Feature: FOP Statistics JSON Dumps
> 
> Hey all,
> 
> I just uploaded a clean patch for our FOP statistics dump feature @
> https://bugzilla.redhat.com/show_bug.cgi?id=1261700 .
> 
> Patches cleanly to v3.6.x/v3.7.x release branches, also includes io-stats
> support for intel arch atomic operations (ifdef'd for portability) such that
> you can collect data 24x7 with a negligible latency hit in the IO path.
> We've been using this for quite sometime and there appeared to have been
> some interest at the dev summit to have this in mainline; so here it is.
> 
> Take a look, and I hope you find it useful.
> 
> Richard

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] High CPU Usage - Glusterfsd

2015-02-22 Thread Ben England
Renchu, 

I didn't see anything about average file size and read/write mix.  One example 
of how to observe both of these, as well as latency and throughput - on server 
run these commands:

# gluster volume profile your-volume start
# gluster volume profile your-volume info  /tmp/dontcare
# sleep 60
# gluster volume profile your-volume info  profile-for-last-minute.log

There is also a gluster volume top command that may be of use to you in 
understanding what your users are doing with Gluster.

Also you may want to run top -H and see whether any threads in either 
glusterfsd or smbd are at or near 100% CPU - if so, you really are hitting a 
CPU bottleneck.  Looking at process CPU utilization can be deceptive, since a 
process may include multiple threads.  sar -n DEV 2 will show you network 
utilization, and iostat -mdx /dev/sd? 2 on your server will show block device 
queue depth (latter two tools require sysstat rpm).  Together these can help 
you to understand what kind of bottleneck you are seeing.

I don't see how many bricks are in your Gluster volume but it sounds like you 
have only one glusterfsd/server.   If you have idle cores on your servers, you 
can harness more CPU power by using multiple bricks/server, which results in 
multiple glusterfsd processes on each server, allowing greater parallelism.
For example, you can do this by presenting individual disk drives as bricks 
rather than RAID volumes.

Let us know if these suggestions helped

-ben england

- Original Message -
 From: Renchu Mathew ren...@cracknell.com
 To: gluster-us...@gluster.org
 Cc: gluster-devel@gluster.org
 Sent: Sunday, February 22, 2015 7:09:09 AM
 Subject: [Gluster-devel] High CPU Usage - Glusterfsd
 
 
 
 Dear all,
 
 
 
 I have implemented glusterfs storage on my company – 2 servers with
 replicate. But glustherfsd shows more than 100% CPU utilization most of the
 time. So it is so slow to access the gluster volume. My setup is two
 glusterfs servers with replication. The gluster volume (almost 10TB of data)
 is mounted on another server (glusterfs native client) and using samba share
 for the network users to access those files. Is there any way to reduce the
 processor usage on these servers? Please give a solution ASAP since the
 users are complaining about the poor performance. I am using glusterfs
 version 3.6.
 
 
 
 Regards
 
 
 
 Renchu Mathew | Sr. IT Administrator
 
 
 
 
 
 
 
 CRACKNELL DUBAI | P.O. Box 66231 | United Arab Emirates | T +971 4 3445417 |
 F +971 4 3493675 | M +971 50 7386484
 
 ABU DHABI | DUBAI | LONDON | MUSCAT | DOHA | JEDDAH
 
 EMAIL ren...@cracknell.com | WEB www.cracknell.com
 
 
 
 This email, its content and any files transmitted with it are intended solely
 for the addressee(s) and may be legally privileged and/or confidential. If
 you are not the intended recipient please let us know by email reply and
 delete it from the system. Please note that any views or opinions presented
 in this email do not necessarily represent those of the company. Email
 transmissions cannot be guaranteed to be secure or error-free as information
 could be intercepted, corrupted, lost, destroyed, arrive late or incomplete,
 or contain viruses. The company therefore does not accept liability for any
 errors or omissions in the contents of this message which arise as a result
 of email transmission.
 
 
 
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Multi-network support proposal

2015-02-15 Thread Ben England
Hope this message makes as much sense to me on Tuesday as it did at 3 AM in the 
airport ;-) Inline...

- Original Message -
 From: Jeff Darcy jda...@redhat.com
 To: Ben England bengl...@redhat.com
 Cc: Gluster Devel gluster-devel@gluster.org, Manoj Pillai 
 mpil...@redhat.com
 Sent: Sunday, February 15, 2015 1:49:17 AM
 Subject: Re: [Gluster-devel] Multi-network support proposal
 
  It's really important for glusterfs not to require that the clients mount
  volumes using same subnet that is used by servers, and clearly your very
  general-purpose proposal could address that.  For example, in a site where
  non-glusterfs protocols are used, there are already good reasons for using
  multiple subnets, and we want glusterfs to be able to coexist with
  non-glusterfs protocols at a site.
  
  However, is there a simpler way to allow glusterfs clients to connect to
  servers through more than one subnet.  For example, suppose your Gluster
  volume subnet is 172.17.50.0/24 and your public network used by glusterfs
  clients is 1.2.3.0/22, but one of the servers also has an interface on
  subnet 4.5.6.0/24 .  So at the time that the volume is either created or
  bricks are added/removed:
  
  - determine what servers are actually in the volume
  - ask each server to return the subnet for each of its active network
  interfaces
  - determine set of subnets that are directly accessible to ALL the volume's
  servers
  - write a glusterfs volfile for each of these subnets and save it
  
  This process is O(N) where N is number of servers, but it only happens for
  volume creation or addition/removal of bricks, these events do not happen
  very often (do they?).  In example, 1.2.3.0/22 and 172.17.50.0/24 would
  have
  glusterfs volfiles, but 4.5.6.0/22 would not.
  
  So now when a client connects, the server knows which subnet the request
  came
  through (getsockaddr), so it can just return the volfile for that subnet.
  If there is no volfile for that subnet, the client mount request is
  rejected..  But what about existing Gluster volumes?  When software is
  upgraded, we should provide a mechanism for triggering this volfile
  generation process to open up additional subnets for glusterfs clients.
  
  This proposal requires additional work to be done where volfiles are
  generated and where glusterfs mount processing is done, but does not
  require
  any additional configuration commands or extra user knowledge of Gluster.
  glusterfs clients can then use *any* subnet that is accessible to all the
  servers.
 
 That does have the advantage of not requiring any special configuration,
 and might work well enough for front-end traffic, but it has the
 drawback of not giving any control over back-end traffic.  How do
 *servers* choose which interfaces to use for NSR normal traffic,
 reconciliation/self-heal, DHT rebalance, and so on?  Which network
 should Ganesha/Samba servers use to communicate with bricks?  Even on
 the front end, what happens when we do get around to adding per-subnet
 access control or options?  For those kinds of use cases we need
 networks to be explicit parts of our model, not implicit or inferred.
 So maybe we need to reconcile the two approaches, and hope that the
 combined result isn't too complicated.  I'm open to suggestions.
 

In defense of your proposal, you are right that it is difficult to manage each 
node's network configuration independently or by volfile, and it would be 
useful to a system manager to be able to configure Gluster network behavior 
across the entire volume.  For example, you can use pdsh to issue commands to 
any subset of Gluster servers, but what if some of them are down at the time 
the command is issued?  How do you make these configuration changes persistent? 
 What happens when you add or remove servers from the volume?  That to me is 
the real selling point of your proposal - if we have a 60-node or even a 
1000-node Gluster volume, we could provide a way to control network behavior in 
a persistent, highly-available, scalable way with as few sysadmin operations as 
possible. 

I have two concerns:


1) Do we have to specify each host's address rewriting in your example - why 
not something like this?

# gluster network add client-net 1.2.3.0/24 

glusterd could then use a discovery process as I described earlier to determine 
for each server what its IP address is on that subnet and rewrite volfiles 
accordingly.

The advantage of this subnet-based specification IMHO is that it scales - as 
you add and remove nodes, you do not have to change client-net entity, you 
just make sure that Gluster servers provide the appropriate network interface 
with appropriate IP address and subnet mask.


2) Could we keep the number of roles and the sysadmin interface in general from 
getting too complicated?  Here's an oversimplified model of Gluster networking 
- there are at most 2 kinds of subnets on each server in use by Gluster or apps:

- replication

Re: [Gluster-devel] RDMA: Patch to make use of pre registered memory

2015-02-08 Thread Ben England
Avati, I'm all for your zero-copy RDMA API proposal, but I have a concern about 
your proposed zero-copy fop below...

- Original Message -
 From: Anand Avati av...@gluster.org
 To: Mohammed Rafi K C rkavu...@redhat.com, Gluster Devel 
 gluster-devel@gluster.org
 Cc: Raghavendra Gowdappa rgowd...@redhat.com, Ben Turner 
 btur...@redhat.com, Ben England
 bengl...@redhat.com, Suman Debnath sdebn...@redhat.com
 Sent: Saturday, January 24, 2015 1:15:52 AM
 Subject: Re: RDMA: Patch to make use of pre registered memory
 
 Couple of comments -
 
 ...
 4. Next step for zero-copy would be introduction of a new fop readto()
 where the destination pointer is passed from the caller (gfapi being the
 primary use case). In this situation RDMA ought to register that memory if
 necessary and request server to RDMA_WRITE into the pointer provided by
 gfapi caller.

The readto() API is emulating the Linux/Unix read() system call, where the 
caller passes in the address of the read buffer.  This API was created half a 
century ago in a non-distributed world.  IMHO The caller should not specify 
where the read data should arrive, instead it should let the read API specify 
where the data arrived.  There should be a pre-registered pool of buffers, that 
both the sender and receiver *already* knew about, that can be used for RDMA 
reads, and one of these will be passed to the caller as part of the read 
event or completion.  This seems related to performance results that Rafi KC 
had posted earlier this month.

Why does it matter?  With RDMA, the read transfer cannot begin until the OTHER 
END of the RDMA connection knows where the data will land, and it cannot know 
this soon enough if we wait until the read API call to specify what address to 
target.  An API where the caller specifies the buffer address *blocks* the 
sender, introduces latency (transmitting RDMA-able address to sender) and 
prevents pipelined, overlapping activity by sender and receiver. 

So a read FOP for RDMA should be more like read_completion_event(buffer ** 
read_data_delivered).   It is possible to change libgfapi to support this since 
it does not have to conform rigidly to POSIX.  Could this work in Gluster 
translator API?   RPC interface?

So then how would the remote sender find out when it was ok to re-use this 
buffer to service another RDMA read request?   Is there an interface, something 
like read_buffer_consumed(buffer * available_buf), on read API side that 
indicates to RDMA that the caller has consumed the buffer and it is ready for 
re-use, without the added expense of unregistering and re-registering?

If so, then you then have a pipeline of buffers in one of 4 states:

- in transmission by sender to reader
- being consumed by reader
- being returned to sender for re-use
- available to sender 
- go back to state 1

By increasing the number of buffers sufficiently, we can avoid a situation 
where round-trip latency prevents you from filling the gigantic 40-Gbps 
(56-Gbps for FDR IB) RDMA pipeline.

I'm also interested in how writes work - how do we avoid copies on the write 
path and also avoid having to re-register buffers with each write?

BTW None of these concerns, or the concerns discussed by Rafi KC, are addressed 
in the Gluster 3.6 RDMA feature page.

-ben (e)

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Reg. multi thread epoll NetBSD failures

2015-01-23 Thread Ben England
Gluster-ians,

would it be ok to temporarily disable multi-thread-epoll on NetBSD, unless 
there is some huge demand for it?  NetBSD may be useful for exposing race 
conditions, but it's not clear to me that all of these race conditions would 
happen in a non-NetBSD environment, so are we chasing problems that non-NetBSD 
users can never see?  what do people think?  If yes, why bust our heads 
figuring them out for NetBSD right now?  

attached is a tiny, crude and possibly out-of-date patch for making 
multi-thread-epoll tunable,   If we make number of epoll threads settable, we 
could add conditional compilation to make GLUSTERFS_EPOLL_MAXTHREADS 1 for 
NetBSD without much trouble, while still allowing people to experiment with it 
on NetBSD.

From a performance perspective, let's review why we should go to the trouble 
of using multi-thread-epoll patch.  The original goal was to allow far greater 
CPU utilization by Gluster than we typically were seeing.  To do this, we want 
multiple Gluster RPC sockets to be read and processed in parallel by a single 
process.  This is important to clients (glusterfs, libgfapi) that have to talk 
to many bricks (example: JBOD, erasure coding), and to brick processes 
(glusterfsd) that have to talk to many clients.  It is also important for SSD 
support (cache tiering) because we need to be able to have the glusterfsd 
process keep up with SSD hardware and caches, which can have orders of 
magnitude more IOPS available than a single disk drive or even a RAID LUN, and 
glusterfsd epoll thread is currently the bottleneck in such configurations.  
This multi-thread-epoll enhancement seems similar to multi-queue ethernet 
driver, etc. that spreads load across CPU cores.  RDMA 40-Gbps networking may 
also encounter this bottleneck.  We don't want a small fraction of CPU cores 
(often just 1) to be a bottleneck - we want either network or storage hardware 
to be the bottleneck instead.

Finally, is it possible with multi-thread-epoll that we do not need to use the 
io-threads translator (Anand Avati's suggestion) that offloads incoming 
requests to worker threads?  In this case, the epoll threads ARE the 
server-side thread pool.  If so, this could reduce context switching and 
latency further.  I for one look forward to finding out but I do not want to 
invest in more performance testing than we have already done unless it is going 
to be upstream to use.

thanks for your help,

-Ben England, Red Hat Perf. Engr.


- Original Message -
 From: Shyam srang...@redhat.com
 To: Emmanuel Dreyfus m...@netbsd.org
 Cc: Gluster Devel gluster-devel@gluster.org
 Sent: Friday, January 23, 2015 2:48:14 PM
 Subject: [Gluster-devel] Reg. multi thread epoll NetBSD failures
 
 Patch: http://review.gluster.org/#/c/3842/
 
 Manu,
 
 I was not able to find the NetBSD job mentioned in the last review
 comment provided by you, pointers to that would help.
 
 Additionally,
 
 What is the support status of epoll on NetBSD? I though NetBSD favored
 the kqueue means of event processing over epoll and that epoll was not
 supported on NetBSD (or *BSD).
 
 I ask this, as this patch specifically changes the number of epoll
 threads, as a result, it is possibly having a different affect on
 NetBSD, which should either be on poll or kqueue (to my understanding).
 
 Could you shed some light on this and on the current status of epoll on
 NetBSD.
 
 Thanks,
 Shyam
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel
 
--- event-epoll.c.nontunable	2014-09-05 16:27:10.261223176 -0400
+++ event-epoll.c	2014-09-05 16:33:19.818407183 -0400
@@ -612,23 +612,35 @@
 }
 
 
-#define GLUSTERFS_EPOLL_MAXTHREADS 2
+#define GLUSTERFS_EPOLL_MAX_THREADS 8
+#define GLUSTERFS_EPOLL_DEFAULT_THREADS 4
 
+int glusterfs_epoll_threads = -1;
 
 static int
 event_dispatch_epoll (struct event_pool *event_pool)
 {
 	int   i = 0;
-	pthread_t pollers[GLUSTERFS_EPOLL_MAXTHREADS];
+	pthread_t pollers[GLUSTERFS_EPOLL_MAX_THREADS];
 	int   ret = -1;
+char *epoll_thrd_str = getenv(GLUSTERFS_EPOLL_THREADS);
 
-	for (i = 0; i  GLUSTERFS_EPOLL_MAXTHREADS; i++) {
+glusterfs_epoll_threads = 
+epoll_thrd_str ? atoi(epoll_thrd_str) : GLUSTERFS_EPOLL_DEFAULT_THREADS;;
+
+if (glusterfs_epoll_threads  GLUSTERFS_EPOLL_MAX_THREADS) {
+			gf_log (epoll, GF_LOG_ERROR,
+user requested %d threads but limit is %d,
+glusterfs_epoll_threads, GLUSTERFS_EPOLL_MAX_THREADS);
+return EINVAL;
+}
+	for (i = 0; i  glusterfs_epoll_threads; i++) {
 		ret = pthread_create (pollers[i], NULL,
   event_dispatch_epoll_worker,
   event_pool);
 	}
 
-	for (i = 0; i  GLUSTERFS_EPOLL_MAXTHREADS; i++)
+	for (i = 0; i  glusterfs_epoll_threads; i++)
 		pthread_join (pollers[i], NULL);
 
 	return ret

Re: [Gluster-devel] RDMA: Patch to make use of pre registered memory

2015-01-23 Thread Ben England
Rafi, great results, thanks.   Your io-cache off columns are read tests with 
the io-cache translator disabled, correct?  What jumps out at me from your 
numbers are two things:

- io-cache translator destroys RDMA read performance. 
- approach 2i) register iobuf pool is the best approach.
-- on reads with io-cache off, 32% better than baseline and 21% better than 1) 
separate buffer 
-- on writes, 22% better than baseline and 14% better than 1)

Can someone explain to me why the typical Gluster site wants to use the 
io-cache translator, given that FUSE now caches file data?  Should we just have 
it turned off by default at this point?  This would buy us time to change 
io-cache implementation to be compatible with RDMA (see below option 2ii).

remaining comments inline

-ben

- Original Message -
 From: Mohammed Rafi K C rkavu...@redhat.com
 To: gluster-devel@gluster.org
 Cc: Raghavendra Gowdappa rgowd...@redhat.com, Anand Avati 
 av...@gluster.org, Ben Turner
 btur...@redhat.com, Ben England bengl...@redhat.com, Suman Debnath 
 sdebn...@redhat.com
 Sent: Friday, January 23, 2015 7:43:45 AM
 Subject: RDMA: Patch to make use of pre registered memory
 
 Hi All,
 
 As I pointed out earlier, for rdma protocol, we need to register memory
 which is used during rdma read and write with rdma device. In fact it is a
 costly operation. To avoid the registration of memory in i/o path, we
 came up with two solutions.
 
 1) To use a separate per-registered iobuf_pool for rdma. The approach
 needs an extra level copying in rdma for each read/write request. ie, we
 need to copy the content of memory given by application to buffers of
 rdma in the rdma code.
 

copying data defeats the whole point of RDMA, which is to *avoid* copying data. 
  

 2) Register default iobuf_pool in glusterfs_ctx with rdma device during
 the rdma
 initialize. Since we are registering buffers from the default pool for
 read/write, we don't require either registration or copying. 

This makes far more sense to me.

 But the
 problem comes when io-cache translator is turned-on; then for each page
 fault, io-cache will take a ref on the io-buf of the response buffer to
 cache it, due to this all the pre-allocated buffer will get locked with
 io-cache very soon.
 Eventually all new requests would get iobufs from new iobuf_pools which
 are not
 registered with rdma and we will have to do registration for every iobuf.
 To address this issue, we can:
 
  i)  Turn-off io-cache
 (we chose this for testing)
 ii)  Use separate buffer for io-cache, and offload from
 default pool to io-cache buffer.
 (New thread to offload)


I think this makes sense, because if you get a io-cache translator cache hit, 
then you don't need to go out to the network, so io-cache memory doesn't have 
to be registered with RDMA.

 iii) Dynamically register each newly created arena with rdma,
  for this need to bring libglusterfs code and transport
 layer code together.
  (Will need changes in packaging and may bring hard
 dependencies of rdma libs)
iv) Increase the default pool size.
 (Will increase the footprint of glusterfs process)
 

registration with RDMA only makes sense to me when data is going to be 
sent/received over the RDMA network.  Is it hard to tell in advance which 
buffers will need to be transmitted?

 We implemented two approaches,  (1) and (2i) to get some
 performance numbers. The setup was 4*2 distributed-replicated volume
 using ram disks as bricks to avoid hard disk bottleneck. And the numbers
 are attached with the mail.
 
 
 Please provide the your thoughts on these approaches.
 
 Regards
 Rafi KC
 
 
 
Seperate buffer for rdma (1)No change   
Register Default iobuf pool(2i) 
write   readio-cache offwrite   readio-cache offwrite   
readio-cache off
1   373 527 656 343 483 532 446 
512 696
2   380 528 668 347 485 540 426 
525 715
3   376 527 594 346 482 540 422 
526 720
4   381 533 597 348 484 540 413 
526 710
5   372 527 479 347 482 538 422 
519 719
Note: (varying result )
Average 376.4   528.4   598.8   346.2   483.2   538 425.8   
521.6   712

command read:   echo 3  /proc/sys/vm/drop_caches; dd 
if=/home/ram0/mount0/foo.txt of=/dev/null bs=1024K count=1000;
write   echo 3  /proc/sys/vm/drop_caches; dd 
of=/home/ram0/mount0/foo.txt if=/dev/zero bs=1024K count=1000 conv=sync;


vol infoVolume Name: xcube
Type: Distributed-Replicate
Volume ID: 84cbc80f

Re: [Gluster-devel] Order of server-side xlators

2015-01-13 Thread Ben England
Since we're on the subject of minimizing STAT calls, there was talk in the 
small-file perf meeting about moving the md-cache translator into the server 
just before the POSIX translator so that all the stat and getxattr calls, etc. 
could be intercepted.  This would be consistent with not needing to put 
access-control down near POSIX translator.  Also we had discussed introducing 
negative caching (we could call llistxattr() first) to the md-cache translator 
so that we would not constantly ask brick fs for non-existent xattrs.


- Original Message -
 From: Xavier Hernandez xhernan...@datalab.es
 To: Anand Avati av...@gluster.org
 Cc: Gluster Devel gluster-devel@gluster.org
 Sent: Tuesday, January 13, 2015 4:18:15 AM
 Subject: Re: [Gluster-devel] Order of server-side xlators
 
 On 01/13/2015 05:45 AM, Anand Avati wrote:
  Valid questions. access-control had to be as close to posix as possible
  in its first implementation (to minimize the cost of the STAT calls
  originated by it), but since the introduction of posix-acl there are no
  extra STAT calls, and given the later introduction of quota, it
  certainly makes sense to have access-control/posix-acl closer to
  protocol/server. Some general constraints to consider while deciding the
  order:
 
  - keep io-stats as close to protocol/server as possible
  - keep io-threads as close to storage/posix as possible
  - any xlator which performs direct filesystem operations (with system
  calls, not STACK_WIND) are better placed between io-threads and posix to
  keep epoll thread nonblocking  (e.g changelog)
 
 
 Based on these constraints and the requirements of each xlator, what do
 you think about this order:
 
  posix
  changelog  (needs FS access)
  index  (needs FS access)
  marker (needs FS access)
  io-threads
  barrier(just above io-threads as per documentation (*))
  quota
  access-control
  locks
  io-stats
  server
 
 (*) I'm not sure of the requirements/dependencies of barrier xlator.
 
 Do you think this order makes sense and it would be better ?
 
 Xavi
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Suggestion needed to make use of iobuf_pool as rdma buffer.

2015-01-13 Thread Ben England
Rafi,

it totally makes sense to me that you need to pre-allocate i/o buffers that 
will be used by RDMA, and you don't want to constantly change (i.e. allocate 
and deallocate) these buffers.  Since a remote RDMA controller can be reading 
and writing to them, we have to be very careful about deallocating in 
particular.  So an arena of pre-registered RDMA buffers makes perfect sense.

Am I understanding you correctly that io-cache translator is soaking up all the 
RDMA-related buffers?   How important is io-cache translator to Gluster 
performance at this point?  Given that FUSE caching is now enabled, it seems to 
me that io-cache translator would accomplish very little.  Should we have it 
disabled by default?  If so, would that solve your problem?

So how do read-ahead translator and write-behind translator interact with RDMA 
buffering?

-ben

- Original Message -
 From: Mohammed Rafi K C rkavu...@redhat.com
 To: gluster-devel@gluster.org
 Sent: Tuesday, January 13, 2015 9:29:56 AM
 Subject: [Gluster-devel] Suggestion needed to make use of iobuf_pool as rdma  
 buffer.
 
 Hi All,
 
 When using RDMA protocol, we need to register the buffer which is going
 to send through rdma with rdma device. In fact, it is a costly
 operation, and a performance killer if it happened in I/O path. So our
 current plan is to register pre-allocated iobuf_arenas from  iobuf_pool
 with rdma when rdma is getting initialized. The problem comes when all
 the iobufs are exhausted, then we need to dynamically allocate new
 arenas from libglusterfs module. Since it is created in libglusterfs, we
 can't make a call to rdma from libglusterfs. So we will force to
 register each of the iobufs from the newly created arenas with rdma in
 I/O path. If io-cache is turned on in client stack, then all the
 pre-registred arenas will use by io-cache as cache buffer. so we have to
 do the registration in rdma for each i/o call for every iobufs,
 eventually we cannot make use of pre registered arenas.
 
 To address the issue, we have two approaches in mind,
 
  1) Register each dynamically created buffers in iobuf by bringing
 transport layer together with libglusterfs.
 
  2) create a separate buffer for caching and offload the data from the
 read response to the cache buffer in background.
 
 If we could make use of preregister memory for every rdma call, then we
 will have approximately 20% increment for write and 25% of increment for
 read.
 
 Please give your thoughts to address the issue.
 
 Thanks  Regards
 Rafi KC
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] small-file performance feature sub-pages

2014-12-03 Thread Ben England
I've expanded specification of two proposals for improving small-file 
performance as new feature pages, referenced at the bottom of this list (not in 
priority order I hope?).  Could we possibly review these proposals at a 
gluster.org meeting this month?

http://www.gluster.org/community/documentation/index.php/Features#Proposed_Features.2FIdeas

new feature pages under this page are:

http://www.gluster.org/community/documentation/index.php/Features/stat-xattr-cache
 - proposed enhancement to POSIX translator for small-file performance
http://www.gluster.org/community/documentation/index.php/Features/composite-operations
 - changes to reduce round trips for small-file performance

Specifically, the stat-xattr-cache proposal does not require Gluster 4.0 - it 
could be implemented today.  These pages are referenced by Features/Planning40 
page and also by the Features/Feature_Smallfile_Perf page.

comments and feedback are appreciated.  There have been other related proposals 
from Rudra Siva concerning round-trip reduction in 
http://supercolony.gluster.org/pipermail/gluster-devel/2014-November/042741.html
 .

-ben
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] How to resolve gfid (and .glusterfs symlink) for a deleted file

2014-11-21 Thread Ben England
Nux,

Those thousands of entries all would match -links 2 but not -links 1  The 
only entry in .glusterfs that would match is the entry where you deleted the 
file from the brick.  That's how hardlinks work - when you create a regular 
file, the link count is increased to 1 (since the directory entry now 
references the inode), and when you create an additional hard link to the same 
file, the link count is increased to 2.   Try this with the stat your-file 
command and look at the link count, watch how it changes.  The find command 
that I gave you just tracks down the one hardlink that you want and nothing 
else.

-ben

- Original Message -
 From: Nux! n...@li.nux.ro
 To: Ben England bengl...@redhat.com
 Cc: Gluster Devel gluster-devel@gluster.org
 Sent: Friday, November 21, 2014 11:03:46 AM
 Subject: Re: [Gluster-devel] How to resolve gfid (and .glusterfs symlink) for 
 a   deleted file
 
 Hi Ben,
 
 I have thousands of entries under /your/brick/directory/.glusterfs .. find
 would return too many results.
 How do I find the one I'm looking for? :-)
 
 --
 Sent from the Delta quadrant using Borg technology!
 
 Nux!
 www.nux.ro
 
 - Original Message -
  From: Ben England bengl...@redhat.com
  To: Nux! n...@li.nux.ro
  Cc: Gluster Devel gluster-devel@gluster.org
  Sent: Friday, 21 November, 2014 16:00:40
  Subject: Re: [Gluster-devel] How to resolve gfid (and .glusterfs symlink)
  for a   deleted file
 
  first of all, links in .glusterfs are HARD links not symlinks.   So the
  file is
  not actually deleted, since the local filesystem keeps a count of
  references to
  the inode and won't release the inode until the ref count reaches zero.   I
  tried this, it turns out you can find it with
  
  # find /your/brick/directory/.glusterfs -links 1 -type f
  
  You use type f because it's a hard link to a file, and you don't want to
  look
  at directories or . or .. .  Once you find the link, you can copy the
  file
  off somewhere, and then delete the link.  At that point, regular self-heal
  could repair it (i.e. just do ls on the file from a Gluster mountpoint).
  
  - Original Message -
  From: Nux! n...@li.nux.ro
  To: Gluster Devel gluster-devel@gluster.org
  Sent: Friday, November 21, 2014 10:34:09 AM
  Subject: [Gluster-devel] How to resolve gfid (and .glusterfs symlink) for
  a
 deleted file
  
  Hi,
  
  I deleted a file by mistake in a brick. I never managed to find out its
  gfid
  so now I have a rogue symlink in .glusterfs pointing to it (if I got how
  it
  works).
  Any way I can discover which is this file and get rid of it?
  
  --
  Sent from the Delta quadrant using Borg technology!
  
  Nux!
  www.nux.ro
  ___
  Gluster-devel mailing list
  Gluster-devel@gluster.org
  http://supercolony.gluster.org/mailman/listinfo/gluster-devel
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Feature help

2014-11-04 Thread Ben England
inline...

- Original Message -
 From: Rudra Siva rudrasiv...@gmail.com
 To: gluster-devel@gluster.org
 Sent: Saturday, November 1, 2014 10:20:41 AM
 Subject: [Gluster-devel] Feature help
 
 Hi,
 
 I'm very interested in helping with this feature by way of development
 help, testing and or benchmarking.
 

I have a parallel-libgfapi benchmark that could be modified to fit the new API, 
and could test performance of it.

https://github.com/bengland2/parallel-libgfapi

 Features/Feature Smallfile Perf
 
 One of the things I was looking into was possibility of adding a few
 API calls to libgfapi to help allow reading and writing multiple small
 files as objects - just as librados does for ceph - cutting out FUSE
 and other semantics that tend to be overheads for really small files.
 I don't know what else I will have to add for libgfapi to support
 this.
 

libgfapi is a good place to prototype, it's easy to change libgfapi by adding 
to the existing calls, but this won't help performance as much as you might 
want unless the Gluster protocol can somehow change to allow combination of 
several separate FOPS such as LOOKUP, OPEN, READ and RELEASE FOPS and LOOKUP, 
CREATE, WRITE and RELEASE FOPS.   That's the hard part IMHO.  I suggest using 
wireshark to watch Gluster small-file creates, and then try to understand what 
each FOP is doing and why it is there.  

suggestions for protocol enhancement:

Can we allow CREATE to piggyback write data if it's under 128 KB or whatever 
RPC size limit is, and optionally do a RELEASE after the WRITE?  Or just create 
a new FOP that does that?  Can we also specify xattrs that the application 
might want to set at create time?   Example, SMB security-related XATTRs, Swift 
metadata.

Can we do something like we did for sequential writes with eager-lock, and 
allow Gluster client to hang on to directory lock for a little while so that we 
don't have to continually reacquire the lock if we are going to keep creating 
files in it?

Second, if we already have a write lock on the directory, we shouldn't have to 
do LOOKUP then CREATE, just do CREATE directly.

Finally, Swift and other apps use hack of rename() call after close() so that 
they can create a file atomically, if we had an API for creating files 
atomically then these apps would not be forced into using the expensive rename 
operation.

Can we do these things in an incremental way so that we can steadily improve 
performance over time without massive disruption to code base?

Perhaps Glusterfs FUSE mount could learn to do something like that as well with 
a special mount option that would allow actual create at server to be deferred 
until any one of these 3 conditions occurred:

- 100 msec had passed, or 
- the file was closed, or
- at least N KB of data was written (i.e. an RPC's worth)

This is a bit like Nagle's algorithm in TCP, which allows TCP to aggregate more 
data into segments before it actually transmits them.  It technically violates 
POSIX and creates some semantic issues (how do you tell user that file already 
exists, for example?), but frankly fs interface in POSIX is an anachronism, we 
need to bend it a little to get what we need, NFS already does.  This might not 
be appropriate for all apps but there might be quite a few cases like initial 
data ingest where this would be a very reasonable thing to do.


 The following is what I was thinking - please feel free to correct me
 or guide me if someone has already done some ground work on this.
 
 For read, multiple objects can be provided and they should be
 separated for read from appropriate brick based on the DHT flag - this
 will help avoid multiple lookups from all servers. In the absence of
 DHT they would be sent to all but only the ones that contain the
 object respond (it's more like a multiple file lookup request).
 

I think it is very ambitious to batch creates for multiple files, and this 
greatly complicates the API.   Let's just get to a point where we can create a 
Gluster file and write the data for it in the same libgfapi call and have that 
work efficiently in the Gluster RPC interface -- this would be a huge win.  

 For write, same as the case of read, complete object writes (no
 partial updates, file offsets etc.)
 
 For delete, most of the lookup and batching logic remains the same.


Delete is not the highest priority thing here.  Creates are the worst 
performers, so we probably should focus on creates.  someday it would be nice 
to be able to express the thought to the file system delete this directory 
tree or delete all files within this directory, since Gluster could then 
make that a parallel operation, hence scalable.

 I can help with testing, documentation or benchmarks if someone has
 already done some work.
 
 -Siva
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel
 

Re: [Gluster-devel] Fwd: [ovirt-devel] Using libgfapi on Gluster Storage Domains

2014-10-27 Thread Ben England
The year-old BZ 1016886 describes three problems that any libgfapi application 
would have, not just KVM.  

As for the rpc-auth-allow-insecure=on setting, has any progress been made in 
this area?   I think this setting should be unnecessary, and it's really 
important in general that Gluster have some way of optionally authenticating 
clients other than by client port number.  Do SSL sockets solve this problem of 
authenticating in the control plane?

Two of these problems can be fixed by just altering the virt group -- the 
settings that you get with the gluster command

- gluster volume set your-volume group virt

And then just document that when you configure a volume for KVM virtualization, 
use the above command, right?  This is why we have volume group feature, 
right?

In /var/lib/glusterd/groups/virt file, it has this:

[root@g60ds-1 groups]# more virt
quick-read=off
read-ahead=off
io-cache=off
stat-prefetch=off
eager-lock=enable
remote-dio=enable
quorum-type=auto
server-quorum-type=server

Just add 1 lines to it:

allow-insecure=on

And change stat-prefetch to on, and you'll have eliminated 2 of the 3 things 
that every KVM user has to do to Gluster.  eager-lock is the default now so 
that can be removed.

-ben


- Original Message -
 From: Vijay Bellur vbel...@redhat.com
 To: Gluster Devel gluster-devel@gluster.org
 Sent: Monday, October 27, 2014 5:38:24 AM
 Subject: [Gluster-devel] Fwd: [ovirt-devel] Using libgfapi on Gluster Storage 
 Domains
 
 FYI - if you are interested in trying out libgfapi support with oVirt.
 
 -Vijay
 
 -- Forwarded message --
 From: Federico Simoncelli  fsimo...@redhat.com 
 Date: Fri, Oct 24, 2014 at 12:06 AM
 Subject: [ovirt-devel] Using libgfapi on Gluster Storage Domains
 To: oVirt Development  de...@ovirt.org 
 
 
 Hi everyone, if you want to try and use the libgfapi support included
 in qemu when accessing volumes on gluster storage domains you can try
 to apply this patch:
 
 http://gerrit.ovirt.org/33768
 
 As far as I know Jason Brooks already tried it and he reported a
 positive feedback.
 
 What has been tested so far is:
 
 - qemu uses libgfapi to access the disks on gluster storage domains
 
 - hotplug of disks on gluster storage domains works as expected (libgfapi)
 
 - hotunplug works as expected (no failure when removing a disk that is
 using libgfapi)
 
 - live snpashots work as expected
 
 - disks of vms started before this patch are not affected (they won't
 use libgfapi since there's no way to do an hot swap)
 
 One major flow that is yet untested is live storage migration.
 
 Remember that you may need to do some special configuration on your
 gluster volumes (most notably the allow-insecure ports option) as
 described here:
 
 http://www.ovirt.org/Features/GlusterFS_Storage_Domain
 
 Please try and test the patch if you're interested and report your
 feedback.
 
 Thanks,
 --
 Federico
 ___
 Devel mailing list
 de...@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/devel
 
 
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] gluster write million of lines: WRITE = -1 (Transport endpoint is not connected)

2014-10-27 Thread Ben England
Sergio, 

I agree, excessive logging is a performance issue and can potentially fill a 
system disk partition or LVM volume over a long enough period of time, 
resulting in other errors.  See bz 1156624 for another example that I 
encountered.  Does it happen in glusterfs-3.6?  Is there a logging 
option/interface that would rate-limit a particular logging call per unit of 
time to N messages/sec, and when that limit is exceeded a message is logged 
saying that M more events of that type were seen in the last second?

-Ben England

- Original Message -
 From: Sergio Traldi sergio.tra...@pd.infn.it
 To: gluster-us...@gluster.org, gluster-devel@gluster.org
 Sent: Monday, October 27, 2014 9:51:37 AM
 Subject: [Gluster-devel] gluster write million of lines: WRITE = -1 
 (Transport endpoint is not connected)
 
 Hi all,
 One server Redhat 6 with this rpms set:
 
 [ ~]# rpm -qa | grep gluster | sort
 glusterfs-3.5.2-1.el6.x86_64
 glusterfs-api-3.5.2-1.el6.x86_64
 glusterfs-cli-3.5.2-1.el6.x86_64
 glusterfs-fuse-3.5.2-1.el6.x86_64
 glusterfs-geo-replication-3.5.2-1.el6.x86_64
 glusterfs-libs-3.5.2-1.el6.x86_64
 glusterfs-server-3.5.2-1.el6.x86_64
 
 I have a gluster volume with 1 server and 1 brick:
 
 [ ~]# gluster volume info volume-nova-pp
 Volume Name: volume-nova-pp
 Type: Distribute
 Volume ID: b5ec289b-9a54-4df1-9c21-52ca556aeead
 Status: Started
 Number of Bricks: 1
 Transport-type: tcp
 Bricks:
 Brick1: 192.168.61.100:/brick-nova-pp/mpathc
 Options Reconfigured:
 storage.owner-gid: 162
 storage.owner-uid: 162
 
 There are four clients attached to this volume with same O.S. and same
 fuse gluster rpms set:
 [ ~]# rpm -qa | grep gluster | sort
 glusterfs-3.5.0-2.el6.x86_64
 glusterfs-api-3.5.0-2.el6.x86_64
 glusterfs-fuse-3.5.0-2.el6.x86_64
 glusterfs-libs-3.5.0-2.el6.x86_6
 
 Last week, but it happens also two weeks ago, I found the disk almost
 full and I found the gluster logs
 /var/log/glusterfs/var-lib-nova-instances.log of 68GB:
 In the log there was the starting problem:
 
 [2014-10-10 07:29:43.730792] W [socket.c:522:__socket_rwv] 0-glusterfs:
 readv on 192.168.61.100:24007 failed (No data available)
 [2014-10-10 07:29:54.022608] E [socket.c:2161:socket_connect_finish]
 0-glusterfs: connection to 192.168.61.100:24007 failed (Connection refused)
 [2014-10-10 07:30:05.271825] W
 [client-rpc-fops.c:866:client3_3_writev_cbk] 0-volume-nova-pp-client-0:
 remote operation failed: Input/output error
 [2014-10-10 07:30:08.783145] W [fuse-bridge.c:2201:fuse_writev_cbk]
 0-glusterfs-fuse: 3661260: WRITE = -1 (Input/output error)
 [2014-10-10 07:30:08.783368] W [fuse-bridge.c:2201:fuse_writev_cbk]
 0-glusterfs-fuse: 3661262: WRITE = -1 (Input/output error)
 [2014-10-10 07:30:08.806553] W [fuse-bridge.c:2201:fuse_writev_cbk]
 0-glusterfs-fuse: 3661649: WRITE = -1 (Input/output error)
 [2014-10-10 07:30:08.844415] W [fuse-bridge.c:2201:fuse_writev_cbk]
 0-glusterfs-fuse: 3662235: WRITE = -1 (Input/output error)
 
 and a lot of these lines:
 
 [2014-10-15 14:41:15.895105] W [fuse-bridge.c:2201:fuse_writev_cbk]
 0-glusterfs-fuse: 951700230: WRITE = -1 (Transport endpoint is not
 connected)
 [2014-10-15 14:41:15.896205] W [fuse-bridge.c:2201:fuse_writev_cbk]
 0-glusterfs-fuse: 951700232: WRITE = -1 (Transport endpoint is not
 connected)
 
 This second line log with different sector number has been written
 every millisecond so in about 1 minute we have 1GB write in O.S. disk.
 
 I search for a solution but I didn't find nobody having the same problem.
 
 I think there was a network problem  but why does gluster write in logs
 million of:
 [2014-10-15 14:41:15.895105] W [fuse-bridge.c:2201:fuse_writev_cbk]
 0-glusterfs-fuse: 951700230: WRITE = -1 (Transport endpoint is not
 connected) ?
 
 Thanks in advance.
 Cheers
 Sergio
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Jeff Darcy's objections to multi-thread-epoll and proposal to use own-thread alternative

2014-10-14 Thread Ben England
This e-mail is specifically about use of multi-thread-epoll optimization 
(originally prototyped by Anand Avati) to solve a Gluster performance problem: 
single-threaded reception of protocol messages (for non-SSL sockets), and 
consequent inability to fully utilize available CPU on server.  A discussion of 
its pros and cons follows, along with the alternative to it suggested by Jeff 
Darcy, referred to as own-thread below.  Thanks to Shyam Ranganathan for 
helping me to clarify my thoughts on this.  Attached is some performance data 
about multi-thread-epoll.

To see why this threading discussion matters, consider that storage hardware 
encountered in the enterprise server world is rapidly speeding up with new 
hardware such as 40-Gbps networks and SSDs, but CPUs are not speeding up nearly 
as much.  Instead, we have more cores per socket.  So adequate performance for 
Gluster will require use of sufficient threads to match CPU throughput to 
network and storage.  

One way to get the server's idle CPU horsepower engaged is JBOD (just a bunch 
of disks, no RAID) - since there is one glusterfsd, hence 1 epoll thread per 
brick (disk).   This causes scalability problems for small-file creates 
(cluster.lookup-unhashed=on is default), and it limits throughput of an 
individual file to the speed of the disk drive, so until these problems are 
addressed, the utility of JBOD approach is limited.

- Original Message -
 From: Jeff Darcy jda...@redhat.com
 To: Gluster Devel gluster-devel@gluster.org
 Sent: Wednesday, October 8, 2014 4:20:34 PM
 Subject: [Gluster-devel] jdarcy status (October 2014)
 
 Multi-threading is even more controversial.  It has also been in the
 tree for two years (it was developed to address the problem of SSL code
 slowing down our entire transport stack).  This feature, controlled by
 the own-thread transport option, uses a thread per connection - not my
 favorite concurrency model, but kind of necessary to deal with the
 OpenSSL API.  More recently, a *completely separate* approach to
 multi-threading - multi-threaded epoll - has been getting some
 attention.  Here's what I see as the pros and cons of this new approach.
 
  * PRO: greater parallelism of requests on a single connection.  I think
the actual performance benefits vs. own-thread are unproven and
likely to be small, but they're real.


We should try comparing performance of multi-thread-epoll to own-thread, 
shouldn't be hard to hack own-thread into non-SSL-socket case.  

HOWEVER, if own-thread implies a thread per network connection, as you scale 
out a Gluster volume with N bricks, you have O(N) clients, and therefore you 
have O(N) threads on each glusterfsd (libgfapi adoption would make it far 
worse)!  Suppose we are implementing a 64-brick configuration with 200 clients, 
not an unreasonably sized Gluster volume for a scalable filesystem.   We then 
have 200 threads per Glusterfsd just listening for RPC messages on each brick.  
On a 60-drive server there can be a lot more than 1 brick per server, so 
multiply threads/glusterfsd by brick count!  It doesn't make sense to have 
total threads = CPUs, and modern processors make context switching between 
threads more and more expensive.  

Shyam mentioned a refinement to own-thread where we equally partition the set 
of TCP connections among a pool of threads (own-thread is a special case of 
this).  This cannot supply an individual client with more than 1 thread to 
receive RPCs, even when most of CPU cores on the server are idle.  Why impose 
this constraint (see below)?  To see why this is important, consider a common 
use case: KVM virtualization.  

SSDs require orders of magnitude more IOPS from glusterfsd and glusterfs than a 
traditional rotating disk.  So even if you dedicate a thread to a single 
network connection, this thread may still have trouble keeping up with the 
high-speed network and the SSD.  Multi-thread-epoll is the only proposal so far 
that offers a way to apply enough CPU to this problem.  Consider that some SSDs 
have throughput on the order of a million IOPS (I/O operations per second).  In 
the past, we have worked around this problem by placing multiple bricks on a 
single SSD, but this causes other problems (scalability, free space 
measurement).


  * CON: with greater concurrency comes greater potential to uncover race
conditions in other modules used to being single-threaded.  We've
already seen this somewhat with own-thread, and we'd see it more with
multi-epoll.
 

On the Gluster server side, because of the io-threads translator, an RPC 
listener thread is effectively just starting a worker thread and then going 
back to read another RPC.  With own-thread, although RPC requests are received 
in order, there is no guarantee that the requests will be processed in the 
order that they were received from the network.   On the client side, we have 
operations such as readdir that will fan out parallel FOPS.  If 

[Gluster-devel] update-link-count-parent POSIX xlator parameter

2014-08-20 Thread Ben England
What is update-link-count-parent POSIX xlator parameter for?  Is it ever set by 
anyone and why?  It appears to be off by default.  Didn't see this documented, 
but the log message in posix.c says:  update-link-count-parent is enabled. 
Thus for each file an extended attribute representing the number of hardlinks 
for that file within the same parent directory is set.  Why would this be 
necessary?  

background: I'm trying to see where various xattr calls per file read are 
coming from.


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel