Re: [Gluster-users] Slow performance - 4 hosts, 10 gigabit ethernet, Gluster 3.2.3
On Friday 09 September 2011 10:30 AM, Thomas Jackson wrote: Hi everyone, Hello Thomas, Try the following: 1. In the fuse volume file, try: Under write-behind: option cache-size 16MB Under read-ahead: option page-count 16 Under io-cache: option cache-size=64MB 2. Did you get 9Gbits/Sec with iperf with a single thread or multiple threads? 3. Can you give me the output of: sysctl -a | egrep 'rmem|wmem' 4. If it is not a problem for you, can you please create a pure distribute setup (instead of distributed-replicate) and then report the numbers? 5. What is the inode size with which you formatted you XFS filesystem ? This last point might not be related to your throughput problem, but if you are planning to use this setup for a large number of files, you might be better off using an inode size of 512 instead of the default 256 bytes. To do that, your mkfs command should be: mkfs -t xfs -i size=512 /dev/disk device Pavan I am seeing slower-than-expected performance in Gluster 3.2.3 between 4 hosts with 10 gigabit eth between them all. Each host has 4x 300GB SAS 15K drives in RAID10, 6-core Xeon E5645 @ 2.40GHz and 24GB RAM running Ubuntu 10.04 64-bit (I have also tested with Scientific Linux 6.1 and Debian Squeeze - same results on those as well). All of the hosts mount the volume using the FUSE module. The base filesystem on all of the nodes is XFS, however tests with ext4 have yielded similar results. Command used to create the volume: gluster volume create cluster-volume replica 2 transport tcp node01:/mnt/local-store/ node02:/mnt/local-store/ node03:/mnt/local-store/ node04:/mnt/local-store/ Command used to mount the Gluster volume on each node: mount -t glusterfs localhost:/cluster-volume /mnt/cluster-volume Creating a 40GB file onto a node's local storage (ie no Gluster involvement): dd if=/dev/zero of=/mnt/local-store/test.file bs=1M count=4 4194304 bytes (42 GB) copied, 92.9264 s, 451 MB/s Getting the same file off the node's local storage: dd if=/mnt/local-store/test.file of=/dev/null 4194304 bytes (42 GB) copied, 81.858 s, 512 MB/s 40GB file onto the Gluster storage: dd if=/dev/zero of=/mnt/cluster-volume/test.file bs=1M count=4 4194304 bytes (42 GB) copied, 226.934 s, 185 MB/s Getting the same file off the Gluster storage dd if=/mnt/cluster-volume/test.file of=/dev/null 4194304 bytes (42 GB) copied, 661.561 s, 63.4 MB/s I have also tried using Gluster 3.1, with similar results. According to the Gluster docs, I should be seeing roughly the lesser of the drive speed and the network speed. The network is able to push 0.9GB/sec according to iperf so that definitely isn't a limiting factor here, and each array is able to do 400-500MB/sec as per above benchmarks. I've tried with/without jumbo frames as well, which doesn't make any major difference. The glusterfs process is using 120% CPU according to top, and glusterfsd is sitting at about 90%. Any ideas / tips of where to start for speeding this config up? Thanks, Thomas ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] write-behind / write-back caching (asked again - nobody can help?)
On Tuesday 30 August 2011 08:36 PM, Christian wrote: Hello to all, I'm currently testing glusterfs ( version 3.1.4, 3.1.6, 3.2.2, 3.2.3 and 3.3beta ) for the following situation / behavior: I want to create a replicated storage via internet / wan with two storage nodes. The first node is located in office A and the other one is in office B. If I try to write a file to the mounted glusterfs (mounted via glusterfs or nfs), the write performance is as poor as the upload speed (~ 1 mbit - adjusted manually using tc). I tested several cache-options (see below) with the following effect: The copy process of a file is done very fast (~40 mbyte/sec), but the application (rsync, mc copy, cp) is waiting at 100% for the final sync of the storage. The process is not finished before glusterfs has written the file to the 2nd node. With a replicate config, this is what you can expect. The increased write-behind cache is holding your file giving you the boosted throughput, but on close, it will have to sync the data to both nodes. The behavior I am looking for is to store files locally first and then sync the content to the second node in the background. Is there a way for this? I think you are better off using geo-replication rather than the traditional replicate configuration for the above requirement of yours. The following link should help you configure geo-rep - http://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Administration_Guide Look for the geo-replication section there. It also gives you a comparison of replicated volumes vs geo-replication. HTH, Pavan ** volume info: Volume Name: gl5 Type: Replicate Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: 192.168.42.130:/gl5 Brick2: 192.168.42.7:/gl5 Options Reconfigured: nfs.disable: off nfs.trusted-sync: on nfs.trusted-write: on performance.flush-behind: off performance.write-behind-window-size: 200MB performance.cache-max-file-size: 200MB ** tested mount options: mount.nfs 127.0.0.1:gl5 /mnt/gluster/ -v -o mountproto=tcp -o async mount -t glusterfs 127.0.0.1:gl5 /mnt/gluster -o async Thanks a lot, Christian ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] One Question about DHT
On Wednesday 17 August 2011 09:18 AM, Daniel wrote: Hello Pavan, I came cross one question about DHT lookup. When dht_lookup process a fresh lookup, if the looked up target can not be found by hashed, why does it assert it as a directory and lookup on all the child nodes? Not sure why you thought I should be the one to address this to. There are more knowledgeable engineers on this user group :) I looked up the code a bit to answer your question and here is what I understand: If it is a fresh lookup and the file hash computed for this entity did not fall into any of the pre-computed hashed ranges, a lookup_everywhere is triggered to go to the backend and see if it exists there. If it is not there too, this brings in a mechanism called directory self heal. The debug message does say - see if it is a directory, but if you look at the dht_lookup_dir_cbk, a check is also made to see if it was *not* a directory. It might only be the debug messages that might have led you into thinking why a directory is being looked up in particular. That is not really the case. Pavan Thanks Dan ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] Gluster on an ARM system
On Friday 12 August 2011 08:48 AM, Emmanuel Dreyfus wrote: John Mark Walkerjwal...@gluster.com wrote: I've CC'd the gluster-devel list in the hopes that someone there can help you out. However, my understanding is that it will take some significant porting to get GlusterFS to run in any production capacity on ARM. What ARM specific problems have been identified? The biggest issue, IMO, will be that of endianness. GlusterFS has been run only on Intel/AMD architecture, AFAIK. I have not heard of any SPARC installations. That means that the code has been tested only on little-endian architecture. The worst problems come in when there is interaction between entities of different endianness. However, there is another side to this. From what I know, ARM is actually a bi-Endian processor. If the ARM cores have the system control co-processor, the endianness of the ARM processor can be controlled by software. So, if we make ARM to work as a little-endian processor, we should work well even in a mixed environment. But then, ARM is a 32-bit processor. I am unsure/ignorant of the stability of 32-bit GlusterFS. If we can solve the two major issues mentioned above viz. Endianness and stability of GlusterFS on 32-bit, we should theoretically be able to get GlusterFS working on ARM without any other major work. Again, I cannot vouch for the above statement. Just my thoughts from what I know. Pavan ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] gluster client performance
On Wednesday 10 August 2011 12:11 AM, Jesse Stroik wrote: Pavan, Thank you for your help. We wanted to get back to you with our results and observations. I'm cc'ing gluster-users for posterity. We did experiment with enable-trickling-writes. That was one of the translator tunables we wanted to know the precise syntax for so that we could be certain we were disabling it. As hoped, disabling trickling writes improved performance somewhat. We are definitely interested in any other undocumented write-buffer related tunables. We've tested the documented tuning parameters. Performance improved significantly when we switched clients to mainline kernel (2.5.35-13). We also updated to OFED 1.5.3 but it wasn't responsible for the performance improvement. Our findings with 32KB block size (cp) write performance: 250-300MB/sec single stream performance 400MB/sec multiple-stream per client performance Ok. Lets see if we can improve this further. Please use the following tunables as suggested below: For write-behind - option cache-size 16MB For read-ahead - option page-count 16 For io-cache - option cache-size 64MB You will need to place these lines in the client volume file, restart the server and remount the volume on the clients. Your client (fuse) volume file sections will look like below (of course, with change in the volume name) - volume testvol-write-behind type performance/write-behind option cache-size 16MB subvolumes testvol-client-0 end-volume volume testvol-read-ahead type performance/read-ahead option page-count 16 subvolumes testvol-write-behind end-volume volume testvol-io-cache type performance/io-cache option cache-size 64MB subvolumes testvol-read-ahead end-volume Run your copy command with these tunables. For now, lets have the default setting for trickling writes which is 'ENABLED'. You can simply remove this tunable from the volume file to get the default behaviour. Pavan This is much higher than we observed with kernel 2.6.18 series. Using the 2.6.18 line, we also observed virtually no difference between running single stream tests and multi stream tests suggesting a bottleneck with the fabric. Both 2.6.18 and 2.6.35-13 performed very well (about 600MB/sec) when writing 128KB blocks. When I disabled write-behind on the 2.6.18 series of kernels as a test, performance plummeted to a few MB/sec when writing blocks sizes smaller than 128KB. We did not test this extensively. Disabling enable-trickling-writes gave us approximately a 20% boost, reflected in the numbers above, for single-stream writes. We observed no significant difference with several streams per client due to disabling that tunable. For reference, we are running another cluster file system on the same underlying hardware/software. With both the old kernel (2.6.18.x) and the new kernel (2.6.35-13) we get approximately: 450-550MB/sec single stream performance 1200MB+/sec multiple stream per client performance We set the test directory to write entire files to a single LUN which is how we configured gluster in an effort to mitigate differences. It is treacherous to speculate why we might be more limited with gluster over RDMA than the other cluster file system without spending a significant amount of analysis. That said, I wonder if there may be an issue with the way in which fuse handles write buffers causing a bottleneck for RMDA. The bottom line is that our observed performance was poor using the 2.6.18 RHEL 5 kernel line relative to the mainline (2.6.35) kernels. Updating to the newer kernels was well worth the testing and downtime. Hopefully this information can help others. Best, Jesse Stroik ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] 3.2.2 Performance Issue
On Wednesday 10 August 2011 02:56 AM, Joey McDonald wrote: Hello all, I've configured 4 bricks over a GigE network, however I'm getting very slow performance for writing to my gluster share. Just set this up this week, and here's what I'm seeing: A few questions - 1. Are these baremetal systems or are they Virtual machines ? 2. What is the amount of RAM of each of these systems ? 3. How many CPUs do they have ? 4. Can you also perform the dd on /gluster as opposed to /root to check the backend performance ? 5. What is your disk backend ? Is it direct attached or is it an array ? 6. What is the backend filesystem ? 7. Can you run a simple scp of about 10M between any two of these systems and report the speed ? Pavan [root@vm-container-0-0 ~]# gluster --version | head -1 glusterfs 3.2.2 built on Jul 14 2011 13:34:25 [root@vm-container-0-0 pifs]# gluster volume info Volume Name: pifs Type: Distributed-Replicate Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: vm-container-0-0:/gluster Brick2: vm-container-0-1:/gluster Brick3: vm-container-0-2:/gluster Brick4: vm-container-0-3:/gluster The 4 systems, are each storage bricks and storage clients, mounting gluster like so: [root@vm-container-0-1 ~]# df -h /pifs/ Filesystem Size Used Avail Use% Mounted on glusterfs#127.0.0.1:pifs 1.8T 848M 1.7T 1% /pifs iperf show's network through put looking good: [root@vm-container-0-0 pifs]# iperf -c vm-container-0-1 Client connecting to vm-container-0-1, TCP port 5001 TCP window size: 16.0 KByte (default) [ 3] local 10.19.127.254 port 53441 connected with 10.19.127.253 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.10 GBytes 941 Mbits/sec Then, writing to the local disk is pretty fast: [root@vm-container-0-0 pifs]# dd if=/dev/zero of=/root/dd_test.img bs=1M count=2000 2000+0 records in 2000+0 records out 2097152000 bytes (2.1 GB) copied, 4.8066 seconds, 436 MB/s However, writes to the gluster share, are abysmally slow: [root@vm-container-0-0 pifs]# dd if=/dev/zero of=/pifs/dd_test.img bs=1M count=2000 2000+0 records in 2000+0 records out 2097152000 bytes (2.1 GB) copied, 241.866 seconds, 8.7 MB/s Other than the fact that it's quite slow, it seems to be very stable. iozone testing shows about the same results. Any help troubleshooting would be much appreciated. Thanks! --joey ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] scrub as in zfs
On Monday 08 August 2011 01:30 PM, Uwe Kastens wrote: Hi again, If one thinks about a large amount of data, maybe as a replacement for tapes. Will auto heal of gluster help with data corruption problems? I would expect that, but only, if the files are accessed on a regular basis. As far as I have seen, there is no regular scrub mechanism like in zfs? Right. Not for now. With proative/Background self-heal, you will get something similar to that. Stay tuned. Pavan Kind Regards Uwe Kastens kiste...@googlemail.com ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] gluster client performance
[..] I don't know why my writes are so slow compared to reads. Let me know if you're able to get better write speeds with the newer version of gluster and any of the configurations (if they apply) that I've posted. It might compel me to upgrade. From your documentation of nfsspeedtest, I see that the reads can happen either via dd or via perl's sysread. I'm not sure if one is better over the other. Secondly - Are you doing direct IO on the backend XFS ? If not, try it with direct IO so that you are not misled by the memory situation in the system at the time of your test. It will give a clearer picture of what your backend is capable of. Your test is such that you write a file and immediately read the same file back. It is possible that a good chunk of it is cached on the backend. After the write, do a flush of the filesystem caches by using: echo 3 /proc/sys/vm/drop_caches. Sleep for a while. Then do the read. Or as suggested earlier, resort to direct IO while testing the backend FS. Pavan ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] gluster's cpuload is too high on specific birck daemon
On Wednesday 27 July 2011 01:45 PM, 공용준(yongjoon kong)/Cloud Computing 기술담당/SKCC wrote: Hello, I'm gluster with distributed-replicated mode(4 brick server) And 10client server mount gluster volume brick1 server. ( mount -t glustefs brick1:/volume /mnt) And there's very strange thing. The brick1's cpu load is too high. From 'top' command, it's over 400% But other brick's load is too low. It is possible that an AFR self heal is getting triggered. On the brick, run the following command: strace -f -c -p glusterfs.pid and provide the output. Pavan Is there any reason for this? Or Is there anyway tracking down this issue? Thanks. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] gluster client performance
But that still does not explain why you should get as low as 50 MB/s for a single stream single client write when the backend can support direct IO throughput of more than 700 MB/s. On the server, can you collect: # iostat -xcdh 2 iostat.log.brickXX for the duration of the dd command ? and # strace -f -o stracelog.server -tt -T -e trace=write,writev -p glusterfsd.pid (again for the duration of the dd command) Hi John, A small change in the request. I hope you have not already spent time on this. The strace command should be: strace -f -o stracelog.server -tt -T -e trace=pwrite -p glusterfsd.pid Thanks, Pavan With the above, I want to measure the delay between the writes coming in from the client. iostat will describe the IO scenario on the server. Once the exercise is done, please attach the iostat.log.brickXX and stracelog.server. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] gluster client performance
On Tuesday 26 July 2011 03:42 AM, John Lalande wrote: Hi- I'm new to Gluster, but am trying to get it set up on a new compute cluster we're building. We picked Gluster for one of our cluster file systems (we're also using Lustre for fast scratch space), but the Gluster performance has been so bad that I think maybe we have a configuration problem -- perhaps we're missing a tuning parameter that would help, but I can't find anything in the Gluster documentation -- all the tuning info I've found seems geared toward Gluster 2.x. For some background, our compute cluster has 64 compute nodes. The gluster storage pool has 10 Dell PowerEdge R515 servers, each with 12 x 2 TB disks. We have another 16 Dell PowerEdge R515s used as Lustre storage servers. The compute and storage nodes are all connected via QDR Infiniband. Both Gluster and Lustre are set to use RDMA over Infiniband. We are using OFED version 1.5.2-20101219, Gluster 3.2.2 and CentOS 5.5 on both the compute and storage nodes. Hi John, I would need some more information about your setup to estimate the performance you should get with your gluster setup. 1. Can you provide the details of how disks are connected to the storage boxes ? Is it via FC ? What raid configuration is it using (if at all any) ? 2. What is the disk bandwidth you are getting on the local filesystem on a given storage node ? I mean, pick any of the 10 storage servers dedicated for Gluster Storage and perform a dd as below: Write bandwidth measurement: dd if=/dev/zero of=/export_directory/10g_file bs=128K count=8 oflag=direct Read bandwidth measurement: dd if=/export_directory/10g_file of=/dev/null bs=128K count=8 iflag=direct [The above command is doing a direct IO of 10GB via your backend FS - ext4/xfs.] 3. What is the IB bandwidth that you are getting between the compute node and the glusterfs storage node? You can run the tool rdma_bw to get the details: On the server, run: # rdma_bw -b [ -b measures bi-directional bandwidth] On the compute node, run, # rdma_bw -b server [If you have not already installed it, rdma_bw is available via - http://mirror.centos.org/centos/5/os/x86_64/CentOS/perftest-1.2.3-1.el5.x86_64.rpm] Lets start with this, and I will ask for more if necessary. Pavan Oddly, it seems like there's some sort of bottleneck on the client side -- for example, we're only seeing about 50 MB/s write throughput from a single compute node when writing a 10GB file. But, if we run multiple simultaneous writes from multiple compute nodes to the same Gluster volume, we get 50 MB/s from each compute node. However, running multiple writes from the same compute node does not increase throughput. The compute nodes have 48 cores and 128 GB RAM, so I don't think the issue is with the compute node hardware. With Lustre, on the same hardware, with the same version of OFED, we're seeing write throughput on that same 10 GB file as follows: 476 MB/s single stream write from a single compute node and aggregate performance of more like 2.4 GB/s if we run simultaneous writes. That leads me to believe that we don't have a problem with RDMA, otherwise Lustre, which is also using RDMA, should be similarly affected. We have tried both xfs and ext4 for the backend file system on the Gluster storage nodes (we're currently using ext4). We went with distributed (not distributed striped) for the Gluster volume -- the thought was that if there was a catastrophic failure of one of the storage nodes, we'd only lose the data on that node; presumably with distributed striped you'd lose any data striped across that volume, unless I have misinterpreted the documentation. So ... what's expected/normal throughput for Gluster over QDR IB to a relatively large storage pool (10 servers / 120 disks)? Does anyone have suggested tuning tips for improving performance? Thanks! John ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] gluster client performance
On Tuesday 26 July 2011 09:24 PM, John Lalande wrote: Thanks for your help, Pavan! Hi John, I would need some more information about your setup to estimate the performance you should get with your gluster setup. 1. Can you provide the details of how disks are connected to the storage boxes ? Is it via FC ? What raid configuration is it using (if at all any) ? The disks are 2TB near-line SAS direct attached via a PERC H700 controller (the Dell PowerEdge R515 has 12 3.5 drive bays). They are in a RAID6 config, exported as a single volume, that's split into 3 equal-size partitions (due to ext4's (well, e2fsprogs') 16 TB limit). 2. What is the disk bandwidth you are getting on the local filesystem on a given storage node ? I mean, pick any of the 10 storage servers dedicated for Gluster Storage and perform a dd as below: Seeing an average of 740 MB/s write, 971 GB/s read. I presume you did this in one of the /data-brick*/export directories ? Command output with the command line would have been clearer, but thats fine. 3. What is the IB bandwidth that you are getting between the compute node and the glusterfs storage node? You can run the tool rdma_bw to get the details: 30407: Bandwidth peak (#0 to #976): 2594.58 MB/sec 30407: Bandwidth average: 2593.62 MB/sec 30407: Service Demand peak (#0 to #976): 978 cycles/KB 30407: Service Demand Avg : 978 cycles/KB This looks like a DDR connection. ibv_devinfo -v will tell a better story about the line width and speed of your infiniband connection. QDR should have a much higher bandwidth. But that still does not explain why you should get as low as 50 MB/s for a single stream single client write when the backend can support direct IO throughput of more than 700 MB/s. On the server, can you collect: # iostat -xcdh 2 iostat.log.brickXX for the duration of the dd command ? and # strace -f -o stracelog.server -tt -T -e trace=write,writev -p glusterfsd.pid (again for the duration of the dd command) With the above, I want to measure the delay between the writes coming in from the client. iostat will describe the IO scenario on the server. Once the exercise is done, please attach the iostat.log.brickXX and stracelog.server. Pavan Here's our gluster config: # gluster volume info data Volume Name: data Type: Distribute Status: Started Number of Bricks: 30 Transport-type: rdma Bricks: Brick1: data-3-1-infiniband.infiniband:/data-brick1/export Brick2: data-3-3-infiniband.infiniband:/data-brick1/export Brick3: data-3-5-infiniband.infiniband:/data-brick1/export Brick4: data-3-7-infiniband.infiniband:/data-brick1/export Brick5: data-3-9-infiniband.infiniband:/data-brick1/export Brick6: data-3-11-infiniband.infiniband:/data-brick1/export Brick7: data-3-13-infiniband.infiniband:/data-brick1/export Brick8: data-3-15-infiniband.infiniband:/data-brick1/export Brick9: data-3-17-infiniband.infiniband:/data-brick1/export Brick10: data-3-19-infiniband.infiniband:/data-brick1/export Brick11: data-3-1-infiniband.infiniband:/data-brick2/export Brick12: data-3-3-infiniband.infiniband:/data-brick2/export Brick13: data-3-5-infiniband.infiniband:/data-brick2/export Brick14: data-3-7-infiniband.infiniband:/data-brick2/export Brick15: data-3-9-infiniband.infiniband:/data-brick2/export Brick16: data-3-11-infiniband.infiniband:/data-brick2/export Brick17: data-3-13-infiniband.infiniband:/data-brick2/export Brick18: data-3-15-infiniband.infiniband:/data-brick2/export Brick19: data-3-17-infiniband.infiniband:/data-brick2/export Brick20: data-3-19-infiniband.infiniband:/data-brick2/export Brick21: data-3-1-infiniband.infiniband:/data-brick3/export Brick22: data-3-3-infiniband.infiniband:/data-brick3/export Brick23: data-3-5-infiniband.infiniband:/data-brick3/export Brick24: data-3-7-infiniband.infiniband:/data-brick3/export Brick25: data-3-9-infiniband.infiniband:/data-brick3/export Brick26: data-3-11-infiniband.infiniband:/data-brick3/export Brick27: data-3-13-infiniband.infiniband:/data-brick3/export Brick28: data-3-15-infiniband.infiniband:/data-brick3/export Brick29: data-3-17-infiniband.infiniband:/data-brick3/export Brick30: data-3-19-infiniband.infiniband:/data-brick3/export Options Reconfigured: nfs.disable: on ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance on GlusterFS
On Saturday 25 June 2011 03:56 PM, anish.b.ku...@ril.com wrote: Yes sure it's 74 MB . I am using Gluster version 3.2.1.1 In my 4 node cluster setup, one of my node on which I am performing test run is physical sever of HP proliant DL 380 G5 having RHEL 5.5 OS ,is having 1000Mbps network. Other three nodes are hosted on Windows 2008 R2 on virtual machine(VM ware Application) , host machine network is of 1Gbps What are you virtual machines? Linux, I suppose? A few aspects of your setup makes the comparison unfair - 1. Since you run untar on the local file system on a physical server, the is a possibility to see the effect of write caching. 2. Since glusterfs is working on VMs, the comparison of its performance with that on a physical server is not fair. 3. The VMs are hosted on a system with low network bandwidth. 4. The IO throughput inside a VM is limited by the throughput of the host file system, in this case - a Windows filesystem (NTFS) ? What is the amount of RAM on the Windows system hosting the VMs? Pavan PS: Adding gluster-users. The discussion might help others. Regards, Anish Kumar ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance on GlusterFS
On Friday 24 June 2011 10:29 AM, anish.b.ku...@ril.com wrote: Hi…. I have setup a 4 node cluster on virtual servers on RHEL platform. It would help if you can post the output of gluster volume info, to start with. Are you using some benchmark to compare GlusterFS performance with local filesystem performance? Pavan Not able to get better performance statistics on glusterFS as compared to local file system. Kindly suggest a test run that can be checked to differentiate between them. Regards, Anish Kumar Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s). are confidential and may be privileged. If you are not the intended recipient. you are hereby notified that any review. re-transmission. conversion to hard copy. copying. circulation or other use of this message and any attachments is strictly prohibited. If you are not the intended recipient. please notify the sender immediately by return email. and delete this message and any attachments from your system. Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster3.2@Grid5000] 128 nodes failure and rr scheduler question
On Sunday 12 June 2011 07:00 PM, François Thiebolt wrote: Hello, To make things clear, what I've done is : - deploying GlusterFS on 2, 4, 8, 16, 32, 64, 128 nodes - running a variant of the MAB benchmark (it's all about compilation of openssl-1.0.0) on 2, 4, 8, 16, 32, 64, 128 nodes - I used 'pdsh -f 512' to start MAB on all nodes at the same time - on each experiment on each node, I ran MAB in a dedicated directory within the glusterfs global namespace (e.g. nodeA usedgluster global namespace/nodeA/mab files) to avoid a metadata storm on the parent directory inode - between each experiment, I destroy and redeploy a complete new GlusterFS setup (and I also destroy everything within each brick i.e the exported storage dir) I then compare the average compilation time vs the number of nodes ... and it increases due to the round robin scheduler that dispatches files on all the bricks 2 : Phase_V(s)avg 249.9332121175 4 : Phase_V(s)avg 262.808117374 8 : Phase_V(s)avg 293.572061537875 16 : Phase_V(s)avg 351.436554833375 32 : Phase_V(s)avg 546.503069517844 64 : Phase_V(s)avg 1010.61019479478 (phase V is related to the compilation itself, previous phases are about metadata ops) You can also try to compile a linux kernel on your own, this is pretty much the same thing. Thanks much for your detailed description. Is phase_V the only phase where you are seeing reduced performance? With regards to your problem, since you are using the bricks also as clients, you have a NUMA kind of scenario. In the case of two bricks (and hence two client), during compilation, ~50% of the files will be available locally for the client for which the latencies will be minimal, and the other 50% with suffer additional latencies. As you increase the number of nodes, this asymmetry is seen for more number of files. So, the problem is not really the introduction of more servers, but the degree of asymmetry your application is seeing. Your numbers for 2 nodes might not be a good indicator of the average performance. Try the same experiment by separating the clients and the servers. If you still see reverse-linear performance with increased bricks/clients, we can investigate further. Pavan Now regarding the GlusterFS setup : yes, you're right, there is no replication so this is a simple stripping (on a file basis) setup Each time, I create a glusterfs volume featuring one brick, then i add bricks (one by one) till I reach the number of nodes ... and after that, I start the volume. Now regarding the 128bricks case, this is when I start the volume that I get a random error telling me thatbrickX does not respond, and this changes every time I retry to start the volume. So far, I didn't tested with a number of nodes between 64 and 128 François On Friday, June 10, 2011 16:38 CEST, Pavan T Ct...@gluster.com wrote: On Wednesday 08 June 2011 06:10 PM, Francois THIEBOLT wrote: Hello, I'm driving some experiments on grid'5000 with GlusterFS 3.2 and, as a first point, i've been unable to start a volume featuring 128bricks (64 ok) Then, due to the round-robin scheduler, as the number of nodes increase (every node is also a brick), the performance of an application on an individual node decrease! I would like to understand what you mean by increase of nodes. You have 64 bricks and each brick also acts as a client. So, where is the increase in the number of nodes? Are you referring to the mounts that you are doing? What is your gluster configuration - I mean, is it a distribute only, or is it a distributed-replicate setup? [From your command sequence, it should be a pure distribute, but I just want to be sure]. What is your application like? Is it mostly I/O intensive? It will help if you provide a brief description of typical operations done by your application. How are you measuring the performance? What parameter determines that you are experiencing a decrease in performance with increase in the number of nodes? Pavan So my question is : how to STOP the round-robin distribution of files over the bricks within a volume ? *** Setup *** - i'm using glusterfs3.2 from source - every node is both a client node and a brick (storage) Commands : - gluster peer probeeach of the 128nodes - gluster volume create myVolume transport tcp128 bricks:/storage - gluster volume start myVolume (fails with 128 bricks!) - mount -t glusterfs .. on all nodes Feel free to tell me how to improve things François ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster3.2@Grid5000] 128 nodes failure and rr scheduler question
On Wednesday 08 June 2011 06:10 PM, Francois THIEBOLT wrote: Hello, I'm driving some experiments on grid'5000 with GlusterFS 3.2 and, as a first point, i've been unable to start a volume featuring 128bricks (64 ok) Then, due to the round-robin scheduler, as the number of nodes increase (every node is also a brick), the performance of an application on an individual node decrease! I would like to understand what you mean by increase of nodes. You have 64 bricks and each brick also acts as a client. So, where is the increase in the number of nodes? Are you referring to the mounts that you are doing? What is your gluster configuration - I mean, is it a distribute only, or is it a distributed-replicate setup? [From your command sequence, it should be a pure distribute, but I just want to be sure]. What is your application like? Is it mostly I/O intensive? It will help if you provide a brief description of typical operations done by your application. How are you measuring the performance? What parameter determines that you are experiencing a decrease in performance with increase in the number of nodes? Pavan So my question is : how to STOP the round-robin distribution of files over the bricks within a volume ? *** Setup *** - i'm using glusterfs3.2 from source - every node is both a client node and a brick (storage) Commands : - gluster peer probe each of the 128nodes - gluster volume create myVolume transport tcp 128 bricks:/storage - gluster volume start myVolume (fails with 128 bricks!) - mount -t glusterfs .. on all nodes Feel free to tell me how to improve things François ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users