Re: [Gluster-users] Extremely slow cluster performance

2019-04-23 Thread Patrick Rennie
Hi Darrel,

Thanks again for your advice, I tried to take yesterday off and just not
think about it, back at it again today. Still no real progress, however my
colleague upgraded our version to 3.13 yesterday, this has broken NFS and
caused some other issues for us now. It did add the 'gluster volume heal
 info summary' so I can use that to try and keep an eye on how many
files do seem to need healing, if it's accurate it's possibly less than I
though.

We are in the progress of moving this data to new storage, but it does take
a long time to move so much data around, and more keeps coming in each day.

We do have 3 cache SSDs for each brick so generally performance on the
bricks themselves is quite quick, I can DD a 10GB file at ~1.7-2GB/s
directly on a brick so I think the performance of each brick is actually
ok.

It's a distribute/replicate volume, not dispearsed so I can't change
disperse.shd-max-threads.

I have checked the basics like all peers connected and no scrubs in
progress etc.

Will keep working away at this, and will start to read through some of your
performance tuning suggestions. Really appreciate your advice.

Cheers,

-Patrick



On Mon, Apr 22, 2019 at 12:43 AM Darrell Budic 
wrote:

> Patrick-
>
> Specifically re:
>
> Thanks again for your advice, I've left it for a while but unfortunately
>>> it's still just as slow and causing more problems for our operations now. I
>>> will need to try and take some steps to at least bring performance back to
>>> normal while continuing to investigate the issue longer term. I can
>>> definitely see one node with heavier CPU than the other, almost double,
>>> which I am OK with, but I think the heal process is going to take forever,
>>> trying to check the "gluster volume heal info" shows thousands and
>>> thousands of files which may need healing, I have no idea how many in total
>>> the command is still running after hours, so I am not sure what has gone so
>>> wrong to cause this.
>>> ...
>>> I have no idea how long the healing is going to take on this cluster, we
>>> have around 560TB of data on here, but I don't think I can wait that long
>>> to try and restore performance to normal.
>>>
>>
> You’re in a bind, I know, but it’s just going to take some time recover.
> You have a lot of data, and even at the best speeds your disks and networks
> can muster, it’s going to take a while. Until your cluster is fully healed,
> anything else you try may not have the full effect it would on a fully
> operational cluster. Your predecessor may have made things worse by not
> having proper posix attributes on the ZFS file system. You may have made
> things worse by killing brick processes in your distributed-replicated
> setup, creating an additional need for healing and possibly compounding the
> overall performance issues. I’m not trying to blame you or make you feel
> bad, but I do want to point out that there’s a problem here, and there is
> unlikely to be a silver bullet that will resolve the issue instantly.
> You’re going to have to give it time to get back into a “normal" condition,
> which seems to be what your setup was configured and tested for in the
> first place.
>
> Those things said, rather than trying to move things from this cluster to
> different storage, what about having your VMs mount different storage in
> the first place and move the write load off of this cluster while it
> recovers?
>
> Looking at the profile you posted for Strahil, your bricks are spending a
> lot of time doing LOOKUPs, and some are slower than others by a significant
> margin. If you haven’t already, check the zfs pools on those, make sure
> they don’t have any failed disks that might be slowing them down. Consider
> if you can speed them up with a ZIL or SLOG if they are spinning disks
> (although your previous server descriptions sound like you don’t need a
> SLOG, ZILs may help fi they are HDDs)? Just saw your additional comments
> that one server is faster than than the other, it’s possible that it’s got
> the actual data and the other one is doing healings every time it gets
> accessed, or it’s just got fuller and slower volumes. It may make sense to
> try forcing all your VM mounts to the faster server for a while, even if
> it’s the one with higher load (serving will get preference to healing, but
> don’t push the shd-max-threads too high, they can squash performance. Given
> it’s a dispersed volume, make sure you’ve got disperse.shd-max-threads at 4
> or 8, and raise disperse.shd-wait-qlength to 4096 or so.
>
> You’re getting into things best tested with everything working, but
> desperate times call for accelerated testing, right?
>
> You could experiment with different values of performance.io-thread-cound,
> try 48. But if your CPU load is already near max, you’re getting everything
> you can out of your CPU already, so don’t spend too much time on it.
>
> Check out
> https://github.com/gluster/glusterfs/blob/release-3.11/extras/group-nl-cache 
> and

Re: [Gluster-users] Upgrade 5.5 -> 5.6: network traffic bug fixed?

2019-04-23 Thread Poornima Gurusiddaiah
Hi,

Thank you for the update, sorry for the delay.

I did some more tests, but couldn't see the behaviour of spiked network
bandwidth usage when quick-read is on. After upgrading, have you remounted
the clients? As in the fix will not be effective until the process is
restarted.
If you have already restarted the client processes, then there must be
something related to workload in the live system that is triggering a bug
in quick-read. Would need wireshark capture if possible, to debug further.

Regards,
Poornima

On Tue, Apr 16, 2019 at 6:25 PM Hu Bert  wrote:

> Hi Poornima,
>
> thx for your efforts. I made a couple of tests and the results are the
> same, so the options are not related. Anyway, i'm not able to
> reproduce the problem on my testing system, although the volume
> options are the same.
>
> About 1.5 hours ago i set performance.quick-read to on again and
> watched: load/iowait went up (not bad at the moment, little traffic),
> but network traffic went up - from <20 MBit/s up to 160 MBit/s. After
> deactivating quick-read traffic dropped to < 20 MBit/s again.
>
> munin graph: https://abload.de/img/network-client4s0kle.png
>
> The 2nd peak is from the last test.
>
>
> Thx,
> Hubert
>
> Am Di., 16. Apr. 2019 um 09:43 Uhr schrieb Hu Bert  >:
> >
> > In my first test on my testing setup the traffic was on a normal
> > level, so i thought i was "safe". But on my live system the network
> > traffic was a multiple of the traffic one would expect.
> > performance.quick-read was enabled in both, the only difference in the
> > volume options between live and testing are:
> >
> > performance.read-ahead: testing on, live off
> > performance.io-cache: testing on, live off
> >
> > I ran another test on my testing setup, deactivated both and copied 9
> > GB of data. Now the traffic went up as well, from before ~9-10 MBit/s
> > up to 100 MBit/s with both options off. Does performance.quick-read
> > require one of those options set to 'on'?
> >
> > I'll start another test shortly, and activate on of those 2 options,
> > maybe there's a connection between those 3 options?
> >
> >
> > Best Regards,
> > Hubert
> >
> > Am Di., 16. Apr. 2019 um 08:57 Uhr schrieb Poornima Gurusiddaiah
> > :
> > >
> > > Thank you for reporting this. I had done testing on my local setup and
> the issue was resolved even with quick-read enabled. Let me test it again.
> > >
> > > Regards,
> > > Poornima
> > >
> > > On Mon, Apr 15, 2019 at 12:25 PM Hu Bert 
> wrote:
> > >>
> > >> fyi: after setting performance.quick-read to off network traffic
> > >> dropped to normal levels, client load/iowait back to normal as well.
> > >>
> > >> client: https://abload.de/img/network-client-afterihjqi.png
> > >> server: https://abload.de/img/network-server-afterwdkrl.png
> > >>
> > >> Am Mo., 15. Apr. 2019 um 08:33 Uhr schrieb Hu Bert <
> revi...@googlemail.com>:
> > >> >
> > >> > Good Morning,
> > >> >
> > >> > today i updated my replica 3 setup (debian stretch) from version 5.5
> > >> > to 5.6, as i thought the network traffic bug (#1673058) was fixed
> and
> > >> > i could re-activate 'performance.quick-read' again. See release
> notes:
> > >> >
> > >> > https://review.gluster.org/#/c/glusterfs/+/22538/
> > >> >
> http://git.gluster.org/cgit/glusterfs.git/commit/?id=34a2347780c2429284f57232f3aabb78547a9795
> > >> >
> > >> > Upgrade went fine, and then i was watching iowait and network
> traffic.
> > >> > It seems that the network traffic went up after upgrade and
> > >> > reactivation of performance.quick-read. Here are some graphs:
> > >> >
> > >> > network client1: https://abload.de/img/network-clientfwj1m.png
> > >> > network client2: https://abload.de/img/network-client2trkow.png
> > >> > network server: https://abload.de/img/network-serverv3jjr.png
> > >> >
> > >> > gluster volume info: https://pastebin.com/ZMuJYXRZ
> > >> >
> > >> > Just wondering if the network traffic bug really got fixed or if
> this
> > >> > is a new problem. I'll wait a couple of minutes and then deactivate
> > >> > performance.quick-read again, just to see if network traffic goes
> down
> > >> > to normal levels.
> > >> >
> > >> >
> > >> > Best regards,
> > >> > Hubert
> > >> ___
> > >> Gluster-users mailing list
> > >> Gluster-users@gluster.org
> > >> https://lists.gluster.org/mailman/listinfo/gluster-users
>
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Proposal: Changes in Gluster Community meetings

2019-04-23 Thread Darrell Budic
I was one of the folk who wanted a NA/EMEA scheduled meeting, and I’m going to 
have to miss it due to some real life issues (clogged sewer I’m going to have 
to be dealing with at the time). Apologies, I’ll work on  making the next one.

  -Darrell

> On Apr 22, 2019, at 4:20 PM, FNU Raghavendra Manjunath  
> wrote:
> 
> 
> Hi,
> 
> This is the agenda for tomorrow's community meeting for NA/EMEA timezone.
> 
> https://hackmd.io/OqZbh7gfQe6uvVUXUVKJ5g?both 
> 
> 
> 
> 
> 
> On Thu, Apr 11, 2019 at 4:56 AM Amar Tumballi Suryanarayan 
> mailto:atumb...@redhat.com>> wrote:
> Hi All,
> 
> Below is the final details of our community meeting, and I will be sending 
> invites to mailing list following this email. You can add Gluster Community 
> Calendar so you can get notifications on the meetings.
> 
> We are starting the meetings from next week. For the first meeting, we need 1 
> volunteer from users to discuss the use case / what went well, and what went 
> bad, etc. preferrably in APAC region.  NA/EMEA region, next week.
>  
> Draft Content: https://hackmd.io/OqZbh7gfQe6uvVUXUVKJ5g 
> 
> 
> Gluster Community Meeting
> 
>  
> Previous
>  Meeting minutes:
> 
> http://github.com/gluster/community 
>  
> Date/Time:
>  Check the community calendar 
> 
>  Bridge
> 
> APAC friendly hours
> Bridge: https://bluejeans.com/836554017 
> NA/EMEA
> Bridge: https://bluejeans.com/486278655 
>  Attendance
> 
> Name, Company
>  Host
> 
> Who will host next meeting?
> Host will need to send out the agenda 24hr - 12hrs in advance to mailing 
> list, and also make sure to send the meeting minutes.
> Host will need to reach out to one user at least who can talk about their 
> usecase, their experience, and their needs.
> Host needs to send meeting minutes as PR to 
> http://github.com/gluster/community 
>  User stories
> 
> Discuss 1 usecase from a user.
> How was the architecture derived, what volume type used, options, etc?
> What were the major issues faced ? How to improve them?
> What worked good?
> How can we all collaborate well, so it is win-win for the community and the 
> user? How can we
>  Community
> 
> Any release updates?
> 
> Blocker issues across the project?
> 
> Metrics
> 
> Number of new bugs since previous meeting. How many are not triaged?
> Number of emails, anything unanswered?
>  
> Conferences
>  / Meetups
> 
> Any conference in next 1 month where gluster-developers are going? 
> gluster-users are going? So we can meet and discuss.
>  Developer 
> focus
> 
> Any design specs to discuss?
> 
> Metrics of the week?
> 
> Coverity
> Clang-Scan
> Number of patches from new developers.
> Did we increase test coverage?
> [Atin] Also talk about most frequent test failures in the CI and carve out an 
> AI to get them fixed.
>  RoundTable
> 
> 
> 
> 
> Regards,
> Amar
> 
> On Mon, Mar 25, 2019 at 8:53 PM Amar Tumballi Suryanarayan 
> mailto:atumb...@redhat.com>> wrote:
> Thanks for the feedback Darrell,
> 
> The new proposal is to have one in North America 'morning' time. (10AM PST), 
> And another in ASIA day time, which is evening 7pm/6pm in Australia, 9pm 
> Newzealand, 5pm Tokyo, 4pm Beijing.
> 
> For example, if we choose Every other Tuesday for meeting, and 1st of the 
> month is Tuesday, we would have North America time for 1st, and on 15th it 
> would be ASIA/Pacific time.
> 
> Hopefully, this way, we can cover all the timezones, and meeting minutes 
> would be committed to github repo, so that way, it will be easier for 
> everyone to be aware of what is happening.
> 
> Regards,
> Amar
> 
> On Mon, Mar 25, 2019 at 8:40 PM Darrell Budic  > wrote:
> As a user, I’d like to visit more of these, but the time slot is my 3AM. Any 
> possibility for a rolling schedule (move meeting +6 hours each week with 
> rolling attendance from maintainers?) or an occasional regional meeting 12 
> hours opposed to the one you’re proposing?
> 
>   -Darrell
> 
>> On Mar 25, 2019, at 4:25 AM, Amar Tumballi Suryanarayan > 

Re: [Gluster-users] GlusterFS on ZFS

2019-04-23 Thread Cody Hill

Thanks for the info Karli,

I wasn’t aware ZFS Dedup was such a dog. I guess I’ll leave that off. My data 
get’s 3.5:1 savings on compression alone. I was aware of stripped sets. I will 
be doing 6x Striped sets across 12x disks. 

On top of this design I’m going to try and test Intel Optane DIMM (512GB) as a 
“Tier” for GlusterFS to try and get further write acceleration. And issues with 
GlusterFS “Tier” functionality that anyone is aware of?

Thank you,
Cody Hill 

> On Apr 18, 2019, at 2:32 AM, Karli Sjöberg  wrote:
> 
> 
> 
> Den 17 apr. 2019 16:30 skrev Cody Hill :
> Hey folks.
> 
> I’m looking to deploy GlusterFS to host some VMs. I’ve done a lot of reading 
> and would like to implement Deduplication and Compression in this setup. My 
> thought would be to run ZFS to handle the Compression and Deduplication.
> 
> You _really_ don't want ZFS doing dedup for any reason.
> 
> 
> ZFS would give me the following benefits:
> 1. If a single disk fails rebuilds happen locally instead of over the network
> 2. Zil & L2Arc should add a slight performance increase
> 
> Adding two really good NVME SSD's as a mirrored SLOG vdev does a huge deal 
> for synchronous write performance, turning every random write into large 
> streams that the spinning drives handle better.
> 
> Don't know how picky Gluster is about synchronicity though, most 
> "performance" tweaking suggests setting stuff to async, which I wouldn't 
> recommend, but it's a huge boost for throughput obviously; not having to wait 
> for stuff to actually get written, but it's dangerous.
> 
> With mirrored NVME SLOG's, you could probably get that throughput without 
> going asynchronous, which saves you from potential data corruption in a 
> sudden power loss.
> 
> L2ARC on the other hand does a bit for read latency, but for a general 
> purpose file server- in practice- not a huge difference, the working set is 
> just too large. Also keep in mind that L2ARC isn't "free". You need more RAM 
> to know where you've cached stuff...
> 
> 3. Deduplication and Compression are inline and have pretty good performance 
> with modern hardware (Intel Skylake)
> 
> ZFS deduplication has terrible performance. Watch your throughput 
> automatically drop from hundreds or thousands of MB/s down to, like 5. It's a 
> feature;)
> 
> 4. Automated Snapshotting
> 
> I can then layer GlusterFS on top to handle distribution to allow 3x Replicas 
> of my storage.
> My question is… Why aren’t more people doing this? Is this a horrible idea 
> for some reason that I’m missing?
> 
> While it could save a lot of space in some hypothetical instance, the 
> drawbacks can never motivate it. E.g. if you want one node to suddenly die 
> and never recover because of RAM exhaustion, go with ZFS dedup ;)
> 
> I’d be very interested to hear your thoughts.
> 
> Avoid ZFS dedup at all costs. LZ4 compression on the hand is awesome, 
> definitely use that! It's basically a free performance enhancer the also 
> saves space :)
> 
> As another person has said, the best performance layout is RAID10- striped 
> mirrors. I understand you'd want to get as much volume as possible with 
> RAID-Z/RAID(5|6) since gluster also replicates/distributes, but it has a huge 
> impact on IOPS. If performance is the main concern, do striped mirrors with 
> replica 3 in Gluster. My advice is to test thoroughly with different pool 
> layouts to see what gives acceptable performance against your volume 
> requirements.
> 
> /K
> 
> 
> Additional thoughts:
> I’d like to use Ganesha pNFS to connect to this storage. (Any issues here?)
> I think I’d need KeepAliveD across these 3x nodes to store in the FSTAB (Is 
> this correct?)
> I’m also thinking about creating a “Gluster Tier” of 512GB of Intel Optane 
> DIMM to really smooth out write latencies… Any issues here?
> 
> Thank you,
> Cody Hill
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
> 

___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Extremely slow Gluster performance

2019-04-23 Thread Patrick Rennie
Hello Gluster Users,

I am hoping someone can help me with resolving an ongoing issue I've been
having, I'm new to mailing lists so forgive me if I have gotten anything
wrong. We have noticed our performance deteriorating over the last few
weeks, easily measured by trying to do an ls on one of our top-level
folders, and timing it, which usually would take 2-5 seconds, and now takes
up to 20 minutes, which obviously renders our cluster basically unusable.
This has been intermittent in the past but is now almost constant and I am
not sure how to work out the exact cause. We have noticed some errors in
the brick logs, and have noticed that if we kill the right brick process,
performance instantly returns back to normal, this is not always the same
brick, but it indicates to me something in the brick processes or
background tasks may be causing extreme latency. Due to this ability to fix
it by killing the right brick process off, I think it's a specific file, or
folder, or operation which may be hanging and causing the increased
latency, but I am not sure how to work it out. One last thing to add is
that our bricks are getting quite full (~95% full), we are trying to
migrate data off to new storage but that is going slowly, not helped by
this issue. I am currently trying to run a full heal as there appear to be
many files needing healing, and I have all brick processes running so they
have an opportunity to heal, but this means performance is very poor. It
currently takes over 15-20 minutes to do an ls of one of our top-level
folders, which just contains 60-80 other folders, this should take 2-5
seconds. This is all being checked by FUSE mount locally on the storage
node itself, but it is the same for other clients and VMs accessing the
cluster. Initially it seemed our NFS mounts were not affected and operated
at normal speed, but testing over the last day has shown that our NFS
clients are also extremely slow, so it doesn't seem specific to FUSE as I
first thought it might be.

I am not sure how to proceed from here, I am fairly new to gluster having
inherited this setup from my predecessor and trying to keep it going. I
have included some info below to try and help with diagnosis, please let me
know if any further info would be helpful. I would really appreciate any
advice on what I could try to work out the cause. Thank you in advance for
reading this, and any suggestions you might be able to offer.

- Patrick

This is an example of the main error I see in our brick logs, there have
been others, I can post them when I see them again too:
[2019-04-20 04:54:43.055680] E [MSGID: 113001]
[posix.c:4940:posix_getxattr] 0-gvAA01-posix: getxattr failed on
/brick1/ library: system.posix_acl_default  [Operation not
supported]
[2019-04-20 05:01:29.476313] W [posix.c:4929:posix_getxattr]
0-gvAA01-posix: Extended attributes not supported (try remounting brick
with 'user_xattr' flag)

Our setup consists of 2 storage nodes and an arbiter node. I have noticed
our nodes are on slightly different versions, I'm not sure if this could be
an issue. We have 9 bricks on each node, made up of ZFS RAIDZ2 pools -
total capacity is around 560TB.
We have bonded 10gbps NICS on each node, and I have tested bandwidth with
iperf and found that it's what would be expected from this config.
Individual brick performance seems ok, I've tested several bricks using dd
and can write a 10GB files at 1.7GB/s.

# dd if=/dev/zero of=/brick1/test/test.file bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (10 GB, 9.8 GiB) copied, 6.20303 s, 1.7 GB/s

Node 1:
# glusterfs --version
glusterfs 3.12.15

Node 2:
# glusterfs --version
glusterfs 3.12.14

Arbiter:
# glusterfs --version
glusterfs 3.12.14

Here is our gluster volume status:

# gluster volume status
Status of volume: gvAA01
Gluster process TCP Port  RDMA Port  Online  Pid
--
Brick 01-B:/brick1/gvAA01/brick49152 0  Y   7219
Brick 02-B:/brick1/gvAA01/brick49152 0  Y   21845
Brick 00-A:/arbiterAA01/gvAA01/bri
ck1 49152 0  Y
 6931
Brick 01-B:/brick2/gvAA01/brick49153 0  Y   7239
Brick 02-B:/brick2/gvAA01/brick49153 0  Y   9916
Brick 00-A:/arbiterAA01/gvAA01/bri
ck2 49153 0  Y
 6939
Brick 01-B:/brick3/gvAA01/brick49154 0  Y   7235
Brick 02-B:/brick3/gvAA01/brick49154 0  Y   21858
Brick 00-A:/arbiterAA01/gvAA01/bri
ck3 49154 0  Y
 6947
Brick 01-B:/brick4/gvAA01/brick49155 0  Y   31840
Brick 02-B:/brick4/gvAA01/brick49155 0  Y   9933
Brick 00-A:/arbiterAA01/gvAA01/bri
ck4 49155 0  Y
 6956
Brick 01-B:/brick5/gvAA01/

Re: [Gluster-users] Extremely slow Gluster performance

2019-04-23 Thread Nithya Balachandran
Hi Patrick,

Did this start only after the upgrade?
How do you determine which brick process to kill?
Are there a lot of files to be healed on the volume?

Can you provide a tcpdump of the slow listing from a separate test client
mount ?

   1. Mount the gluster volume on a different mount point than the one
   being used by your users.
   2. Start capturing packets on the system where you have mounted the
   volume in (1).
   - tcpdump -i any -s 0 -w /var/tmp/dirls.pcap tcp and not port 22
  3. List the directory that is slow from the fuse client
   4. Stop the capture (after a couple of minutes or after the listing
   returns, whichever is earlier)
   5. Send us the pcap and the listing of the same directory from one of
   the bricks in order to compare the entries.


We may need more information post looking at the tcpdump.

Regards,
Nithya

On Tue, 23 Apr 2019 at 23:39, Patrick Rennie 
wrote:

> Hello Gluster Users,
>
> I am hoping someone can help me with resolving an ongoing issue I've been
> having, I'm new to mailing lists so forgive me if I have gotten anything
> wrong. We have noticed our performance deteriorating over the last few
> weeks, easily measured by trying to do an ls on one of our top-level
> folders, and timing it, which usually would take 2-5 seconds, and now takes
> up to 20 minutes, which obviously renders our cluster basically unusable.
> This has been intermittent in the past but is now almost constant and I am
> not sure how to work out the exact cause. We have noticed some errors in
> the brick logs, and have noticed that if we kill the right brick process,
> performance instantly returns back to normal, this is not always the same
> brick, but it indicates to me something in the brick processes or
> background tasks may be causing extreme latency. Due to this ability to fix
> it by killing the right brick process off, I think it's a specific file, or
> folder, or operation which may be hanging and causing the increased
> latency, but I am not sure how to work it out. One last thing to add is
> that our bricks are getting quite full (~95% full), we are trying to
> migrate data off to new storage but that is going slowly, not helped by
> this issue. I am currently trying to run a full heal as there appear to be
> many files needing healing, and I have all brick processes running so they
> have an opportunity to heal, but this means performance is very poor. It
> currently takes over 15-20 minutes to do an ls of one of our top-level
> folders, which just contains 60-80 other folders, this should take 2-5
> seconds. This is all being checked by FUSE mount locally on the storage
> node itself, but it is the same for other clients and VMs accessing the
> cluster. Initially it seemed our NFS mounts were not affected and operated
> at normal speed, but testing over the last day has shown that our NFS
> clients are also extremely slow, so it doesn't seem specific to FUSE as I
> first thought it might be.
>
> I am not sure how to proceed from here, I am fairly new to gluster having
> inherited this setup from my predecessor and trying to keep it going. I
> have included some info below to try and help with diagnosis, please let me
> know if any further info would be helpful. I would really appreciate any
> advice on what I could try to work out the cause. Thank you in advance for
> reading this, and any suggestions you might be able to offer.
>
> - Patrick
>
> This is an example of the main error I see in our brick logs, there have
> been others, I can post them when I see them again too:
> [2019-04-20 04:54:43.055680] E [MSGID: 113001]
> [posix.c:4940:posix_getxattr] 0-gvAA01-posix: getxattr failed on
> /brick1/ library: system.posix_acl_default  [Operation not
> supported]
> [2019-04-20 05:01:29.476313] W [posix.c:4929:posix_getxattr]
> 0-gvAA01-posix: Extended attributes not supported (try remounting brick
> with 'user_xattr' flag)
>
> Our setup consists of 2 storage nodes and an arbiter node. I have noticed
> our nodes are on slightly different versions, I'm not sure if this could be
> an issue. We have 9 bricks on each node, made up of ZFS RAIDZ2 pools -
> total capacity is around 560TB.
> We have bonded 10gbps NICS on each node, and I have tested bandwidth with
> iperf and found that it's what would be expected from this config.
> Individual brick performance seems ok, I've tested several bricks using dd
> and can write a 10GB files at 1.7GB/s.
>
> # dd if=/dev/zero of=/brick1/test/test.file bs=1M count=1
> 1+0 records in
> 1+0 records out
> 1048576 bytes (10 GB, 9.8 GiB) copied, 6.20303 s, 1.7 GB/s
>
> Node 1:
> # glusterfs --version
> glusterfs 3.12.15
>
> Node 2:
> # glusterfs --version
> glusterfs 3.12.14
>
> Arbiter:
> # glusterfs --version
> glusterfs 3.12.14
>
> Here is our gluster volume status:
>
> # gluster volume status
> Status of volume: gvAA01
> Gluster process TCP Port  RDMA Port  Online
> Pid
>
> -