Re: [Gluster-users] Gluster replicate 3 arbiter 1 in split brain. gluster cli seems unaware

2017-12-22 Thread Henrik Juul Pedersen
Hi Karthik,

Thanks for the info. Maybe the documentation should be updated to
explain the different AFR versions, I know I was confused.

Also, when looking at the changelogs from my three bricks before fixing:

Brick 1:
trusted.afr.virt_images-client-1=0x0228
trusted.afr.virt_images-client-3=0x

Brick 2:
trusted.afr.virt_images-client-2=0x03ef
trusted.afr.virt_images-client-3=0x

Brick 3 (arbiter):
trusted.afr.virt_images-client-1=0x0228

I would think that the changelog for client 1 should win by majority
vote? Or how does the self-healing process work?
I assumed this as the correct version, and reset client 2 on brick 2:
# setfattr -n trusted.afr.virt_images-client-2 -v
0x fedora27.qcow2

I then did a directory listing, which might have started a heal, but
heal statistics show (i also did a full heal):
Starting time of crawl: Fri Dec 22 11:34:47 2017

Ending time of crawl: Fri Dec 22 11:34:47 2017

Type of crawl: INDEX
No. of entries healed: 0
No. of entries in split-brain: 0
No. of heal failed entries: 1

Starting time of crawl: Fri Dec 22 11:39:29 2017

Ending time of crawl: Fri Dec 22 11:39:29 2017

Type of crawl: FULL
No. of entries healed: 0
No. of entries in split-brain: 0
No. of heal failed entries: 1

I was immediately able to touch the file, so gluster was okay about
it, however heal info still showed the file for a while:
# gluster volume heal virt_images info
Brick virt3:/data/virt_images/brick
/fedora27.qcow2
Status: Connected
Number of entries: 1

Brick virt2:/data/virt_images/brick
/fedora27.qcow2
Status: Connected
Number of entries: 1

Brick printserver:/data/virt_images/brick
/fedora27.qcow2
Status: Connected
Number of entries: 1



Now heal info shows 0 entries, and the two data bricks have the same
md5sum, so it's back in sync.



I have a few questions after all of this:

1) How can a split brain happen in a replica 3 arbiter 1 setup with
both server- and client quorum enabled?
2) Why was it not able to self heal, when tro bricks seemed in sync
with their changelogs?
3) Why could I not see the file in heal info split-brain?
4) Why could I not fix this through the cli split-brain resolution tool?
5) Is it possible to force a sync in a volume? Or maybe test sync
status? It might be smart to be able to "flush" changes when taking a
brick down for maintenance.
6) How am I supposed to monitor events like this? I have a gluster
volume with ~500.000 files, I need to be able to guarantee data
integrity and availability to the users.
7) Is glusterfs "production ready"? Because I find it hard to monitor
and thus trust in these setups. Also performance with small / many
files seems horrible at best - but that's for another discussion.

Thanks for all of your help, Ill continue to try and tweak some
performance out of this. :)

Best regards,
Henrik Juul Pedersen
LIAB ApS

On 22 December 2017 at 07:26, Karthik Subrahmanya  wrote:
> Hi Henrik,
>
> Thanks for providing the required outputs. See my replies inline.
>
> On Thu, Dec 21, 2017 at 10:42 PM, Henrik Juul Pedersen  wrote:
>>
>> Hi Karthik and Ben,
>>
>> I'll try and reply to you inline.
>>
>> On 21 December 2017 at 07:18, Karthik Subrahmanya 
>> wrote:
>> > Hey,
>> >
>> > Can you give us the volume info output for this volume?
>>
>> # gluster volume info virt_images
>>
>> Volume Name: virt_images
>> Type: Replicate
>> Volume ID: 9f3c8273-4d9d-4af2-a4e7-4cb4a51e3594
>> Status: Started
>> Snapshot Count: 2
>> Number of Bricks: 1 x (2 + 1) = 3
>> Transport-type: tcp
>> Bricks:
>> Brick1: virt3:/data/virt_images/brick
>> Brick2: virt2:/data/virt_images/brick
>> Brick3: printserver:/data/virt_images/brick (arbiter)
>> Options Reconfigured:
>> features.quota-deem-statfs: on
>> features.inode-quota: on
>> features.quota: on
>> features.barrier: disable
>> features.scrub: Active
>> features.bitrot: on
>> nfs.rpc-auth-allow: on
>> server.allow-insecure: on
>> user.cifs: off
>> features.shard: off
>> cluster.shd-wait-qlength: 1
>> cluster.locking-scheme: granular
>> cluster.data-self-heal-algorithm: full
>> cluster.server-quorum-type: server
>> cluster.quorum-type: auto
>> cluster.eager-lock: enable
>> network.remote-dio: enable
>> performance.low-prio-threads: 32
>> performance.io-cache: off
>> performance.read-ahead: off
>> performance.quick-read: off
>> nfs.disable: on
>> transport.address-family: inet
>> server.outstanding-rpc-limit: 512
>>
>> > Why are you not able to get the xattrs from arbiter brick? It is the
>> > same
>> > way as you do it on data bricks.
>>
>> Yes I must have confused myself yesterday somehow, here it is in full
>> from all three bricks:
>>
>> Brick 1 (virt2): # getfattr -d -m . -e hex fedora27.qcow2
>> # file: fedora27.qcow2
>> trusted.afr.dirty=0x
>> trusted.afr.virt_images-client-1=0x0228

Re: [Gluster-users] Gluster replicate 3 arbiter 1 in split brain. gluster cli seems unaware

2017-12-22 Thread Karthik Subrahmanya
Hey Henrik,

Good to know that the issue got resolved. I will try to answer some of the
questions you have.
- The time taken to heal the file depends on its size. That's why you were
seeing some delay in getting everything back to normal in the heal info
output.
- You did not hit the split-brain situation. In split-brain all the bricks
will be blaming the other bricks. But in your case the third brick was not
blamed by any other brick.
- It was not able to heal the file because arbiter can not be source for
data heal. The other two data bricks were blaming each other, so heal was
not able to decide on the source.
  This is arbiter becoming source for data heal issue. We are working on
the fix for this, and it will be shipped with the next release.
- Since it was not in split brain, you were not able see this in heal info
split-brain and not able to resolve this using the cli for split-brain
resolution.
- You can use the heal command to perform syncing of data after brick
maintenance. Once the brick comes up any ways the heal will be triggered
automatically.
- You can use the heal info command to monitor the status of heal.

Regards,
Karthik

On Fri, Dec 22, 2017 at 6:01 PM, Henrik Juul Pedersen  wrote:

> Hi Karthik,
>
> Thanks for the info. Maybe the documentation should be updated to
> explain the different AFR versions, I know I was confused.
>
> Also, when looking at the changelogs from my three bricks before fixing:
>
> Brick 1:
> trusted.afr.virt_images-client-1=0x0228
> trusted.afr.virt_images-client-3=0x
>
> Brick 2:
> trusted.afr.virt_images-client-2=0x03ef
> trusted.afr.virt_images-client-3=0x
>
> Brick 3 (arbiter):
> trusted.afr.virt_images-client-1=0x0228
>
> I would think that the changelog for client 1 should win by majority
> vote? Or how does the self-healing process work?
> I assumed this as the correct version, and reset client 2 on brick 2:
> # setfattr -n trusted.afr.virt_images-client-2 -v
> 0x fedora27.qcow2
>
> I then did a directory listing, which might have started a heal, but
> heal statistics show (i also did a full heal):
> Starting time of crawl: Fri Dec 22 11:34:47 2017
>
> Ending time of crawl: Fri Dec 22 11:34:47 2017
>
> Type of crawl: INDEX
> No. of entries healed: 0
> No. of entries in split-brain: 0
> No. of heal failed entries: 1
>
> Starting time of crawl: Fri Dec 22 11:39:29 2017
>
> Ending time of crawl: Fri Dec 22 11:39:29 2017
>
> Type of crawl: FULL
> No. of entries healed: 0
> No. of entries in split-brain: 0
> No. of heal failed entries: 1
>
> I was immediately able to touch the file, so gluster was okay about
> it, however heal info still showed the file for a while:
> # gluster volume heal virt_images info
> Brick virt3:/data/virt_images/brick
> /fedora27.qcow2
> Status: Connected
> Number of entries: 1
>
> Brick virt2:/data/virt_images/brick
> /fedora27.qcow2
> Status: Connected
> Number of entries: 1
>
> Brick printserver:/data/virt_images/brick
> /fedora27.qcow2
> Status: Connected
> Number of entries: 1
>
>
>
> Now heal info shows 0 entries, and the two data bricks have the same
> md5sum, so it's back in sync.
>
>
>
> I have a few questions after all of this:
>
> 1) How can a split brain happen in a replica 3 arbiter 1 setup with
> both server- and client quorum enabled?
> 2) Why was it not able to self heal, when tro bricks seemed in sync
> with their changelogs?
> 3) Why could I not see the file in heal info split-brain?
> 4) Why could I not fix this through the cli split-brain resolution tool?
> 5) Is it possible to force a sync in a volume? Or maybe test sync
> status? It might be smart to be able to "flush" changes when taking a
> brick down for maintenance.
> 6) How am I supposed to monitor events like this? I have a gluster
> volume with ~500.000 files, I need to be able to guarantee data
> integrity and availability to the users.
> 7) Is glusterfs "production ready"? Because I find it hard to monitor
> and thus trust in these setups. Also performance with small / many
> files seems horrible at best - but that's for another discussion.
>
> Thanks for all of your help, Ill continue to try and tweak some
> performance out of this. :)
>
> Best regards,
> Henrik Juul Pedersen
> LIAB ApS
>
> On 22 December 2017 at 07:26, Karthik Subrahmanya 
> wrote:
> > Hi Henrik,
> >
> > Thanks for providing the required outputs. See my replies inline.
> >
> > On Thu, Dec 21, 2017 at 10:42 PM, Henrik Juul Pedersen 
> wrote:
> >>
> >> Hi Karthik and Ben,
> >>
> >> I'll try and reply to you inline.
> >>
> >> On 21 December 2017 at 07:18, Karthik Subrahmanya 
> >> wrote:
> >> > Hey,
> >> >
> >> > Can you give us the volume info output for this volume?
> >>
> >> # gluster volume info virt_images
> >>
> >> Volume Name: virt_images
> >> Type: Replicate
> >> Volume ID: 

Re: [Gluster-users] Gluster replicate 3 arbiter 1 in split brain. gluster cli seems unaware

2017-12-21 Thread Karthik Subrahmanya
Hi Henrik,

Thanks for providing the required outputs. See my replies inline.

On Thu, Dec 21, 2017 at 10:42 PM, Henrik Juul Pedersen  wrote:

> Hi Karthik and Ben,
>
> I'll try and reply to you inline.
>
> On 21 December 2017 at 07:18, Karthik Subrahmanya 
> wrote:
> > Hey,
> >
> > Can you give us the volume info output for this volume?
>
> # gluster volume info virt_images
>
> Volume Name: virt_images
> Type: Replicate
> Volume ID: 9f3c8273-4d9d-4af2-a4e7-4cb4a51e3594
> Status: Started
> Snapshot Count: 2
> Number of Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1: virt3:/data/virt_images/brick
> Brick2: virt2:/data/virt_images/brick
> Brick3: printserver:/data/virt_images/brick (arbiter)
> Options Reconfigured:
> features.quota-deem-statfs: on
> features.inode-quota: on
> features.quota: on
> features.barrier: disable
> features.scrub: Active
> features.bitrot: on
> nfs.rpc-auth-allow: on
> server.allow-insecure: on
> user.cifs: off
> features.shard: off
> cluster.shd-wait-qlength: 1
> cluster.locking-scheme: granular
> cluster.data-self-heal-algorithm: full
> cluster.server-quorum-type: server
> cluster.quorum-type: auto
> cluster.eager-lock: enable
> network.remote-dio: enable
> performance.low-prio-threads: 32
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> nfs.disable: on
> transport.address-family: inet
> server.outstanding-rpc-limit: 512
>
> > Why are you not able to get the xattrs from arbiter brick? It is the same
> > way as you do it on data bricks.
>
> Yes I must have confused myself yesterday somehow, here it is in full
> from all three bricks:
>
> Brick 1 (virt2): # getfattr -d -m . -e hex fedora27.qcow2
> # file: fedora27.qcow2
> trusted.afr.dirty=0x
> trusted.afr.virt_images-client-1=0x0228
> trusted.afr.virt_images-client-3=0x
> trusted.bit-rot.version=0x1d005a3aa0db000c6563
> trusted.gfid=0x7a36937d52fc4b55a93299e2328f02ba
> trusted.gfid2path.c076c6ac27a43012=0x30303030303030302d303030302d
> 303030302d303030302d3030303030303030303030312f6665646f726132372e71636f7732
> trusted.glusterfs.quota.----0001.contri.1=
> 0xa49eb001
> trusted.pgfid.----0001=0x0001
>
> Brick 2 (virt3): # getfattr -d -m . -e hex fedora27.qcow2
> # file: fedora27.qcow2
> trusted.afr.dirty=0x
> trusted.afr.virt_images-client-2=0x03ef
> trusted.afr.virt_images-client-3=0x
> trusted.bit-rot.version=0x19005a3a9f82000c382a
> trusted.gfid=0x7a36937d52fc4b55a93299e2328f02ba
> trusted.gfid2path.c076c6ac27a43012=0x30303030303030302d303030302d
> 303030302d303030302d3030303030303030303030312f6665646f726132372e71636f7732
> trusted.glusterfs.quota.----0001.contri.1=
> 0xa2fbe001
> trusted.pgfid.----0001=0x0001
>
> Brick 3 - arbiter (printserver): # getfattr -d -m . -e hex fedora27.qcow2
> # file: fedora27.qcow2
> trusted.afr.dirty=0x
> trusted.afr.virt_images-client-1=0x0228
> trusted.bit-rot.version=0x31005a39237200073206
> trusted.gfid=0x7a36937d52fc4b55a93299e2328f02ba
> trusted.gfid2path.c076c6ac27a43012=0x30303030303030302d303030302d
> 303030302d303030302d3030303030303030303030312f6665646f726132372e71636f7732
> trusted.glusterfs.quota.----0001.contri.1=
> 0x0001
> trusted.pgfid.----0001=0x0001
>
> I was expecting trusted.afr.virt_images-client-{1,2,3} on all bricks?
>
>From AFR-V2 we do not have  self blaming attrs. So you will see a brick
blaming other bricks only.
For example brcik1 can blame brick2 & brick 3, not itself.

>
> > The changelog xattrs are named trusted.afr.virt_images-client-{1,2,3}
> in the
> > getxattr outputs you have provided.
> > Did you do a remove-brick and add-brick any time? Otherwise it will be
> > trusted.afr.virt_images-client-{0,1,2} usually.
>
> Yes, the bricks was moved around initially; brick 0 was re-created as
> brick 2, and the arbiter was added later on as well.
>
> >
> > To overcome this scenario you can do what Ben Turner had suggested.
> Select
> > the source copy and change the xattrs manually.
>
> I won't mind doing that, but again, the guides assume that I have
> trusted.afr.virt_images-client-{1,2,3} on all bricks, so I'm not sure
> what to change to what, where.


> > I am suspecting that it has hit the arbiter becoming source for data heal
> > bug. But to confirm that we need the xattrs on the arbiter brick also.
> >
> > Regards,
> > Karthik
> >
> >
> > On Thu, Dec 21, 2017 at 9:55 AM, Ben Turner  wrote:
> >>
> >> Here is the process for resolving split brain on replica 2:
> >>
> >>
> >> 

Re: [Gluster-users] Gluster replicate 3 arbiter 1 in split brain. gluster cli seems unaware

2017-12-21 Thread Henrik Juul Pedersen
Hi Karthik and Ben,

I'll try and reply to you inline.

On 21 December 2017 at 07:18, Karthik Subrahmanya  wrote:
> Hey,
>
> Can you give us the volume info output for this volume?

# gluster volume info virt_images

Volume Name: virt_images
Type: Replicate
Volume ID: 9f3c8273-4d9d-4af2-a4e7-4cb4a51e3594
Status: Started
Snapshot Count: 2
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: virt3:/data/virt_images/brick
Brick2: virt2:/data/virt_images/brick
Brick3: printserver:/data/virt_images/brick (arbiter)
Options Reconfigured:
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
features.barrier: disable
features.scrub: Active
features.bitrot: on
nfs.rpc-auth-allow: on
server.allow-insecure: on
user.cifs: off
features.shard: off
cluster.shd-wait-qlength: 1
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: enable
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
nfs.disable: on
transport.address-family: inet
server.outstanding-rpc-limit: 512

> Why are you not able to get the xattrs from arbiter brick? It is the same
> way as you do it on data bricks.

Yes I must have confused myself yesterday somehow, here it is in full
from all three bricks:

Brick 1 (virt2): # getfattr -d -m . -e hex fedora27.qcow2
# file: fedora27.qcow2
trusted.afr.dirty=0x
trusted.afr.virt_images-client-1=0x0228
trusted.afr.virt_images-client-3=0x
trusted.bit-rot.version=0x1d005a3aa0db000c6563
trusted.gfid=0x7a36937d52fc4b55a93299e2328f02ba
trusted.gfid2path.c076c6ac27a43012=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6665646f726132372e71636f7732
trusted.glusterfs.quota.----0001.contri.1=0xa49eb001
trusted.pgfid.----0001=0x0001

Brick 2 (virt3): # getfattr -d -m . -e hex fedora27.qcow2
# file: fedora27.qcow2
trusted.afr.dirty=0x
trusted.afr.virt_images-client-2=0x03ef
trusted.afr.virt_images-client-3=0x
trusted.bit-rot.version=0x19005a3a9f82000c382a
trusted.gfid=0x7a36937d52fc4b55a93299e2328f02ba
trusted.gfid2path.c076c6ac27a43012=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6665646f726132372e71636f7732
trusted.glusterfs.quota.----0001.contri.1=0xa2fbe001
trusted.pgfid.----0001=0x0001

Brick 3 - arbiter (printserver): # getfattr -d -m . -e hex fedora27.qcow2
# file: fedora27.qcow2
trusted.afr.dirty=0x
trusted.afr.virt_images-client-1=0x0228
trusted.bit-rot.version=0x31005a39237200073206
trusted.gfid=0x7a36937d52fc4b55a93299e2328f02ba
trusted.gfid2path.c076c6ac27a43012=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6665646f726132372e71636f7732
trusted.glusterfs.quota.----0001.contri.1=0x0001
trusted.pgfid.----0001=0x0001

I was expecting trusted.afr.virt_images-client-{1,2,3} on all bricks?

> The changelog xattrs are named trusted.afr.virt_images-client-{1,2,3} in the
> getxattr outputs you have provided.
> Did you do a remove-brick and add-brick any time? Otherwise it will be
> trusted.afr.virt_images-client-{0,1,2} usually.

Yes, the bricks was moved around initially; brick 0 was re-created as
brick 2, and the arbiter was added later on as well.

>
> To overcome this scenario you can do what Ben Turner had suggested. Select
> the source copy and change the xattrs manually.

I won't mind doing that, but again, the guides assume that I have
trusted.afr.virt_images-client-{1,2,3} on all bricks, so I'm not sure
what to change to what, where.

> I am suspecting that it has hit the arbiter becoming source for data heal
> bug. But to confirm that we need the xattrs on the arbiter brick also.
>
> Regards,
> Karthik
>
>
> On Thu, Dec 21, 2017 at 9:55 AM, Ben Turner  wrote:
>>
>> Here is the process for resolving split brain on replica 2:
>>
>>
>> https://access.redhat.com/documentation/en-US/Red_Hat_Storage/2.1/html/Administration_Guide/Recovering_from_File_Split-brain.html
>>
>> It should be pretty much the same for replica 3, you change the xattrs
>> with something like:
>>
>> # setfattr -n trusted.afr.vol-client-0 -v 0x0001
>> /gfs/brick-b/a
>>
>> When I try to decide which copy to use I normally run things like:
>>
>> # stat //pat/to/file
>>
>> Check out the access and change times of the file on the back end bricks.
>> I normally pick the copy with the latest access / change times.  

Re: [Gluster-users] Gluster replicate 3 arbiter 1 in split brain. gluster cli seems unaware

2017-12-20 Thread Karthik Subrahmanya
Hey,

Can you give us the volume info output for this volume?
Why are you not able to get the xattrs from arbiter brick? It is the same
way as you do it on data bricks.
The changelog xattrs are named trusted.afr.virt_images-client-{1,2,3} in
the getxattr outputs you have provided.
Did you do a remove-brick and add-brick any time? Otherwise it will be
trusted.afr.virt_images-client-{0,1,2} usually.

To overcome this scenario you can do what Ben Turner had suggested. Select
the source copy and change the xattrs manually.
I am suspecting that it has hit the arbiter becoming source for data heal
bug. But to confirm that we need the xattrs on the arbiter brick also.

Regards,
Karthik


On Thu, Dec 21, 2017 at 9:55 AM, Ben Turner <btur...@redhat.com> wrote:

> Here is the process for resolving split brain on replica 2:
>
> https://access.redhat.com/documentation/en-US/Red_Hat_Storage/2.1/html/
> Administration_Guide/Recovering_from_File_Split-brain.html
>
> It should be pretty much the same for replica 3, you change the xattrs
> with something like:
>
> # setfattr -n trusted.afr.vol-client-0 -v 0x0001
> /gfs/brick-b/a
>
> When I try to decide which copy to use I normally run things like:
>
> # stat //pat/to/file
>
> Check out the access and change times of the file on the back end bricks.
> I normally pick the copy with the latest access / change times.  I'll also
> check:
>
> # md5sum //pat/to/file
>
> Compare the hashes of the file on both bricks to see if the data actually
> differs.  If the data is the same it makes choosing the proper replica
> easier.
>
> Any idea how you got in this situation?  Did you have a loss of NW
> connectivity?  I see you are using server side quorum, maybe check the logs
> for any loss of quorum?  I wonder if there was a loos of quorum and there
> was some sort of race condition hit:
>
> http://docs.gluster.org/en/latest/Administrator%20Guide/
> arbiter-volumes-and-quorum/#server-quorum-and-some-pitfalls
>
> "Unlike in client-quorum where the volume becomes read-only when quorum is
> lost, loss of server-quorum in a particular node makes glusterd kill the
> brick processes on that node (for the participating volumes) making even
> reads impossible."
>
> I wonder if the killing of brick processes could have led to some sort of
> race condition where writes were serviced on one brick / the arbiter and
> not the other?
>
> If you can find a reproducer for this please open a BZ with it, I have
> been seeing something similar(I think) but I haven't been able to run the
> issue down yet.
>
> -b
>
> - Original Message -
> > From: "Henrik Juul Pedersen" <h...@liab.dk>
> > To: gluster-users@gluster.org
> > Cc: "Henrik Juul Pedersen" <hen...@corepower.dk>
> > Sent: Wednesday, December 20, 2017 1:26:37 PM
> > Subject: [Gluster-users] Gluster replicate 3 arbiter 1 in split brain.
>   gluster cli seems unaware
> >
> > Hi,
> >
> > I have the following volume:
> >
> > Volume Name: virt_images
> > Type: Replicate
> > Volume ID: 9f3c8273-4d9d-4af2-a4e7-4cb4a51e3594
> > Status: Started
> > Snapshot Count: 2
> > Number of Bricks: 1 x (2 + 1) = 3
> > Transport-type: tcp
> > Bricks:
> > Brick1: virt3:/data/virt_images/brick
> > Brick2: virt2:/data/virt_images/brick
> > Brick3: printserver:/data/virt_images/brick (arbiter)
> > Options Reconfigured:
> > features.quota-deem-statfs: on
> > features.inode-quota: on
> > features.quota: on
> > features.barrier: disable
> > features.scrub: Active
> > features.bitrot: on
> > nfs.rpc-auth-allow: on
> > server.allow-insecure: on
> > user.cifs: off
> > features.shard: off
> > cluster.shd-wait-qlength: 1
> > cluster.locking-scheme: granular
> > cluster.data-self-heal-algorithm: full
> > cluster.server-quorum-type: server
> > cluster.quorum-type: auto
> > cluster.eager-lock: enable
> > network.remote-dio: enable
> > performance.low-prio-threads: 32
> > performance.io-cache: off
> > performance.read-ahead: off
> > performance.quick-read: off
> > nfs.disable: on
> > transport.address-family: inet
> > server.outstanding-rpc-limit: 512
> >
> > After a server reboot (brick 1) a single file has become unavailable:
> > # touch fedora27.qcow2
> > touch: setting times of 'fedora27.qcow2': Input/output error
> >
> > Looking at the split brain status from the client side cli:
> > # getfattr -n replica.split-brain-status fedora27.qcow2
> > # file: fedora27.qcow2
> > replica.split-brain-status="The file is not under

Re: [Gluster-users] Gluster replicate 3 arbiter 1 in split brain. gluster cli seems unaware

2017-12-20 Thread Ben Turner
Here is the process for resolving split brain on replica 2:

https://access.redhat.com/documentation/en-US/Red_Hat_Storage/2.1/html/Administration_Guide/Recovering_from_File_Split-brain.html

It should be pretty much the same for replica 3, you change the xattrs with 
something like:

# setfattr -n trusted.afr.vol-client-0 -v 0x0001 
/gfs/brick-b/a

When I try to decide which copy to use I normally run things like:

# stat //pat/to/file

Check out the access and change times of the file on the back end bricks.  I 
normally pick the copy with the latest access / change times.  I'll also check:

# md5sum //pat/to/file

Compare the hashes of the file on both bricks to see if the data actually 
differs.  If the data is the same it makes choosing the proper replica easier.

Any idea how you got in this situation?  Did you have a loss of NW 
connectivity?  I see you are using server side quorum, maybe check the logs for 
any loss of quorum?  I wonder if there was a loos of quorum and there was some 
sort of race condition hit:

http://docs.gluster.org/en/latest/Administrator%20Guide/arbiter-volumes-and-quorum/#server-quorum-and-some-pitfalls

"Unlike in client-quorum where the volume becomes read-only when quorum is 
lost, loss of server-quorum in a particular node makes glusterd kill the brick 
processes on that node (for the participating volumes) making even reads 
impossible."

I wonder if the killing of brick processes could have led to some sort of race 
condition where writes were serviced on one brick / the arbiter and not the 
other?

If you can find a reproducer for this please open a BZ with it, I have been 
seeing something similar(I think) but I haven't been able to run the issue down 
yet.

-b

- Original Message -
> From: "Henrik Juul Pedersen" <h...@liab.dk>
> To: gluster-users@gluster.org
> Cc: "Henrik Juul Pedersen" <hen...@corepower.dk>
> Sent: Wednesday, December 20, 2017 1:26:37 PM
> Subject: [Gluster-users] Gluster replicate 3 arbiter 1 in split brain.
> gluster cli seems unaware
> 
> Hi,
> 
> I have the following volume:
> 
> Volume Name: virt_images
> Type: Replicate
> Volume ID: 9f3c8273-4d9d-4af2-a4e7-4cb4a51e3594
> Status: Started
> Snapshot Count: 2
> Number of Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1: virt3:/data/virt_images/brick
> Brick2: virt2:/data/virt_images/brick
> Brick3: printserver:/data/virt_images/brick (arbiter)
> Options Reconfigured:
> features.quota-deem-statfs: on
> features.inode-quota: on
> features.quota: on
> features.barrier: disable
> features.scrub: Active
> features.bitrot: on
> nfs.rpc-auth-allow: on
> server.allow-insecure: on
> user.cifs: off
> features.shard: off
> cluster.shd-wait-qlength: 1
> cluster.locking-scheme: granular
> cluster.data-self-heal-algorithm: full
> cluster.server-quorum-type: server
> cluster.quorum-type: auto
> cluster.eager-lock: enable
> network.remote-dio: enable
> performance.low-prio-threads: 32
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> nfs.disable: on
> transport.address-family: inet
> server.outstanding-rpc-limit: 512
> 
> After a server reboot (brick 1) a single file has become unavailable:
> # touch fedora27.qcow2
> touch: setting times of 'fedora27.qcow2': Input/output error
> 
> Looking at the split brain status from the client side cli:
> # getfattr -n replica.split-brain-status fedora27.qcow2
> # file: fedora27.qcow2
> replica.split-brain-status="The file is not under data or metadata
> split-brain"
> 
> However, in the client side log, a split brain is mentioned:
> [2017-12-20 18:05:23.570762] E [MSGID: 108008]
> [afr-transaction.c:2629:afr_write_txn_refresh_done]
> 0-virt_images-replicate-0: Failing SETATTR on gfid
> 7a36937d-52fc-4b55-a932-99e2328f02ba: split-brain observed.
> [Input/output error]
> [2017-12-20 18:05:23.576046] W [MSGID: 108027]
> [afr-common.c:2733:afr_discover_done] 0-virt_images-replicate-0: no
> read subvols for /fedora27.qcow2
> [2017-12-20 18:05:23.578149] W [fuse-bridge.c:1153:fuse_setattr_cbk]
> 0-glusterfs-fuse: 182: SETATTR() /fedora27.qcow2 => -1 (Input/output
> error)
> 
> = Server side
> 
> No mention of a possible split brain:
> # gluster volume heal virt_images info split-brain
> Brick virt3:/data/virt_images/brick
> Status: Connected
> Number of entries in split-brain: 0
> 
> Brick virt2:/data/virt_images/brick
> Status: Connected
> Number of entries in split-brain: 0
> 
> Brick printserver:/data/virt_images/brick
> Status: Connected
> Number of entries in split-brain: 0
> 
> The info command shows the file:
> ]# gluster volume heal virt_images info
>

[Gluster-users] Gluster replicate 3 arbiter 1 in split brain. gluster cli seems unaware

2017-12-20 Thread Henrik Juul Pedersen
Hi,

I have the following volume:

Volume Name: virt_images
Type: Replicate
Volume ID: 9f3c8273-4d9d-4af2-a4e7-4cb4a51e3594
Status: Started
Snapshot Count: 2
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: virt3:/data/virt_images/brick
Brick2: virt2:/data/virt_images/brick
Brick3: printserver:/data/virt_images/brick (arbiter)
Options Reconfigured:
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
features.barrier: disable
features.scrub: Active
features.bitrot: on
nfs.rpc-auth-allow: on
server.allow-insecure: on
user.cifs: off
features.shard: off
cluster.shd-wait-qlength: 1
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: enable
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
nfs.disable: on
transport.address-family: inet
server.outstanding-rpc-limit: 512

After a server reboot (brick 1) a single file has become unavailable:
# touch fedora27.qcow2
touch: setting times of 'fedora27.qcow2': Input/output error

Looking at the split brain status from the client side cli:
# getfattr -n replica.split-brain-status fedora27.qcow2
# file: fedora27.qcow2
replica.split-brain-status="The file is not under data or metadata split-brain"

However, in the client side log, a split brain is mentioned:
[2017-12-20 18:05:23.570762] E [MSGID: 108008]
[afr-transaction.c:2629:afr_write_txn_refresh_done]
0-virt_images-replicate-0: Failing SETATTR on gfid
7a36937d-52fc-4b55-a932-99e2328f02ba: split-brain observed.
[Input/output error]
[2017-12-20 18:05:23.576046] W [MSGID: 108027]
[afr-common.c:2733:afr_discover_done] 0-virt_images-replicate-0: no
read subvols for /fedora27.qcow2
[2017-12-20 18:05:23.578149] W [fuse-bridge.c:1153:fuse_setattr_cbk]
0-glusterfs-fuse: 182: SETATTR() /fedora27.qcow2 => -1 (Input/output
error)

= Server side

No mention of a possible split brain:
# gluster volume heal virt_images info split-brain
Brick virt3:/data/virt_images/brick
Status: Connected
Number of entries in split-brain: 0

Brick virt2:/data/virt_images/brick
Status: Connected
Number of entries in split-brain: 0

Brick printserver:/data/virt_images/brick
Status: Connected
Number of entries in split-brain: 0

The info command shows the file:
]# gluster volume heal virt_images info
Brick virt3:/data/virt_images/brick
/fedora27.qcow2
Status: Connected
Number of entries: 1

Brick virt2:/data/virt_images/brick
/fedora27.qcow2
Status: Connected
Number of entries: 1

Brick printserver:/data/virt_images/brick
/fedora27.qcow2
Status: Connected
Number of entries: 1


The heal and heal full commands does nothing, and I can't find
anything in the logs about them trying and failing to fix the file.

Trying to manually resolve the split brain from cli gives the following:
# gluster volume heal virt_images split-brain source-brick
virt3:/data/virt_images/brick /fedora27.qcow2
Healing /fedora27.qcow2 failed: File not in split-brain.
Volume heal failed.

The attrs from virt2 and virt3 are as follows:
[root@virt2 brick]# getfattr -d -m . -e hex fedora27.qcow2
# file: fedora27.qcow2
trusted.afr.dirty=0x
trusted.afr.virt_images-client-1=0x0228
trusted.afr.virt_images-client-3=0x
trusted.bit-rot.version=0x1d005a3aa0db000c6563
trusted.gfid=0x7a36937d52fc4b55a93299e2328f02ba
trusted.gfid2path.c076c6ac27a43012=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6665646f726132372e71636f7732
trusted.glusterfs.quota.----0001.contri.1=0xa49eb001
trusted.pgfid.----0001=0x0001

# file: fedora27.qcow2
trusted.afr.dirty=0x
trusted.afr.virt_images-client-2=0x03ef
trusted.afr.virt_images-client-3=0x
trusted.bit-rot.version=0x19005a3a9f82000c382a
trusted.gfid=0x7a36937d52fc4b55a93299e2328f02ba
trusted.gfid2path.c076c6ac27a43012=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6665646f726132372e71636f7732
trusted.glusterfs.quota.----0001.contri.1=0xa2fbe001
trusted.pgfid.----0001=0x0001

I don't know how to find similar information from the arbiter...

Versions are the same on all three systems:
# glusterd --version
glusterfs 3.12.2

# gluster volume get all cluster.op-version
Option  Value
--  -
cluster.op-version  31202

I might try upgrading to version 3.13.0 tomorrow, but I want to hear
you out first.

How do I fix this? Do I have to manually change the file attributes?

Also, in the guides for manual resolution through setfattr, all the
bricks are listed with a