Re: [Gluster-users] gluster tiering errors

2017-10-26 Thread Milind Changire
Herb,
I'm trying to weed out issues here.

So, I can see quota turned *on* and would like you to check the quota
settings and test to see system behavior *if quota is turned off*.

Although the file size that failed migration was 29K, I'm being a bit
paranoid while weeding out issues.

Are you still facing tiering errors ?
I can see your response to Alex with the disk space consumption and found
it a bit ambiguous w.r.t. state of affairs.

--
Milind



On Tue, Oct 24, 2017 at 11:34 PM, Herb Burnswell <
herbert.burnsw...@gmail.com> wrote:

> Milind - Thank you for the response..
>
> >> What are the high and low watermarks for the tier set at ?
>
> # gluster volume get  cluster.watermark-hi
> Option  Value
>
> --  -
>
> cluster.watermark-hi90
>
>
> # gluster volume get  cluster.watermark-low
> Option  Value
>
> --  -
>
> cluster.watermark-low   75
>
>
>
> >> What is the size of the file that failed to migrate as per the
> following tierd log:
>
> >> [2017-10-19 17:52:07.519614] I [MSGID: 109038]
> [tier.c:1169:tier_migrate_using_query_file] 0--tier-dht: Promotion
> failed for (gfid:edaf97e1-02e0-4838-9d26-71ea3aab22fb)
>
> The file was a word doc @ 29K in size.
>
> >>If possible, a *gluster volume info* would also help, instead of going
> to and fro with questions.
>
> # gluster vol info
>
> Volume Name: ctdb
> Type: Replicate
> Volume ID: f679c476-e0dd-4f3a-9813-1b26016b5384
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 2 = 2
> Transport-type: tcp
> Bricks:
> Brick1: :/mnt/ctdb_local/brick
> Brick2: :/mnt/ctdb_local/brick
> Options Reconfigured:
> nfs.disable: on
> transport.address-family: inet
>
> Volume Name: 
> Type: Tier
> Volume ID: 7710ed2f-775e-4dd9-92ad-66407c72b0ad
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 8
> Transport-type: tcp
> Hot Tier :
> Hot Tier Type : Distributed-Replicate
> Number of Bricks: 2 x 2 = 4
> Brick1: :/mnt/brick_nvme1/brick
> Brick2: :/mnt/brick_nvme2/brick
> Brick3: :/mnt/brick_nvme2/brick
> Brick4: :/mnt/brick_nvme1/brick
> Cold Tier:
> Cold Tier Type : Distributed-Replicate
> Number of Bricks: 2 x 2 = 4
> Brick5: :/mnt/brick1/brick
> Brick6: :/mnt/brick2/brick
> Brick7: :/mnt/brick2/brick
> Brick8: :/mnt/brick1/brick
> Options Reconfigured:
> cluster.lookup-optimize: on
> client.event-threads: 4
> server.event-threads: 4
> performance.write-behind-window-size: 4MB
> performance.cache-size: 16GB
> features.quota-deem-statfs: on
> features.inode-quota: on
> features.quota: on
> nfs.disable: on
> transport.address-family: inet
> features.ctr-enabled: on
> cluster.tier-mode: cache
> performance.io-cache: off
> performance.quick-read: off
> cluster.tier-max-files: 100
>
>
> HB
>
>
>
>
> On Sun, Oct 22, 2017 at 8:41 AM, Milind Changire 
> wrote:
>
>> Herb,
>> What are the high and low watermarks for the tier set at ?
>>
>> # gluster volume get  cluster.watermark-hi
>>
>> # gluster volume get  cluster.watermark-low
>>
>> What is the size of the file that failed to migrate as per the following
>> tierd log:
>>
>> [2017-10-19 17:52:07.519614] I [MSGID: 109038]
>> [tier.c:1169:tier_migrate_using_query_file] 0--tier-dht: Promotion
>> failed for (gfid:edaf97e1-02e0-4838-9d26-71ea3aab22fb)
>>
>> If possible, a *gluster volume info* would also help, instead of going
>> to and fro with questions.
>>
>> --
>> Milind
>>
>>
>>
>> On Fri, Oct 20, 2017 at 12:42 AM, Herb Burnswell <
>> herbert.burnsw...@gmail.com> wrote:
>>
>>> All,
>>>
>>> I am new to gluster and have some questions/concerns about some tiering
>>> errors that I see in the log files.
>>>
>>> OS: CentOs 7.3.1611
>>> Gluster version: 3.10.5
>>> Samba version: 4.6.2
>>>
>>> I see the following (scrubbed):
>>>
>>> Node 1 /var/log/glusterfs/tier//tierd.log:
>>>
>>> [2017-10-19 17:52:07.519614] I [MSGID: 109038]
>>> [tier.c:1169:tier_migrate_using_query_file] 0--tier-dht: Promotion
>>> failed for (gfid:edaf97e1-02e0-4838-9d26-71ea3aab22fb)
>>> [2017-10-19 17:52:07.525110] E [MSGID: 109011]
>>> [dht-common.c:7188:dht_create] 0--hot-dht: no subvolume in layout for
>>> path=/path/to/
>>> [2017-10-19 17:52:07.526088] E [MSGID: 109023]
>>> [dht-rebalance.c:757:__dht_rebalance_create_dst_file] 0--tier-dht:
>>> failed to create  on -hot-dht [Input/output error]
>>> [2017-10-19 17:52:07.526111] E [MSGID: 0] 
>>> [dht-rebalance.c:1696:dht_migrate_file]
>>> 0--tier-dht: Create dst failed on - -hot-dht for file - 
>>> [2017-10-19 17:52:07.527214] E [MSGID: 109037]
>>> [tier.c:969:tier_migrate_link] 0--tier-dht: Failed to migrate 
>>> [No space left on device]
>>> [2017-10-19 17:52:07.527244] I [MSGID: 109038]
>>> [tier.c:1169:tier_migrate_using_query_file] 0--tier-dht: Promotion
>>> failed for (gfid:fb4411c4-a387-4e5f-a2b7-897633ef4aa8)
>>> [2017-10-19 17:52:07.533510] E [MSGID: 109011]
>>> 

Re: [Gluster-users] gluster tiering errors

2017-10-26 Thread Herb Burnswell
Alex - Thank you for the response...


> >>> There are several messages "no space left on device". I would check
> first that free disk space is available for the volume.
>

 The volumes appear to be fine with available space:

/dev/mapper/vg_bricks-brick_nvme11.4T  782G  652G  55%
/mnt/brick_nvme1
/dev/mapper/vg_bricks-brick_nvme21.4T  742G  691G  52%
/mnt/brick_nvme2

As mentioned, I'm new to Gluster.. Is this not where the "no space left on
device" would be referring to?

Thanks again,

HB


> Herb,
>> What are the high and low watermarks for the tier set at ?
>>
>> # gluster volume get  cluster.watermark-hi
>>
>> # gluster volume get  cluster.watermark-low
>>
>> What is the size of the file that failed to migrate as per the following
>> tierd log:
>>
>> [2017-10-19 17:52:07.519614] I [MSGID: 109038]
>> [tier.c:1169:tier_migrate_using_query_file] 0--tier-dht: Promotion
>> failed for (gfid:edaf97e1-02e0-4838-9d26-71ea3aab22fb)
>>
>> If possible, a *gluster volume info* would also help, instead of going
>> to and fro with questions.
>>
>> --
>> Milind
>>
>>
>>
>> On Fri, Oct 20, 2017 at 12:42 AM, Herb Burnswell <
>> herbert.burnsw...@gmail.com> wrote:
>>
>>> All,
>>>
>>> I am new to gluster and have some questions/concerns about some tiering
>>> errors that I see in the log files.
>>>
>>> OS: CentOs 7.3.1611
>>> Gluster version: 3.10.5
>>> Samba version: 4.6.2
>>>
>>> I see the following (scrubbed):
>>>
>>> Node 1 /var/log/glusterfs/tier//tierd.log:
>>>
>>> [2017-10-19 17:52:07.519614] I [MSGID: 109038]
>>> [tier.c:1169:tier_migrate_using_query_file] 0--tier-dht: Promotion
>>> failed for (gfid:edaf97e1-02e0-4838-9d26-71ea3aab22fb)
>>> [2017-10-19 17:52:07.525110] E [MSGID: 109011]
>>> [dht-common.c:7188:dht_create] 0--hot-dht: no subvolume in layout for
>>> path=/path/to/
>>> [2017-10-19 17:52:07.526088] E [MSGID: 109023]
>>> [dht-rebalance.c:757:__dht_rebalance_create_dst_file] 0--tier-dht:
>>> failed to create  on -hot-dht [Input/output error]
>>> [2017-10-19 17:52:07.526111] E [MSGID: 0] 
>>> [dht-rebalance.c:1696:dht_migrate_file]
>>> 0--tier-dht: Create dst failed on - -hot-dht for file - 
>>> [2017-10-19 17:52:07.527214] E [MSGID: 109037]
>>> [tier.c:969:tier_migrate_link] 0--tier-dht: Failed to migrate 
>>> [No space left on device]
>>> [2017-10-19 17:52:07.527244] I [MSGID: 109038]
>>> [tier.c:1169:tier_migrate_using_query_file] 0--tier-dht: Promotion
>>> failed for (gfid:fb4411c4-a387-4e5f-a2b7-897633ef4aa8)
>>> [2017-10-19 17:52:07.533510] E [MSGID: 109011]
>>> [dht-common.c:7188:dht_create] 0--hot-dht: no subvolume in layout for
>>> path=/path/to/
>>> [2017-10-19 17:52:07.534434] E [MSGID: 109023]
>>> [dht-rebalance.c:757:__dht_rebalance_create_dst_file] 0--tier-dht:
>>> failed to create  on -hot-dht [Input/output error]
>>> [2017-10-19 17:52:07.534453] E [MSGID: 0] 
>>> [dht-rebalance.c:1696:dht_migrate_file]
>>> 0--tier-dht: Create dst failed on - -hot-dht for file - 
>>> [2017-10-19 17:52:07.535570] E [MSGID: 109037]
>>> [tier.c:969:tier_migrate_link] 0--tier-dht: Failed to migrate 
>>> [No space left on device]
>>> [2017-10-19 17:52:07.535594] I [MSGID: 109038]
>>> [tier.c:1169:tier_migrate_using_query_file] 0--tier-dht: Promotion
>>> failed for (gfid:fba421e7-0500-47c4-bf67-10a40690e13d)
>>> [2017-10-19 17:52:07.541363] E [MSGID: 109011]
>>> [dht-common.c:7188:dht_create] 0--hot-dht: no subvolume in layout for
>>> path=/path/to/
>>> [2017-10-19 17:52:07.542296] E [MSGID: 109023]
>>> [dht-rebalance.c:757:__dht_rebalance_create_dst_file] 0--tier-dht:
>>> failed to create  on -hot-dht [Input/output error]
>>> [2017-10-19 17:52:07.542357] E [MSGID: 0] 
>>> [dht-rebalance.c:1696:dht_migrate_file]
>>> 0--tier-dht: Create dst failed on - -hot-dht for file - 
>>> [2017-10-19 17:52:07.543480] E [MSGID: 109037]
>>> [tier.c:969:tier_migrate_link] 0--tier-dht: Failed to migrate 
>>> [No space left on device]
>>> [2017-10-19 17:52:07.543521] I [MSGID: 109038]
>>> [tier.c:1169:tier_migrate_using_query_file] 0--tier-dht: Promotion
>>> failed for (gfid:fe6799e1-42e6-43e5-a7eb-ac8facfcbc9f)
>>> [2017-10-19 17:52:07.549959] E [MSGID: 109011]
>>> [dht-common.c:7188:dht_create] 0--hot-dht: no subvolume in layout for
>>> path=/path/to/
>>> [2017-10-19 17:52:07.550901] E [MSGID: 109023]
>>> [dht-rebalance.c:757:__dht_rebalance_create_dst_file] 0--tier-dht:
>>> failed to create  on -hot-dht [Input/output error]
>>> [2017-10-19 17:52:07.550922] E [MSGID: 0] 
>>> [dht-rebalance.c:1696:dht_migrate_file]
>>> 0--tier-dht: Create dst failed on - -hot-dht for file - 
>>> [2017-10-19 17:52:07.551896] E [MSGID: 109037]
>>> [tier.c:969:tier_migrate_link] 0--tier-dht: Failed to migrate 
>>> [No space left on device]
>>> [2017-10-19 17:52:07.551917] I [MSGID: 109038]
>>> [tier.c:1169:tier_migrate_using_query_file] 0--tier-dht: Promotion
>>> failed for (gfid:ffe3a3f2-b170-43f0-a9fb-97c78e3173eb)
>>> [2017-10-19 17:52:07.551945] E [MSGID: 109037] [tier.c:2565:tier_run]
>>> 0--tier-dht: 

Re: [Gluster-users] gluster tiering errors

2017-10-26 Thread Herb Burnswell
Milind - Thank you for the response..

>> What are the high and low watermarks for the tier set at ?

# gluster volume get  cluster.watermark-hi
Option  Value

--  -

cluster.watermark-hi90


# gluster volume get  cluster.watermark-low
Option  Value

--  -

cluster.watermark-low   75



>> What is the size of the file that failed to migrate as per the following
tierd log:

>> [2017-10-19 17:52:07.519614] I [MSGID: 109038]
[tier.c:1169:tier_migrate_using_query_file] 0--tier-dht: Promotion
failed for (gfid:edaf97e1-02e0-4838-9d26-71ea3aab22fb)

The file was a word doc @ 29K in size.

>>If possible, a *gluster volume info* would also help, instead of going to
and fro with questions.

# gluster vol info

Volume Name: ctdb
Type: Replicate
Volume ID: f679c476-e0dd-4f3a-9813-1b26016b5384
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: :/mnt/ctdb_local/brick
Brick2: :/mnt/ctdb_local/brick
Options Reconfigured:
nfs.disable: on
transport.address-family: inet

Volume Name: 
Type: Tier
Volume ID: 7710ed2f-775e-4dd9-92ad-66407c72b0ad
Status: Started
Snapshot Count: 0
Number of Bricks: 8
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick1: :/mnt/brick_nvme1/brick
Brick2: :/mnt/brick_nvme2/brick
Brick3: :/mnt/brick_nvme2/brick
Brick4: :/mnt/brick_nvme1/brick
Cold Tier:
Cold Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick5: :/mnt/brick1/brick
Brick6: :/mnt/brick2/brick
Brick7: :/mnt/brick2/brick
Brick8: :/mnt/brick1/brick
Options Reconfigured:
cluster.lookup-optimize: on
client.event-threads: 4
server.event-threads: 4
performance.write-behind-window-size: 4MB
performance.cache-size: 16GB
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
nfs.disable: on
transport.address-family: inet
features.ctr-enabled: on
cluster.tier-mode: cache
performance.io-cache: off
performance.quick-read: off
cluster.tier-max-files: 100


HB




On Sun, Oct 22, 2017 at 8:41 AM, Milind Changire 
wrote:

> Herb,
> What are the high and low watermarks for the tier set at ?
>
> # gluster volume get  cluster.watermark-hi
>
> # gluster volume get  cluster.watermark-low
>
> What is the size of the file that failed to migrate as per the following
> tierd log:
>
> [2017-10-19 17:52:07.519614] I [MSGID: 109038]
> [tier.c:1169:tier_migrate_using_query_file] 0--tier-dht: Promotion
> failed for (gfid:edaf97e1-02e0-4838-9d26-71ea3aab22fb)
>
> If possible, a *gluster volume info* would also help, instead of going to
> and fro with questions.
>
> --
> Milind
>
>
>
> On Fri, Oct 20, 2017 at 12:42 AM, Herb Burnswell <
> herbert.burnsw...@gmail.com> wrote:
>
>> All,
>>
>> I am new to gluster and have some questions/concerns about some tiering
>> errors that I see in the log files.
>>
>> OS: CentOs 7.3.1611
>> Gluster version: 3.10.5
>> Samba version: 4.6.2
>>
>> I see the following (scrubbed):
>>
>> Node 1 /var/log/glusterfs/tier//tierd.log:
>>
>> [2017-10-19 17:52:07.519614] I [MSGID: 109038]
>> [tier.c:1169:tier_migrate_using_query_file] 0--tier-dht: Promotion
>> failed for (gfid:edaf97e1-02e0-4838-9d26-71ea3aab22fb)
>> [2017-10-19 17:52:07.525110] E [MSGID: 109011]
>> [dht-common.c:7188:dht_create] 0--hot-dht: no subvolume in layout for
>> path=/path/to/
>> [2017-10-19 17:52:07.526088] E [MSGID: 109023]
>> [dht-rebalance.c:757:__dht_rebalance_create_dst_file] 0--tier-dht:
>> failed to create  on -hot-dht [Input/output error]
>> [2017-10-19 17:52:07.526111] E [MSGID: 0] 
>> [dht-rebalance.c:1696:dht_migrate_file]
>> 0--tier-dht: Create dst failed on - -hot-dht for file - 
>> [2017-10-19 17:52:07.527214] E [MSGID: 109037]
>> [tier.c:969:tier_migrate_link] 0--tier-dht: Failed to migrate 
>> [No space left on device]
>> [2017-10-19 17:52:07.527244] I [MSGID: 109038]
>> [tier.c:1169:tier_migrate_using_query_file] 0--tier-dht: Promotion
>> failed for (gfid:fb4411c4-a387-4e5f-a2b7-897633ef4aa8)
>> [2017-10-19 17:52:07.533510] E [MSGID: 109011]
>> [dht-common.c:7188:dht_create] 0--hot-dht: no subvolume in layout for
>> path=/path/to/
>> [2017-10-19 17:52:07.534434] E [MSGID: 109023]
>> [dht-rebalance.c:757:__dht_rebalance_create_dst_file] 0--tier-dht:
>> failed to create  on -hot-dht [Input/output error]
>> [2017-10-19 17:52:07.534453] E [MSGID: 0] 
>> [dht-rebalance.c:1696:dht_migrate_file]
>> 0--tier-dht: Create dst failed on - -hot-dht for file - 
>> [2017-10-19 17:52:07.535570] E [MSGID: 109037]
>> [tier.c:969:tier_migrate_link] 0--tier-dht: Failed to migrate 
>> [No space left on device]
>> [2017-10-19 17:52:07.535594] I [MSGID: 109038]
>> [tier.c:1169:tier_migrate_using_query_file] 0--tier-dht: Promotion
>> failed for (gfid:fba421e7-0500-47c4-bf67-10a40690e13d)
>> [2017-10-19 17:52:07.541363] E [MSGID: 109011]
>> 

Re: [Gluster-users] not healing one file

2017-10-26 Thread Karthik Subrahmanya
Hi Richard,

Thanks for the informations. As you said there is gfid mismatch for the
file.
On brick-1 & brick-2 the gfids are same & on brick-3 the gfid is different.
This is not considered as split-brain because we have two good copies here.
Gluster 3.10 does not have a method to resolve this situation other than the
manual intervention [1]. Basically what you need to do is remove the file
and
the gfid hardlink from brick-3 (considering brick-3 entry as bad). Then when
you do a lookup for the file from mount it will recreate the entry on the
other brick.

Form 3.12 we have methods to resolve this situation with the cli option [2]
and
with favorite-child-policy [3]. For the time being you can use [1] to
resolve this
and if you can consider upgrading to 3.12 that would give you options to
handle
these scenarios.

[1]
http://docs.gluster.org/en/latest/Troubleshooting/split-brain/#fixing-directory-entry-split-brain
[2] https://review.gluster.org/#/c/17485/
[3] https://review.gluster.org/#/c/16878/

HTH,
Karthik

On Thu, Oct 26, 2017 at 12:40 PM, Richard Neuboeck 
wrote:

> Hi Karthik,
>
> thanks for taking a look at this. I'm not working with gluster long
> enough to make heads or tails out of the logs. The logs are attached to
> this mail and here is the other information:
>
> # gluster volume info home
>
> Volume Name: home
> Type: Replicate
> Volume ID: fe6218ae-f46b-42b3-a467-5fc6a36ad48a
> Status: Started
> Snapshot Count: 1
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: sphere-six:/srv/gluster_home/brick
> Brick2: sphere-five:/srv/gluster_home/brick
> Brick3: sphere-four:/srv/gluster_home/brick
> Options Reconfigured:
> features.barrier: disable
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> nfs.disable: on
> performance.readdir-ahead: on
> transport.address-family: inet
> features.cache-invalidation: on
> features.cache-invalidation-timeout: 600
> performance.stat-prefetch: on
> performance.cache-samba-metadata: on
> performance.cache-invalidation: on
> performance.md-cache-timeout: 600
> network.inode-lru-limit: 9
> performance.cache-size: 1GB
> performance.client-io-threads: on
> cluster.lookup-optimize: on
> cluster.readdir-optimize: on
> features.quota: on
> features.inode-quota: on
> features.quota-deem-statfs: on
> cluster.server-quorum-ratio: 51%
>
>
> [root@sphere-four ~]# getfattr -d -e hex -m .
> /srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-
> 1396429081309/sessionstore-backups/recovery.baklz4
> getfattr: Removing leading '/' from absolute path names
> # file:
> srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-
> 1396429081309/sessionstore-backups/recovery.baklz4
> security.selinux=0x73797374656d5f753a6f626a6563
> 745f723a756e6c6162656c65645f743a733000
> trusted.afr.dirty=0x
> trusted.bit-rot.version=0x020059df20a40006f989
> trusted.gfid=0xda1c94b1643544b18d5b6f4654f60bf5
> trusted.glusterfs.quota.48e9eea6-cda6-4e53-bb4a-72059debf4c2.contri.1=
> 0x9a01
> trusted.pgfid.48e9eea6-cda6-4e53-bb4a-72059debf4c2=0x0001
>
> [root@sphere-five ~]# getfattr -d -e hex -m .
> /srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-
> 1396429081309/sessionstore-backups/recovery.baklz4
> getfattr: Removing leading '/' from absolute path names
> # file:
> srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-
> 1396429081309/sessionstore-backups/recovery.baklz4
> security.selinux=0x73797374656d5f753a6f626a6563
> 745f723a756e6c6162656c65645f743a733000
> trusted.afr.dirty=0x
> trusted.afr.home-client-4=0x00010001
> trusted.bit-rot.version=0x020059df1f310006ce63
> trusted.gfid=0xea8ecfd195fd4e48b994fd0a2da226f9
> trusted.glusterfs.quota.48e9eea6-cda6-4e53-bb4a-72059debf4c2.contri.1=
> 0x9a01
> trusted.pgfid.48e9eea6-cda6-4e53-bb4a-72059debf4c2=0x0001
>
> [root@sphere-six ~]# getfattr -d -e hex -m .
> /srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-
> 1396429081309/sessionstore-backups/recovery.baklz4
> getfattr: Removing leading '/' from absolute path names
> # file:
> srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-
> 1396429081309/sessionstore-backups/recovery.baklz4
> security.selinux=0x73797374656d5f753a6f626a6563
> 745f723a756e6c6162656c65645f743a733000
> trusted.afr.dirty=0x
> trusted.afr.home-client-4=0x00010001
> trusted.bit-rot.version=0x020059df11cd000548ec
> trusted.gfid=0xea8ecfd195fd4e48b994fd0a2da226f9
> trusted.glusterfs.quota.48e9eea6-cda6-4e53-bb4a-72059debf4c2.contri.1=
> 0x9a01
> trusted.pgfid.48e9eea6-cda6-4e53-bb4a-72059debf4c2=0x0001
>
> Cheers
> Richard
>
> On 26.10.17 07:41, Karthik Subrahmanya wrote:
> > HeyRichard,
> >
> > Could you share the following informations please?
> > 1. 

Re: [Gluster-users] not healing one file

2017-10-26 Thread Richard Neuboeck
Hi Amar,

thanks for the information! I tried this tool on all machines.

# gluster-health-report

Loaded reports: glusterd-op-version, georep, gfid-mismatch-dht-report,
glusterd-peer-disconnect, disk_usage, errors_in_logs, coredump,
glusterd, glusterd_volume_version_cksum_errors, kernel_issues,
errors_in_logs, ifconfig, nic-health, process_status

[ OK] Disk used percentage  path=/  percentage=4
[ OK] Disk used percentage  path=/var  percentage=4
[ OK] Disk used percentage  path=/tmp  percentage=4
[ OK] All peers are in connected state  connected_count=2
total_peer_count=2
[ OK] no gfid mismatch
[  ERROR] Report failure  report=report_check_glusterd_op_version
[ NOT OK] The maximum size of core files created is NOT set to unlimited.
[  ERROR] Report failure  report=report_check_worker_restarts
[  ERROR] Report failure  report=report_non_participating_bricks
[WARNING] Glusterd uptime is less than 24 hours  uptime_sec=72798
[WARNING] Errors in Glusterd log file  num_errors=35
[WARNING] Warnings in Glusterd log file  num_warning=37
[ NOT OK] Recieve errors in "ifconfig bond0" output
[ NOT OK] Errors seen in "cat /proc/net/dev -- bond0" output
High CPU usage by Self-heal
[WARNING] Errors in Glusterd log file num_errors=77
[WARNING] Warnings in Glusterd log file num_warnings=61

Basically it's the same message on all of them with varying error and
warning counts.
Glusterd is not up for long since I updated and then rebootet the
machines yesterday. That's also the reason for some of the errors and
warnings and also for the network errors since it always takes some time
until the bonded device (4x1Gbit, balanced alb) is fully functional.

From what I've seen in the getfattr output Karthik asked me to get GFIDs
are different on the file in question. Even though the report says there
is no mismatch.

So is this a split-brain situation gluster is not aware of?

Cheers
Richard

On 26.10.17 06:51, Amar Tumballi wrote:
> On a side note, try recently released health report tool, and see if it
> does diagnose any issues in setup. Currently you may have to run it in
> all the three machines.
> 
> 
> 
> On 26-Oct-2017 6:50 AM, "Amar Tumballi"  > wrote:
> 
> Thanks for this report. This week many of the developers are at
> Gluster Summit in Prague, will be checking this and respond next
> week. Hope that's fine.
> 
> Thanks,
> Amar
> 
> 
> On 25-Oct-2017 3:07 PM, "Richard Neuboeck"  > wrote:
> 
> Hi Gluster Gurus,
> 
> I'm using a gluster volume as home for our users. The volume is
> replica 3, running on CentOS 7, gluster version 3.10
> (3.10.6-1.el7.x86_64). Clients are running Fedora 26 and also
> gluster 3.10 (3.10.6-3.fc26.x86_64).
> 
> During the data backup I got an I/O error on one file. Manually
> checking for this file on a client confirms this:
> 
> ls -l
> 
> romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/
> ls: cannot access
> 
> 'romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4':
> Input/output error
> total 2015
> -rw---. 1 romanoch tbi 998211 Sep 15 18:44 previous.js
> -rw---. 1 romanoch tbi  65222 Oct 17 17:57 previous.jsonlz4
> -rw---. 1 romanoch tbi 149161 Oct  1 13:46 recovery.bak
> -?? ? ???? recovery.baklz4
> 
> Out of curiosity I checked all the bricks for this file. It's
> present there. Making a checksum shows that the file is different on
> one of the three replica servers.
> 
> Querying healing information shows that the file should be healed:
> # gluster volume heal home info
> Brick sphere-six:/srv/gluster_home/brick
> 
> /romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4
> 
> Status: Connected
> Number of entries: 1
> 
> Brick sphere-five:/srv/gluster_home/brick
> 
> /romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4
> 
> Status: Connected
> Number of entries: 1
> 
> Brick sphere-four:/srv/gluster_home/brick
> Status: Connected
> Number of entries: 0
> 
> Manually triggering heal doesn't report an error but also does not
> heal the file.
> # gluster volume heal home
> Launching heal operation to perform index self heal on volume home
> has been successful
> 
> Same with a full heal
> # gluster volume heal home full
> Launching heal operation to perform full self heal on volume home
> has been successful
> 
> According to the split brain query that's not the problem:
> # gluster volume heal 

Re: [Gluster-users] [Gluster-devel] Gluster Health Report tool

2017-10-26 Thread Marcin Dulak
On Thu, Oct 26, 2017 at 3:53 AM, Sankarshan Mukhopadhyay <
sankarshan.mukhopadh...@gmail.com> wrote:

> On Thu, Oct 26, 2017 at 2:24 AM, Marcin Dulak 
> wrote:
> > Hi,
> >
> > since people are suggesting nagios then I can't resist suggesting
> exporting
> > the metrics in the prometheus format,
> > or at least making the project into a library so
> > https://github.com/prometheus/client_python could be used to export the
> > prometheus metrics.
> > There has been an attempt at https://github.com/ofesseler/
> gluster_exporter
> > but it is not maintained anymore.
> >
>
> There is an on-going effort which provides a monitoring dashboard for
> a Gluster cluster. Some detail at
>  At present the
> stack is not consuming Prometheus, however, the team is looking at
> switching over so as to make a more malleable dashboard.


that's a good idea. Prometheus is a time series collector which provides a
very basic dashboards,
and the fancy, colorful dashboards (if anyone needs them) are usually
created in grafana using prometheus as one of the sources of time series.
Working on a project that includes both monitoring and dashboarding does
not make sense unless
the goal is to sell the project to one of the giants dealing with large,
corporate environments.

Cheers

Marcin


> There is of
> course a Gitter channel at 
> Install+configure instructions for the latest release are at
>  release-v1.5.3-(install-guide)>
>
>
> --
> sankarshan mukhopadhyay
> 
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users