If more time is needed to analyze this, is this an option? Shut down 7.5,
downgrade it back to 5.13 and restart, or would this screw something up
badly? I didn't up the op-version yet.

Thanks.

Sincerely,
Artem

--
Founder, Android Police <http://www.androidpolice.com>, APK Mirror
<http://www.apkmirror.com/>, Illogical Robot LLC
beerpla.net | @ArtemR <http://twitter.com/ArtemR>


On Thu, Apr 30, 2020 at 3:13 PM Artem Russakovskii <[email protected]>
wrote:

> The number of heal pending on citadel, the one that was upgraded to 7.5,
> has now gone to 10s of thousands and continues to go up.
>
> Sincerely,
> Artem
>
> --
> Founder, Android Police <http://www.androidpolice.com>, APK Mirror
> <http://www.apkmirror.com/>, Illogical Robot LLC
> beerpla.net | @ArtemR <http://twitter.com/ArtemR>
>
>
> On Thu, Apr 30, 2020 at 2:57 PM Artem Russakovskii <[email protected]>
> wrote:
>
>> Hi all,
>>
>> Today, I decided to upgrade one of the four servers (citadel) we have to
>> 7.5 from 5.13. There are 2 volumes, 1x4 replicate, and fuse mounts (I sent
>> the full details earlier in another message). If everything looked OK, I
>> would have proceeded the rolling upgrade for all of them, following the
>> full heal.
>>
>> However, as soon as I upgraded and restarted, the logs filled with
>> messages like these:
>>
>> [2020-04-30 21:39:21.316149] E
>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor
>> (1298437:400:17) failed to complete successfully
>> [2020-04-30 21:39:21.382891] E
>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor
>> (1298437:400:17) failed to complete successfully
>> [2020-04-30 21:39:21.442440] E
>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor
>> (1298437:400:17) failed to complete successfully
>> [2020-04-30 21:39:21.445587] E
>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor
>> (1298437:400:17) failed to complete successfully
>> [2020-04-30 21:39:21.571398] E
>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor
>> (1298437:400:17) failed to complete successfully
>> [2020-04-30 21:39:21.668192] E
>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor
>> (1298437:400:17) failed to complete successfully
>>
>>
>> The message "I [MSGID: 108031]
>> [afr-common.c:2581:afr_local_discovery_cbk]
>> 0-androidpolice_data3-replicate-0: selecting local read_child
>> androidpolice_data3-client-3" repeated 10 times between [2020-04-30
>> 21:46:41.854675] and [2020-04-30 21:48:20.206323]
>> The message "W [MSGID: 114031]
>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk]
>> 0-androidpolice_data3-client-1: remote operation failed [Transport endpoint
>> is not connected]" repeated 264 times between [2020-04-30 21:46:32.129567]
>> and [2020-04-30 21:48:29.905008]
>> The message "W [MSGID: 114031]
>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk]
>> 0-androidpolice_data3-client-0: remote operation failed [Transport endpoint
>> is not connected]" repeated 264 times between [2020-04-30 21:46:32.129602]
>> and [2020-04-30 21:48:29.905040]
>> The message "W [MSGID: 114031]
>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk]
>> 0-androidpolice_data3-client-2: remote operation failed [Transport endpoint
>> is not connected]" repeated 264 times between [2020-04-30 21:46:32.129512]
>> and [2020-04-30 21:48:29.905047]
>>
>>
>>
>> Once in a while, I'm seeing this:
>> ==> bricks/mnt-hive_block4-androidpolice_data3.log <==
>> [2020-04-30 21:45:54.251637] I [MSGID: 115072]
>> [server-rpc-fops_v2.c:1681:server4_setattr_cbk]
>> 0-androidpolice_data3-server: 5725811: SETATTR /
>> androidpolice.com/public/wp-content/uploads/2019/03/cielo-breez-plus-hero.png
>> (d4556eb4-f15b-412c-a42a-32b4438af557), client:
>> CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-2-RECON_NO:-1,
>> error-xlator: androidpolice_data3-access-control [Operation not permitted]
>> [2020-04-30 21:49:10.439701] I [MSGID: 115072]
>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk]
>> 0-androidpolice_data3-server: 201833: SETATTR /
>> androidpolice.com/public/wp-content/uploads
>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client:
>> CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2,
>> error-xlator: androidpolice_data3-access-control [Operation not permitted]
>> [2020-04-30 21:49:10.453724] I [MSGID: 115072]
>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk]
>> 0-androidpolice_data3-server: 201842: SETATTR /
>> androidpolice.com/public/wp-content/uploads
>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client:
>> CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2,
>> error-xlator: androidpolice_data3-access-control [Operation not permitted]
>> [2020-04-30 21:49:16.224662] I [MSGID: 115072]
>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk]
>> 0-androidpolice_data3-server: 202865: SETATTR /
>> androidpolice.com/public/wp-content/uploads
>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client:
>> CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2,
>> error-xlator: androidpolice_data3-access-control [Operation not permitted]
>>
>> There's also lots of self-healing happening that I didn't expect at all,
>> since the upgrade only took ~10-15s.
>> [2020-04-30 21:47:38.714448] I [MSGID: 108026]
>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>> 0-apkmirror_data1-replicate-0: performing metadata selfheal on
>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461
>> [2020-04-30 21:47:38.765033] I [MSGID: 108026]
>> [afr-self-heal-common.c:1723:afr_log_selfheal]
>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on
>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461. sources=[3]  sinks=0 1 2
>> [2020-04-30 21:47:38.765289] I [MSGID: 108026]
>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>> 0-apkmirror_data1-replicate-0: performing metadata selfheal on
>> f3c62a41-1864-4e75-9883-4357a7091296
>> [2020-04-30 21:47:38.800987] I [MSGID: 108026]
>> [afr-self-heal-common.c:1723:afr_log_selfheal]
>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on
>> f3c62a41-1864-4e75-9883-4357a7091296. sources=[3]  sinks=0 1 2
>>
>>
>> I'm also seeing "remote operation failed" and "writing to fuse device
>> failed: No such file or directory" messages
>> [2020-04-30 21:46:34.891957] I [MSGID: 108026]
>> [afr-self-heal-common.c:1723:afr_log_selfheal]
>> 0-androidpolice_data3-replicate-0: Completed metadata selfheal on
>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2]  sinks=3
>> [2020-04-30 21:45:36.127412] W [MSGID: 114031]
>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk]
>> 0-androidpolice_data3-client-0: remote operation failed [Operation not
>> permitted]
>> [2020-04-30 21:45:36.345924] W [MSGID: 114031]
>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk]
>> 0-androidpolice_data3-client-1: remote operation failed [Operation not
>> permitted]
>> [2020-04-30 21:46:35.291853] I [MSGID: 108031]
>> [afr-common.c:2543:afr_local_discovery_cbk]
>> 0-androidpolice_data3-replicate-0: selecting local read_child
>> androidpolice_data3-client-2
>> [2020-04-30 21:46:35.977342] I [MSGID: 108026]
>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>> 0-androidpolice_data3-replicate-0: performing metadata selfheal on
>> 2692eeba-1ebe-49b6-927f-1dfbcd227591
>> [2020-04-30 21:46:36.006607] I [MSGID: 108026]
>> [afr-self-heal-common.c:1723:afr_log_selfheal]
>> 0-androidpolice_data3-replicate-0: Completed metadata selfheal on
>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2]  sinks=3
>> [2020-04-30 21:46:37.245599] E [fuse-bridge.c:219:check_and_dump_fuse_W]
>> (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d]
>> (-->
>> /usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a]
>> (-->
>> /usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb]
>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (-->
>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) 0-glusterfs-fuse:
>> writing to fuse device failed: No such file or directory
>> [2020-04-30 21:46:50.864797] E [fuse-bridge.c:219:check_and_dump_fuse_W]
>> (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d]
>> (-->
>> /usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a]
>> (-->
>> /usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb]
>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (-->
>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) 0-glusterfs-fuse:
>> writing to fuse device failed: No such file or directory
>>
>> The number of items being healed is going up and down wildly, from 0 to
>> 8000+ and sometimes taking a really long time to return a value. I'm really
>> worried as this is a production system, and I didn't observe this in our
>> test system.
>>
>>
>>
>> gluster v heal apkmirror_data1 info summary
>> Brick nexus2:/mnt/nexus2_block1/apkmirror_data1
>> Status: Connected
>> Total Number of entries: 27
>> Number of entries in heal pending: 27
>> Number of entries in split-brain: 0
>> Number of entries possibly healing: 0
>>
>> Brick forge:/mnt/forge_block1/apkmirror_data1
>> Status: Connected
>> Total Number of entries: 27
>> Number of entries in heal pending: 27
>> Number of entries in split-brain: 0
>> Number of entries possibly healing: 0
>>
>> Brick hive:/mnt/hive_block1/apkmirror_data1
>> Status: Connected
>> Total Number of entries: 27
>> Number of entries in heal pending: 27
>> Number of entries in split-brain: 0
>> Number of entries possibly healing: 0
>>
>> Brick citadel:/mnt/citadel_block1/apkmirror_data1
>> Status: Connected
>> Total Number of entries: 8540
>> Number of entries in heal pending: 8540
>> Number of entries in split-brain: 0
>> Number of entries possibly healing: 0
>>
>>
>>
>> gluster v heal androidpolice_data3 info summary
>> Brick nexus2:/mnt/nexus2_block4/androidpolice_data3
>> Status: Connected
>> Total Number of entries: 1
>> Number of entries in heal pending: 1
>> Number of entries in split-brain: 0
>> Number of entries possibly healing: 0
>>
>> Brick forge:/mnt/forge_block4/androidpolice_data3
>> Status: Connected
>> Total Number of entries: 1
>> Number of entries in heal pending: 1
>> Number of entries in split-brain: 0
>> Number of entries possibly healing: 0
>>
>> Brick hive:/mnt/hive_block4/androidpolice_data3
>> Status: Connected
>> Total Number of entries: 1
>> Number of entries in heal pending: 1
>> Number of entries in split-brain: 0
>> Number of entries possibly healing: 0
>>
>> Brick citadel:/mnt/citadel_block4/androidpolice_data3
>> Status: Connected
>> Total Number of entries: 1149
>> Number of entries in heal pending: 1149
>> Number of entries in split-brain: 0
>> Number of entries possibly healing: 0
>>
>>
>> What should I do at this point? The files I tested seem to be replicating
>> correctly, but I don't know if it's the case for all of them, and the heals
>> going up and down, and all these log messages are making me very nervous.
>>
>> Thank you.
>>
>> Sincerely,
>> Artem
>>
>> --
>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror
>> <http://www.apkmirror.com/>, Illogical Robot LLC
>> beerpla.net | @ArtemR <http://twitter.com/ArtemR>
>>
>
________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-users

Reply via email to