Re: [Gluster-users] Geo-replication completely broken

Felix Kölzow Fri, 03 Jul 2020 01:16:07 -0700

Dear Users,
the geo-replication is still broken. This is not really a comfortable
situation.
Does any user has had the same experience and is able to share a
possible workaround?
We are actually running gluster v6.0
Regards,


Felix


On 25/06/2020 10:04, Shwetha Acharya wrote:

Hi Rob and Felix,

Please share the *-changes.log files and brick logs, which will help
in analysis of the issue.

Regards,
Shwetha

On Thu, Jun 25, 2020 at 1:26 PM Felix Kölzow <felix.koel...@gmx.de
<mailto:felix.koel...@gmx.de>> wrote:

    Hey Rob,


    same issue for our third volume. Have a look at the logs just from
    right now (below).

    Question: You removed the htime files and the old changelogs. Just
    rm the files or is there something to pay more attention

    before removing the changelog files and the htime file.

    Regards,

    Felix

    [2020-06-25 07:51:53.795430] I [resource(worker
    /gluster/vg00/dispersed_fuse1024/brick):1435:connect_remote] SSH:
    SSH connection between master and slave established.   
    duration=1.2341
    [2020-06-25 07:51:53.795639] I [resource(worker
    /gluster/vg00/dispersed_fuse1024/brick):1105:connect] GLUSTER:
    Mounting gluster volume locally...
    [2020-06-25 07:51:54.520601] I [monitor(monitor):280:monitor]
    Monitor: worker died in startup phase
    brick=/gluster/vg01/dispersed_fuse1024/brick
    [2020-06-25 07:51:54.535809] I
    [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker
    Status Change    status=Faulty
    [2020-06-25 07:51:54.882143] I [resource(worker
    /gluster/vg00/dispersed_fuse1024/brick):1128:connect] GLUSTER:
    Mounted gluster volume    duration=1.0864
    [2020-06-25 07:51:54.882388] I [subcmds(worker
    /gluster/vg00/dispersed_fuse1024/brick):84:subcmd_worker] <top>:
    Worker spawn successful. Acknowledging back to monitor
    [2020-06-25 07:51:56.911412] E [repce(agent
    /gluster/vg00/dispersed_fuse1024/brick):121:worker] <top>: call
    failed:
    Traceback (most recent call last):
      File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line
    117, in worker
        res = getattr(self.obj, rmeth)(*in_data[2:])
      File
    "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line
    40, in register
        return Changes.cl_register(cl_brick, cl_dir, cl_log, cl_level,
    retries)
      File
    "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line
    46, in cl_register
        cls.raise_changelog_err()
      File
    "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line
    30, in raise_changelog_err
        raise ChangelogException(errn, os.strerror(errn))
    ChangelogException: [Errno 2] No such file or directory
    [2020-06-25 07:51:56.912056] E [repce(worker
    /gluster/vg00/dispersed_fuse1024/brick):213:__call__] RepceClient:
    call failed call=75086:140098349655872:1593071514.91
    method=register    error=ChangelogException
    [2020-06-25 07:51:56.912396] E [resource(worker
    /gluster/vg00/dispersed_fuse1024/brick):1286:service_loop]
    GLUSTER: Changelog register failed    error=[Errno 2] No such file
    or directory
    [2020-06-25 07:51:56.928031] I [repce(agent
    /gluster/vg00/dispersed_fuse1024/brick):96:service_loop]
    RepceServer: terminating on reaching EOF.
    [2020-06-25 07:51:57.886126] I [monitor(monitor):280:monitor]
    Monitor: worker died in startup phase
    brick=/gluster/vg00/dispersed_fuse1024/brick
    [2020-06-25 07:51:57.895920] I
    [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker
    Status Change    status=Faulty
    [2020-06-25 07:51:58.607405] I [gsyncdstatus(worker
    /gluster/vg00/dispersed_fuse1024/brick):287:set_passive]
    GeorepStatus: Worker Status Change    status=Passive
    [2020-06-25 07:51:58.607768] I [gsyncdstatus(worker
    /gluster/vg01/dispersed_fuse1024/brick):287:set_passive]
    GeorepStatus: Worker Status Change    status=Passive
    [2020-06-25 07:51:58.608004] I [gsyncdstatus(worker
    /gluster/vg00/dispersed_fuse1024/brick):281:set_active]
    GeorepStatus: Worker Status Change    status=Active


    On 25/06/2020 09:15, rob.quaglio...@rabobank.com
    <mailto:rob.quaglio...@rabobank.com> wrote:


    Hi All,

    We’ve got two six node RHEL 7.8 clusters and geo-replication
    would appear to be completely broken between them. I’ve deleted
    the session, removed & recreated pem files, old changlogs/htime
    (after removing relevant options from volume) and completely set
    up geo-rep from scratch, but the new session comes up as
    Initializing, then goes faulty, and starts looping. Volume (on
    both sides) is a 4 x 2 disperse, running Gluster v6 (RH latest). 
    Gsyncd reports:

    [2020-06-25 07:07:14.701423] I
    [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus:
    Worker Status Change status=Initializing...

    [2020-06-25 07:07:14.701744] I [monitor(monitor):159:monitor]
    Monitor: starting gsyncd worker   brick=/rhgs/brick20/brick
    slave_node=bxts470194.eu.rabonet.com
    <http://bxts470194.eu.rabonet.com>

    [2020-06-25 07:07:14.707997] D [monitor(monitor):230:monitor]
    Monitor: Worker would mount volume privately

    [2020-06-25 07:07:14.757181] I [gsyncd(agent
    /rhgs/brick20/brick):318:main] <top>: Using session config file
    
path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf

    [2020-06-25 07:07:14.758126] D [subcmds(agent
    /rhgs/brick20/brick):107:subcmd_agent] <top>: RPC FD     
    rpc_fd='5,12,11,10'

    [2020-06-25 07:07:14.758627] I [changelogagent(agent
    /rhgs/brick20/brick):72:__init__] ChangelogAgent: Agent listining...

    [2020-06-25 07:07:14.764234] I [gsyncd(worker
    /rhgs/brick20/brick):318:main] <top>: Using session config file
    
path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf

    [2020-06-25 07:07:14.779409] I [resource(worker
    /rhgs/brick20/brick):1386:connect_remote] SSH: Initializing SSH
    connection between master and slave...

    [2020-06-25 07:07:14.841793] D [repce(worker
    /rhgs/brick20/brick):195:push] RepceClient: call
    6799:140380783982400:1593068834.84 __repce_version__() ...

    [2020-06-25 07:07:16.148725] D [repce(worker
    /rhgs/brick20/brick):215:__call__] RepceClient: call
    6799:140380783982400:1593068834.84 __repce_version__ -> 1.0

    [2020-06-25 07:07:16.148911] D [repce(worker
    /rhgs/brick20/brick):195:push] RepceClient: call
    6799:140380783982400:1593068836.15 version() ...

    [2020-06-25 07:07:16.149574] D [repce(worker
    /rhgs/brick20/brick):215:__call__] RepceClient: call
    6799:140380783982400:1593068836.15 version -> 1.0

    [2020-06-25 07:07:16.149735] D [repce(worker
    /rhgs/brick20/brick):195:push] RepceClient: call
    6799:140380783982400:1593068836.15 pid() ...

    [2020-06-25 07:07:16.150588] D [repce(worker
    /rhgs/brick20/brick):215:__call__] RepceClient: call
    6799:140380783982400:1593068836.15 pid -> 30703

    [2020-06-25 07:07:16.150747] I [resource(worker
    /rhgs/brick20/brick):1435:connect_remote] SSH: SSH connection
    between master and slave established. duration=1.3712

    [2020-06-25 07:07:16.150819] I [resource(worker
    /rhgs/brick20/brick):1105:connect] GLUSTER: Mounting gluster
    volume locally...

    [2020-06-25 07:07:16.265860] D [resource(worker
    /rhgs/brick20/brick):879:inhibit] DirectMounter: auxiliary
    glusterfs mount in place

    [2020-06-25 07:07:17.272511] D [resource(worker
    /rhgs/brick20/brick):953:inhibit] DirectMounter: auxiliary
    glusterfs mount prepared

    [2020-06-25 07:07:17.272708] I [resource(worker
    /rhgs/brick20/brick):1128:connect] GLUSTER: Mounted gluster
    volume      duration=1.1218

    [2020-06-25 07:07:17.272794] I [subcmds(worker
    /rhgs/brick20/brick):84:subcmd_worker] <top>: Worker spawn
    successful. Acknowledging back to monitor

    [2020-06-25 07:07:17.272973] D [master(worker
    /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up
    change detection mode mode=xsync

    [2020-06-25 07:07:17.273063] D [monitor(monitor):273:monitor]
    Monitor: worker(/rhgs/brick20/brick) connected

    [2020-06-25 07:07:17.273678] D [master(worker
    /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up
    change detection mode mode=changelog

    [2020-06-25 07:07:17.274224] D [master(worker
    /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up
    change detection mode mode=changeloghistory

    [2020-06-25 07:07:17.276484] D [repce(worker
    /rhgs/brick20/brick):195:push] RepceClient: call
    6799:140380783982400:1593068837.28 version() ...

    [2020-06-25 07:07:17.276916] D [repce(worker
    /rhgs/brick20/brick):215:__call__] RepceClient: call
    6799:140380783982400:1593068837.28 version -> 1.0

    [2020-06-25 07:07:17.277009] D [master(worker
    /rhgs/brick20/brick):777:setup_working_dir] _GMaster: changelog
    working dir
    
/var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick

    [2020-06-25 07:07:17.277098] D [repce(worker
    /rhgs/brick20/brick):195:push] RepceClient: call
    6799:140380783982400:1593068837.28 init() ...

    [2020-06-25 07:07:17.292944] D [repce(worker
    /rhgs/brick20/brick):215:__call__] RepceClient: call
    6799:140380783982400:1593068837.28 init -> None

    [2020-06-25 07:07:17.293097] D [repce(worker
    /rhgs/brick20/brick):195:push] RepceClient: call
    6799:140380783982400:1593068837.29
    register('/rhgs/brick20/brick',
    
'/var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick',
    
'/var/log/glusterfs/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/changes-rhgs-brick20-brick.log',
    8, 5) ...

    [2020-06-25 07:07:19.296294] E [repce(agent
    /rhgs/brick20/brick):121:worker] <top>: call failed:

    Traceback (most recent call last):

      File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line
    117, in worker

        res = getattr(self.obj, rmeth)(*in_data[2:])

      File
    "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py",
    line 40, in register

        return Changes.cl_register(cl_brick, cl_dir, cl_log,
    cl_level, retries)

      File
    "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
    line 46, in cl_register

        cls.raise_changelog_err()

      File
    "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
    line 30, in raise_changelog_err

        raise ChangelogException(errn, os.strerror(errn))

    ChangelogException: [Errno 2] No such file or directory

    [2020-06-25 07:07:19.297161] E [repce(worker
    /rhgs/brick20/brick):213:__call__] RepceClient: call failed
    call=6799:140380783982400:1593068837.29 method=register
    error=ChangelogException

    [2020-06-25 07:07:19.297338] E [resource(worker
    /rhgs/brick20/brick):1286:service_loop] GLUSTER: Changelog
    register failed      error=[Errno 2] No such file or directory

    [2020-06-25 07:07:19.315074] I [repce(agent
    /rhgs/brick20/brick):96:service_loop] RepceServer: terminating on
    reaching EOF.

    [2020-06-25 07:07:20.275701] I [monitor(monitor):280:monitor]
    Monitor: worker died in startup phase     brick=/rhgs/brick20/brick

    [2020-06-25 07:07:20.277383] I
    [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus:
    Worker Status Change status=Faulty

    We’ve done everything we can think of, including an “strace –f”
    on the pid, and we can’t really find anything. I’m about to lose
    the last of my hair over this, so does anyone have any ideas at
    all? We’ve even removed the entire slave vol and rebuilt it.

    Thanks

    Rob

    *Rob Quagliozzi*

    *Specialised Application Support*



    ------------------------------------------------------------------------
    This email (including any attachments to it) is confidential,
    legally privileged, subject to copyright and is sent for the
    personal attention of the intended recipient only. If you have
    received this email in error, please advise us immediately and
    delete it. You are notified that disclosing, copying,
    distributing or taking any action in reliance on the contents of
    this information is strictly prohibited. Although we have taken
    reasonable precautions to ensure no viruses are present in this
    email, we cannot accept responsibility for any loss or damage
    arising from the viruses in this email or attachments. We exclude
    any liability for the content of this email, or for the
    consequences of any actions taken on the basis of the information
    provided in this email or its attachments, unless that
    information is subsequently confirmed in writing. <#rbnl#1898i>
    ------------------------------------------------------------------------


    ________



    Community Meeting Calendar:

    Schedule -
    Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
    Bridge:https://bluejeans.com/441850968

    Gluster-users mailing list
    Gluster-users@gluster.org  <mailto:Gluster-users@gluster.org>
    https://lists.gluster.org/mailman/listinfo/gluster-users

    ________



    Community Meeting Calendar:

    Schedule -
    Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
    Bridge: https://bluejeans.com/441850968

    Gluster-users mailing list
    Gluster-users@gluster.org <mailto:Gluster-users@gluster.org>
    https://lists.gluster.org/mailman/listinfo/gluster-users


________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Geo-replication completely broken

Reply via email to