On May 7, 2014, at 11:07 PM, CJ Beck <chris.b...@workday.com> wrote:
> Ok, I have duplicated the issue I was seeing, and it looks like it’s > happening because of the master.py process crashing. After that happens, it > goes into a “Hybrid Crawl” mode for a while, then the files that are synced > (or, more importantly, deleted) during that time, are not synced to the slave > cluster. The worker process (master.py) crashed due to an I/O error while creation of entries on the slave. Could you also share Geo-replicaiton client logs from the slave and the brick logs too. After a crash, the monitor process (monitor.py) restarts Geo-replication. Because of the restart, a one-shot FS crawl (Hybrid crawl) is done before switching to using Changelogs (as currently, I/O operations that happened when geo-rep was not running cannot be consumed via Changelogs, hence the need for file system crawl). > > Once it completes the Hybrid Crawl, it goes back to ChangeLog and the deletes > work as expected. Currently, hybrid crawl has a limitation of not replicating deletes/renames to the slave. Hence during a restart (stop/start or after a crash) any deletes or renames that took place in that window would be replicated. Changelog mode can handle that efficiently. > > I’ve attached the reason for the crash at the bottom of this email. I’ll check if there’s already a BZ for this, else I’ll raise one. > > I think the “feature request” that’s required here is some kind of “resync” > of the entire volume, which will include deleting files, if needed. Is there > another way around this issue? There are patches under review upstream that introduces historical consumption of changelogs. This would enable geo-replication to use Changelogs directly (instead of a file system crawl) after a restart and replicate deletes/renames too. Patch: http://review.gluster.org/#/c/6930/ > > [2014-05-05 19:11:31.969067] I [master(/data/gluster-poc):445:crawlwrap] > _GMaster: 1 crawls, 3 turns > [2014-05-05 19:11:31.980235] I [master(/data/gluster-poc):1059:crawl] > _GMaster: slave's time: (1399310325, 0) > [2014-05-05 19:11:33.595560] E [repce(/data/gluster-poc):188:__call__] > RepceClient: call 7338:140349237835520:1399317093.54 (entry_ops) failed on > peer with OSError > [2014-05-05 19:11:33.596086] E > [syncdutils(/data/gluster-poc):240:log_raise_exception] <top>: FAIL: > Traceback (most recent call last): > File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main > main_i() > File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 542, in > main_i > local.service_loop(*[r for r in [remote] if r]) > File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1177, in > service_loop > g2.crawlwrap() > File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 467, in > crawlwrap > self.crawl() > File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1067, in > crawl > self.process(changes) > File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 825, in > process > self.process_change(change, done, retry) > File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 793, in > process_change > self.slave.server.entry_ops(entries) > File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 204, in > __call__ > return self.ins(self.meth, *a) > File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 189, in > __call__ > raise res > OSError: [Errno 5] Input/output error: > '.gfid/d9492ddc-2e3e-4a9a-bc3c-1d70301fd5c9/.package-tools-22.0.05.008-1.noarch.rpm.cY9HYk' > [2014-05-05 19:11:33.604927] I [syncdutils(/data/gluster-poc):192:finalize] > <top>: exiting. > [2014-05-05 19:11:33.608426] I [monitor(monitor):81:set_state] Monitor: new > state: faulty > [2014-05-05 19:11:43.620071] I [monitor(monitor):129:monitor] Monitor: > ------------------------------------------------------------ > [2014-05-05 19:11:43.620531] I [monitor(monitor):130:monitor] Monitor: > starting gsyncd worker > [2014-05-05 19:11:43.879199] I [gsyncd(/data/gluster-poc):532:main_i] <top>: > syncing: gluster://localhost:gluster-poc -> > ssh://root@10.10.10.120:gluster://localhost:gluster-poc > [2014-05-05 19:11:55.286906] I [master(/data/gluster-poc):58:gmaster_builder] > <top>: setting up xsync change detection mode > [2014-05-05 19:11:55.287549] I [master(/data/gluster-poc):357:__init__] > _GMaster: using 'rsync' as the sync engine > [2014-05-05 19:11:55.288985] I [master(/data/gluster-poc):58:gmaster_builder] > <top>: setting up changelog change detection mode > [2014-05-05 19:11:55.289535] I [master(/data/gluster-poc):357:__init__] > _GMaster: using 'rsync' as the sync engine > [2014-05-05 19:11:55.291219] I [master(/data/gluster-poc):1103:register] > _GMaster: xsync temp directory: > /var/run/gluster/gluster-poc/ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc/5afdd54e66545ea49962eb4d8e257a59/xsync > [2014-05-05 19:11:55.339888] I [master(/data/gluster-poc):421:crawlwrap] > _GMaster: primary master with volume id 1b5b9836-659e-4203-9165-4afc68de83c5 > ... > [2014-05-05 19:11:55.347475] I [master(/data/gluster-poc):432:crawlwrap] > _GMaster: crawl interval: 60 seconds > [2014-05-05 19:11:55.353055] I [master(/data/gluster-poc):1124:crawl] > _GMaster: starting hybrid crawl... > [2014-05-05 19:11:56.357074] I [master(/data/gluster-poc):1133:crawl] > _GMaster: processing xsync changelog > /var/run/gluster/gluster-poc/ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc/5afdd54e66545ea49962eb4d8e257a59/xsync/XSYNC-CHANGELOG.1399317115 > > -CJ > > From: Venky Shankar <vshan...@redhat.com> > Date: Friday, May 2, 2014 at 2:41 AM > To: "gluster-users@gluster.org" <gluster-users@gluster.org> > Subject: Re: [Gluster-users] Question about geo-replication and deletes in > 3.5 beta train > > On 05/01/2014 11:22 PM, CJ Beck wrote: >> Ok, I have found a way to get back to “ChangeLog”… This might be related to >> the similar thread that we have going regarding the method for setting up >> the initial geo-replication session. Seems as though when geo-repliation is >> set up on my cluster, it tried to open the changelog fifo, but it wasn’t >> there. >> >> In order to fix this, I had to do the following: >> >> >> * Stop geo-replication >> * Stop volume >> * Start volume >> * Change geo-replication “change_detector” to changelog >> * Start geo-replication >> >> Once I did that, it went to Hybrid mode first, then changed to ChangeLog >> mode. > > That's correct. But the question is why was change-logging unavailable. The > socket is created when changelog on turned on (done by geo-replication on a > start). > > There would be a initial one-shot hybrid crawl (not a full FS crawl) and then > a switch over to use change-logging. This happens on geo-rep restart. > >> -CJ >> >> From: CJ Beck <chris.b...@workday.com<mailto:chris.b...@workday.com>> >> Date: Thursday, May 1, 2014 at 10:28 AM >> To: Venky Shankar <yknev.shan...@gmail.com<mailto:yknev.shan...@gmail.com>> >> Cc: "gluster-users@gluster.org<mailto:gluster-users@gluster.org>" >> <gluster-users@gluster.org<mailto:gluster-users@gluster.org>> >> Subject: Re: [Gluster-users] Question about geo-replication and deletes in >> 3.5 beta train >> >> I just noticed this, which might be related to the change to xsync? >> >> [root@dev604 eafea2c974a3c29ecfbf48cea274dc23]# more changes.log >> [2014-04-30 15:45:27.807181] I >> [gf-changelog.c:179:gf_changelog_notification_init] 0-glusterfs: connecting >> to changelog socket: >> /var/run/gluster/changelog-eafea2c974a3c29ecfbf48cea274dc23.sock (brick: >> /data/sac-poc) >> [2014-04-30 15:45:27.807257] W >> [gf-changelog.c:189:gf_changelog_notification_init] 0-glusterfs: connection >> attempt 1/5... >> [2014-04-30 15:45:29.807404] W >> [gf-changelog.c:189:gf_changelog_notification_init] 0-glusterfs: connection >> attempt 2/5... >> [2014-04-30 15:45:31.807607] W >> [gf-changelog.c:189:gf_changelog_notification_init] 0-glusterfs: connection >> attempt 3/5... >> [2014-04-30 15:45:33.807818] W >> [gf-changelog.c:189:gf_changelog_notification_init] 0-glusterfs: connection >> attempt 4/5... >> [2014-04-30 15:45:35.808038] W >> [gf-changelog.c:189:gf_changelog_notification_init] 0-glusterfs: connection >> attempt 5/5... >> [2014-04-30 15:45:37.808239] E >> [gf-changelog.c:204:gf_changelog_notification_init] 0-glusterfs: could not >> connect to changelog socket! bailing out... >> >> From: CJ Beck <chris.b...@workday.com<mailto:chris.b...@workday.com>> >> Date: Wednesday, April 30, 2014 at 2:50 PM >> To: Venky Shankar <yknev.shan...@gmail.com<mailto:yknev.shan...@gmail.com>> >> Cc: "gluster-users@gluster.org<mailto:gluster-users@gluster.org>" >> <gluster-users@gluster.org<mailto:gluster-users@gluster.org>> >> Subject: Re: [Gluster-users] Question about geo-replication and deletes in >> 3.5 beta train >> >> I just got back to testing this, and for some reason on my “freshly” created >> cluster and geo-replication session, it’s defaulting to “Hybrid Mode”. It >> also keeps bouncing back to xsync as the change method (it seems). >> >> Geo-replication log: >> [root@dev604 gluster-poc]# egrep -i 'changelog|xsync' * >> ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc.log:[2014-04-30 >> 15:45:27.763072] I [master(/data/gluster-poc):58:gmaster_builder] <top>: >> setting up xsync change detection mode >> ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc.log:[2014-04-30 >> 15:45:27.765294] I [master(/data/gluster-poc):58:gmaster_builder] <top>: >> setting up changelog change detection mode >> ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc.log:[2014-04-30 >> 15:45:27.768302] I [master(/data/gluster-poc):1103:register] _GMaster: >> xsync temp directory: >> /var/run/gluster/gluster-poc/ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc/eafea2c974a3c29ecfbf48cea274dc23/xsync >> ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc.log:[2014-04-30 >> 15:45:37.808617] I [master(/data/gluster-poc):682:fallback_xsync] _GMaster: >> falling back to xsync mode >> ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc.log:[2014-04-30 >> 15:45:52.113879] I [master(/data/gluster-poc):58:gmaster_builder] <top>: >> setting up xsync change detection mode >> ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc.log:[2014-04-30 >> 15:45:52.116525] I [master(/data/gluster-poc):58:gmaster_builder] <top>: >> setting up xsync change detection mode >> ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc.log:[2014-04-30 >> 15:45:52.120129] I [master(/data/gluster-poc):1103:register] _GMaster: >> xsync temp directory: >> /var/run/gluster/gluster-poc/ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc/eafea2c974a3c29ecfbf48cea274dc23/xsync >> ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc.log:[2014-04-30 >> 15:45:52.120604] I [master(/data/gluster-poc):1103:register] _GMaster: >> xsync temp directory: >> /var/run/gluster/gluster-poc/ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc/eafea2c974a3c29ecfbf48cea274dc23/xsync >> ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc.log:[2014-04-30 >> 15:45:54.146847] I [master(/data/gluster-poc):1133:crawl] _GMaster: >> processing xsync changelog >> /var/run/gluster/gluster-poc/ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc/eafea2c974a3c29ecfbf48cea274dc23/xsync/XSYNC-CHANGELOG.1398872752 >> ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc.log:[2014-04-30 >> 15:47:08.204514] I [master(/data/gluster-poc):58:gmaster_builder] <top>: >> setting up xsync change detection mode >> ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc.log:[2014-04-30 >> 15:47:08.206767] I [master(/data/gluster-poc):58:gmaster_builder] <top>: >> setting up xsync change detection mode >> ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc.log:[2014-04-30 >> 15:47:08.210570] I [master(/data/gluster-poc):1103:register] _GMaster: >> xsync temp directory: >> /var/run/gluster/gluster-poc/ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc/eafea2c974a3c29ecfbf48cea274dc23/xsync >> ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc.log:[2014-04-30 >> 15:47:08.211069] I [master(/data/gluster-poc):1103:register] _GMaster: >> xsync temp directory: >> /var/run/gluster/gluster-poc/ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc/eafea2c974a3c29ecfbf48cea274dc23/xsync >> ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc.log:[2014-04-30 >> 15:47:09.247109] I [master(/data/gluster-poc):1133:crawl] _GMaster: >> processing xsync changelog >> /var/run/gluster/gluster-poc/ssh%3A%2F%2Froot%4010.10.10.120%3Agluster%3A%2F%2F127.0.0.1%3Agluster-poc/eafea2c974a3c29ecfbf48cea274dc23/xsync/XSYNC-CHANGELOG.1398872828 >> >> >> [root@dev604 gluster-poc]# gluster volume geo-replication gluster-poc >> 10.10.10.120::gluster-poc status detail >> >> MASTER NODE MASTER VOL MASTER BRICK SLAVE >> STATUS CHECKPOINT STATUS CRAWL STATUS FILES SYNCD FILES >> PENDING BYTES PENDING DELETES PENDING FILES SKIPPED >> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> dev604.domain.com gluster-poc /data/gluster-poc >> 10.10.10.120::gluster-poc Active N/A Hybrid Crawl >> 0 323 0 0 0 >> dev606.domain.com gluster-poc /data/gluster-poc >> 10.10.10.122::gluster-poc Passive N/A N/A >> 0 0 0 0 0 >> dev605.domain.com gluster-poc /data/gluster-poc >> 10.10.10.121::gluster-poc Passive N/A N/A >> 0 0 0 0 0 >> >> >> >> From: Venky Shankar <yknev.shan...@gmail.com<mailto:yknev.shan...@gmail.com>> >> Date: Wednesday, April 23, 2014 at 12:09 PM >> To: CJ Beck <chris.b...@workday.com<mailto:chris.b...@workday.com>> >> Cc: "gluster-users@gluster.org<mailto:gluster-users@gluster.org>" >> <gluster-users@gluster.org<mailto:gluster-users@gluster.org>> >> Subject: Re: [Gluster-users] Question about geo-replication and deletes in >> 3.5 beta train >> >> That should not happen. After a replica failover the "now" active node >> should continue where the "old" active node left off. >> >> Could you provide geo-replication logs from master and slave after >> reproducing this (with changelog mode). >> >> Thanks, >> -venky >> >> >> On Thu, Apr 17, 2014 at 9:34 PM, CJ Beck >> <chris.b...@workday.com<mailto:chris.b...@workday.com>> wrote: >> I did set it intentionally because I found a case where files would be >> missed during geo-replication. Xsync seemed to handle the case better. The >> issue was when you bring the “Active” node down that is handling the >> geo-replication session, and it’s set to ChangeLog as the change method. Any >> files that are written into the cluster while geo-replication is down (eg, >> while the geo-replication session is being failed to another node), are >> missed / skipped, and won’t ever be transferred to the other cluster. >> >> Is this the expected behavior? If not, then I can open a bug on it. >> >> -CJ >> >> From: Venky Shankar <yknev.shan...@gmail.com<mailto:yknev.shan...@gmail.com>> >> Date: Wednesday, April 16, 2014 at 4:43 PM >> >> To: CJ Beck <chris.b...@workday.com<mailto:chris.b...@workday.com>> >> Cc: "gluster-users@gluster.org<mailto:gluster-users@gluster.org>" >> <gluster-users@gluster.org<mailto:gluster-users@gluster.org>> >> Subject: Re: [Gluster-users] Question about geo-replication and deletes in >> 3.5 beta train >> >> >> On Thu, Apr 17, 2014 at 3:01 AM, CJ Beck >> <chris.b...@workday.com<mailto:chris.b...@workday.com>> wrote: >> I did have the “change_detector” set to xsync, which seems to be the issue >> (bypassing the changelog method). So I can fix that and see if the deletes >> are propagated. >> >> Was that set intentionally? Setting this as the main change detection >> mechanism would crawl the filesystem every 60 seconds to replicate the >> changes. Changelog mode handles live changes, so any deletes that were >> performed before this option was set would not be propagated. >> >> >> Also, is there a way to tell the geo-replication to go ahead and walk the >> filesystems to do a “sync” so the remote side files are deleted, if they are >> not on the source? >> >> As of now, no. With distributed geo-replication, the geo-rep daemon crawls >> the bricks (instead of the mount). Since the brick would have a subset of >> the file system entities (for e.g. in a distributed volume), it's hard to >> find out purged entries without having to crawl the mount and comparing the >> entries b/w master and slave (which is slow). This is where changelog mode >> helps. >> >> >> Thanks for the quick reply! >> >> [root@host ~]# gluster volume geo-replication test-poc 10.10.1.120::test-poc >> status detail >> >> MASTER NODE MASTER VOL MASTER BRICK SLAVE >> STATUS CHECKPOINT STATUS CRAWL STATUS FILES SYNCD FILES >> PENDING BYTES PENDING DELETES PENDING FILES SKIPPED >> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> host1.com<http://host1.com> test-poc /data/test-poc >> 10.10.1.120::test-poc Passive N/A N/A 382 >> 0 0 0 0 >> host2.com<http://host2.com> test-poc /data/test-poc >> 10.10.1.122::test-poc Passive N/A N/A 0 >> 0 0 0 0 >> host3.com<http://host3.com> test-poc /data/test-poc >> 10.10.1.121::test-poc Active N/A Hybrid Crawl >> 10765 70 0 0 0 >> >> >> From: Venky Shankar <yknev.shan...@gmail.com<mailto:yknev.shan...@gmail.com>> >> Date: Wednesday, April 16, 2014 at 1:54 PM >> To: CJ Beck <chris.b...@workday.com<mailto:chris.b...@workday.com>> >> Cc: "gluster-users@gluster.org<mailto:gluster-users@gluster.org>" >> <gluster-users@gluster.org<mailto:gluster-users@gluster.org>> >> Subject: Re: [Gluster-users] Question about geo-replication and deletes in >> 3.5 beta train >> >> "ignore-deletes" is only valid in the initial crawl mode[1] where it does >> not propagate deletes to the slave (changelog mode does). Was the session >> restarted by any chance? >> >> [1] Geo-replication now has two internal operations modes: a one shot >> filesystem crawl mode (used to replicate data already present in a volume) >> and the changelog mode (for replicating live changes). >> >> Thanks, >> -venky >> >> >> >> On Thu, Apr 17, 2014 at 1:25 AM, CJ Beck >> <chris.b...@workday.com<mailto:chris.b...@workday.com>> wrote: >> I have an issue where deletes are not being propagated to the slave cluster >> in a geo-replicated environment. I’ve looked through the code, and it >> appears as though this is something that might have been changed to be hard >> coded? >> >> When I try to change it via a config option on the command line, it replies >> with a “reserved option” error: >> [root@host ~]# gluster volume geo-replication test-poc 10.10.1.120::test-poc >> config ignore_deletes 1 >> Reserved option >> geo-replication command failed >> [root@host ~]# gluster volume geo-replication test-poc 10.10.1.120::test-poc >> config ignore-deletes 1 >> Reserved option >> geo-replication command failed >> [root@host ~]# >> >> Looking at the source code (although, I’m not a C expert by any means), it >> seems as though it’s hard-coded to be “true” all the time? >> >> (from glusterd-geo-rep.c): >> 4285 /* ignore-deletes */ >> 4286 runinit_gsyncd_setrx (&runner, conf_path); >> 4287 runner_add_args (&runner, "ignore-deletes", "true", ".", ".", >> NULL); >> 4288 RUN_GSYNCD_CMD; >> >> Any ideas how to get deletes propagated to the slave cluster? >> >> Thanks! >> >> -CJ >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users@gluster.org<mailto:Gluster-users@gluster.org>http://supercolony.gluster.org/mailman/listinfo/gluster-users >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users@gluster.orghttp://supercolony.gluster.org/mailman/listinfo/gluster-users >
_______________________________________________ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users