Re: [Gluster-users] Does replace-brick migrate data?

2019-06-04 Thread Alan Orth
Hi Ravi,

You're right that I had mentioned using rsync to copy the brick content to
a new host, but in the end I actually decided not to bring it up on a new
brick. Instead I added the original brick back into the volume. So the
xattrs and symlinks to .glusterfs on the original brick are fine. I think
the problem probably lies with a remove-brick that got interrupted. A few
weeks ago during the maintenance I had tried to remove a brick and then
after twenty minutes and no obvious progress I stopped it—after that the
bricks were still part of the volume.

In the last few days I have run a fix-layout that took 26 hours and
finished successfully. Then I started a full index heal and it has healed
about 3.3 million files in a few days and I see a clear increase of network
traffic from old brick host to new brick host over that time. Once the full
index heal completes I will try to do a rebalance.

Thank you,


On Mon, Jun 3, 2019 at 7:40 PM Ravishankar N  wrote:

>
> On 01/06/19 9:37 PM, Alan Orth wrote:
>
> Dear Ravi,
>
> The .glusterfs hardlinks/symlinks should be fine. I'm not sure how I could
> verify them for six bricks and millions of files, though... :\
>
> Hi Alan,
>
> The reason I asked this is because you had mentioned in one of your
> earlier emails that when you moved content from the old brick to the new
> one, you had skipped the .glusterfs directory. So I was assuming that when
> you added back this new brick to the cluster, it might have been missing
> the .glusterfs entries. If that is the cae, one way to verify could be to
> check using a script if all files on the brick have a link-count of at
> least 2 and all dirs have valid symlinks inside .glusterfs pointing to
> themselves.
>
>
> I had a small success in fixing some issues with duplicated files on the
> FUSE mount point yesterday. I read quite a bit about the elastic hashing
> algorithm that determines which files get placed on which bricks based on
> the hash of their filename and the trusted.glusterfs.dht xattr on brick
> directories (thanks to Joe Julian's blog post and Python script for showing
> how it works¹). With that knowledge I looked closer at one of the files
> that was appearing as duplicated on the FUSE mount and found that it was
> also duplicated on more than `replica 2` bricks. For this particular file I
> found two "real" files and several zero-size files with
> trusted.glusterfs.dht.linkto xattrs. Neither of the "real" files were on
> the correct brick as far as the DHT layout is concerned, so I copied one of
> them to the correct brick, deleted the others and their hard links, and did
> a `stat` on the file from the FUSE mount point and it fixed itself. Yay!
>
> Could this have been caused by a replace-brick that got interrupted and
> didn't finish re-labeling the xattrs?
>
> No, replace-brick only initiates AFR self-heal, which just copies the
> contents from the other brick(s) of the *same* replica pair into the
> replaced brick.  The link-to files are created by DHT when you rename a
> file from the client. If the new name hashes to a different  brick, DHT
> does not move the entire file there. It instead creates the link-to file
> (the one with the dht.linkto xattrs) on the hashed subvol. The value of
> this xattr points to the brick where the actual data is there (`getfattr -e
> text` to see it for yourself).  Perhaps you had attempted a rebalance or
> remove-brick earlier and interrupted that?
>
> Should I be thinking of some heuristics to identify and fix these issues
> with a script (incorrect brick placement), or is this something a fix
> layout or repeated volume heals can fix? I've already completed a whole
> heal on this particular volume this week and it did heal about 1,000,000
> files (mostly data and metadata, but about 20,000 entry heals as well).
>
> Maybe you should let the AFR self-heals complete first and then attempt a
> full rebalance to take care of the dht link-to files. But  if the files are
> in millions, it could take quite some time to complete.
> Regards,
> Ravi
>
> Thanks for your support,
>
> ¹ https://joejulian.name/post/dht-misses-are-expensive/
>
> On Fri, May 31, 2019 at 7:57 AM Ravishankar N 
> wrote:
>
>>
>> On 31/05/19 3:20 AM, Alan Orth wrote:
>>
>> Dear Ravi,
>>
>> I spent a bit of time inspecting the xattrs on some files and directories
>> on a few bricks for this volume and it looks a bit messy. Even if I could
>> make sense of it for a few and potentially heal them manually, there are
>> millions of files and directories in total so that's definitely not a
>> scalable solution. After a few missteps with `replace-brick ... commit
>> force` in the last week—one of which on a brick that was dead/offline—as
>> well as some premature `remove-brick` commands, I'm unsure how how to
>> proceed and I'm getting demotivated. It's scary how quickly things get out
>> of hand in distributed systems...
>>
>> Hi Alan,
>> The one good thing about gluster is it that the data is always 

Re: [Gluster-users] Geo Replication stops replicating

2019-06-04 Thread Kotresh Hiremath Ravishankar
Ccing Sunny, who was investing similar issue.

On Tue, Jun 4, 2019 at 5:46 PM deepu srinivasan  wrote:

> Have already added the path in bashrc . Still in faulty state
>
> On Tue, Jun 4, 2019, 5:27 PM Kotresh Hiremath Ravishankar <
> khire...@redhat.com> wrote:
>
>> could you please try adding /usr/sbin to $PATH for user 'sas'? If it's
>> bash, add 'export PATH=/usr/sbin:$PATH' in
>> /home/sas/.bashrc
>>
>> On Tue, Jun 4, 2019 at 5:24 PM deepu srinivasan 
>> wrote:
>>
>>> Hi Kortesh
>>> Please find the logs of the above error
>>> *Master log snippet*
>>>
 [2019-06-04 11:52:09.254731] I [resource(worker
 /home/sas/gluster/data/code-misc):1379:connect_remote] SSH: Initializing
 SSH connection between master and slave...
  [2019-06-04 11:52:09.308923] D [repce(worker
 /home/sas/gluster/data/code-misc):196:push] RepceClient: call
 89724:139652759443264:1559649129.31 __repce_version__() ...
  [2019-06-04 11:52:09.602792] E [syncdutils(worker
 /home/sas/gluster/data/code-misc):311:log_raise_exception] :
 connection to peer is broken
  [2019-06-04 11:52:09.603312] E [syncdutils(worker
 /home/sas/gluster/data/code-misc):805:errlog] Popen: command returned error
   cmd=ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i
 /var/lib/ glusterd/geo-replication/secret.pem -p 22 -oControlMaster=auto -S
 /tmp/gsyncd-aux-ssh-4aL2tc/d893f66e0addc32f7d0080bb503f5185.sock
 sas@192.168.185.107 /usr/libexec/glusterfs/gsyncd slave code-misc sas@
   192.168.185.107::code-misc --master-node 192.168.185.106
 --master-node-id 851b64d0-d885-4ae9-9b38-ab5b15db0fec --master-brick
 /home/sas/gluster/data/code-misc --local-node 192.168.185.122 --local-node-
   id bcaa7af6-c3a1-4411-8e99-4ebecb32eb6a --slave-timeout 120
 --slave-log-level DEBUG --slave-gluster-log-level INFO
 --slave-gluster-command-dir /usr/sbin   error=1
  [2019-06-04 11:52:09.614996] I [repce(agent
 /home/sas/gluster/data/code-misc):97:service_loop] RepceServer: terminating
 on reaching EOF.
  [2019-06-04 11:52:09.615545] D [monitor(monitor):271:monitor] Monitor:
 worker(/home/sas/gluster/data/code-misc) connected
  [2019-06-04 11:52:09.616528] I [monitor(monitor):278:monitor] Monitor:
 worker died in startup phase brick=/home/sas/gluster/data/code-misc
  [2019-06-04 11:52:09.619391] I
 [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status
 Change status=Faulty

>>>
>>> *Slave log snippet*
>>>
 [2019-06-04 11:50:09.782668] E [syncdutils(slave
 192.168.185.106/home/sas/gluster/data/code-misc):809:logerr] Popen:
 /usr/sbin/gluster> 2 : failed with this errno (No such file or directory)
 [2019-06-04 11:50:11.188167] W [gsyncd(slave
 192.168.185.125/home/sas/gluster/data/code-misc):305:main] :
 Session config file not exists, using the default config
 path=/var/lib/glusterd/geo-replication/code-misc_192.168.185.107_code-misc/gsyncd.conf
 [2019-06-04 11:50:11.201070] I [resource(slave
 192.168.185.125/home/sas/gluster/data/code-misc):1098:connect]
 GLUSTER: Mounting gluster volume locally...
 [2019-06-04 11:50:11.271231] E [resource(slave
 192.168.185.125/home/sas/gluster/data/code-misc):1006:handle_mounter]
 MountbrokerMounter: glusterd answered mnt=
 [2019-06-04 11:50:11.271998] E [syncdutils(slave
 192.168.185.125/home/sas/gluster/data/code-misc):805:errlog] Popen:
 command returned error cmd=/usr/sbin/gluster --remote-host=localhost
 system:: mount sas user-map-root=sas aux-gfid-mount acl log-level=INFO
 log-file=/var/log/glusterfs/geo-replication-slaves/code-misc_192.168.185.107_code-misc/mnt-192.168.185.125-home-sas-gluster-data-code-misc.log
 volfile-server=localhost volfile-id=code-misc client-pid=-1 error=1
 [2019-06-04 11:50:11.272113] E [syncdutils(slave
 192.168.185.125/home/sas/gluster/data/code-misc):809:logerr] Popen:
 /usr/sbin/gluster> 2 : failed with this errno (No such file or directory)
>>>
>>>
>>> On Tue, Jun 4, 2019 at 5:10 PM deepu srinivasan 
>>> wrote:
>>>
 Hi
 As discussed I have upgraded gluster from 4.1 to 6.2 version. But the
 Geo replication failed to start.
 Stays in faulty state

 On Fri, May 31, 2019, 5:32 PM deepu srinivasan 
 wrote:

> Checked the data. It remains in 2708. No progress.
>
> On Fri, May 31, 2019 at 4:36 PM Kotresh Hiremath Ravishankar <
> khire...@redhat.com> wrote:
>
>> That means it could be working and the defunct process might be some
>> old zombie one. Could you check, that data progress ?
>>
>> On Fri, May 31, 2019 at 4:29 PM deepu srinivasan 
>> wrote:
>>
>>> Hi
>>> When i change the rsync option the rsync process doesnt seem to
>>> start . Only a defunt process is listed in ps aux. Only when i set rsync
>>> option to " " and restart all the process the rsync process is listed 
>>> 

Re: [Gluster-users] Geo Replication stops replicating

2019-06-04 Thread Kotresh Hiremath Ravishankar
could you please try adding /usr/sbin to $PATH for user 'sas'? If it's
bash, add 'export PATH=/usr/sbin:$PATH' in
/home/sas/.bashrc

On Tue, Jun 4, 2019 at 5:24 PM deepu srinivasan  wrote:

> Hi Kortesh
> Please find the logs of the above error
> *Master log snippet*
>
>> [2019-06-04 11:52:09.254731] I [resource(worker
>> /home/sas/gluster/data/code-misc):1379:connect_remote] SSH: Initializing
>> SSH connection between master and slave...
>>  [2019-06-04 11:52:09.308923] D [repce(worker
>> /home/sas/gluster/data/code-misc):196:push] RepceClient: call
>> 89724:139652759443264:1559649129.31 __repce_version__() ...
>>  [2019-06-04 11:52:09.602792] E [syncdutils(worker
>> /home/sas/gluster/data/code-misc):311:log_raise_exception] :
>> connection to peer is broken
>>  [2019-06-04 11:52:09.603312] E [syncdutils(worker
>> /home/sas/gluster/data/code-misc):805:errlog] Popen: command returned error
>>   cmd=ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i
>> /var/lib/ glusterd/geo-replication/secret.pem -p 22 -oControlMaster=auto -S
>> /tmp/gsyncd-aux-ssh-4aL2tc/d893f66e0addc32f7d0080bb503f5185.sock
>> sas@192.168.185.107 /usr/libexec/glusterfs/gsyncd slave code-misc sas@
>> 192.168.185.107::code-misc --master-node 192.168.185.106
>> --master-node-id 851b64d0-d885-4ae9-9b38-ab5b15db0fec --master-brick
>> /home/sas/gluster/data/code-misc --local-node 192.168.185.122 --local-node-
>>   id bcaa7af6-c3a1-4411-8e99-4ebecb32eb6a --slave-timeout 120
>> --slave-log-level DEBUG --slave-gluster-log-level INFO
>> --slave-gluster-command-dir /usr/sbin   error=1
>>  [2019-06-04 11:52:09.614996] I [repce(agent
>> /home/sas/gluster/data/code-misc):97:service_loop] RepceServer: terminating
>> on reaching EOF.
>>  [2019-06-04 11:52:09.615545] D [monitor(monitor):271:monitor] Monitor:
>> worker(/home/sas/gluster/data/code-misc) connected
>>  [2019-06-04 11:52:09.616528] I [monitor(monitor):278:monitor] Monitor:
>> worker died in startup phase brick=/home/sas/gluster/data/code-misc
>>  [2019-06-04 11:52:09.619391] I
>> [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status
>> Change status=Faulty
>>
>
> *Slave log snippet*
>
>> [2019-06-04 11:50:09.782668] E [syncdutils(slave
>> 192.168.185.106/home/sas/gluster/data/code-misc):809:logerr] Popen:
>> /usr/sbin/gluster> 2 : failed with this errno (No such file or directory)
>> [2019-06-04 11:50:11.188167] W [gsyncd(slave
>> 192.168.185.125/home/sas/gluster/data/code-misc):305:main] :
>> Session config file not exists, using the default config
>> path=/var/lib/glusterd/geo-replication/code-misc_192.168.185.107_code-misc/gsyncd.conf
>> [2019-06-04 11:50:11.201070] I [resource(slave
>> 192.168.185.125/home/sas/gluster/data/code-misc):1098:connect] GLUSTER:
>> Mounting gluster volume locally...
>> [2019-06-04 11:50:11.271231] E [resource(slave
>> 192.168.185.125/home/sas/gluster/data/code-misc):1006:handle_mounter]
>> MountbrokerMounter: glusterd answered mnt=
>> [2019-06-04 11:50:11.271998] E [syncdutils(slave
>> 192.168.185.125/home/sas/gluster/data/code-misc):805:errlog] Popen:
>> command returned error cmd=/usr/sbin/gluster --remote-host=localhost
>> system:: mount sas user-map-root=sas aux-gfid-mount acl log-level=INFO
>> log-file=/var/log/glusterfs/geo-replication-slaves/code-misc_192.168.185.107_code-misc/mnt-192.168.185.125-home-sas-gluster-data-code-misc.log
>> volfile-server=localhost volfile-id=code-misc client-pid=-1 error=1
>> [2019-06-04 11:50:11.272113] E [syncdutils(slave
>> 192.168.185.125/home/sas/gluster/data/code-misc):809:logerr] Popen:
>> /usr/sbin/gluster> 2 : failed with this errno (No such file or directory)
>
>
> On Tue, Jun 4, 2019 at 5:10 PM deepu srinivasan 
> wrote:
>
>> Hi
>> As discussed I have upgraded gluster from 4.1 to 6.2 version. But the Geo
>> replication failed to start.
>> Stays in faulty state
>>
>> On Fri, May 31, 2019, 5:32 PM deepu srinivasan 
>> wrote:
>>
>>> Checked the data. It remains in 2708. No progress.
>>>
>>> On Fri, May 31, 2019 at 4:36 PM Kotresh Hiremath Ravishankar <
>>> khire...@redhat.com> wrote:
>>>
 That means it could be working and the defunct process might be some
 old zombie one. Could you check, that data progress ?

 On Fri, May 31, 2019 at 4:29 PM deepu srinivasan 
 wrote:

> Hi
> When i change the rsync option the rsync process doesnt seem to start
> . Only a defunt process is listed in ps aux. Only when i set rsync option
> to " " and restart all the process the rsync process is listed in ps aux.
>
>
> On Fri, May 31, 2019 at 4:23 PM Kotresh Hiremath Ravishankar <
> khire...@redhat.com> wrote:
>
>> Yes, rsync config option should have fixed this issue.
>>
>> Could you share the output of the following?
>>
>> 1. gluster volume geo-replication  ::
>> config rsync-options
>> 2. ps -ef | grep rsync
>>
>> On Fri, May 31, 2019 at 4:11 PM deepu srinivasan 
>> wrote:
>>
>>> Done.
>>> We 

[Gluster-users] GETXATTR op pending on index xlator for more than 10 hours

2019-06-04 Thread Xie Changlong




Hi all,




Today, i found gnfs GETXATTR  bailing out on gluster release 3.12.0. I have a 
simple 4*2 Distributed-Rep volume. 




[2019-06-03 19:58:33.085880] E [rpc-clnt.c:185:Call_bail] 0-cl25vol01-client-4: 
bailing out frame type(GlusterFS 3.3) op(GETXATTR(18)) xid=0x21de4275 sent = 
2019-06-03 19:28:30.552356. timeout = 1800 for 10.3.133.57:49153




xid= 0x21de4275 = 568214133




Then i try to dump brick 10.3.133.57:49153, and find the GETXATTR op pending on 
index xlator for more than 10 hours!




MicrosoftInternetExplorer402DocumentNotSpecified7.8 磅Normal0



  

[root@node0001 gluster]# grep -rn 568214133 
gluster-brick-1-cl25vol01.6078.dump.15596*

gluster-brick-1-cl25vol01.6078.dump.1559617125:5093:unique=568214133

gluster-brick-1-cl25vol01.6078.dump.1559618121:5230:unique=568214133

gluster-brick-1-cl25vol01.6078.dump.1559618912:5434:unique=568214133

gluster-brick-1-cl25vol01.6078.dump.1559628467:6921:unique=568214133




[root@node0001 gluster]# date -d @1559617125

Tue Jun  4 10:58:45 CST 2019

[root@node0001 gluster]# date -d @1559628467

Tue Jun  4 14:07:47 CST 2019

MicrosoftInternetExplorer402DocumentNotSpecified7.8 磅Normal0



  

[root@node0001 gluster]# 








[global.callpool.stack.115]

stack=0x7f8b342623c0

uid=500

gid=500

pid=-6

unique=568214133

lk-owner=faff

op=stack

type=0

cnt=4




[global.callpool.stack.115.frame.1]

frame=0x7f8b1d6fb540

ref_count=0

translator=cl25vol01-index

complete=0

parent=cl25vol01-quota

wind_from=quota_getxattr

wind_to=(this->children->xlator)->fops->getxattr

unwind_to=default_getxattr_cbk




[global.callpool.stack.115.frame.2]

frame=0x7f8b30a14da0

ref_count=1

translator=cl25vol01-quota

complete=0

parent=cl25vol01-io-stats

wind_from=io_stats_getxattr

wind_to=(this->children->xlator)->fops->getxattr

unwind_to=io_stats_getxattr_cbk




[global.callpool.stack.115.frame.3]

frame=0x7f8b6debada0

ref_count=1

translator=cl25vol01-io-stats

complete=0

parent=cl25vol01-server

wind_from=server_getxattr_resume

wind_to=FIRST_CHILD(this)->fops->getxattr

unwind_to=server_getxattr_cbk




[global.callpool.stack.115.frame.4]

frame=0x7f8b21962a60

ref_count=1

translator=cl25vol01-server

complete=0




I've checked the code logic and got nothing, any advice? I still have the scene 
on my side, so we can dig more.







Thanks









___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Memory leak in glusterfs

2019-06-04 Thread ABHISHEK PALIWAL
Hi Team,

Please respond on the issue which I raised.

Regards,
Abhishek

On Fri, May 17, 2019 at 2:46 PM ABHISHEK PALIWAL 
wrote:

> Anyone please reply
>
> On Thu, May 16, 2019, 10:49 ABHISHEK PALIWAL 
> wrote:
>
>> Hi Team,
>>
>> I upload some valgrind logs from my gluster 5.4 setup. This is writing to
>> the volume every 15 minutes. I stopped glusterd and then copy away the
>> logs.  The test was running for some simulated days. They are zipped in
>> valgrind-54.zip.
>>
>> Lots of info in valgrind-2730.log. Lots of possibly lost bytes in
>> glusterfs and even some definitely lost bytes.
>>
>> ==2737== 1,572,880 bytes in 1 blocks are possibly lost in loss record 391
>> of 391
>> ==2737== at 0x4C29C25: calloc (in
>> /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
>> ==2737== by 0xA22485E: ??? (in
>> /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so)
>> ==2737== by 0xA217C94: ??? (in
>> /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so)
>> ==2737== by 0xA21D9F8: ??? (in
>> /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so)
>> ==2737== by 0xA21DED9: ??? (in
>> /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so)
>> ==2737== by 0xA21E685: ??? (in
>> /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so)
>> ==2737== by 0xA1B9D8C: init (in
>> /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so)
>> ==2737== by 0x4E511CE: xlator_init (in /usr/lib64/libglusterfs.so.0.0.1)
>> ==2737== by 0x4E8A2B8: ??? (in /usr/lib64/libglusterfs.so.0.0.1)
>> ==2737== by 0x4E8AAB3: glusterfs_graph_activate (in
>> /usr/lib64/libglusterfs.so.0.0.1)
>> ==2737== by 0x409C35: glusterfs_process_volfp (in /usr/sbin/glusterfsd)
>> ==2737== by 0x409D99: glusterfs_volumes_init (in /usr/sbin/glusterfsd)
>> ==2737==
>> ==2737== LEAK SUMMARY:
>> ==2737== definitely lost: 1,053 bytes in 10 blocks
>> ==2737== indirectly lost: 317 bytes in 3 blocks
>> ==2737== possibly lost: 2,374,971 bytes in 524 blocks
>> ==2737== still reachable: 53,277 bytes in 201 blocks
>> ==2737== suppressed: 0 bytes in 0 blocks
>>
>> --
>>
>>
>>
>>
>> Regards
>> Abhishek Paliwal
>>
>

-- 




Regards
Abhishek Paliwal
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users