Re: [Gluster-users] Transport endpoint is not connected failures in

2019-03-27 Thread Nithya Balachandran
On Wed, 27 Mar 2019 at 21:47,  wrote:

> Hello Amar and list,
>
>
>
> I wanted to follow-up to confirm that upgrading to 5.5 seem to fix the
> “Transport endpoint is not connected failures” for us.
>
>
>
> We did not have any of these failures in this past weekend backups cycle.
>
>
>
> Thank you very much for fixing whatever was the problem.
>
>
>
> I also removed some volume config options.  One or more of the settings
> was contributing to the slow directory listing.
>

Hi Brandon,

Which options were removed?

Thanks,
Nithya

>
>
> Here is our current volume info.
>
>
>
> [root@lonbaknode3 ~]# gluster volume info
>
>
>
> Volume Name: volbackups
>
> Type: Distribute
>
> Volume ID: 32bf4fe9-5450-49f8-b6aa-05471d3bdffa
>
> Status: Started
>
> Snapshot Count: 0
>
> Number of Bricks: 8
>
> Transport-type: tcp
>
> Bricks:
>
> Brick1: lonbaknode3.domain.net:/lvbackups/brick
>
> Brick2: lonbaknode4.domain.net:/lvbackups/brick
>
> Brick3: lonbaknode5.domain.net:/lvbackups/brick
>
> Brick4: lonbaknode6.domain.net:/lvbackups/brick
>
> Brick5: lonbaknode7.domain.net:/lvbackups/brick
>
> Brick6: lonbaknode8.domain.net:/lvbackups/brick
>
> Brick7: lonbaknode9.domain.net:/lvbackups/brick
>
> Brick8: lonbaknode10.domain.net:/lvbackups/brick
>
> Options Reconfigured:
>
> performance.io-thread-count: 32
>
> performance.client-io-threads: on
>
> client.event-threads: 8
>
> diagnostics.brick-sys-log-level: WARNING
>
> diagnostics.brick-log-level: WARNING
>
> performance.cache-max-file-size: 2MB
>
> performance.cache-size: 256MB
>
> cluster.min-free-disk: 1%
>
> nfs.disable: on
>
> transport.address-family: inet
>
> server.event-threads: 8
>
> [root@lonbaknode3 ~]#
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-27 Thread Raghavendra Gowdappa
On Wed, Mar 27, 2019 at 8:38 PM Xavi Hernandez  wrote:

> On Wed, Mar 27, 2019 at 2:20 PM Pranith Kumar Karampuri <
> pkara...@redhat.com> wrote:
>
>>
>>
>> On Wed, Mar 27, 2019 at 6:38 PM Xavi Hernandez 
>> wrote:
>>
>>> On Wed, Mar 27, 2019 at 1:13 PM Pranith Kumar Karampuri <
>>> pkara...@redhat.com> wrote:
>>>


 On Wed, Mar 27, 2019 at 5:13 PM Xavi Hernandez 
 wrote:

> On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa <
> rgowd...@redhat.com> wrote:
>
>>
>>
>> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez 
>> wrote:
>>
>>> Hi Raghavendra,
>>>
>>> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <
>>> rgowd...@redhat.com> wrote:
>>>
 All,

 Glusterfs cleans up POSIX locks held on an fd when the client/mount
 through which those locks are held disconnects from bricks/server. This
 helps Glusterfs to not run into a stale lock problem later (For eg., if
 application unlocks while the connection was still down). However, this
 means the lock is no longer exclusive as other applications/clients can
 acquire the same lock. To communicate that locks are no longer valid, 
 we
 are planning to mark the fd (which has POSIX locks) bad on a 
 disconnect so
 that any future operations on that fd will fail, forcing the 
 application to
 re-open the fd and re-acquire locks it needs [1].

>>>
>>> Wouldn't it be better to retake the locks when the brick is
>>> reconnected if the lock is still in use ?
>>>
>>
>> There is also  a possibility that clients may never reconnect. That's
>> the primary reason why bricks assume the worst (client will not 
>> reconnect)
>> and cleanup the locks.
>>
>
> True, so it's fine to cleanup the locks. I'm not saying that locks
> shouldn't be released on disconnect. The assumption is that if the client
> has really died, it will also disconnect from other bricks, who will
> release the locks. So, eventually, another client will have enough quorum
> to attempt a lock that will succeed. In other words, if a client gets
> disconnected from too many bricks simultaneously (loses Quorum), then that
> client can be considered as bad and can return errors to the application.
> This should also cause to release the locks on the remaining connected
> bricks.
>
> On the other hand, if the disconnection is very short and the client
> has not died, it will keep enough locked files (it has quorum) to avoid
> other clients to successfully acquire a lock. In this case, if the brick 
> is
> reconnected, all existing locks should be reacquired to recover the
> original state before the disconnection.
>
>
>>
>>> BTW, the referenced bug is not public. Should we open another bug to
>>> track this ?
>>>
>>
>> I've just opened up the comment to give enough context. I'll open a
>> bug upstream too.
>>
>>
>>>
>>>

 Note that with AFR/replicate in picture we can prevent errors to
 application as long as Quorum number of children "never ever" lost
 connection with bricks after locks have been acquired. I am using the 
 term
 "never ever" as locks are not healed back after re-connection and hence
 first disconnect would've marked the fd bad and the fd remains so even
 after re-connection happens. So, its not just Quorum number of children
 "currently online", but Quorum number of children "never having
 disconnected with bricks after locks are acquired".

>>>
>>> I think this requisite is not feasible. In a distributed file
>>> system, sooner or later all bricks will be disconnected. It could be
>>> because of failures or because an upgrade is done, but it will happen.
>>>
>>> The difference here is how long are fd's kept open. If applications
>>> open and close files frequently enough (i.e. the fd is not kept open 
>>> more
>>> time than it takes to have more than Quorum bricks disconnected) then
>>> there's no problem. The problem can only appear on applications that 
>>> open
>>> files for a long time and also use posix locks. In this case, the only 
>>> good
>>> solution I see is to retake the locks on brick reconnection.
>>>
>>
>> Agree. But lock-healing should be done only by HA layers like AFR/EC
>> as only they know whether there are enough online bricks to have 
>> prevented
>> any conflicting lock. Protocol/client itself doesn't have enough
>> information to do that. If its a plain distribute, I don't see a way to
>> heal locks without loosing the property of exclusivity of locks.
>>
>
> Lock-healing of locks acquired while a brick was disconnected need to
> be 

Re: [Gluster-users] Transport endpoint is not connected failures in

2019-03-27 Thread Raghavendra Gowdappa
On Wed, Mar 27, 2019 at 9:46 PM  wrote:

> Hello Amar and list,
>
>
>
> I wanted to follow-up to confirm that upgrading to 5.5 seem to fix the
> “Transport endpoint is not connected failures” for us.
>

What was the version you saw failures in? Were there any logs matching with
the pattern "ping_timer_expired" earlier?


>
> We did not have any of these failures in this past weekend backups cycle.
>
>
>
> Thank you very much for fixing whatever was the problem.
>
>
>
> I also removed some volume config options.  One or more of the settings
> was contributing to the slow directory listing.
>
>
>
> Here is our current volume info.
>
>
>
> [root@lonbaknode3 ~]# gluster volume info
>
>
>
> Volume Name: volbackups
>
> Type: Distribute
>
> Volume ID: 32bf4fe9-5450-49f8-b6aa-05471d3bdffa
>
> Status: Started
>
> Snapshot Count: 0
>
> Number of Bricks: 8
>
> Transport-type: tcp
>
> Bricks:
>
> Brick1: lonbaknode3.domain.net:/lvbackups/brick
>
> Brick2: lonbaknode4.domain.net:/lvbackups/brick
>
> Brick3: lonbaknode5.domain.net:/lvbackups/brick
>
> Brick4: lonbaknode6.domain.net:/lvbackups/brick
>
> Brick5: lonbaknode7.domain.net:/lvbackups/brick
>
> Brick6: lonbaknode8.domain.net:/lvbackups/brick
>
> Brick7: lonbaknode9.domain.net:/lvbackups/brick
>
> Brick8: lonbaknode10.domain.net:/lvbackups/brick
>
> Options Reconfigured:
>
> performance.io-thread-count: 32
>
> performance.client-io-threads: on
>
> client.event-threads: 8
>
> diagnostics.brick-sys-log-level: WARNING
>
> diagnostics.brick-log-level: WARNING
>
> performance.cache-max-file-size: 2MB
>
> performance.cache-size: 256MB
>
> cluster.min-free-disk: 1%
>
> nfs.disable: on
>
> transport.address-family: inet
>
> server.event-threads: 8
>
> [root@lonbaknode3 ~]#
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Transport endpoint is not connected failures in

2019-03-27 Thread Thing
I have this issue, for a few days with my new setup.   I will have to get
back to you on versions but it was centos7.6 patched yesterday (27/3/2019).



On Thu, 28 Mar 2019 at 12:58, Sankarshan Mukhopadhyay <
sankarshan.mukhopadh...@gmail.com> wrote:

> On Wed, Mar 27, 2019 at 9:46 PM  wrote:
> >
> > Hello Amar and list,
> >
> >
> >
> > I wanted to follow-up to confirm that upgrading to 5.5 seem to fix the
> “Transport endpoint is not connected failures” for us.
> >
> >
> >
> > We did not have any of these failures in this past weekend backups cycle.
> >
> >
> >
> > Thank you very much for fixing whatever was the problem.
>
> As always, thank you for circling back to the list and sharing that
> the issues have been addressed.
> >
> > I also removed some volume config options.  One or more of the settings
> was contributing to the slow directory listing.
> >
> >
> >
> > Here is our current volume info.
> >
>
> This is very useful!
>
> >
> > [root@lonbaknode3 ~]# gluster volume info
> >
> >
> >
> > Volume Name: volbackups
> >
> > Type: Distribute
> >
> > Volume ID: 32bf4fe9-5450-49f8-b6aa-05471d3bdffa
> >
> > Status: Started
> >
> > Snapshot Count: 0
> >
> > Number of Bricks: 8
> >
> > Transport-type: tcp
> >
> > Bricks:
> >
> > Brick1: lonbaknode3.domain.net:/lvbackups/brick
> >
> > Brick2: lonbaknode4.domain.net:/lvbackups/brick
> >
> > Brick3: lonbaknode5.domain.net:/lvbackups/brick
> >
> > Brick4: lonbaknode6.domain.net:/lvbackups/brick
> >
> > Brick5: lonbaknode7.domain.net:/lvbackups/brick
> >
> > Brick6: lonbaknode8.domain.net:/lvbackups/brick
> >
> > Brick7: lonbaknode9.domain.net:/lvbackups/brick
> >
> > Brick8: lonbaknode10.domain.net:/lvbackups/brick
> >
> > Options Reconfigured:
> >
> > performance.io-thread-count: 32
> >
> > performance.client-io-threads: on
> >
> > client.event-threads: 8
> >
> > diagnostics.brick-sys-log-level: WARNING
> >
> > diagnostics.brick-log-level: WARNING
> >
> > performance.cache-max-file-size: 2MB
> >
> > performance.cache-size: 256MB
> >
> > cluster.min-free-disk: 1%
> >
> > nfs.disable: on
> >
> > transport.address-family: inet
> >
> > server.event-threads: 8
> >
> > [root@lonbaknode3 ~]#
> >
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] [Gluster-infra] Gluster HA

2019-03-27 Thread Sankarshan Mukhopadhyay
[This email was originally posted to the gluster-infra list. Since
that list is used for coordination between members who work on the
infrastructure of the project, I am redirecting it to gluster-users
for better visibility and responses.]

On Thu, Mar 28, 2019 at 1:13 AM Guy Boisvert
 wrote:
>
> Hi,
>
>  New to this mailing list.  I'm seeking people advice for GlusterFS
> HA in the context of KVM Virtual Machines (VM) storage. We have 3 x KVM
> servers that use a 3 x GlusterFS nodes.  The Volumes are 3 way replicate.
>
>  My question is: You guys, what is your network architecture / setup
> for GlusterFS HA?  I read many articles on the internet. Many people are
> talking about bonding to a switch but i don't consider this as a good
> solution.  I'd like to have Gluster and KVM servers linked to at least 2
> switches to have switch / wire and network car redundancy.
>
>  I saw people using 2 x dumb switches with bonding mode 6 on their
> servers with mii monitoring.  It seems to be about good but it could
> append that mii is up but frames / packets won't flow. So it this case,
> i can't imagine how the servers would handle this.
>
>  Another setup is dual dumb switches and running Quagga on the
> servers (OSPF / ECMP).  This seems to be the best setup, what do you
> think?  Do you have experience with one of those setups?  What are your
> thoughts on this?  Ah and lastly, how can i search in the list?
>
>
> Thanks!
>
>
> Guy
>
> --
> Guy Boisvert, ing.
> IngTegration inc.
> http://www.ingtegration.com
> https://www.linkedin.com/pub/guy-boisvert/7/48/899/fr
>
> AVIS DE CONFIDENTIALITE : ce message peut contenir des
> renseignements confidentiels appartenant exclusivement a
> IngTegration Inc. ou a ses filiales. Si vous n'etes pas
> le destinataire indique ou prevu dans ce  message (ou
> responsable de livrer ce message a la personne indiquee ou
> prevue) ou si vous pensez que ce message vous a ete adresse
> par erreur, vous ne pouvez pas utiliser ou reproduire ce
> message, ni le livrer a quelqu'un d'autre. Dans ce cas, vous
> devez le detruire et vous etes prie d'avertir l'expediteur
> en repondant au courriel.
>
> CONFIDENTIALITY NOTICE : Proprietary/Confidential Information
> belonging to IngTegration Inc. and its affiliates may be
> contained in this message. If you are not a recipient
> indicated or intended in this message (or responsible for
> delivery of this message to such person), or you think for
> any reason that this message may have been addressed to you
> in error, you may not use or copy or deliver this message to
> anyone else. In such case, you should destroy this message
> and are asked to notify the sender by reply email.
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Transport endpoint is not connected failures in

2019-03-27 Thread Sankarshan Mukhopadhyay
On Wed, Mar 27, 2019 at 9:46 PM  wrote:
>
> Hello Amar and list,
>
>
>
> I wanted to follow-up to confirm that upgrading to 5.5 seem to fix the 
> “Transport endpoint is not connected failures” for us.
>
>
>
> We did not have any of these failures in this past weekend backups cycle.
>
>
>
> Thank you very much for fixing whatever was the problem.

As always, thank you for circling back to the list and sharing that
the issues have been addressed.
>
> I also removed some volume config options.  One or more of the settings was 
> contributing to the slow directory listing.
>
>
>
> Here is our current volume info.
>

This is very useful!

>
> [root@lonbaknode3 ~]# gluster volume info
>
>
>
> Volume Name: volbackups
>
> Type: Distribute
>
> Volume ID: 32bf4fe9-5450-49f8-b6aa-05471d3bdffa
>
> Status: Started
>
> Snapshot Count: 0
>
> Number of Bricks: 8
>
> Transport-type: tcp
>
> Bricks:
>
> Brick1: lonbaknode3.domain.net:/lvbackups/brick
>
> Brick2: lonbaknode4.domain.net:/lvbackups/brick
>
> Brick3: lonbaknode5.domain.net:/lvbackups/brick
>
> Brick4: lonbaknode6.domain.net:/lvbackups/brick
>
> Brick5: lonbaknode7.domain.net:/lvbackups/brick
>
> Brick6: lonbaknode8.domain.net:/lvbackups/brick
>
> Brick7: lonbaknode9.domain.net:/lvbackups/brick
>
> Brick8: lonbaknode10.domain.net:/lvbackups/brick
>
> Options Reconfigured:
>
> performance.io-thread-count: 32
>
> performance.client-io-threads: on
>
> client.event-threads: 8
>
> diagnostics.brick-sys-log-level: WARNING
>
> diagnostics.brick-log-level: WARNING
>
> performance.cache-max-file-size: 2MB
>
> performance.cache-size: 256MB
>
> cluster.min-free-disk: 1%
>
> nfs.disable: on
>
> transport.address-family: inet
>
> server.event-threads: 8
>
> [root@lonbaknode3 ~]#
>
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-27 Thread Xavi Hernandez
On Wed, 27 Mar 2019, 18:26 Pranith Kumar Karampuri, 
wrote:

>
>
> On Wed, Mar 27, 2019 at 8:38 PM Xavi Hernandez 
> wrote:
>
>> On Wed, Mar 27, 2019 at 2:20 PM Pranith Kumar Karampuri <
>> pkara...@redhat.com> wrote:
>>
>>>
>>>
>>> On Wed, Mar 27, 2019 at 6:38 PM Xavi Hernandez 
>>> wrote:
>>>
 On Wed, Mar 27, 2019 at 1:13 PM Pranith Kumar Karampuri <
 pkara...@redhat.com> wrote:

>
>
> On Wed, Mar 27, 2019 at 5:13 PM Xavi Hernandez 
> wrote:
>
>> On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa <
>> rgowd...@redhat.com> wrote:
>>
>>>
>>>
>>> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez 
>>> wrote:
>>>
 Hi Raghavendra,

 On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <
 rgowd...@redhat.com> wrote:

> All,
>
> Glusterfs cleans up POSIX locks held on an fd when the
> client/mount through which those locks are held disconnects from
> bricks/server. This helps Glusterfs to not run into a stale lock 
> problem
> later (For eg., if application unlocks while the connection was still
> down). However, this means the lock is no longer exclusive as other
> applications/clients can acquire the same lock. To communicate that 
> locks
> are no longer valid, we are planning to mark the fd (which has POSIX 
> locks)
> bad on a disconnect so that any future operations on that fd will 
> fail,
> forcing the application to re-open the fd and re-acquire locks it 
> needs [1].
>

 Wouldn't it be better to retake the locks when the brick is
 reconnected if the lock is still in use ?

>>>
>>> There is also  a possibility that clients may never reconnect.
>>> That's the primary reason why bricks assume the worst (client will not
>>> reconnect) and cleanup the locks.
>>>
>>
>> True, so it's fine to cleanup the locks. I'm not saying that locks
>> shouldn't be released on disconnect. The assumption is that if the client
>> has really died, it will also disconnect from other bricks, who will
>> release the locks. So, eventually, another client will have enough quorum
>> to attempt a lock that will succeed. In other words, if a client gets
>> disconnected from too many bricks simultaneously (loses Quorum), then 
>> that
>> client can be considered as bad and can return errors to the application.
>> This should also cause to release the locks on the remaining connected
>> bricks.
>>
>> On the other hand, if the disconnection is very short and the client
>> has not died, it will keep enough locked files (it has quorum) to avoid
>> other clients to successfully acquire a lock. In this case, if the brick 
>> is
>> reconnected, all existing locks should be reacquired to recover the
>> original state before the disconnection.
>>
>>
>>>
 BTW, the referenced bug is not public. Should we open another bug
 to track this ?

>>>
>>> I've just opened up the comment to give enough context. I'll open a
>>> bug upstream too.
>>>
>>>


>
> Note that with AFR/replicate in picture we can prevent errors to
> application as long as Quorum number of children "never ever" lost
> connection with bricks after locks have been acquired. I am using the 
> term
> "never ever" as locks are not healed back after re-connection and 
> hence
> first disconnect would've marked the fd bad and the fd remains so even
> after re-connection happens. So, its not just Quorum number of 
> children
> "currently online", but Quorum number of children "never having
> disconnected with bricks after locks are acquired".
>

 I think this requisite is not feasible. In a distributed file
 system, sooner or later all bricks will be disconnected. It could be
 because of failures or because an upgrade is done, but it will happen.

 The difference here is how long are fd's kept open. If applications
 open and close files frequently enough (i.e. the fd is not kept open 
 more
 time than it takes to have more than Quorum bricks disconnected) then
 there's no problem. The problem can only appear on applications that 
 open
 files for a long time and also use posix locks. In this case, the only 
 good
 solution I see is to retake the locks on brick reconnection.

>>>
>>> Agree. But lock-healing should be done only by HA layers like AFR/EC
>>> as only they know whether there are enough online bricks to have 
>>> prevented
>>> any conflicting lock. Protocol/client itself doesn't have enough
>>> information 

Re: [Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-27 Thread Pranith Kumar Karampuri
On Wed, Mar 27, 2019 at 8:38 PM Xavi Hernandez  wrote:

> On Wed, Mar 27, 2019 at 2:20 PM Pranith Kumar Karampuri <
> pkara...@redhat.com> wrote:
>
>>
>>
>> On Wed, Mar 27, 2019 at 6:38 PM Xavi Hernandez 
>> wrote:
>>
>>> On Wed, Mar 27, 2019 at 1:13 PM Pranith Kumar Karampuri <
>>> pkara...@redhat.com> wrote:
>>>


 On Wed, Mar 27, 2019 at 5:13 PM Xavi Hernandez 
 wrote:

> On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa <
> rgowd...@redhat.com> wrote:
>
>>
>>
>> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez 
>> wrote:
>>
>>> Hi Raghavendra,
>>>
>>> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <
>>> rgowd...@redhat.com> wrote:
>>>
 All,

 Glusterfs cleans up POSIX locks held on an fd when the client/mount
 through which those locks are held disconnects from bricks/server. This
 helps Glusterfs to not run into a stale lock problem later (For eg., if
 application unlocks while the connection was still down). However, this
 means the lock is no longer exclusive as other applications/clients can
 acquire the same lock. To communicate that locks are no longer valid, 
 we
 are planning to mark the fd (which has POSIX locks) bad on a 
 disconnect so
 that any future operations on that fd will fail, forcing the 
 application to
 re-open the fd and re-acquire locks it needs [1].

>>>
>>> Wouldn't it be better to retake the locks when the brick is
>>> reconnected if the lock is still in use ?
>>>
>>
>> There is also  a possibility that clients may never reconnect. That's
>> the primary reason why bricks assume the worst (client will not 
>> reconnect)
>> and cleanup the locks.
>>
>
> True, so it's fine to cleanup the locks. I'm not saying that locks
> shouldn't be released on disconnect. The assumption is that if the client
> has really died, it will also disconnect from other bricks, who will
> release the locks. So, eventually, another client will have enough quorum
> to attempt a lock that will succeed. In other words, if a client gets
> disconnected from too many bricks simultaneously (loses Quorum), then that
> client can be considered as bad and can return errors to the application.
> This should also cause to release the locks on the remaining connected
> bricks.
>
> On the other hand, if the disconnection is very short and the client
> has not died, it will keep enough locked files (it has quorum) to avoid
> other clients to successfully acquire a lock. In this case, if the brick 
> is
> reconnected, all existing locks should be reacquired to recover the
> original state before the disconnection.
>
>
>>
>>> BTW, the referenced bug is not public. Should we open another bug to
>>> track this ?
>>>
>>
>> I've just opened up the comment to give enough context. I'll open a
>> bug upstream too.
>>
>>
>>>
>>>

 Note that with AFR/replicate in picture we can prevent errors to
 application as long as Quorum number of children "never ever" lost
 connection with bricks after locks have been acquired. I am using the 
 term
 "never ever" as locks are not healed back after re-connection and hence
 first disconnect would've marked the fd bad and the fd remains so even
 after re-connection happens. So, its not just Quorum number of children
 "currently online", but Quorum number of children "never having
 disconnected with bricks after locks are acquired".

>>>
>>> I think this requisite is not feasible. In a distributed file
>>> system, sooner or later all bricks will be disconnected. It could be
>>> because of failures or because an upgrade is done, but it will happen.
>>>
>>> The difference here is how long are fd's kept open. If applications
>>> open and close files frequently enough (i.e. the fd is not kept open 
>>> more
>>> time than it takes to have more than Quorum bricks disconnected) then
>>> there's no problem. The problem can only appear on applications that 
>>> open
>>> files for a long time and also use posix locks. In this case, the only 
>>> good
>>> solution I see is to retake the locks on brick reconnection.
>>>
>>
>> Agree. But lock-healing should be done only by HA layers like AFR/EC
>> as only they know whether there are enough online bricks to have 
>> prevented
>> any conflicting lock. Protocol/client itself doesn't have enough
>> information to do that. If its a plain distribute, I don't see a way to
>> heal locks without loosing the property of exclusivity of locks.
>>
>
> Lock-healing of locks acquired while a brick was disconnected need to
> be 

Re: [Gluster-users] cannot add server back to cluster after reinstallation

2019-03-27 Thread Riccardo Murri
Thanks all for the help!  The cluster has been up for a few hours now
with no reported errors, so I guess replacement of the server went
ultimately fine ;-)

Ciao,
R
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] glusterfs unusable?

2019-03-27 Thread richard lucassen
On Wed, 27 Mar 2019 07:53:55 -0700
Joe Julian  wrote:

Ok Joe, this is the situation: I have a glusterfs cluster using R630
Dell servers with 256GB of memory, a bunch of 3.4TB SSD's and Intel Xeon
E5-2667 beasts. Using such power and seeing glusterfs taking 5
seconds for a simple "ls -alR" on a client directly connected over a
1Gbit cable to these servers is rather slow (this will be nominated
for The Understement Of The Week)  Rather slow, not unusable (and I
haven't even added an arbiter to these two servers yet)

OTOH, contrary to what you suggest, I'm not using a brick at home, it is
just a linux client connecting to these two servers, ok, I admit, over a
slow line. I was just looking how long it would take
before a simple "ls -alR" would take. And when this takes almost an
hour consuming 2GB upload then I think I can say it's quite unusable.

So I'm sorry Joe, I don't want to spoil your day, but I have to say
that Glusterfs (sorry for the wrong abbreviation) did spoil my day
because of this issue. Such a bad behaviour would certainly be a show
stopper.

I hope the patches will resolve these issues.

R.

> First, your statement and subject is hyperbolic and combative. In
> general it's best not to begin any approach for help with an
> uneducated attack on a community.
> 
> GFS (Global File System) is an entirely different project but I'm
> going to assume you're in the right place and actually asking about
> GlusterFS.
> 
> You haven't described your use case so I'll make an assumption that
> your intent is to sync files from your office to your home. I'll
> further guess that you're replicating one brick at home and the other
> at the office.
> 
> Yes, this is generally an unusable use case for to latency and
> connectivity reasons. Your 2Gb transfer was very likely a self heal
> due to a connectivity problem from one of your clients. When your
> home client performed a lookup() of the files, it caught the
> discrepancy and fixed it. The latency is multiplied due to the very
> nature of clustering and your latent connection.
> 
> For a more useful answer, I'd suggest describing your needs and
> asking for help. There is tons of experienced storage professionals
> here that are happy to share their knowledge and advice.
> 
> On March 27, 2019 7:23:35 AM PDT, richard lucassen
>  wrote:
> >Hello list,
> >
> >glusterfs 5.4-1 on Debian Buster (both servers and clients)
> >
> >I'm quite new to GFS and it's an old problem I know. When running a
> >simple "ls -alR" on a local directory containing 50MB and 3468 files
> >it takes:
> >
> >real0m0.567s
> >user0m0.084s
> >sys 0m0.168s
> >
> >Same thing for a copy of that dir on GFS takes more than 5 seconds:
> >
> >real0m5.557s
> >user0m0.128s
> >sys 0m0.208s
> >
> >Ok. But from my workstation at home, an "ls -alR" of that directory
> >takes more than half an hour and the upload is more than 2GB (no
> >typo: TWO Gigabytes). To keep it simple, the ls of a few directories:
> >
> >$ time ls
> >all  xabc-db  xabc-dc1  xabc-gluster  xabc-mail  xabc-otp  xabc-smtp
> >
> >real0m5.766s
> >user0m0.001s
> >sys 0m0.003s
> >
> >it receives 56kB and sends 2.3 MB for a simple ls.
> >
> >This is weird isn't it? Why this huge upload?
> >
> >Changing these options mentioned here doesn't make any difference:
> >
> >https://lists.gluster.org/pipermail/gluster-users/2016-January/024865.html
> >
> >Anyone a hint? Or should I drop GFS? This is unusable IMHO.
> >
> >Richard.
> >
> >-- 
> >richard lucassen
> >http://contact.xaq.nl/
> >___
> >Gluster-users mailing list
> >Gluster-users@gluster.org
> >https://lists.gluster.org/mailman/listinfo/gluster-users
> 


-- 
richard lucassen
http://contact.xaq.nl/
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Transport endpoint is not connected failures in

2019-03-27 Thread brandon
Hello Amar and list,

 

I wanted to follow-up to confirm that upgrading to 5.5 seem to fix the
"Transport endpoint is not connected failures" for us.  

 

We did not have any of these failures in this past weekend backups cycle.

 

Thank you very much for fixing whatever was the problem.

 

I also removed some volume config options.  One or more of the settings was
contributing to the slow directory listing.

 

Here is our current volume info.

 

[root@lonbaknode3 ~]# gluster volume info

 

Volume Name: volbackups

Type: Distribute

Volume ID: 32bf4fe9-5450-49f8-b6aa-05471d3bdffa

Status: Started

Snapshot Count: 0

Number of Bricks: 8

Transport-type: tcp

Bricks:

Brick1: lonbaknode3.domain.net:/lvbackups/brick

Brick2: lonbaknode4.domain.net:/lvbackups/brick

Brick3: lonbaknode5.domain.net:/lvbackups/brick

Brick4: lonbaknode6.domain.net:/lvbackups/brick

Brick5: lonbaknode7.domain.net:/lvbackups/brick

Brick6: lonbaknode8.domain.net:/lvbackups/brick

Brick7: lonbaknode9.domain.net:/lvbackups/brick

Brick8: lonbaknode10.domain.net:/lvbackups/brick

Options Reconfigured:

performance.io-thread-count: 32

performance.client-io-threads: on

client.event-threads: 8

diagnostics.brick-sys-log-level: WARNING

diagnostics.brick-log-level: WARNING

performance.cache-max-file-size: 2MB

performance.cache-size: 256MB

cluster.min-free-disk: 1%

nfs.disable: on

transport.address-family: inet

server.event-threads: 8

[root@lonbaknode3 ~]#

___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-27 Thread Xavi Hernandez
On Wed, Mar 27, 2019 at 2:20 PM Pranith Kumar Karampuri 
wrote:

>
>
> On Wed, Mar 27, 2019 at 6:38 PM Xavi Hernandez 
> wrote:
>
>> On Wed, Mar 27, 2019 at 1:13 PM Pranith Kumar Karampuri <
>> pkara...@redhat.com> wrote:
>>
>>>
>>>
>>> On Wed, Mar 27, 2019 at 5:13 PM Xavi Hernandez 
>>> wrote:
>>>
 On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa <
 rgowd...@redhat.com> wrote:

>
>
> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez 
> wrote:
>
>> Hi Raghavendra,
>>
>> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <
>> rgowd...@redhat.com> wrote:
>>
>>> All,
>>>
>>> Glusterfs cleans up POSIX locks held on an fd when the client/mount
>>> through which those locks are held disconnects from bricks/server. This
>>> helps Glusterfs to not run into a stale lock problem later (For eg., if
>>> application unlocks while the connection was still down). However, this
>>> means the lock is no longer exclusive as other applications/clients can
>>> acquire the same lock. To communicate that locks are no longer valid, we
>>> are planning to mark the fd (which has POSIX locks) bad on a disconnect 
>>> so
>>> that any future operations on that fd will fail, forcing the 
>>> application to
>>> re-open the fd and re-acquire locks it needs [1].
>>>
>>
>> Wouldn't it be better to retake the locks when the brick is
>> reconnected if the lock is still in use ?
>>
>
> There is also  a possibility that clients may never reconnect. That's
> the primary reason why bricks assume the worst (client will not reconnect)
> and cleanup the locks.
>

 True, so it's fine to cleanup the locks. I'm not saying that locks
 shouldn't be released on disconnect. The assumption is that if the client
 has really died, it will also disconnect from other bricks, who will
 release the locks. So, eventually, another client will have enough quorum
 to attempt a lock that will succeed. In other words, if a client gets
 disconnected from too many bricks simultaneously (loses Quorum), then that
 client can be considered as bad and can return errors to the application.
 This should also cause to release the locks on the remaining connected
 bricks.

 On the other hand, if the disconnection is very short and the client
 has not died, it will keep enough locked files (it has quorum) to avoid
 other clients to successfully acquire a lock. In this case, if the brick is
 reconnected, all existing locks should be reacquired to recover the
 original state before the disconnection.


>
>> BTW, the referenced bug is not public. Should we open another bug to
>> track this ?
>>
>
> I've just opened up the comment to give enough context. I'll open a
> bug upstream too.
>
>
>>
>>
>>>
>>> Note that with AFR/replicate in picture we can prevent errors to
>>> application as long as Quorum number of children "never ever" lost
>>> connection with bricks after locks have been acquired. I am using the 
>>> term
>>> "never ever" as locks are not healed back after re-connection and hence
>>> first disconnect would've marked the fd bad and the fd remains so even
>>> after re-connection happens. So, its not just Quorum number of children
>>> "currently online", but Quorum number of children "never having
>>> disconnected with bricks after locks are acquired".
>>>
>>
>> I think this requisite is not feasible. In a distributed file system,
>> sooner or later all bricks will be disconnected. It could be because of
>> failures or because an upgrade is done, but it will happen.
>>
>> The difference here is how long are fd's kept open. If applications
>> open and close files frequently enough (i.e. the fd is not kept open more
>> time than it takes to have more than Quorum bricks disconnected) then
>> there's no problem. The problem can only appear on applications that open
>> files for a long time and also use posix locks. In this case, the only 
>> good
>> solution I see is to retake the locks on brick reconnection.
>>
>
> Agree. But lock-healing should be done only by HA layers like AFR/EC
> as only they know whether there are enough online bricks to have prevented
> any conflicting lock. Protocol/client itself doesn't have enough
> information to do that. If its a plain distribute, I don't see a way to
> heal locks without loosing the property of exclusivity of locks.
>

 Lock-healing of locks acquired while a brick was disconnected need to
 be handled by AFR/EC. However, locks already present at the moment of
 disconnection could be recovered by client xlator itself as long as the
 file has not been closed (which client xlator already knows).

>>>
>>> What if another 

Re: [Gluster-users] glusterfs unusable?

2019-03-27 Thread Joe Julian
First, your statement and subject is hyperbolic and combative. In general it's 
best not to begin any approach for help with an uneducated attack on a 
community.

GFS (Global File System) is an entirely different project but I'm going to 
assume you're in the right place and actually asking about GlusterFS.

You haven't described your use case so I'll make an assumption that your intent 
is to sync files from your office to your home. I'll further guess that you're 
replicating one brick at home and the other at the office.

Yes, this is generally an unusable use case for to latency and connectivity 
reasons. Your 2Gb transfer was very likely a self heal due to a connectivity 
problem from one of your clients. When your home client performed a lookup() of 
the files, it caught the discrepancy and fixed it. The latency is multiplied 
due to the very nature of clustering and your latent connection.

For a more useful answer, I'd suggest describing your needs and asking for 
help. There is tons of experienced storage professionals here that are happy to 
share their knowledge and advice.

On March 27, 2019 7:23:35 AM PDT, richard lucassen  
wrote:
>Hello list,
>
>glusterfs 5.4-1 on Debian Buster (both servers and clients)
>
>I'm quite new to GFS and it's an old problem I know. When running a
>simple "ls -alR" on a local directory containing 50MB and 3468 files it
>takes:
>
>real0m0.567s
>user0m0.084s
>sys 0m0.168s
>
>Same thing for a copy of that dir on GFS takes more than 5 seconds:
>
>real0m5.557s
>user0m0.128s
>sys 0m0.208s
>
>Ok. But from my workstation at home, an "ls -alR" of that directory
>takes more than half an hour and the upload is more than 2GB (no typo:
>TWO Gigabytes). To keep it simple, the ls of a few directories:
>
>$ time ls
>all  xabc-db  xabc-dc1  xabc-gluster  xabc-mail  xabc-otp  xabc-smtp
>
>real0m5.766s
>user0m0.001s
>sys 0m0.003s
>
>it receives 56kB and sends 2.3 MB for a simple ls.
>
>This is weird isn't it? Why this huge upload?
>
>Changing these options mentioned here doesn't make any difference:
>
>https://lists.gluster.org/pipermail/gluster-users/2016-January/024865.html
>
>Anyone a hint? Or should I drop GFS? This is unusable IMHO.
>
>Richard.
>
>-- 
>richard lucassen
>http://contact.xaq.nl/
>___
>Gluster-users mailing list
>Gluster-users@gluster.org
>https://lists.gluster.org/mailman/listinfo/gluster-users

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Prioritise local bricks for IO?

2019-03-27 Thread Poornima Gurusiddaiah
This feature is not under active development as it was not used widely.
AFAIK its not supported feature.
+Nithya +Raghavendra for further clarifications.

Regards,
Poornima

On Wed, Mar 27, 2019 at 12:33 PM Lucian  wrote:

> Oh, that's just what the doctor ordered!
> Hope it works, thanks
>
> On 27 March 2019 03:15:57 GMT, Vlad Kopylov  wrote:
>>
>> I don't remember if it still in works
>> NUFA
>>
>> https://github.com/gluster/glusterfs-specs/blob/master/done/Features/nufa.md
>>
>> v
>>
>> On Tue, Mar 26, 2019 at 7:27 AM Nux!  wrote:
>>
>>> Hello,
>>>
>>> I'm trying to set up a distributed backup storage (no replicas), but I'd
>>> like to prioritise the local bricks for any IO done on the volume.
>>> This will be a backup stor, so in other words, I'd like the files to be
>>> written locally if there is space, so as to save the NICs for other traffic.
>>>
>>> Anyone knows how this might be achievable, if at all?
>>>
>>> --
>>> Sent from the Delta quadrant using Borg technology!
>>>
>>> Nux!
>>> www.nux.ro
>>> ___
>>> Gluster-users mailing list
>>> Gluster-users@gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] glusterfs unusable?

2019-03-27 Thread richard lucassen
On Wed, 27 Mar 2019 14:37:14 +
Marcelo Terres  wrote:

> https://bugzilla.redhat.com/show_bug.cgi?id=1673058

Ok, thnx, I missed that one (not used the proper search arguments I
guess)

Hope this will resolve the problem. There is a 5.5-1 in Debian
experimental from the 25th of march, don't think that version will
resolve the issue, there's no changelog AFAICS. I'll try to compile and
apply the patches tonight or tomorrow.

-- 
richard lucassen
http://contact.xaq.nl/
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] glusterfs unusable?

2019-03-27 Thread Marcelo Terres
https://bugzilla.redhat.com/show_bug.cgi?id=1673058

Regards,

Marcelo H. Terres 
https://www.mundoopensource.com.br
https://twitter.com/mhterres
https://linkedin.com/in/marceloterres


On Wed, 27 Mar 2019 at 14:32, richard lucassen 
wrote:

> Hello list,
>
> glusterfs 5.4-1 on Debian Buster (both servers and clients)
>
> I'm quite new to GFS and it's an old problem I know. When running a
> simple "ls -alR" on a local directory containing 50MB and 3468 files it
> takes:
>
> real0m0.567s
> user0m0.084s
> sys 0m0.168s
>
> Same thing for a copy of that dir on GFS takes more than 5 seconds:
>
> real0m5.557s
> user0m0.128s
> sys 0m0.208s
>
> Ok. But from my workstation at home, an "ls -alR" of that directory
> takes more than half an hour and the upload is more than 2GB (no typo:
> TWO Gigabytes). To keep it simple, the ls of a few directories:
>
> $ time ls
> all  xabc-db  xabc-dc1  xabc-gluster  xabc-mail  xabc-otp  xabc-smtp
>
> real0m5.766s
> user0m0.001s
> sys 0m0.003s
>
> it receives 56kB and sends 2.3 MB for a simple ls.
>
> This is weird isn't it? Why this huge upload?
>
> Changing these options mentioned here doesn't make any difference:
>
> https://lists.gluster.org/pipermail/gluster-users/2016-January/024865.html
>
> Anyone a hint? Or should I drop GFS? This is unusable IMHO.
>
> Richard.
>
> --
> richard lucassen
> http://contact.xaq.nl/
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] glusterfs unusable?

2019-03-27 Thread richard lucassen
Hello list,

glusterfs 5.4-1 on Debian Buster (both servers and clients)

I'm quite new to GFS and it's an old problem I know. When running a
simple "ls -alR" on a local directory containing 50MB and 3468 files it
takes:

real0m0.567s
user0m0.084s
sys 0m0.168s

Same thing for a copy of that dir on GFS takes more than 5 seconds:

real0m5.557s
user0m0.128s
sys 0m0.208s

Ok. But from my workstation at home, an "ls -alR" of that directory
takes more than half an hour and the upload is more than 2GB (no typo:
TWO Gigabytes). To keep it simple, the ls of a few directories:

$ time ls
all  xabc-db  xabc-dc1  xabc-gluster  xabc-mail  xabc-otp  xabc-smtp

real0m5.766s
user0m0.001s
sys 0m0.003s

it receives 56kB and sends 2.3 MB for a simple ls.

This is weird isn't it? Why this huge upload?

Changing these options mentioned here doesn't make any difference:

https://lists.gluster.org/pipermail/gluster-users/2016-January/024865.html

Anyone a hint? Or should I drop GFS? This is unusable IMHO.

Richard.

-- 
richard lucassen
http://contact.xaq.nl/
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] Gluster GEO replication fault after write over nfs-ganesha

2019-03-27 Thread Alexey Talikov
 I have two clusters with dispersed volumes (2+1) with GEO replication
It works fine till I use glusterfs-fuse, but as even one file written over
nfs-ganesha replication goes to Fault and recovers after I remove this file
(sometimes after stop/start)
I think nfs-hanesha writes file in some way that produces problem with
replication

OSError: [Errno 61] No data available:
'.gfid/9c9514ce-a310-4a1c-a87b-a800a32a99f8'

but if I check over glusterfs mounted with aux-gfid-mount

getfattr -n trusted.glusterfs.pathinfo -e text
/mnt/TEST/.gfid/9c9514ce-a310-4a1c-a87b-a800a32a99f8
getfattr: Removing leading '/' from absolute path names
# file: mnt/TEST/.gfid/9c9514ce-a310-4a1c-a87b-a800a32a99f8
trusted.glusterfs.pathinfo="(
(
))"

File exists
Details available here https://github.com/nfs-ganesha/nfs-ganesha/issues/408
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Issues with submounted directories on a client

2019-03-27 Thread Greene, Tami McFarlin
The system is a 5 server, 20 brick distributed system with a hardware 
configured RAID 6 underneath with xfs as filesystem.  This client is a data 
collection node which transfers data to specific directories within one of the 
gluster volumes.

I have a client with submounted directories (glustervolume/project) rather than 
the entire volume.  Some files can be transferred no problem, but others send 
an error about transport endpoint not connected.  The transfer is handed by a 
rsync script triggered as a cron job.

When remotely connected to this client, user access to these files does not 
always behave as they are set – 2770 for directories and 440.  Owners are not 
always able to move the files, processes ran as the owners are not always able 
to move files; root is not always allowed to move or delete these file.

This process seemed to worked smoothly before adding another server and 4 
storage bricks to the volume, logs indicate there were intermittent issues at 
least a month before the last server was added.  While a new collection device 
has been streaming to this one machine, the issue started the day before.

Is there another level for permissions and ownership that I am not aware of 
that needs to be sync’d?


Tami

Tami McFarlin Greene
Lab Technician
RF, Communications, and Intelligent Systems Group
Electrical and Electronics System Research Division
Oak Ridge National Laboratory
Bldg. 3500, Rm. A15
gree...@ornl.gov  (865) 
643-0401

___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Inconsistent issues with a client

2019-03-27 Thread Tami Greene
The system is a 5 server, 20 brick distributed system with a hardware
configured RAID 6 underneath with xfs as filesystem.  This client is a data
collection node which transfers data to specific directories within one of
the gluster volumes.



I have a client with submounted directories (glustervolume/project) rather
than the entire volume.  Some files can be transferred no problem, but
others send an error about transport endpoint not connected.  The transfer
is handed by a rsync script triggered as a cron job.



When remotely connected to this client, user access to these files does not
always behave as they are set – 2770 for directories and 440.  Owners are
not always able to move the files, processes ran as the owners are not
always able to move files; root is not always allowed to move or delete
these file.



This process seemed to worked smoothly before adding another server and 4
storage bricks to the volume, logs indicate there were intermittent issues
at least a month before the last server was added.  While a new collection
device has been streaming to this one machine, the issue started the day
before.



Is there another level for permissions and ownership that I am not aware of
that needs to be sync’d?


-- 
Tami
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-27 Thread Pranith Kumar Karampuri
On Wed, Mar 27, 2019 at 6:38 PM Xavi Hernandez  wrote:

> On Wed, Mar 27, 2019 at 1:13 PM Pranith Kumar Karampuri <
> pkara...@redhat.com> wrote:
>
>>
>>
>> On Wed, Mar 27, 2019 at 5:13 PM Xavi Hernandez 
>> wrote:
>>
>>> On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa <
>>> rgowd...@redhat.com> wrote:
>>>


 On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez 
 wrote:

> Hi Raghavendra,
>
> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <
> rgowd...@redhat.com> wrote:
>
>> All,
>>
>> Glusterfs cleans up POSIX locks held on an fd when the client/mount
>> through which those locks are held disconnects from bricks/server. This
>> helps Glusterfs to not run into a stale lock problem later (For eg., if
>> application unlocks while the connection was still down). However, this
>> means the lock is no longer exclusive as other applications/clients can
>> acquire the same lock. To communicate that locks are no longer valid, we
>> are planning to mark the fd (which has POSIX locks) bad on a disconnect 
>> so
>> that any future operations on that fd will fail, forcing the application 
>> to
>> re-open the fd and re-acquire locks it needs [1].
>>
>
> Wouldn't it be better to retake the locks when the brick is
> reconnected if the lock is still in use ?
>

 There is also  a possibility that clients may never reconnect. That's
 the primary reason why bricks assume the worst (client will not reconnect)
 and cleanup the locks.

>>>
>>> True, so it's fine to cleanup the locks. I'm not saying that locks
>>> shouldn't be released on disconnect. The assumption is that if the client
>>> has really died, it will also disconnect from other bricks, who will
>>> release the locks. So, eventually, another client will have enough quorum
>>> to attempt a lock that will succeed. In other words, if a client gets
>>> disconnected from too many bricks simultaneously (loses Quorum), then that
>>> client can be considered as bad and can return errors to the application.
>>> This should also cause to release the locks on the remaining connected
>>> bricks.
>>>
>>> On the other hand, if the disconnection is very short and the client has
>>> not died, it will keep enough locked files (it has quorum) to avoid other
>>> clients to successfully acquire a lock. In this case, if the brick is
>>> reconnected, all existing locks should be reacquired to recover the
>>> original state before the disconnection.
>>>
>>>

> BTW, the referenced bug is not public. Should we open another bug to
> track this ?
>

 I've just opened up the comment to give enough context. I'll open a bug
 upstream too.


>
>
>>
>> Note that with AFR/replicate in picture we can prevent errors to
>> application as long as Quorum number of children "never ever" lost
>> connection with bricks after locks have been acquired. I am using the 
>> term
>> "never ever" as locks are not healed back after re-connection and hence
>> first disconnect would've marked the fd bad and the fd remains so even
>> after re-connection happens. So, its not just Quorum number of children
>> "currently online", but Quorum number of children "never having
>> disconnected with bricks after locks are acquired".
>>
>
> I think this requisite is not feasible. In a distributed file system,
> sooner or later all bricks will be disconnected. It could be because of
> failures or because an upgrade is done, but it will happen.
>
> The difference here is how long are fd's kept open. If applications
> open and close files frequently enough (i.e. the fd is not kept open more
> time than it takes to have more than Quorum bricks disconnected) then
> there's no problem. The problem can only appear on applications that open
> files for a long time and also use posix locks. In this case, the only 
> good
> solution I see is to retake the locks on brick reconnection.
>

 Agree. But lock-healing should be done only by HA layers like AFR/EC as
 only they know whether there are enough online bricks to have prevented any
 conflicting lock. Protocol/client itself doesn't have enough information to
 do that. If its a plain distribute, I don't see a way to heal locks without
 loosing the property of exclusivity of locks.

>>>
>>> Lock-healing of locks acquired while a brick was disconnected need to be
>>> handled by AFR/EC. However, locks already present at the moment of
>>> disconnection could be recovered by client xlator itself as long as the
>>> file has not been closed (which client xlator already knows).
>>>
>>
>> What if another client (say mount-2) took locks at the time of disconnect
>> from mount-1 and modified the file and unlocked? client xlator doing the
>> heal may not be a good idea.
>>
>
> To avoid that we 

Re: [Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-27 Thread Xavi Hernandez
On Wed, Mar 27, 2019 at 1:13 PM Pranith Kumar Karampuri 
wrote:

>
>
> On Wed, Mar 27, 2019 at 5:13 PM Xavi Hernandez 
> wrote:
>
>> On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa <
>> rgowd...@redhat.com> wrote:
>>
>>>
>>>
>>> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez 
>>> wrote:
>>>
 Hi Raghavendra,

 On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <
 rgowd...@redhat.com> wrote:

> All,
>
> Glusterfs cleans up POSIX locks held on an fd when the client/mount
> through which those locks are held disconnects from bricks/server. This
> helps Glusterfs to not run into a stale lock problem later (For eg., if
> application unlocks while the connection was still down). However, this
> means the lock is no longer exclusive as other applications/clients can
> acquire the same lock. To communicate that locks are no longer valid, we
> are planning to mark the fd (which has POSIX locks) bad on a disconnect so
> that any future operations on that fd will fail, forcing the application 
> to
> re-open the fd and re-acquire locks it needs [1].
>

 Wouldn't it be better to retake the locks when the brick is reconnected
 if the lock is still in use ?

>>>
>>> There is also  a possibility that clients may never reconnect. That's
>>> the primary reason why bricks assume the worst (client will not reconnect)
>>> and cleanup the locks.
>>>
>>
>> True, so it's fine to cleanup the locks. I'm not saying that locks
>> shouldn't be released on disconnect. The assumption is that if the client
>> has really died, it will also disconnect from other bricks, who will
>> release the locks. So, eventually, another client will have enough quorum
>> to attempt a lock that will succeed. In other words, if a client gets
>> disconnected from too many bricks simultaneously (loses Quorum), then that
>> client can be considered as bad and can return errors to the application.
>> This should also cause to release the locks on the remaining connected
>> bricks.
>>
>> On the other hand, if the disconnection is very short and the client has
>> not died, it will keep enough locked files (it has quorum) to avoid other
>> clients to successfully acquire a lock. In this case, if the brick is
>> reconnected, all existing locks should be reacquired to recover the
>> original state before the disconnection.
>>
>>
>>>
 BTW, the referenced bug is not public. Should we open another bug to
 track this ?

>>>
>>> I've just opened up the comment to give enough context. I'll open a bug
>>> upstream too.
>>>
>>>


>
> Note that with AFR/replicate in picture we can prevent errors to
> application as long as Quorum number of children "never ever" lost
> connection with bricks after locks have been acquired. I am using the term
> "never ever" as locks are not healed back after re-connection and hence
> first disconnect would've marked the fd bad and the fd remains so even
> after re-connection happens. So, its not just Quorum number of children
> "currently online", but Quorum number of children "never having
> disconnected with bricks after locks are acquired".
>

 I think this requisite is not feasible. In a distributed file system,
 sooner or later all bricks will be disconnected. It could be because of
 failures or because an upgrade is done, but it will happen.

 The difference here is how long are fd's kept open. If applications
 open and close files frequently enough (i.e. the fd is not kept open more
 time than it takes to have more than Quorum bricks disconnected) then
 there's no problem. The problem can only appear on applications that open
 files for a long time and also use posix locks. In this case, the only good
 solution I see is to retake the locks on brick reconnection.

>>>
>>> Agree. But lock-healing should be done only by HA layers like AFR/EC as
>>> only they know whether there are enough online bricks to have prevented any
>>> conflicting lock. Protocol/client itself doesn't have enough information to
>>> do that. If its a plain distribute, I don't see a way to heal locks without
>>> loosing the property of exclusivity of locks.
>>>
>>
>> Lock-healing of locks acquired while a brick was disconnected need to be
>> handled by AFR/EC. However, locks already present at the moment of
>> disconnection could be recovered by client xlator itself as long as the
>> file has not been closed (which client xlator already knows).
>>
>
> What if another client (say mount-2) took locks at the time of disconnect
> from mount-1 and modified the file and unlocked? client xlator doing the
> heal may not be a good idea.
>

To avoid that we should ensure that any lock/unlocks are sent to the
client, even if we know it's disconnected, so that client xlator can track
them. The alternative is to duplicate and maintain code both on AFR and EC
(and not sure if 

Re: [Gluster-users] cannot add server back to cluster after reinstallation

2019-03-27 Thread Rafi Kavungal Chundattu Parambil


- Original Message -
From: "Atin Mukherjee" 
To: "Rafi Kavungal Chundattu Parambil" , "Riccardo Murri" 

Cc: gluster-users@gluster.org
Sent: Wednesday, March 27, 2019 4:07:42 PM
Subject: Re: [Gluster-users] cannot add server back to cluster after 
reinstallation

On Wed, 27 Mar 2019 at 16:02, Riccardo Murri 
wrote:

> Hello Atin,
>
> > Check cluster.op-version, peer status, volume status output. If they are
> all fine you’re good.
>
> Both `op-version` and `peer status` look fine:
> ```
> # gluster volume get all cluster.max-op-version
> Option  Value
> --  -
> cluster.max-op-version  31202
>
> # gluster peer status
> Number of Peers: 4
>
> Hostname: glusterfs-server-004
> Uuid: 9a5763d2-1941-4e5d-8d33-8d6756f7f318
> State: Peer in Cluster (Connected)
>
> Hostname: glusterfs-server-005
> Uuid: d53398f6-19d4-4633-8bc3-e493dac41789
> State: Peer in Cluster (Connected)
>
> Hostname: glusterfs-server-003
> Uuid: 3c74d2b4-a4f3-42d4-9511-f6174b0a641d
> State: Peer in Cluster (Connected)
>
> Hostname: glusterfs-server-001
> Uuid: 60bcc47e-ccbe-493e-b4ea-d45d63123977
> State: Peer in Cluster (Connected)
> ```
>
> However, `volume status` shows a missing snapshotd on the reinstalled
> server (the 002 one).


I believe you ran this command on 002? And in that case its showing as
localhost.


> We're not using snapshots so I guess this is fine too?


Is features.uss enabled for this volume? Otherwise we don’t show snapd
information in status output.

Rafi - am I correct?

Yes. We don't show snapd information unless uss is enabled. So please check 
whether uss is enabled or not.

You can use gluster v get glusterfs features.uss . If you are not using any 
snapshot then it doesn't make sense to use uss. You can disable it using 
gluster v set glusterfs features.uss disable


Please note that if you are doing the rolling upgrade, it is not recommended to 
do any configuration changes. In that case you can disable it after completing 
the upgrade.


Rafi KC

>
> ```
> # gluster volume status
> Status of volume: glusterfs
> Gluster process TCP Port  RDMA Port  Online
> Pid
>
> --
> Brick glusterfs-server-005:/s
> rv/glusterfs49152 0  Y
>  1410
> Brick glusterfs-server-004:/s
> rv/glusterfs49152 0  Y
>  1416
> Brick glusterfs-server-003:/s
> rv/glusterfs49152 0  Y
>  1520
> Brick glusterfs-server-001:/s
> rv/glusterfs49152 0  Y
>  1266
> Brick glusterfs-server-002:/s
> rv/glusterfs49152 0  Y
>  3011
> Snapshot Daemon on localhostN/A   N/AY
>  3029
> Snapshot Daemon on glusterfs-
> server-001  49153 0  Y
>  1361
> Snapshot Daemon on glusterfs-
> server-005  49153 0  Y
>  1478
> Snapshot Daemon on glusterfs-
> server-004  49153 0  Y
>  1490
> Snapshot Daemon on glusterfs-
> server-003  49153 0  Y
>  1563
>
> Task Status of Volume glusterfs
>
> --
> Task : Rebalance
> ID   : 0eaf6ad1-df95-48f4-b941-17488010ddcc
> Status   : failed
> ```
>
> Thanks,
> Riccardo
>
-- 
--Atin
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-27 Thread Pranith Kumar Karampuri
On Wed, Mar 27, 2019 at 5:13 PM Xavi Hernandez  wrote:

> On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa 
> wrote:
>
>>
>>
>> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez 
>> wrote:
>>
>>> Hi Raghavendra,
>>>
>>> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <
>>> rgowd...@redhat.com> wrote:
>>>
 All,

 Glusterfs cleans up POSIX locks held on an fd when the client/mount
 through which those locks are held disconnects from bricks/server. This
 helps Glusterfs to not run into a stale lock problem later (For eg., if
 application unlocks while the connection was still down). However, this
 means the lock is no longer exclusive as other applications/clients can
 acquire the same lock. To communicate that locks are no longer valid, we
 are planning to mark the fd (which has POSIX locks) bad on a disconnect so
 that any future operations on that fd will fail, forcing the application to
 re-open the fd and re-acquire locks it needs [1].

>>>
>>> Wouldn't it be better to retake the locks when the brick is reconnected
>>> if the lock is still in use ?
>>>
>>
>> There is also  a possibility that clients may never reconnect. That's the
>> primary reason why bricks assume the worst (client will not reconnect) and
>> cleanup the locks.
>>
>
> True, so it's fine to cleanup the locks. I'm not saying that locks
> shouldn't be released on disconnect. The assumption is that if the client
> has really died, it will also disconnect from other bricks, who will
> release the locks. So, eventually, another client will have enough quorum
> to attempt a lock that will succeed. In other words, if a client gets
> disconnected from too many bricks simultaneously (loses Quorum), then that
> client can be considered as bad and can return errors to the application.
> This should also cause to release the locks on the remaining connected
> bricks.
>
> On the other hand, if the disconnection is very short and the client has
> not died, it will keep enough locked files (it has quorum) to avoid other
> clients to successfully acquire a lock. In this case, if the brick is
> reconnected, all existing locks should be reacquired to recover the
> original state before the disconnection.
>
>
>>
>>> BTW, the referenced bug is not public. Should we open another bug to
>>> track this ?
>>>
>>
>> I've just opened up the comment to give enough context. I'll open a bug
>> upstream too.
>>
>>
>>>
>>>

 Note that with AFR/replicate in picture we can prevent errors to
 application as long as Quorum number of children "never ever" lost
 connection with bricks after locks have been acquired. I am using the term
 "never ever" as locks are not healed back after re-connection and hence
 first disconnect would've marked the fd bad and the fd remains so even
 after re-connection happens. So, its not just Quorum number of children
 "currently online", but Quorum number of children "never having
 disconnected with bricks after locks are acquired".

>>>
>>> I think this requisite is not feasible. In a distributed file system,
>>> sooner or later all bricks will be disconnected. It could be because of
>>> failures or because an upgrade is done, but it will happen.
>>>
>>> The difference here is how long are fd's kept open. If applications open
>>> and close files frequently enough (i.e. the fd is not kept open more time
>>> than it takes to have more than Quorum bricks disconnected) then there's no
>>> problem. The problem can only appear on applications that open files for a
>>> long time and also use posix locks. In this case, the only good solution I
>>> see is to retake the locks on brick reconnection.
>>>
>>
>> Agree. But lock-healing should be done only by HA layers like AFR/EC as
>> only they know whether there are enough online bricks to have prevented any
>> conflicting lock. Protocol/client itself doesn't have enough information to
>> do that. If its a plain distribute, I don't see a way to heal locks without
>> loosing the property of exclusivity of locks.
>>
>
> Lock-healing of locks acquired while a brick was disconnected need to be
> handled by AFR/EC. However, locks already present at the moment of
> disconnection could be recovered by client xlator itself as long as the
> file has not been closed (which client xlator already knows).
>

What if another client (say mount-2) took locks at the time of disconnect
from mount-1 and modified the file and unlocked? client xlator doing the
heal may not be a good idea.


>
> Xavi
>
>
>> What I proposed is a short term solution. mid to long term solution
>> should be lock healing feature implemented in AFR/EC. In fact I had this
>> conversation with +Karampuri, Pranith  before
>> posting this msg to ML.
>>
>>
>>>
 However, this use case is not affected if the application don't acquire
 any POSIX locks. So, I am interested in knowing
 * whether your use cases use POSIX locks?
 * Is it feasible 

Re: [Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-27 Thread Xavi Hernandez
On Wed, Mar 27, 2019 at 11:54 AM Raghavendra Gowdappa 
wrote:

>
>
> On Wed, Mar 27, 2019 at 4:22 PM Raghavendra Gowdappa 
> wrote:
>
>>
>>
>> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez 
>> wrote:
>>
>>> Hi Raghavendra,
>>>
>>> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <
>>> rgowd...@redhat.com> wrote:
>>>
 All,

 Glusterfs cleans up POSIX locks held on an fd when the client/mount
 through which those locks are held disconnects from bricks/server. This
 helps Glusterfs to not run into a stale lock problem later (For eg., if
 application unlocks while the connection was still down). However, this
 means the lock is no longer exclusive as other applications/clients can
 acquire the same lock. To communicate that locks are no longer valid, we
 are planning to mark the fd (which has POSIX locks) bad on a disconnect so
 that any future operations on that fd will fail, forcing the application to
 re-open the fd and re-acquire locks it needs [1].

>>>
>>> Wouldn't it be better to retake the locks when the brick is reconnected
>>> if the lock is still in use ?
>>>
>>
>> There is also  a possibility that clients may never reconnect. That's the
>> primary reason why bricks assume the worst (client will not reconnect) and
>> cleanup the locks.
>>
>>
>>> BTW, the referenced bug is not public. Should we open another bug to
>>> track this ?
>>>
>>
>> I've just opened up the comment to give enough context. I'll open a bug
>> upstream too.
>>
>>
>>>
>>>

 Note that with AFR/replicate in picture we can prevent errors to
 application as long as Quorum number of children "never ever" lost
 connection with bricks after locks have been acquired. I am using the term
 "never ever" as locks are not healed back after re-connection and hence
 first disconnect would've marked the fd bad and the fd remains so even
 after re-connection happens. So, its not just Quorum number of children
 "currently online", but Quorum number of children "never having
 disconnected with bricks after locks are acquired".

>>>
>>> I think this requisite is not feasible. In a distributed file system,
>>> sooner or later all bricks will be disconnected. It could be because of
>>> failures or because an upgrade is done, but it will happen.
>>>
>>> The difference here is how long are fd's kept open. If applications open
>>> and close files frequently enough (i.e. the fd is not kept open more time
>>> than it takes to have more than Quorum bricks disconnected) then there's no
>>> problem. The problem can only appear on applications that open files for a
>>> long time and also use posix locks. In this case, the only good solution I
>>> see is to retake the locks on brick reconnection.
>>>
>>
>> Agree. But lock-healing should be done only by HA layers like AFR/EC as
>> only they know whether there are enough online bricks to have prevented any
>> conflicting lock. Protocol/client itself doesn't have enough information to
>> do that. If its a plain distribute, I don't see a way to heal locks without
>> loosing the property of exclusivity of locks.
>>
>> What I proposed is a short term solution. mid to long term solution
>> should be lock healing feature implemented in AFR/EC. In fact I had this
>> conversation with +Karampuri, Pranith  before
>> posting this msg to ML.
>>
>>
>>>
 However, this use case is not affected if the application don't acquire
 any POSIX locks. So, I am interested in knowing
 * whether your use cases use POSIX locks?
 * Is it feasible for your application to re-open fds and re-acquire
 locks on seeing EBADFD errors?

>>>
>>> I think that many applications are not prepared to handle that.
>>>
>>
>> I too suspected that and in fact not too happy with the solution. But
>> went ahead with this mail as I heard implementing lock-heal  in AFR will
>> take time and hence there are no alternative short term solutions.
>>
>
> Also failing loudly is preferred to silently dropping locks.
>

Yes. Silently dropping locks can cause corruption, which is worse. However
causing application failures doesn't improve user experience either.

Unfortunately I'm not aware of any other short term solution right now.


>
>>
>>
>>> Xavi
>>>
>>>

 [1] https://bugzilla.redhat.com/show_bug.cgi?id=1689375#c7

 regards,
 Raghavendra

 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-27 Thread Xavi Hernandez
On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa 
wrote:

>
>
> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez 
> wrote:
>
>> Hi Raghavendra,
>>
>> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa 
>> wrote:
>>
>>> All,
>>>
>>> Glusterfs cleans up POSIX locks held on an fd when the client/mount
>>> through which those locks are held disconnects from bricks/server. This
>>> helps Glusterfs to not run into a stale lock problem later (For eg., if
>>> application unlocks while the connection was still down). However, this
>>> means the lock is no longer exclusive as other applications/clients can
>>> acquire the same lock. To communicate that locks are no longer valid, we
>>> are planning to mark the fd (which has POSIX locks) bad on a disconnect so
>>> that any future operations on that fd will fail, forcing the application to
>>> re-open the fd and re-acquire locks it needs [1].
>>>
>>
>> Wouldn't it be better to retake the locks when the brick is reconnected
>> if the lock is still in use ?
>>
>
> There is also  a possibility that clients may never reconnect. That's the
> primary reason why bricks assume the worst (client will not reconnect) and
> cleanup the locks.
>

True, so it's fine to cleanup the locks. I'm not saying that locks
shouldn't be released on disconnect. The assumption is that if the client
has really died, it will also disconnect from other bricks, who will
release the locks. So, eventually, another client will have enough quorum
to attempt a lock that will succeed. In other words, if a client gets
disconnected from too many bricks simultaneously (loses Quorum), then that
client can be considered as bad and can return errors to the application.
This should also cause to release the locks on the remaining connected
bricks.

On the other hand, if the disconnection is very short and the client has
not died, it will keep enough locked files (it has quorum) to avoid other
clients to successfully acquire a lock. In this case, if the brick is
reconnected, all existing locks should be reacquired to recover the
original state before the disconnection.


>
>> BTW, the referenced bug is not public. Should we open another bug to
>> track this ?
>>
>
> I've just opened up the comment to give enough context. I'll open a bug
> upstream too.
>
>
>>
>>
>>>
>>> Note that with AFR/replicate in picture we can prevent errors to
>>> application as long as Quorum number of children "never ever" lost
>>> connection with bricks after locks have been acquired. I am using the term
>>> "never ever" as locks are not healed back after re-connection and hence
>>> first disconnect would've marked the fd bad and the fd remains so even
>>> after re-connection happens. So, its not just Quorum number of children
>>> "currently online", but Quorum number of children "never having
>>> disconnected with bricks after locks are acquired".
>>>
>>
>> I think this requisite is not feasible. In a distributed file system,
>> sooner or later all bricks will be disconnected. It could be because of
>> failures or because an upgrade is done, but it will happen.
>>
>> The difference here is how long are fd's kept open. If applications open
>> and close files frequently enough (i.e. the fd is not kept open more time
>> than it takes to have more than Quorum bricks disconnected) then there's no
>> problem. The problem can only appear on applications that open files for a
>> long time and also use posix locks. In this case, the only good solution I
>> see is to retake the locks on brick reconnection.
>>
>
> Agree. But lock-healing should be done only by HA layers like AFR/EC as
> only they know whether there are enough online bricks to have prevented any
> conflicting lock. Protocol/client itself doesn't have enough information to
> do that. If its a plain distribute, I don't see a way to heal locks without
> loosing the property of exclusivity of locks.
>

Lock-healing of locks acquired while a brick was disconnected need to be
handled by AFR/EC. However, locks already present at the moment of
disconnection could be recovered by client xlator itself as long as the
file has not been closed (which client xlator already knows).

Xavi


> What I proposed is a short term solution. mid to long term solution should
> be lock healing feature implemented in AFR/EC. In fact I had this
> conversation with +Karampuri, Pranith  before
> posting this msg to ML.
>
>
>>
>>> However, this use case is not affected if the application don't acquire
>>> any POSIX locks. So, I am interested in knowing
>>> * whether your use cases use POSIX locks?
>>> * Is it feasible for your application to re-open fds and re-acquire
>>> locks on seeing EBADFD errors?
>>>
>>
>> I think that many applications are not prepared to handle that.
>>
>
> I too suspected that and in fact not too happy with the solution. But went
> ahead with this mail as I heard implementing lock-heal  in AFR will take
> time and hence there are no alternative short term solutions.
>

>
>> 

[Gluster-users] Weird issue with logrotate of bitd.log on GlusterFS 4.1

2019-03-27 Thread Thorgeir Marthinussen
All,

We're seeing some issues with the default provided logrotate configuration in 
regards to the bitd.log files.
Logrotate has a postrotate-script to run "killall -HUP glusterfs", to make the 
processes release the filehandles and create a new logfile, and using 
"delaycompress".

Recently we noticed that the 'df' reported usage on our /var/log didn't match 
"actual" usage reported with 'du'.
Checking 'lsof' we found that basically all "bitd.log.1" files are listed as 
open but "deleted", when lograte did the compression.

This only applies to the bitrot-daemon logs, none of the other logs.

In addition to this we are also seeing that the bitd.log file is significantly 
larger on the "second" replica-node in the cluster (the "first" node is the one 
used in fstab on the clients).

Please note, we are currently running a two-node replica set, we have a plan to 
introduce an arbiter-node, but need to complete some internal testing, as one 
of the volumes currently contain over 20 million files, and we are unsure how 
the introduction of the arbiter will impact the volume.

We are running glusterfs-4.1.5-1.el7.x86_64


'lsof' output from "first" node
glusterfs 12698 root5w   REG 253,11 611193834 50333986 
/var/log/glusterfs/bitd.log.1 (deleted)
glusterfs 12698 root8w   REG 253,11 611193834 50333986 
/var/log/glusterfs/bitd.log.1 (deleted)
glusterfs 12698 root   12w   REG 253,11 611193834 50333986 
/var/log/glusterfs/bitd.log.1 (deleted)

'lsof' output from "second" node
glusterfs  12742 root5w   REG 253,11 12959954668 50351288 
/var/log/glusterfs/bitd.log.1 (deleted)
glusterfs  12742 root8w   REG 253,11 12959954668 50351288 
/var/log/glusterfs/bitd.log.1 (deleted)
glusterfs  12742 root   11w   REG 253,11 12959954668 50351288 
/var/log/glusterfs/bitd.log.1 (deleted)

Relevant part of logrotate-config
/var/log/glusterfs/*.log {
  sharedscripts
  weekly
  rotate 52
  missingok
  compress
  delaycompress
  notifempty
  postrotate
  /usr/bin/killall -HUP glusterfs > /dev/null 2>&1 || true
  /usr/bin/killall -HUP glusterd > /dev/null 2>&1 || true
  endscript
}


Best regards
--
THORGEIR MARTHINUSSEN
Systems Consultant
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-27 Thread Raghavendra Gowdappa
On Wed, Mar 27, 2019 at 4:22 PM Raghavendra Gowdappa 
wrote:

>
>
> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez 
> wrote:
>
>> Hi Raghavendra,
>>
>> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa 
>> wrote:
>>
>>> All,
>>>
>>> Glusterfs cleans up POSIX locks held on an fd when the client/mount
>>> through which those locks are held disconnects from bricks/server. This
>>> helps Glusterfs to not run into a stale lock problem later (For eg., if
>>> application unlocks while the connection was still down). However, this
>>> means the lock is no longer exclusive as other applications/clients can
>>> acquire the same lock. To communicate that locks are no longer valid, we
>>> are planning to mark the fd (which has POSIX locks) bad on a disconnect so
>>> that any future operations on that fd will fail, forcing the application to
>>> re-open the fd and re-acquire locks it needs [1].
>>>
>>
>> Wouldn't it be better to retake the locks when the brick is reconnected
>> if the lock is still in use ?
>>
>
> There is also  a possibility that clients may never reconnect. That's the
> primary reason why bricks assume the worst (client will not reconnect) and
> cleanup the locks.
>
>
>> BTW, the referenced bug is not public. Should we open another bug to
>> track this ?
>>
>
> I've just opened up the comment to give enough context. I'll open a bug
> upstream too.
>
>
>>
>>
>>>
>>> Note that with AFR/replicate in picture we can prevent errors to
>>> application as long as Quorum number of children "never ever" lost
>>> connection with bricks after locks have been acquired. I am using the term
>>> "never ever" as locks are not healed back after re-connection and hence
>>> first disconnect would've marked the fd bad and the fd remains so even
>>> after re-connection happens. So, its not just Quorum number of children
>>> "currently online", but Quorum number of children "never having
>>> disconnected with bricks after locks are acquired".
>>>
>>
>> I think this requisite is not feasible. In a distributed file system,
>> sooner or later all bricks will be disconnected. It could be because of
>> failures or because an upgrade is done, but it will happen.
>>
>> The difference here is how long are fd's kept open. If applications open
>> and close files frequently enough (i.e. the fd is not kept open more time
>> than it takes to have more than Quorum bricks disconnected) then there's no
>> problem. The problem can only appear on applications that open files for a
>> long time and also use posix locks. In this case, the only good solution I
>> see is to retake the locks on brick reconnection.
>>
>
> Agree. But lock-healing should be done only by HA layers like AFR/EC as
> only they know whether there are enough online bricks to have prevented any
> conflicting lock. Protocol/client itself doesn't have enough information to
> do that. If its a plain distribute, I don't see a way to heal locks without
> loosing the property of exclusivity of locks.
>
> What I proposed is a short term solution. mid to long term solution should
> be lock healing feature implemented in AFR/EC. In fact I had this
> conversation with +Karampuri, Pranith  before
> posting this msg to ML.
>
>
>>
>>> However, this use case is not affected if the application don't acquire
>>> any POSIX locks. So, I am interested in knowing
>>> * whether your use cases use POSIX locks?
>>> * Is it feasible for your application to re-open fds and re-acquire
>>> locks on seeing EBADFD errors?
>>>
>>
>> I think that many applications are not prepared to handle that.
>>
>
> I too suspected that and in fact not too happy with the solution. But went
> ahead with this mail as I heard implementing lock-heal  in AFR will take
> time and hence there are no alternative short term solutions.
>

Also failing loudly is preferred to silently dropping locks.


>
>
>> Xavi
>>
>>
>>>
>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1689375#c7
>>>
>>> regards,
>>> Raghavendra
>>>
>>> ___
>>> Gluster-users mailing list
>>> Gluster-users@gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>>
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-27 Thread Raghavendra Gowdappa
On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez  wrote:

> Hi Raghavendra,
>
> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa 
> wrote:
>
>> All,
>>
>> Glusterfs cleans up POSIX locks held on an fd when the client/mount
>> through which those locks are held disconnects from bricks/server. This
>> helps Glusterfs to not run into a stale lock problem later (For eg., if
>> application unlocks while the connection was still down). However, this
>> means the lock is no longer exclusive as other applications/clients can
>> acquire the same lock. To communicate that locks are no longer valid, we
>> are planning to mark the fd (which has POSIX locks) bad on a disconnect so
>> that any future operations on that fd will fail, forcing the application to
>> re-open the fd and re-acquire locks it needs [1].
>>
>
> Wouldn't it be better to retake the locks when the brick is reconnected if
> the lock is still in use ?
>

There is also  a possibility that clients may never reconnect. That's the
primary reason why bricks assume the worst (client will not reconnect) and
cleanup the locks.


> BTW, the referenced bug is not public. Should we open another bug to track
> this ?
>

I've just opened up the comment to give enough context. I'll open a bug
upstream too.


>
>
>>
>> Note that with AFR/replicate in picture we can prevent errors to
>> application as long as Quorum number of children "never ever" lost
>> connection with bricks after locks have been acquired. I am using the term
>> "never ever" as locks are not healed back after re-connection and hence
>> first disconnect would've marked the fd bad and the fd remains so even
>> after re-connection happens. So, its not just Quorum number of children
>> "currently online", but Quorum number of children "never having
>> disconnected with bricks after locks are acquired".
>>
>
> I think this requisite is not feasible. In a distributed file system,
> sooner or later all bricks will be disconnected. It could be because of
> failures or because an upgrade is done, but it will happen.
>
> The difference here is how long are fd's kept open. If applications open
> and close files frequently enough (i.e. the fd is not kept open more time
> than it takes to have more than Quorum bricks disconnected) then there's no
> problem. The problem can only appear on applications that open files for a
> long time and also use posix locks. In this case, the only good solution I
> see is to retake the locks on brick reconnection.
>

Agree. But lock-healing should be done only by HA layers like AFR/EC as
only they know whether there are enough online bricks to have prevented any
conflicting lock. Protocol/client itself doesn't have enough information to
do that. If its a plain distribute, I don't see a way to heal locks without
loosing the property of exclusivity of locks.

What I proposed is a short term solution. mid to long term solution should
be lock healing feature implemented in AFR/EC. In fact I had this
conversation with +Karampuri, Pranith  before posting
this msg to ML.


>
>> However, this use case is not affected if the application don't acquire
>> any POSIX locks. So, I am interested in knowing
>> * whether your use cases use POSIX locks?
>> * Is it feasible for your application to re-open fds and re-acquire locks
>> on seeing EBADFD errors?
>>
>
> I think that many applications are not prepared to handle that.
>

I too suspected that and in fact not too happy with the solution. But went
ahead with this mail as I heard implementing lock-heal  in AFR will take
time and hence there are no alternative short term solutions.


> Xavi
>
>
>>
>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1689375#c7
>>
>> regards,
>> Raghavendra
>>
>> ___
>> Gluster-users mailing list
>> Gluster-users@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] cannot add server back to cluster after reinstallation

2019-03-27 Thread Atin Mukherjee
On Wed, 27 Mar 2019 at 16:02, Riccardo Murri 
wrote:

> Hello Atin,
>
> > Check cluster.op-version, peer status, volume status output. If they are
> all fine you’re good.
>
> Both `op-version` and `peer status` look fine:
> ```
> # gluster volume get all cluster.max-op-version
> Option  Value
> --  -
> cluster.max-op-version  31202
>
> # gluster peer status
> Number of Peers: 4
>
> Hostname: glusterfs-server-004
> Uuid: 9a5763d2-1941-4e5d-8d33-8d6756f7f318
> State: Peer in Cluster (Connected)
>
> Hostname: glusterfs-server-005
> Uuid: d53398f6-19d4-4633-8bc3-e493dac41789
> State: Peer in Cluster (Connected)
>
> Hostname: glusterfs-server-003
> Uuid: 3c74d2b4-a4f3-42d4-9511-f6174b0a641d
> State: Peer in Cluster (Connected)
>
> Hostname: glusterfs-server-001
> Uuid: 60bcc47e-ccbe-493e-b4ea-d45d63123977
> State: Peer in Cluster (Connected)
> ```
>
> However, `volume status` shows a missing snapshotd on the reinstalled
> server (the 002 one).


I believe you ran this command on 002? And in that case its showing as
localhost.


> We're not using snapshots so I guess this is fine too?


Is features.uss enabled for this volume? Otherwise we don’t show snapd
information in status output.

Rafi - am I correct?


>
> ```
> # gluster volume status
> Status of volume: glusterfs
> Gluster process TCP Port  RDMA Port  Online
> Pid
>
> --
> Brick glusterfs-server-005:/s
> rv/glusterfs49152 0  Y
>  1410
> Brick glusterfs-server-004:/s
> rv/glusterfs49152 0  Y
>  1416
> Brick glusterfs-server-003:/s
> rv/glusterfs49152 0  Y
>  1520
> Brick glusterfs-server-001:/s
> rv/glusterfs49152 0  Y
>  1266
> Brick glusterfs-server-002:/s
> rv/glusterfs49152 0  Y
>  3011
> Snapshot Daemon on localhostN/A   N/AY
>  3029
> Snapshot Daemon on glusterfs-
> server-001  49153 0  Y
>  1361
> Snapshot Daemon on glusterfs-
> server-005  49153 0  Y
>  1478
> Snapshot Daemon on glusterfs-
> server-004  49153 0  Y
>  1490
> Snapshot Daemon on glusterfs-
> server-003  49153 0  Y
>  1563
>
> Task Status of Volume glusterfs
>
> --
> Task : Rebalance
> ID   : 0eaf6ad1-df95-48f4-b941-17488010ddcc
> Status   : failed
> ```
>
> Thanks,
> Riccardo
>
-- 
--Atin
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] cannot add server back to cluster after reinstallation

2019-03-27 Thread Riccardo Murri
Hello Atin,

> Check cluster.op-version, peer status, volume status output. If they are all 
> fine you’re good.

Both `op-version` and `peer status` look fine:
```
# gluster volume get all cluster.max-op-version
Option  Value
--  -
cluster.max-op-version  31202

# gluster peer status
Number of Peers: 4

Hostname: glusterfs-server-004
Uuid: 9a5763d2-1941-4e5d-8d33-8d6756f7f318
State: Peer in Cluster (Connected)

Hostname: glusterfs-server-005
Uuid: d53398f6-19d4-4633-8bc3-e493dac41789
State: Peer in Cluster (Connected)

Hostname: glusterfs-server-003
Uuid: 3c74d2b4-a4f3-42d4-9511-f6174b0a641d
State: Peer in Cluster (Connected)

Hostname: glusterfs-server-001
Uuid: 60bcc47e-ccbe-493e-b4ea-d45d63123977
State: Peer in Cluster (Connected)
```

However, `volume status` shows a missing snapshotd on the reinstalled
server (the 002 one).
We're not using snapshots so I guess this is fine too?

```
# gluster volume status
Status of volume: glusterfs
Gluster process TCP Port  RDMA Port  Online  Pid
--
Brick glusterfs-server-005:/s
rv/glusterfs49152 0  Y   1410
Brick glusterfs-server-004:/s
rv/glusterfs49152 0  Y   1416
Brick glusterfs-server-003:/s
rv/glusterfs49152 0  Y   1520
Brick glusterfs-server-001:/s
rv/glusterfs49152 0  Y   1266
Brick glusterfs-server-002:/s
rv/glusterfs49152 0  Y   3011
Snapshot Daemon on localhostN/A   N/AY   3029
Snapshot Daemon on glusterfs-
server-001  49153 0  Y   1361
Snapshot Daemon on glusterfs-
server-005  49153 0  Y   1478
Snapshot Daemon on glusterfs-
server-004  49153 0  Y   1490
Snapshot Daemon on glusterfs-
server-003  49153 0  Y   1563

Task Status of Volume glusterfs
--
Task : Rebalance
ID   : 0eaf6ad1-df95-48f4-b941-17488010ddcc
Status   : failed
```

Thanks,
Riccardo
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] cannot add server back to cluster after reinstallation

2019-03-27 Thread Atin Mukherjee
On Wed, 27 Mar 2019 at 15:24, Riccardo Murri 
wrote:

> I managed to put the reinstalled server back into connected state with
> this procedure:
>
> 1. Run `for other_server in ...; do gluster peer probe $other_server;
> done` on the reinstalled server
> 2. Now all the peers on the reinstalled server show up as "Accepted
> Peer Request", which I fixed with the procedure outlined in the last
> paragraph of
> https://docs.gluster.org/en/v3/Troubleshooting/troubleshooting-glusterd/#debugging-glusterd
>
> Can anyone confirm that this is a good way to proceed and I won't be
> heading quickly towards corrupting volume data?


Check cluster.op-version, peer status, volume status output. If they are
all fine you’re good.


>
> Thanks,
> Riccardo
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
-- 
--Atin
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] cannot add server back to cluster after reinstallation

2019-03-27 Thread Karthik Subrahmanya
+Sanju Rakonde  & +Atin Mukherjee
 adding
glusterd folks who can help here.

On Wed, Mar 27, 2019 at 3:24 PM Riccardo Murri 
wrote:

> I managed to put the reinstalled server back into connected state with
> this procedure:
>
> 1. Run `for other_server in ...; do gluster peer probe $other_server;
> done` on the reinstalled server
> 2. Now all the peers on the reinstalled server show up as "Accepted
> Peer Request", which I fixed with the procedure outlined in the last
> paragraph of
> https://docs.gluster.org/en/v3/Troubleshooting/troubleshooting-glusterd/#debugging-glusterd
>
> Can anyone confirm that this is a good way to proceed and I won't be
> heading quickly towards corrupting volume data?
>
> Thanks,
> Riccardo
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] cannot add server back to cluster after reinstallation

2019-03-27 Thread Riccardo Murri
I managed to put the reinstalled server back into connected state with
this procedure:

1. Run `for other_server in ...; do gluster peer probe $other_server;
done` on the reinstalled server
2. Now all the peers on the reinstalled server show up as "Accepted
Peer Request", which I fixed with the procedure outlined in the last
paragraph of 
https://docs.gluster.org/en/v3/Troubleshooting/troubleshooting-glusterd/#debugging-glusterd

Can anyone confirm that this is a good way to proceed and I won't be
heading quickly towards corrupting volume data?

Thanks,
Riccardo
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-27 Thread Soumya Koduri



On 3/27/19 12:55 PM, Xavi Hernandez wrote:

Hi Raghavendra,

On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa 
mailto:rgowd...@redhat.com>> wrote:


All,

Glusterfs cleans up POSIX locks held on an fd when the client/mount
through which those locks are held disconnects from bricks/server.
This helps Glusterfs to not run into a stale lock problem later (For
eg., if application unlocks while the connection was still down).
However, this means the lock is no longer exclusive as other
applications/clients can acquire the same lock. To communicate that
locks are no longer valid, we are planning to mark the fd (which has
POSIX locks) bad on a disconnect so that any future operations on
that fd will fail, forcing the application to re-open the fd and
re-acquire locks it needs [1].


Wouldn't it be better to retake the locks when the brick is reconnected 
if the lock is still in use ?


BTW, the referenced bug is not public. Should we open another bug to 
track this ?



Note that with AFR/replicate in picture we can prevent errors to
application as long as Quorum number of children "never ever" lost
connection with bricks after locks have been acquired. I am using
the term "never ever" as locks are not healed back after
re-connection and hence first disconnect would've marked the fd bad
and the fd remains so even after re-connection happens. So, its not
just Quorum number of children "currently online", but Quorum number
of children "never having disconnected with bricks after locks are
acquired".


I think this requisite is not feasible. In a distributed file system, 
sooner or later all bricks will be disconnected. It could be because of 
failures or because an upgrade is done, but it will happen.


The difference here is how long are fd's kept open. If applications open 
and close files frequently enough (i.e. the fd is not kept open more 
time than it takes to have more than Quorum bricks disconnected) then 
there's no problem. The problem can only appear on applications that 
open files for a long time and also use posix locks. In this case, the 
only good solution I see is to retake the locks on brick reconnection.



However, this use case is not affected if the application don't
acquire any POSIX locks. So, I am interested in knowing
* whether your use cases use POSIX locks?
* Is it feasible for your application to re-open fds and re-acquire
locks on seeing EBADFD errors?


I think that many applications are not prepared to handle that.


+1 to all the points mentioned by Xavi. This has been day-1 issue for 
all the applications using locks (like NFS-Ganesha and Samba). Not many 
applications re-open and re-acquire the locks. On receiving EBADFD, that 
error is most likely propagated to application clients.


Agree with Xavi that its better to heal/re-acquire the locks on brick 
reconnects before it accepts any fresh requests. I also suggest to have 
this healing mechanism generic enough (if possible) to heal any 
server-side state (like upcall, leases etc).


Thanks,
Soumya



Xavi


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1689375#c7

regards,
Raghavendra

___
Gluster-users mailing list
Gluster-users@gluster.org 
https://lists.gluster.org/mailman/listinfo/gluster-users


___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] cannot add server back to cluster after reinstallation

2019-03-27 Thread riccardo . murri
Hello,

a couple days ago, the OS disk of one of the server of a local GlusterFS
cluster suffered a bad crash, and I had to reinstall everything from
scratch.

However, when I restart the GlusterFS service on the server that has
been reinstalled, I see that it sends back a "RJT" response to other
servers of the cluster, which then list it as "State: Peer Rejected
(Connected)"; the reinstalled server instead shows "Number of peers: 0".
The DEBUG level log on the reinstalled machine shows these lines after
the peer probe from another server in the cluster:

I [MSGID: 106490] 
[glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req] 0-glusterd: 
Received probe from uuid: 9a5763d2-1941-4e5d-8d33-8d6756f7f318
D [MSGID: 0] [glusterd-peer-utils.c:208:glusterd_peerinfo_find_by_uuid] 
0-management: Friend with uuid: 9a5763d2-1941-4e5d-8d33-8d6756f7f318, not found
D [MSGID: 0] [glusterd-peer-utils.c:234:glusterd_peerinfo_find] 
0-management: Unable to find peer by uuid: 9a5763d2-1941-4e5d-8d33-8d6756f7f318
D [MSGID: 0] [glusterd-peer-utils.c:132:glusterd_peerinfo_find_by_hostname] 
0-management: Unable to find friend: glusterfs-server-004
D [MSGID: 0] [glusterd-peer-utils.c:246:glusterd_peerinfo_find] 
0-management: Unable to find hostname: glusterfs-server-004
I [MSGID: 106493] [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 
0-glusterd: Responded to glusterfs-server-004 (24007), ret: 0, op_ret: -1

What can I do to re-add the reinstalled server into the cluster?  Is it
safe (= keeps data) to "peer detach" it and then "peer probe" again?

Additional info:

* The actual GlusterFS brick data was on a different disk and so is safe
  and mounted back in the original location.

* I copied back the `/etc/glusterfs/glusterd.vol` from the other servers
  in the cluster and restored the UUID into
  `/var/lib/glusterfs/glusterd.info`

* I have checked that `max.op-version` is the same on all servers of the
  cluster, including the reinstalled one.

* All servers run Ubuntu 16.04

Thanks for any suggestion!

Riccardo
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [ovirt-users] Re: VM disk corruption with LSM on Gluster

2019-03-27 Thread Sander Hoentjen
Hi Krutika, Leo,

Sounds promising. I will test this too, and report back tomorrow (or
maybe sooner, if corruption occurs again).

-- Sander


On 27-03-19 10:00, Krutika Dhananjay wrote:
> This is needed to prevent any inconsistencies stemming from buffered
> writes/caching file data during live VM migration.
> Besides, for Gluster to truly honor direct-io behavior in qemu's
> 'cache=none' mode (which is what oVirt uses),
> one needs to turn on performance.strict-o-direct and disable remote-dio.
>
> -Krutika
>
> On Wed, Mar 27, 2019 at 12:24 PM Leo David  > wrote:
>
> Hi,
> I can confirm that after setting these two options, I haven't
> encountered disk corruptions anymore.
> The downside, is that at least for me it had a pretty big impact
> on performance.
> The iops really went down - performing  inside vm fio tests.
>
> On Wed, Mar 27, 2019, 07:03 Krutika Dhananjay  > wrote:
>
> Could you enable strict-o-direct and disable remote-dio on the
> src volume as well, restart the vms on "old" and retry migration?
>
> # gluster volume set  performance.strict-o-direct on
> # gluster volume set  network.remote-dio off
>
> -Krutika
>
> On Tue, Mar 26, 2019 at 10:32 PM Sander Hoentjen
> mailto:san...@hoentjen.eu>> wrote:
>
> On 26-03-19 14:23, Sahina Bose wrote:
> > +Krutika Dhananjay and gluster ml
> >
> > On Tue, Mar 26, 2019 at 6:16 PM Sander Hoentjen
> mailto:san...@hoentjen.eu>> wrote:
> >> Hello,
> >>
> >> tl;dr We have disk corruption when doing live storage
> migration on oVirt
> >> 4.2 with gluster 3.12.15. Any idea why?
> >>
> >> We have a 3-node oVirt cluster that is both compute and
> gluster-storage.
> >> The manager runs on separate hardware. We are running
> out of space on
> >> this volume, so we added another Gluster volume that is
> bigger, put a
> >> storage domain on it and then we migrated VM's to it
> with LSM. After
> >> some time, we noticed that (some of) the migrated VM's
> had corrupted
> >> filesystems. After moving everything back with
> export-import to the old
> >> domain where possible, and recovering from backups
> where needed we set
> >> off to investigate this issue.
> >>
> >> We are now at the point where we can reproduce this
> issue within a day.
> >> What we have found so far:
> >> 1) The corruption occurs at the very end of the
> replication step, most
> >> probably between START and FINISH of
> diskReplicateFinish, before the
> >> START merge step
> >> 2) In the corrupted VM, at some place where data should
> be, this data is
> >> replaced by zero's. This can be file-contents or a
> directory-structure
> >> or whatever.
> >> 3) The source gluster volume has different settings
> then the destination
> >> (Mostly because the defaults were different at creation
> time):
> >>
> >> Setting                                 old(src)  new(dst)
> >> cluster.op-version                      30800     30800
> (the same)
> >> cluster.max-op-version                  31202     31202
> (the same)
> >> cluster.metadata-self-heal              off       on
> >> cluster.data-self-heal                  off       on
> >> cluster.entry-self-heal                 off       on
> >> performance.low-prio-threads            16        32
> >> performance.strict-o-direct             off       on
> >> network.ping-timeout                    42        30
> >> network.remote-dio                      enable    off
> >> transport.address-family                -         inet
> >> performance.stat-prefetch               off       on
> >> features.shard-block-size               512MB     64MB
> >> cluster.shd-max-threads                 1         8
> >> cluster.shd-wait-qlength                1024      1
> >> cluster.locking-scheme                  full      granular
> >> cluster.granular-entry-heal             no        enable
> >>
> >> 4) To test, we migrate some VM's back and forth. The
> corruption does not
> >> occur every time. To this point it only occurs from old
> to new, but we
> >> don't have enough data-points to be sure about that.

Re: [Gluster-users] [ovirt-users] Re: VM disk corruption with LSM on Gluster

2019-03-27 Thread Krutika Dhananjay
This is needed to prevent any inconsistencies stemming from buffered
writes/caching file data during live VM migration.
Besides, for Gluster to truly honor direct-io behavior in qemu's
'cache=none' mode (which is what oVirt uses),
one needs to turn on performance.strict-o-direct and disable remote-dio.

-Krutika

On Wed, Mar 27, 2019 at 12:24 PM Leo David  wrote:

> Hi,
> I can confirm that after setting these two options, I haven't encountered
> disk corruptions anymore.
> The downside, is that at least for me it had a pretty big impact on
> performance.
> The iops really went down - performing  inside vm fio tests.
>
> On Wed, Mar 27, 2019, 07:03 Krutika Dhananjay  wrote:
>
>> Could you enable strict-o-direct and disable remote-dio on the src volume
>> as well, restart the vms on "old" and retry migration?
>>
>> # gluster volume set  performance.strict-o-direct on
>> # gluster volume set  network.remote-dio off
>>
>> -Krutika
>>
>> On Tue, Mar 26, 2019 at 10:32 PM Sander Hoentjen 
>> wrote:
>>
>>> On 26-03-19 14:23, Sahina Bose wrote:
>>> > +Krutika Dhananjay and gluster ml
>>> >
>>> > On Tue, Mar 26, 2019 at 6:16 PM Sander Hoentjen 
>>> wrote:
>>> >> Hello,
>>> >>
>>> >> tl;dr We have disk corruption when doing live storage migration on
>>> oVirt
>>> >> 4.2 with gluster 3.12.15. Any idea why?
>>> >>
>>> >> We have a 3-node oVirt cluster that is both compute and
>>> gluster-storage.
>>> >> The manager runs on separate hardware. We are running out of space on
>>> >> this volume, so we added another Gluster volume that is bigger, put a
>>> >> storage domain on it and then we migrated VM's to it with LSM. After
>>> >> some time, we noticed that (some of) the migrated VM's had corrupted
>>> >> filesystems. After moving everything back with export-import to the
>>> old
>>> >> domain where possible, and recovering from backups where needed we set
>>> >> off to investigate this issue.
>>> >>
>>> >> We are now at the point where we can reproduce this issue within a
>>> day.
>>> >> What we have found so far:
>>> >> 1) The corruption occurs at the very end of the replication step, most
>>> >> probably between START and FINISH of diskReplicateFinish, before the
>>> >> START merge step
>>> >> 2) In the corrupted VM, at some place where data should be, this data
>>> is
>>> >> replaced by zero's. This can be file-contents or a directory-structure
>>> >> or whatever.
>>> >> 3) The source gluster volume has different settings then the
>>> destination
>>> >> (Mostly because the defaults were different at creation time):
>>> >>
>>> >> Setting old(src)  new(dst)
>>> >> cluster.op-version  30800 30800 (the same)
>>> >> cluster.max-op-version  31202 31202 (the same)
>>> >> cluster.metadata-self-heal  off   on
>>> >> cluster.data-self-heal  off   on
>>> >> cluster.entry-self-heal off   on
>>> >> performance.low-prio-threads1632
>>> >> performance.strict-o-direct off   on
>>> >> network.ping-timeout4230
>>> >> network.remote-dio  enableoff
>>> >> transport.address-family- inet
>>> >> performance.stat-prefetch   off   on
>>> >> features.shard-block-size   512MB 64MB
>>> >> cluster.shd-max-threads 1 8
>>> >> cluster.shd-wait-qlength1024  1
>>> >> cluster.locking-scheme  full  granular
>>> >> cluster.granular-entry-heal noenable
>>> >>
>>> >> 4) To test, we migrate some VM's back and forth. The corruption does
>>> not
>>> >> occur every time. To this point it only occurs from old to new, but we
>>> >> don't have enough data-points to be sure about that.
>>> >>
>>> >> Anybody an idea what is causing the corruption? Is this the best list
>>> to
>>> >> ask, or should I ask on a Gluster list? I am not sure if this is oVirt
>>> >> specific or Gluster specific though.
>>> > Do you have logs from old and new gluster volumes? Any errors in the
>>> > new volume's fuse mount logs?
>>>
>>> Around the time of corruption I see the message:
>>> The message "I [MSGID: 133017] [shard.c:4941:shard_seek]
>>> 0-ZoneA_Gluster1-shard: seek called on
>>> 7fabc273-3d8a-4a49-8906-b8ccbea4a49f. [Operation not supported]" repeated
>>> 231 times between [2019-03-26 13:14:22.297333] and [2019-03-26
>>> 13:15:42.912170]
>>>
>>> I also see this message at other times, when I don't see the corruption
>>> occur, though.
>>>
>>> --
>>> Sander
>>> ___
>>> Users mailing list -- us...@ovirt.org
>>> To unsubscribe send an email to users-le...@ovirt.org
>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>> oVirt Code of Conduct:
>>> https://www.ovirt.org/community/about/community-guidelines/
>>> List Archives:
>>> 

[Gluster-users] what versions are packaged for what Linux distro?

2019-03-27 Thread Riccardo Murri
Hello,

following the announcement of GlusterFS 6, I tried to install the
package from the Ubuntu PPA on a 16.04 "xenial" machine, only to find
out that GlusterFS 6 is only packaged for Ubuntu "bionic" and up.

Is there an online page with a table or matrix detailing what versions
are packaged for what Linux distribution?

Thanks,
Riccardo
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-27 Thread Xavi Hernandez
Hi Raghavendra,

On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa 
wrote:

> All,
>
> Glusterfs cleans up POSIX locks held on an fd when the client/mount
> through which those locks are held disconnects from bricks/server. This
> helps Glusterfs to not run into a stale lock problem later (For eg., if
> application unlocks while the connection was still down). However, this
> means the lock is no longer exclusive as other applications/clients can
> acquire the same lock. To communicate that locks are no longer valid, we
> are planning to mark the fd (which has POSIX locks) bad on a disconnect so
> that any future operations on that fd will fail, forcing the application to
> re-open the fd and re-acquire locks it needs [1].
>

Wouldn't it be better to retake the locks when the brick is reconnected if
the lock is still in use ?

BTW, the referenced bug is not public. Should we open another bug to track
this ?


>
> Note that with AFR/replicate in picture we can prevent errors to
> application as long as Quorum number of children "never ever" lost
> connection with bricks after locks have been acquired. I am using the term
> "never ever" as locks are not healed back after re-connection and hence
> first disconnect would've marked the fd bad and the fd remains so even
> after re-connection happens. So, its not just Quorum number of children
> "currently online", but Quorum number of children "never having
> disconnected with bricks after locks are acquired".
>

I think this requisite is not feasible. In a distributed file system,
sooner or later all bricks will be disconnected. It could be because of
failures or because an upgrade is done, but it will happen.

The difference here is how long are fd's kept open. If applications open
and close files frequently enough (i.e. the fd is not kept open more time
than it takes to have more than Quorum bricks disconnected) then there's no
problem. The problem can only appear on applications that open files for a
long time and also use posix locks. In this case, the only good solution I
see is to retake the locks on brick reconnection.


> However, this use case is not affected if the application don't acquire
> any POSIX locks. So, I am interested in knowing
> * whether your use cases use POSIX locks?
> * Is it feasible for your application to re-open fds and re-acquire locks
> on seeing EBADFD errors?
>

I think that many applications are not prepared to handle that.

Xavi


>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1689375#c7
>
> regards,
> Raghavendra
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Prioritise local bricks for IO?

2019-03-27 Thread Lucian
Oh, that's just what the doctor ordered! 
Hope it works, thanks

On 27 March 2019 03:15:57 GMT, Vlad Kopylov  wrote:
>I don't remember if it still in works
>NUFA
>https://github.com/gluster/glusterfs-specs/blob/master/done/Features/nufa.md
>
>v
>
>On Tue, Mar 26, 2019 at 7:27 AM Nux!  wrote:
>
>> Hello,
>>
>> I'm trying to set up a distributed backup storage (no replicas), but
>I'd
>> like to prioritise the local bricks for any IO done on the volume.
>> This will be a backup stor, so in other words, I'd like the files to
>be
>> written locally if there is space, so as to save the NICs for other
>traffic.
>>
>> Anyone knows how this might be achievable, if at all?
>>
>> --
>> Sent from the Delta quadrant using Borg technology!
>>
>> Nux!
>> www.nux.ro
>> ___
>> Gluster-users mailing list
>> Gluster-users@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users