Re: [Gluster-devel] How long should metrics collection on a cluster take?

2018-07-26 Thread Pranith Kumar Karampuri
On Thu, Jul 26, 2018 at 8:30 PM, John Strunk  wrote:

> It is configurable. Use the default as a notion of scale... 5s may become
> 30s; It won't be 5m.
> Also remember, this is the maximum, not minimum. A change to a watched
> kube resource will cause an immediate reconcile. The periodic, timer-based
> loop is just a fallback to catch state changes not represented in the kube
> API.
>

Cool, got it. Let us wait if anyone sees any objections to the solution
proposed.

Request everyone to comment if they see any issues with
https://github.com/gluster/glusterd2/issues/1069
I think EC/AFR/Quota components will definitely be affected with this
approach. CCing them.
Please feel free to CC anyone who works on commands that require a mount to
give status.


>
> On Thu, Jul 26, 2018 at 12:57 AM Pranith Kumar Karampuri <
> pkara...@redhat.com> wrote:
>
>>
>>
>> On Thu, Jul 26, 2018 at 9:59 AM, Pranith Kumar Karampuri <
>> pkara...@redhat.com> wrote:
>>
>>>
>>>
>>> On Wed, Jul 25, 2018 at 10:48 PM, John Strunk 
>>> wrote:
>>>
 I have not put together a list. Perhaps the following will help w/ the
 context though...

 The "reconcile loop" of the operator will take the cluster CRs and
 reconcile them against the actual cluster config. At the 20k foot level,
 this amounts to something like determining there should be 8 gluster pods
 running, and making the appropriate changes if that doesn't match reality.
 In practical terms, the construction of this reconciliation loop can be
 thought of as a set (array) of 3-tuples: [{should_act() -> bool, can_act ->
 bool, action() -> ok, error}, {..., ..., ...}, ...]

 Each capability of the operator would be expressed as one of these
 tuples.
 should_act() : true if the action() should be taken
 can_act() : true if the prerequisites for taking the action are met
 action() : make the change. Only run if should && can.
 (note that I believe should_act() and can_act() should not be separate
 in the implementation, for reasons I'll not go into here)

 An example action might be "upgrade the container image for pod X". The
 associated should_act would be triggered if the "image=" of the pod doesn't
 match the desired "image=" in the operator CRs. The can_act evaluation
 would be verifying that it's ok to do this... Thinking from the top of my
 head:
 - All volumes w/ a brick on this pod should be fully healed
 - Sufficient cluster nodes should be up such that quorum is not lost
 when this node goes down (does this matter?)
 - The proposed image is compatible with the current version of the CSI
 driver(s), the operator, and other gluster pods
 - Probably some other stuff
 The action() would update the "image=" in the Deployment to trigger the
 rollout

 The idea is that queries would be made, both to the kube API and the
 gluster cluster to verify the necessary preconditions for an action prior
 to that action being invoked. There would obviously be commonality among
 the preconditions for various actions, so the results should be fetched
 exactly once per reconcile cycle. Also note, 1 cycle == at most 1 action()
 due to the action changing the state of the system.

 Given that we haven't designed (or even listed) all the potential
 action()s, I can't give you a list of everything to query. I guarantee
 we'll need to know the up/down status, heal counts, and free capacity for
 each brick and node.

>>>
>>> Thanks for the detailed explanation. This helps. One question though, is
>>> 5 seconds a hard limit or is there a possibility to configure it?
>>>
>>
>> I put together an idea for reducing the mgmt operation latency involving
>> mounts at https://github.com/gluster/glusterd2/issues/1069, comments
>> welcome.
>> @john Still want to know if there exists  a way to find if the hard limit
>> can be configured...
>>
>>
>>>
>>>

 -John

 On Wed, Jul 25, 2018 at 11:56 AM Pranith Kumar Karampuri <
 pkara...@redhat.com> wrote:

>
>
> On Wed, Jul 25, 2018 at 8:17 PM, John Strunk 
> wrote:
>
>> To add an additional data point... The operator will need to
>> regularly reconcile the true state of the gluster cluster with the 
>> desired
>> state stored in kubernetes. This task will be required frequently (i.e.,
>> operator-framework defaults to every 5s even if there are no config
>> changes).
>> The actual amount of data we will need to query from the cluster is
>> currently TBD and likely significantly affected by Heketi/GD1 vs. GD2
>> choice.
>>
>
> Do we have any partial list of data we will gather? Just want to
> understand what this might entail already...
>
>
>>
>> -John
>>
>>
>> On Wed, Jul 25, 2018 at 5:45 AM Pranith Kumar Karampuri <
>> pkara...@redhat.com> wrote:
>>
>>>

Re: [Gluster-devel] How long should metrics collection on a cluster take?

2018-07-26 Thread John Strunk
It is configurable. Use the default as a notion of scale... 5s may become
30s; It won't be 5m.
Also remember, this is the maximum, not minimum. A change to a watched kube
resource will cause an immediate reconcile. The periodic, timer-based loop
is just a fallback to catch state changes not represented in the kube API.

On Thu, Jul 26, 2018 at 12:57 AM Pranith Kumar Karampuri <
pkara...@redhat.com> wrote:

>
>
> On Thu, Jul 26, 2018 at 9:59 AM, Pranith Kumar Karampuri <
> pkara...@redhat.com> wrote:
>
>>
>>
>> On Wed, Jul 25, 2018 at 10:48 PM, John Strunk  wrote:
>>
>>> I have not put together a list. Perhaps the following will help w/ the
>>> context though...
>>>
>>> The "reconcile loop" of the operator will take the cluster CRs and
>>> reconcile them against the actual cluster config. At the 20k foot level,
>>> this amounts to something like determining there should be 8 gluster pods
>>> running, and making the appropriate changes if that doesn't match reality.
>>> In practical terms, the construction of this reconciliation loop can be
>>> thought of as a set (array) of 3-tuples: [{should_act() -> bool, can_act ->
>>> bool, action() -> ok, error}, {..., ..., ...}, ...]
>>>
>>> Each capability of the operator would be expressed as one of these
>>> tuples.
>>> should_act() : true if the action() should be taken
>>> can_act() : true if the prerequisites for taking the action are met
>>> action() : make the change. Only run if should && can.
>>> (note that I believe should_act() and can_act() should not be separate
>>> in the implementation, for reasons I'll not go into here)
>>>
>>> An example action might be "upgrade the container image for pod X". The
>>> associated should_act would be triggered if the "image=" of the pod doesn't
>>> match the desired "image=" in the operator CRs. The can_act evaluation
>>> would be verifying that it's ok to do this... Thinking from the top of my
>>> head:
>>> - All volumes w/ a brick on this pod should be fully healed
>>> - Sufficient cluster nodes should be up such that quorum is not lost
>>> when this node goes down (does this matter?)
>>> - The proposed image is compatible with the current version of the CSI
>>> driver(s), the operator, and other gluster pods
>>> - Probably some other stuff
>>> The action() would update the "image=" in the Deployment to trigger the
>>> rollout
>>>
>>> The idea is that queries would be made, both to the kube API and the
>>> gluster cluster to verify the necessary preconditions for an action prior
>>> to that action being invoked. There would obviously be commonality among
>>> the preconditions for various actions, so the results should be fetched
>>> exactly once per reconcile cycle. Also note, 1 cycle == at most 1 action()
>>> due to the action changing the state of the system.
>>>
>>> Given that we haven't designed (or even listed) all the potential
>>> action()s, I can't give you a list of everything to query. I guarantee
>>> we'll need to know the up/down status, heal counts, and free capacity for
>>> each brick and node.
>>>
>>
>> Thanks for the detailed explanation. This helps. One question though, is
>> 5 seconds a hard limit or is there a possibility to configure it?
>>
>
> I put together an idea for reducing the mgmt operation latency involving
> mounts at https://github.com/gluster/glusterd2/issues/1069, comments
> welcome.
> @john Still want to know if there exists  a way to find if the hard limit
> can be configured...
>
>
>>
>>
>>>
>>> -John
>>>
>>> On Wed, Jul 25, 2018 at 11:56 AM Pranith Kumar Karampuri <
>>> pkara...@redhat.com> wrote:
>>>


 On Wed, Jul 25, 2018 at 8:17 PM, John Strunk 
 wrote:

> To add an additional data point... The operator will need to regularly
> reconcile the true state of the gluster cluster with the desired state
> stored in kubernetes. This task will be required frequently (i.e.,
> operator-framework defaults to every 5s even if there are no config
> changes).
> The actual amount of data we will need to query from the cluster is
> currently TBD and likely significantly affected by Heketi/GD1 vs. GD2
> choice.
>

 Do we have any partial list of data we will gather? Just want to
 understand what this might entail already...


>
> -John
>
>
> On Wed, Jul 25, 2018 at 5:45 AM Pranith Kumar Karampuri <
> pkara...@redhat.com> wrote:
>
>>
>>
>> On Tue, Jul 24, 2018 at 10:10 PM, Sankarshan Mukhopadhyay <
>> sankarshan.mukhopadh...@gmail.com> wrote:
>>
>>> On Tue, Jul 24, 2018 at 9:48 PM, Pranith Kumar Karampuri
>>>  wrote:
>>> > hi,
>>> >   Quite a few commands to monitor gluster at the moment take
>>> almost a
>>> > second to give output.
>>>
>>> Is this at the (most) minimum recommended cluster size?
>>>
>>
>> Yes, with a single volume with 3 bricks i.e. 3 nodes in cluster.
>>
>>
>>>
>>> > Some categories of 

[Gluster-devel] Coverity covscan for 2018-07-26-2836e158 (master branch)

2018-07-26 Thread staticanalysis


GlusterFS Coverity covscan results for the master branch are available from
http://download.gluster.org/pub/gluster/glusterfs/static-analysis/master/glusterfs-coverity/2018-07-26-2836e158/

Coverity covscan results for other active branches are also available at
http://download.gluster.org/pub/gluster/glusterfs/static-analysis/

___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Maintainer's Meeting on 23rd July, 2018: Meeting minutes

2018-07-26 Thread Amar Tumballi
BJ Link

   - Bridge: https://bluejeans.com/217609845
   - Download: https://bluejeans.com/s/qGMqd

Attendance

   - Amar, Kaleb, Nigel, Ravi, Vijay, Shyam, Rafi, Nithya, Kaushal, Pranith

Agenda

   -

   AI from previous meetings:
   - AI-1: Python3/Python2 discussion, take it to closure:
 - We now have agreed upon how it all looks.
 - Will be running tests on Fedora 28 for python3, and CentOS for
 Python2
 - AI: Shyam to update release notes - DONE
  - AI-2: Coding Standard - Clang format
 - All set to do it this week.
 - There is more work on coding standard agreement.
 - AI: Amar to send a proposal on going live
  -

   Documentation Hackathon :
   - Review help needed: http://bit.ly/gluster-doc-hack-report
  - Help more, fix your component.
  - Hold another hackathon with more advance notice
   -

   Coding Standard:
   - Need to improve the existing coding standard properly
  
,
  and point it clearly in developer document.
  - For example, one point added here: https://review.gluster.org/20540
  - More suggestions welcome
   -

   Commit message:
   - While there is a standard we ‘try’ to follow, it is not enforced. Can
  we try to get it documented?
  - Sample here @ github
  
.
  Review it in Gerrit .
  - Can we bring back the gerrit URL to commit messages? There is a
  wealth of information and review comments captured there and
going back to
  see the discussions later on is becoming a pain point.
 - One way to get that is by using notes in a local repo :
 
https://gerrit.googlesource.com/plugins/reviewnotes/+/master/src/main/resources/Documentation/refs-notes-review.md
 - Also, in the repo do git config notes.displayRef notes/ref after
 setting up the notes remote.
  -

   Infrastructure:
   - Now the regression failure output comes in Gerrit itself, hence please
  check the reason for failure before re-triggerring regression.
   -

   Emails on Feature classification & Sunset of few features:
   - Have a look @ Proposal for deprecation
  
   and Classification of Features
  
   emails.
   - What happens to tests when we sunset components?
 - We will tag tests that map to components that we don’t support
 anymore.
 - We will no longer run tests that are not relevant to releases
 anymore.
  - What is the difference between sunset and deprecated?
 - We seem to be using them with opposite meanings that other
 projects use.
 - Sunset - We will remove it in the future.
 - Deprecated - We are going to remove it in the next release
  - Add anything more missing into the emails. Or even your thoughts.
   -

   Mountpoint.IO 
   - Who all are attending?
 - kshlm (visa dependent)
 - nigelb (depends on visa)
 - amarts (depends on visa)
 - gobinda Das (depends on visa)
  -

   Release *v5.0*
   - Have you tagged the feature you are working on, for next release?
 - Feature tagging and a post on devel list about proposed features
 would be awesome!
  -

   Status update from other projects?
   - GlusterD2
 - Focus on GCS
 - Automatic volume provisoning is in alpha state
 - Ongoing work on transacation framework, snapshots etc.
  - NFS-Ganesha
 - Upcoming Bakeathon in September
 - storahug being integrated with gd2?
  -

   Round Table:
   - Kaleb: Coverity tool updated to 2018-06, 50 more defects observed
  - Possible move back to gerrit for gd2 reviews

-Amar


On Fri, Jul 20, 2018 at 5:22 PM, Amar Tumballi  wrote:

> BJ Link
>
>- Bridge: https://bluejeans.com/217609845
>- Download:
>
> Attendance
>
>- 
>- 
>
> Agenda
>
>-
>
>AI from previous meetings:
>- AI-1: Python3/Python2 discussion, take it to closure:
>  - We now have agreed upon how it all looks.
>  - Will be running tests on Fedora 28 for python3, and CentOS for
>  Python2
>   - AI-2: Coding Standard - Clang format
>  - All set to do it this week.
>  - There is more work on coding standard agreement.
>   -
>
>Documentation Hackathon :
>- Review help needed: