Re: [Gluster-devel] How long should metrics collection on a cluster take?

2018-07-25 Thread Pranith Kumar Karampuri
On Thu, Jul 26, 2018 at 9:59 AM, Pranith Kumar Karampuri <
pkara...@redhat.com> wrote:

>
>
> On Wed, Jul 25, 2018 at 10:48 PM, John Strunk  wrote:
>
>> I have not put together a list. Perhaps the following will help w/ the
>> context though...
>>
>> The "reconcile loop" of the operator will take the cluster CRs and
>> reconcile them against the actual cluster config. At the 20k foot level,
>> this amounts to something like determining there should be 8 gluster pods
>> running, and making the appropriate changes if that doesn't match reality.
>> In practical terms, the construction of this reconciliation loop can be
>> thought of as a set (array) of 3-tuples: [{should_act() -> bool, can_act ->
>> bool, action() -> ok, error}, {..., ..., ...}, ...]
>>
>> Each capability of the operator would be expressed as one of these tuples.
>> should_act() : true if the action() should be taken
>> can_act() : true if the prerequisites for taking the action are met
>> action() : make the change. Only run if should && can.
>> (note that I believe should_act() and can_act() should not be separate in
>> the implementation, for reasons I'll not go into here)
>>
>> An example action might be "upgrade the container image for pod X". The
>> associated should_act would be triggered if the "image=" of the pod doesn't
>> match the desired "image=" in the operator CRs. The can_act evaluation
>> would be verifying that it's ok to do this... Thinking from the top of my
>> head:
>> - All volumes w/ a brick on this pod should be fully healed
>> - Sufficient cluster nodes should be up such that quorum is not lost when
>> this node goes down (does this matter?)
>> - The proposed image is compatible with the current version of the CSI
>> driver(s), the operator, and other gluster pods
>> - Probably some other stuff
>> The action() would update the "image=" in the Deployment to trigger the
>> rollout
>>
>> The idea is that queries would be made, both to the kube API and the
>> gluster cluster to verify the necessary preconditions for an action prior
>> to that action being invoked. There would obviously be commonality among
>> the preconditions for various actions, so the results should be fetched
>> exactly once per reconcile cycle. Also note, 1 cycle == at most 1 action()
>> due to the action changing the state of the system.
>>
>> Given that we haven't designed (or even listed) all the potential
>> action()s, I can't give you a list of everything to query. I guarantee
>> we'll need to know the up/down status, heal counts, and free capacity for
>> each brick and node.
>>
>
> Thanks for the detailed explanation. This helps. One question though, is 5
> seconds a hard limit or is there a possibility to configure it?
>

I put together an idea for reducing the mgmt operation latency involving
mounts at https://github.com/gluster/glusterd2/issues/1069, comments
welcome.
@john Still want to know if there exists  a way to find if the hard limit
can be configured...


>
>
>>
>> -John
>>
>> On Wed, Jul 25, 2018 at 11:56 AM Pranith Kumar Karampuri <
>> pkara...@redhat.com> wrote:
>>
>>>
>>>
>>> On Wed, Jul 25, 2018 at 8:17 PM, John Strunk  wrote:
>>>
 To add an additional data point... The operator will need to regularly
 reconcile the true state of the gluster cluster with the desired state
 stored in kubernetes. This task will be required frequently (i.e.,
 operator-framework defaults to every 5s even if there are no config
 changes).
 The actual amount of data we will need to query from the cluster is
 currently TBD and likely significantly affected by Heketi/GD1 vs. GD2
 choice.

>>>
>>> Do we have any partial list of data we will gather? Just want to
>>> understand what this might entail already...
>>>
>>>

 -John


 On Wed, Jul 25, 2018 at 5:45 AM Pranith Kumar Karampuri <
 pkara...@redhat.com> wrote:

>
>
> On Tue, Jul 24, 2018 at 10:10 PM, Sankarshan Mukhopadhyay <
> sankarshan.mukhopadh...@gmail.com> wrote:
>
>> On Tue, Jul 24, 2018 at 9:48 PM, Pranith Kumar Karampuri
>>  wrote:
>> > hi,
>> >   Quite a few commands to monitor gluster at the moment take
>> almost a
>> > second to give output.
>>
>> Is this at the (most) minimum recommended cluster size?
>>
>
> Yes, with a single volume with 3 bricks i.e. 3 nodes in cluster.
>
>
>>
>> > Some categories of these commands:
>> > 1) Any command that needs to do some sort of mount/glfs_init.
>> >  Examples: 1) heal info family of commands 2) statfs to find
>> > space-availability etc (On my laptop replica 3 volume with all
>> local bricks,
>> > glfs_init takes 0.3 seconds on average)
>> > 2) glusterd commands that need to wait for the previous command to
>> unlock.
>> > If the previous command is something related to lvm snapshot which
>> takes
>> > quite a few seconds, it would be even more time consuming.

Re: [Gluster-devel] Release 5: Master branch health report (Week of 23rd July)

2018-07-25 Thread Nigel Babu
Replies inline

On Thu, Jul 26, 2018 at 1:48 AM Shyam Ranganathan 
wrote:

> On 07/24/2018 03:28 PM, Shyam Ranganathan wrote:
> > On 07/24/2018 03:12 PM, Shyam Ranganathan wrote:
> >> 1) master branch health checks (weekly, till branching)
> >>   - Expect every Monday a status update on various tests runs
> >
> > See https://build.gluster.org/job/nightly-master/ for a report on
> > various nightly and periodic jobs on master.
> >
> > RED:
> > 1. Nightly regression
> > 2. Regression with multiplex (cores and test failures)
> > 3. line-coverage (cores and test failures)
>
> The failures for line coverage issues, are filed as the following BZs
> 1) Parent BZ for nightly line coverage failure:
> https://bugzilla.redhat.com/show_bug.cgi?id=1608564
>
> 2) glusterd crash in test sdfs-sanity.t:
> https://bugzilla.redhat.com/show_bug.cgi?id=1608566
>
> glusterd folks, request you to take a look to correct this.
>
> 3) bug-1432542-mpx-restart-crash.t times out consistently:
> https://bugzilla.redhat.com/show_bug.cgi?id=1608568
>
> @nigel is there a way to on-demand request lcov tests through gerrit? I
> am thinking of pushing a patch that increases the timeout and check if
> it solves the problem for this test as detailed in the bug.
>

You should have access to trigger the job from Jenkins. Does that work for
now?


>
> >
> > Calling out to contributors to take a look at various failures, and post
> > the same as bugs AND to the lists (so that duplication is avoided) to
> > get this to a GREEN status.
> >
> > GREEN:
> > 1. cpp-check
> > 2. RPM builds
> >
> > IGNORE (for now):
> > 1. clang scan (@nigel, this job requires clang warnings to be fixed to
> > go green, right?)
>

So there are two ways. Back when I first ran it, I set a limit on how many
clang failures we have. If we went above the number, the job would turn
yellow. The current threshold is 955 and we're at 1001. What would be
useful is for us to fix a few bugs a week and keeping bumping this limit
down.


> >
> > Shyam
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-devel
> >
>


-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] How long should metrics collection on a cluster take?

2018-07-25 Thread Sankarshan Mukhopadhyay
On Wed, Jul 25, 2018 at 11:53 PM, Yaniv Kaul  wrote:
>
>
> On Tue, Jul 24, 2018, 7:20 PM Pranith Kumar Karampuri 
> wrote:
>>
>> hi,
>>   Quite a few commands to monitor gluster at the moment take almost a
>> second to give output.
>> Some categories of these commands:
>> 1) Any command that needs to do some sort of mount/glfs_init.
>>  Examples: 1) heal info family of commands 2) statfs to find
>> space-availability etc (On my laptop replica 3 volume with all local bricks,
>> glfs_init takes 0.3 seconds on average)
>> 2) glusterd commands that need to wait for the previous command to unlock.
>> If the previous command is something related to lvm snapshot which takes
>> quite a few seconds, it would be even more time consuming.
>>
>> Nowadays container workloads have hundreds of volumes if not thousands. If
>> we want to serve any monitoring solution at this scale (I have seen
>> customers use upto 600 volumes at a time, it will only get bigger) and lets
>> say collecting metrics per volume takes 2 seconds per volume(Let us take the
>> worst example which has all major features enabled like
>> snapshot/geo-rep/quota etc etc), that will mean that it will take 20 minutes
>> to collect metrics of the cluster with 600 volumes. What are the ways in
>> which we can make this number more manageable? I was initially thinking may
>> be it is possible to get gd2 to execute commands in parallel on different
>> volumes, so potentially we could get this done in ~2 seconds. But quite a
>> few of the metrics need a mount or equivalent of a mount(glfs_init) to
>> collect different information like statfs, number of pending heals, quota
>> usage etc. This may lead to high memory usage as the size of the mounts tend
>> to be high.
>>
>> I wanted to seek suggestions from others on how to come to a conclusion
>> about which path to take and what problems to solve.
>
>
> I would imagine that in gd2 world:
> 1. All stats would be in etcd.
> 2. There will be a single API call for GetALLVolumesStats or something and
> we won't be asking the client to loop, or there will be a similar efficient
> single API to query and deliver stats for some volumes in a batch ('all
> bricks in host X' for example).
>

Single end point for metrics/monitoring was a topic that was not
agreed upon at 

> Worth looking how it's implemented elsewhere in K8S.
>
> In any case, when asking for metrics I assume the latest already available
> would be returned and we are not going to fetch them when queried. This is
> both fragile (imagine an entity that doesn't respond well) and adds latency
> and will be inaccurate anyway a split second later.
>
> Y.



-- 
sankarshan mukhopadhyay

___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] How long should metrics collection on a cluster take?

2018-07-25 Thread Pranith Kumar Karampuri
On Wed, Jul 25, 2018 at 10:48 PM, John Strunk  wrote:

> I have not put together a list. Perhaps the following will help w/ the
> context though...
>
> The "reconcile loop" of the operator will take the cluster CRs and
> reconcile them against the actual cluster config. At the 20k foot level,
> this amounts to something like determining there should be 8 gluster pods
> running, and making the appropriate changes if that doesn't match reality.
> In practical terms, the construction of this reconciliation loop can be
> thought of as a set (array) of 3-tuples: [{should_act() -> bool, can_act ->
> bool, action() -> ok, error}, {..., ..., ...}, ...]
>
> Each capability of the operator would be expressed as one of these tuples.
> should_act() : true if the action() should be taken
> can_act() : true if the prerequisites for taking the action are met
> action() : make the change. Only run if should && can.
> (note that I believe should_act() and can_act() should not be separate in
> the implementation, for reasons I'll not go into here)
>
> An example action might be "upgrade the container image for pod X". The
> associated should_act would be triggered if the "image=" of the pod doesn't
> match the desired "image=" in the operator CRs. The can_act evaluation
> would be verifying that it's ok to do this... Thinking from the top of my
> head:
> - All volumes w/ a brick on this pod should be fully healed
> - Sufficient cluster nodes should be up such that quorum is not lost when
> this node goes down (does this matter?)
> - The proposed image is compatible with the current version of the CSI
> driver(s), the operator, and other gluster pods
> - Probably some other stuff
> The action() would update the "image=" in the Deployment to trigger the
> rollout
>
> The idea is that queries would be made, both to the kube API and the
> gluster cluster to verify the necessary preconditions for an action prior
> to that action being invoked. There would obviously be commonality among
> the preconditions for various actions, so the results should be fetched
> exactly once per reconcile cycle. Also note, 1 cycle == at most 1 action()
> due to the action changing the state of the system.
>
> Given that we haven't designed (or even listed) all the potential
> action()s, I can't give you a list of everything to query. I guarantee
> we'll need to know the up/down status, heal counts, and free capacity for
> each brick and node.
>

Thanks for the detailed explanation. This helps. One question though, is 5
seconds a hard limit or is there a possibility to configure it?


>
> -John
>
> On Wed, Jul 25, 2018 at 11:56 AM Pranith Kumar Karampuri <
> pkara...@redhat.com> wrote:
>
>>
>>
>> On Wed, Jul 25, 2018 at 8:17 PM, John Strunk  wrote:
>>
>>> To add an additional data point... The operator will need to regularly
>>> reconcile the true state of the gluster cluster with the desired state
>>> stored in kubernetes. This task will be required frequently (i.e.,
>>> operator-framework defaults to every 5s even if there are no config
>>> changes).
>>> The actual amount of data we will need to query from the cluster is
>>> currently TBD and likely significantly affected by Heketi/GD1 vs. GD2
>>> choice.
>>>
>>
>> Do we have any partial list of data we will gather? Just want to
>> understand what this might entail already...
>>
>>
>>>
>>> -John
>>>
>>>
>>> On Wed, Jul 25, 2018 at 5:45 AM Pranith Kumar Karampuri <
>>> pkara...@redhat.com> wrote:
>>>


 On Tue, Jul 24, 2018 at 10:10 PM, Sankarshan Mukhopadhyay <
 sankarshan.mukhopadh...@gmail.com> wrote:

> On Tue, Jul 24, 2018 at 9:48 PM, Pranith Kumar Karampuri
>  wrote:
> > hi,
> >   Quite a few commands to monitor gluster at the moment take
> almost a
> > second to give output.
>
> Is this at the (most) minimum recommended cluster size?
>

 Yes, with a single volume with 3 bricks i.e. 3 nodes in cluster.


>
> > Some categories of these commands:
> > 1) Any command that needs to do some sort of mount/glfs_init.
> >  Examples: 1) heal info family of commands 2) statfs to find
> > space-availability etc (On my laptop replica 3 volume with all local
> bricks,
> > glfs_init takes 0.3 seconds on average)
> > 2) glusterd commands that need to wait for the previous command to
> unlock.
> > If the previous command is something related to lvm snapshot which
> takes
> > quite a few seconds, it would be even more time consuming.
> >
> > Nowadays container workloads have hundreds of volumes if not
> thousands. If
> > we want to serve any monitoring solution at this scale (I have seen
> > customers use upto 600 volumes at a time, it will only get bigger)
> and lets
> > say collecting metrics per volume takes 2 seconds per volume(Let us
> take the
> > worst example which has all major features enabled like
> > snapshot/geo-rep/quota etc etc),

Re: [Gluster-devel] How long should metrics collection on a cluster take?

2018-07-25 Thread Aravinda Vishwanathapura Krishna Murthy
On Tue, Jul 24, 2018 at 10:11 PM Sankarshan Mukhopadhyay <
sankarshan.mukhopadh...@gmail.com> wrote:

> On Tue, Jul 24, 2018 at 9:48 PM, Pranith Kumar Karampuri
>  wrote:
> > hi,
> >   Quite a few commands to monitor gluster at the moment take almost a
> > second to give output.
>
> Is this at the (most) minimum recommended cluster size?
>
> > Some categories of these commands:
> > 1) Any command that needs to do some sort of mount/glfs_init.
> >  Examples: 1) heal info family of commands 2) statfs to find
> > space-availability etc (On my laptop replica 3 volume with all local
> bricks,
> > glfs_init takes 0.3 seconds on average)
> > 2) glusterd commands that need to wait for the previous command to
> unlock.
> > If the previous command is something related to lvm snapshot which takes
> > quite a few seconds, it would be even more time consuming.
> >
> > Nowadays container workloads have hundreds of volumes if not thousands.
> If
> > we want to serve any monitoring solution at this scale (I have seen
> > customers use upto 600 volumes at a time, it will only get bigger) and
> lets
> > say collecting metrics per volume takes 2 seconds per volume(Let us take
> the
> > worst example which has all major features enabled like
> > snapshot/geo-rep/quota etc etc), that will mean that it will take 20
> minutes
> > to collect metrics of the cluster with 600 volumes. What are the ways in
> > which we can make this number more manageable? I was initially thinking
> may
> > be it is possible to get gd2 to execute commands in parallel on different
> > volumes, so potentially we could get this done in ~2 seconds. But quite a
> > few of the metrics need a mount or equivalent of a mount(glfs_init) to
> > collect different information like statfs, number of pending heals, quota
> > usage etc. This may lead to high memory usage as the size of the mounts
> tend
> > to be high.
> >
>
> I am not sure if starting from the "worst example" (it certainly is
> not) is a good place to start from. That said, for any environment
> with that number of disposable volumes, what kind of metrics do
> actually make any sense/impact?
>

This is really interesting question. When we have more number of disposable
volumes, I think we need metrics like available size in the cluster to
create more volumes than the utilization per volumes. (If we need to
observe the usage patterns of applications then we need per volume
utilization as well)


>
> > I wanted to seek suggestions from others on how to come to a conclusion
> > about which path to take and what problems to solve.
> >
> > I will be happy to raise github issues based on our conclusions on this
> mail
> > thread.
> >
> > --
> > Pranith
> >
>
>
>
>
>
> --
> sankarshan mukhopadhyay
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel


--
regards
Aravinda VK
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] How long should metrics collection on a cluster take?

2018-07-25 Thread Aravinda Vishwanathapura Krishna Murthy
On Wed, Jul 25, 2018 at 11:54 PM Yaniv Kaul  wrote:

>
>
> On Tue, Jul 24, 2018, 7:20 PM Pranith Kumar Karampuri 
> wrote:
>
>> hi,
>>   Quite a few commands to monitor gluster at the moment take almost a
>> second to give output.
>> Some categories of these commands:
>> 1) Any command that needs to do some sort of mount/glfs_init.
>>  Examples: 1) heal info family of commands 2) statfs to find
>> space-availability etc (On my laptop replica 3 volume with all local
>> bricks, glfs_init takes 0.3 seconds on average)
>> 2) glusterd commands that need to wait for the previous command to
>> unlock. If the previous command is something related to lvm snapshot which
>> takes quite a few seconds, it would be even more time consuming.
>>
>> Nowadays container workloads have hundreds of volumes if not thousands.
>> If we want to serve any monitoring solution at this scale (I have seen
>> customers use upto 600 volumes at a time, it will only get bigger) and lets
>> say collecting metrics per volume takes 2 seconds per volume(Let us take
>> the worst example which has all major features enabled like
>> snapshot/geo-rep/quota etc etc), that will mean that it will take 20
>> minutes to collect metrics of the cluster with 600 volumes. What are the
>> ways in which we can make this number more manageable? I was initially
>> thinking may be it is possible to get gd2 to execute commands in parallel
>> on different volumes, so potentially we could get this done in ~2 seconds.
>> But quite a few of the metrics need a mount or equivalent of a
>> mount(glfs_init) to collect different information like statfs, number of
>> pending heals, quota usage etc. This may lead to high memory usage as the
>> size of the mounts tend to be high.
>>
>> I wanted to seek suggestions from others on how to come to a conclusion
>> about which path to take and what problems to solve.
>>
>
> I would imagine that in gd2 world:
> 1. All stats would be in etcd.
>

Only static state information stored in etcd by gd2. For real-time status
gd2 still has to reach respective nodes to collect the details. For
example, Volume utilization is changed by multiple mounts which are
external to gd2, to keep track of real-time status gd2 has to poll bricks
utilization on every node and update etcd.



> 2. There will be a single API call for GetALLVolumesStats or something and
> we won't be asking the client to loop, or there will be a similar efficient
> single API to query and deliver stats for some volumes in a batch ('all
> bricks in host X' for example).
>

Single API available for Volume stats, but this API is expensive because
the real-time stats not stored in etcd.


>
> Worth looking how it's implemented elsewhere in K8S.
>
> In any case, when asking for metrics I assume the latest already available
> would be returned and we are not going to fetch them when queried. This is
> both fragile (imagine an entity that doesn't respond well) and adds latency
> and will be inaccurate anyway a split second later.
>
> Y.
>
>
>
>> I will be happy to raise github issues based on our conclusions on this
>> mail thread.
>>
>> --
>> Pranith
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel


--
regards
Aravinda VK
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Release 5: Master branch health report (Week of 23rd July)

2018-07-25 Thread Shyam Ranganathan
On 07/25/2018 04:18 PM, Shyam Ranganathan wrote:
> 2) glusterd crash in test sdfs-sanity.t:
> https://bugzilla.redhat.com/show_bug.cgi?id=1608566
> 
> glusterd folks, request you to take a look to correct this.

Persisted with this a little longer and the fix is posted at
https://review.gluster.org/#/c/20565/ (reviews welcome)
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Release 5: Master branch health report (Week of 23rd July)

2018-07-25 Thread Shyam Ranganathan
On 07/24/2018 03:28 PM, Shyam Ranganathan wrote:
> On 07/24/2018 03:12 PM, Shyam Ranganathan wrote:
>> 1) master branch health checks (weekly, till branching)
>>   - Expect every Monday a status update on various tests runs
> 
> See https://build.gluster.org/job/nightly-master/ for a report on
> various nightly and periodic jobs on master.
> 
> RED:
> 1. Nightly regression
> 2. Regression with multiplex (cores and test failures)
> 3. line-coverage (cores and test failures)

The failures for line coverage issues, are filed as the following BZs
1) Parent BZ for nightly line coverage failure:
https://bugzilla.redhat.com/show_bug.cgi?id=1608564

2) glusterd crash in test sdfs-sanity.t:
https://bugzilla.redhat.com/show_bug.cgi?id=1608566

glusterd folks, request you to take a look to correct this.

3) bug-1432542-mpx-restart-crash.t times out consistently:
https://bugzilla.redhat.com/show_bug.cgi?id=1608568

@nigel is there a way to on-demand request lcov tests through gerrit? I
am thinking of pushing a patch that increases the timeout and check if
it solves the problem for this test as detailed in the bug.

> 
> Calling out to contributors to take a look at various failures, and post
> the same as bugs AND to the lists (so that duplication is avoided) to
> get this to a GREEN status.
> 
> GREEN:
> 1. cpp-check
> 2. RPM builds
> 
> IGNORE (for now):
> 1. clang scan (@nigel, this job requires clang warnings to be fixed to
> go green, right?)
> 
> Shyam
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] How long should metrics collection on a cluster take?

2018-07-25 Thread Yaniv Kaul
On Tue, Jul 24, 2018, 7:20 PM Pranith Kumar Karampuri 
wrote:

> hi,
>   Quite a few commands to monitor gluster at the moment take almost a
> second to give output.
> Some categories of these commands:
> 1) Any command that needs to do some sort of mount/glfs_init.
>  Examples: 1) heal info family of commands 2) statfs to find
> space-availability etc (On my laptop replica 3 volume with all local
> bricks, glfs_init takes 0.3 seconds on average)
> 2) glusterd commands that need to wait for the previous command to unlock.
> If the previous command is something related to lvm snapshot which takes
> quite a few seconds, it would be even more time consuming.
>
> Nowadays container workloads have hundreds of volumes if not thousands. If
> we want to serve any monitoring solution at this scale (I have seen
> customers use upto 600 volumes at a time, it will only get bigger) and lets
> say collecting metrics per volume takes 2 seconds per volume(Let us take
> the worst example which has all major features enabled like
> snapshot/geo-rep/quota etc etc), that will mean that it will take 20
> minutes to collect metrics of the cluster with 600 volumes. What are the
> ways in which we can make this number more manageable? I was initially
> thinking may be it is possible to get gd2 to execute commands in parallel
> on different volumes, so potentially we could get this done in ~2 seconds.
> But quite a few of the metrics need a mount or equivalent of a
> mount(glfs_init) to collect different information like statfs, number of
> pending heals, quota usage etc. This may lead to high memory usage as the
> size of the mounts tend to be high.
>
> I wanted to seek suggestions from others on how to come to a conclusion
> about which path to take and what problems to solve.
>

I would imagine that in gd2 world:
1. All stats would be in etcd.
2. There will be a single API call for GetALLVolumesStats or something and
we won't be asking the client to loop, or there will be a similar efficient
single API to query and deliver stats for some volumes in a batch ('all
bricks in host X' for example).

Worth looking how it's implemented elsewhere in K8S.

In any case, when asking for metrics I assume the latest already available
would be returned and we are not going to fetch them when queried. This is
both fragile (imagine an entity that doesn't respond well) and adds latency
and will be inaccurate anyway a split second later.

Y.



> I will be happy to raise github issues based on our conclusions on this
> mail thread.
>
> --
> Pranith
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] How long should metrics collection on a cluster take?

2018-07-25 Thread John Strunk
I have not put together a list. Perhaps the following will help w/ the
context though...

The "reconcile loop" of the operator will take the cluster CRs and
reconcile them against the actual cluster config. At the 20k foot level,
this amounts to something like determining there should be 8 gluster pods
running, and making the appropriate changes if that doesn't match reality.
In practical terms, the construction of this reconciliation loop can be
thought of as a set (array) of 3-tuples: [{should_act() -> bool, can_act ->
bool, action() -> ok, error}, {..., ..., ...}, ...]

Each capability of the operator would be expressed as one of these tuples.
should_act() : true if the action() should be taken
can_act() : true if the prerequisites for taking the action are met
action() : make the change. Only run if should && can.
(note that I believe should_act() and can_act() should not be separate in
the implementation, for reasons I'll not go into here)

An example action might be "upgrade the container image for pod X". The
associated should_act would be triggered if the "image=" of the pod doesn't
match the desired "image=" in the operator CRs. The can_act evaluation
would be verifying that it's ok to do this... Thinking from the top of my
head:
- All volumes w/ a brick on this pod should be fully healed
- Sufficient cluster nodes should be up such that quorum is not lost when
this node goes down (does this matter?)
- The proposed image is compatible with the current version of the CSI
driver(s), the operator, and other gluster pods
- Probably some other stuff
The action() would update the "image=" in the Deployment to trigger the
rollout

The idea is that queries would be made, both to the kube API and the
gluster cluster to verify the necessary preconditions for an action prior
to that action being invoked. There would obviously be commonality among
the preconditions for various actions, so the results should be fetched
exactly once per reconcile cycle. Also note, 1 cycle == at most 1 action()
due to the action changing the state of the system.

Given that we haven't designed (or even listed) all the potential
action()s, I can't give you a list of everything to query. I guarantee
we'll need to know the up/down status, heal counts, and free capacity for
each brick and node.

-John

On Wed, Jul 25, 2018 at 11:56 AM Pranith Kumar Karampuri <
pkara...@redhat.com> wrote:

>
>
> On Wed, Jul 25, 2018 at 8:17 PM, John Strunk  wrote:
>
>> To add an additional data point... The operator will need to regularly
>> reconcile the true state of the gluster cluster with the desired state
>> stored in kubernetes. This task will be required frequently (i.e.,
>> operator-framework defaults to every 5s even if there are no config
>> changes).
>> The actual amount of data we will need to query from the cluster is
>> currently TBD and likely significantly affected by Heketi/GD1 vs. GD2
>> choice.
>>
>
> Do we have any partial list of data we will gather? Just want to
> understand what this might entail already...
>
>
>>
>> -John
>>
>>
>> On Wed, Jul 25, 2018 at 5:45 AM Pranith Kumar Karampuri <
>> pkara...@redhat.com> wrote:
>>
>>>
>>>
>>> On Tue, Jul 24, 2018 at 10:10 PM, Sankarshan Mukhopadhyay <
>>> sankarshan.mukhopadh...@gmail.com> wrote:
>>>
 On Tue, Jul 24, 2018 at 9:48 PM, Pranith Kumar Karampuri
  wrote:
 > hi,
 >   Quite a few commands to monitor gluster at the moment take
 almost a
 > second to give output.

 Is this at the (most) minimum recommended cluster size?

>>>
>>> Yes, with a single volume with 3 bricks i.e. 3 nodes in cluster.
>>>
>>>

 > Some categories of these commands:
 > 1) Any command that needs to do some sort of mount/glfs_init.
 >  Examples: 1) heal info family of commands 2) statfs to find
 > space-availability etc (On my laptop replica 3 volume with all local
 bricks,
 > glfs_init takes 0.3 seconds on average)
 > 2) glusterd commands that need to wait for the previous command to
 unlock.
 > If the previous command is something related to lvm snapshot which
 takes
 > quite a few seconds, it would be even more time consuming.
 >
 > Nowadays container workloads have hundreds of volumes if not
 thousands. If
 > we want to serve any monitoring solution at this scale (I have seen
 > customers use upto 600 volumes at a time, it will only get bigger)
 and lets
 > say collecting metrics per volume takes 2 seconds per volume(Let us
 take the
 > worst example which has all major features enabled like
 > snapshot/geo-rep/quota etc etc), that will mean that it will take 20
 minutes
 > to collect metrics of the cluster with 600 volumes. What are the ways
 in
 > which we can make this number more manageable? I was initially
 thinking may
 > be it is possible to get gd2 to execute commands in parallel on
 different
 > volumes, so potentially we could get t

Re: [Gluster-devel] How long should metrics collection on a cluster take?

2018-07-25 Thread Pranith Kumar Karampuri
On Wed, Jul 25, 2018 at 8:17 PM, John Strunk  wrote:

> To add an additional data point... The operator will need to regularly
> reconcile the true state of the gluster cluster with the desired state
> stored in kubernetes. This task will be required frequently (i.e.,
> operator-framework defaults to every 5s even if there are no config
> changes).
> The actual amount of data we will need to query from the cluster is
> currently TBD and likely significantly affected by Heketi/GD1 vs. GD2
> choice.
>

Do we have any partial list of data we will gather? Just want to understand
what this might entail already...


>
> -John
>
>
> On Wed, Jul 25, 2018 at 5:45 AM Pranith Kumar Karampuri <
> pkara...@redhat.com> wrote:
>
>>
>>
>> On Tue, Jul 24, 2018 at 10:10 PM, Sankarshan Mukhopadhyay <
>> sankarshan.mukhopadh...@gmail.com> wrote:
>>
>>> On Tue, Jul 24, 2018 at 9:48 PM, Pranith Kumar Karampuri
>>>  wrote:
>>> > hi,
>>> >   Quite a few commands to monitor gluster at the moment take
>>> almost a
>>> > second to give output.
>>>
>>> Is this at the (most) minimum recommended cluster size?
>>>
>>
>> Yes, with a single volume with 3 bricks i.e. 3 nodes in cluster.
>>
>>
>>>
>>> > Some categories of these commands:
>>> > 1) Any command that needs to do some sort of mount/glfs_init.
>>> >  Examples: 1) heal info family of commands 2) statfs to find
>>> > space-availability etc (On my laptop replica 3 volume with all local
>>> bricks,
>>> > glfs_init takes 0.3 seconds on average)
>>> > 2) glusterd commands that need to wait for the previous command to
>>> unlock.
>>> > If the previous command is something related to lvm snapshot which
>>> takes
>>> > quite a few seconds, it would be even more time consuming.
>>> >
>>> > Nowadays container workloads have hundreds of volumes if not
>>> thousands. If
>>> > we want to serve any monitoring solution at this scale (I have seen
>>> > customers use upto 600 volumes at a time, it will only get bigger) and
>>> lets
>>> > say collecting metrics per volume takes 2 seconds per volume(Let us
>>> take the
>>> > worst example which has all major features enabled like
>>> > snapshot/geo-rep/quota etc etc), that will mean that it will take 20
>>> minutes
>>> > to collect metrics of the cluster with 600 volumes. What are the ways
>>> in
>>> > which we can make this number more manageable? I was initially
>>> thinking may
>>> > be it is possible to get gd2 to execute commands in parallel on
>>> different
>>> > volumes, so potentially we could get this done in ~2 seconds. But
>>> quite a
>>> > few of the metrics need a mount or equivalent of a mount(glfs_init) to
>>> > collect different information like statfs, number of pending heals,
>>> quota
>>> > usage etc. This may lead to high memory usage as the size of the
>>> mounts tend
>>> > to be high.
>>> >
>>>
>>> I am not sure if starting from the "worst example" (it certainly is
>>> not) is a good place to start from.
>>
>>
>> I didn't understand your statement. Are you saying 600 volumes is a worst
>> example?
>>
>>
>>> That said, for any environment
>>> with that number of disposable volumes, what kind of metrics do
>>> actually make any sense/impact?
>>>
>>
>> Same metrics you track for long running volumes. It is just that the way
>> the metrics
>> are interpreted will be different. On a long running volume, you would
>> look at the metrics
>> and try to find why is the volume not giving performance as expected in
>> the last 1 hour. Where as
>> in this case, you would look at metrics and find the reason why volumes
>> that were
>> created and deleted in the last hour didn't give performance as expected.
>>
>>
>>>
>>> > I wanted to seek suggestions from others on how to come to a conclusion
>>> > about which path to take and what problems to solve.
>>> >
>>> > I will be happy to raise github issues based on our conclusions on
>>> this mail
>>> > thread.
>>> >
>>> > --
>>> > Pranith
>>> >
>>>
>>>
>>>
>>>
>>>
>>> --
>>> sankarshan mukhopadhyay
>>> 
>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>>
>>
>>
>> --
>> Pranith
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
>


-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] How long should metrics collection on a cluster take?

2018-07-25 Thread John Strunk
To add an additional data point... The operator will need to regularly
reconcile the true state of the gluster cluster with the desired state
stored in kubernetes. This task will be required frequently (i.e.,
operator-framework defaults to every 5s even if there are no config
changes).
The actual amount of data we will need to query from the cluster is
currently TBD and likely significantly affected by Heketi/GD1 vs. GD2
choice.

-John


On Wed, Jul 25, 2018 at 5:45 AM Pranith Kumar Karampuri 
wrote:

>
>
> On Tue, Jul 24, 2018 at 10:10 PM, Sankarshan Mukhopadhyay <
> sankarshan.mukhopadh...@gmail.com> wrote:
>
>> On Tue, Jul 24, 2018 at 9:48 PM, Pranith Kumar Karampuri
>>  wrote:
>> > hi,
>> >   Quite a few commands to monitor gluster at the moment take almost
>> a
>> > second to give output.
>>
>> Is this at the (most) minimum recommended cluster size?
>>
>
> Yes, with a single volume with 3 bricks i.e. 3 nodes in cluster.
>
>
>>
>> > Some categories of these commands:
>> > 1) Any command that needs to do some sort of mount/glfs_init.
>> >  Examples: 1) heal info family of commands 2) statfs to find
>> > space-availability etc (On my laptop replica 3 volume with all local
>> bricks,
>> > glfs_init takes 0.3 seconds on average)
>> > 2) glusterd commands that need to wait for the previous command to
>> unlock.
>> > If the previous command is something related to lvm snapshot which takes
>> > quite a few seconds, it would be even more time consuming.
>> >
>> > Nowadays container workloads have hundreds of volumes if not thousands.
>> If
>> > we want to serve any monitoring solution at this scale (I have seen
>> > customers use upto 600 volumes at a time, it will only get bigger) and
>> lets
>> > say collecting metrics per volume takes 2 seconds per volume(Let us
>> take the
>> > worst example which has all major features enabled like
>> > snapshot/geo-rep/quota etc etc), that will mean that it will take 20
>> minutes
>> > to collect metrics of the cluster with 600 volumes. What are the ways in
>> > which we can make this number more manageable? I was initially thinking
>> may
>> > be it is possible to get gd2 to execute commands in parallel on
>> different
>> > volumes, so potentially we could get this done in ~2 seconds. But quite
>> a
>> > few of the metrics need a mount or equivalent of a mount(glfs_init) to
>> > collect different information like statfs, number of pending heals,
>> quota
>> > usage etc. This may lead to high memory usage as the size of the mounts
>> tend
>> > to be high.
>> >
>>
>> I am not sure if starting from the "worst example" (it certainly is
>> not) is a good place to start from.
>
>
> I didn't understand your statement. Are you saying 600 volumes is a worst
> example?
>
>
>> That said, for any environment
>> with that number of disposable volumes, what kind of metrics do
>> actually make any sense/impact?
>>
>
> Same metrics you track for long running volumes. It is just that the way
> the metrics
> are interpreted will be different. On a long running volume, you would
> look at the metrics
> and try to find why is the volume not giving performance as expected in
> the last 1 hour. Where as
> in this case, you would look at metrics and find the reason why volumes
> that were
> created and deleted in the last hour didn't give performance as expected.
>
>
>>
>> > I wanted to seek suggestions from others on how to come to a conclusion
>> > about which path to take and what problems to solve.
>> >
>> > I will be happy to raise github issues based on our conclusions on this
>> mail
>> > thread.
>> >
>> > --
>> > Pranith
>> >
>>
>>
>>
>>
>>
>> --
>> sankarshan mukhopadhyay
>> 
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
>
> --
> Pranith
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Coverity covscan for 2018-07-25-8ad159b2 (master branch)

2018-07-25 Thread staticanalysis


GlusterFS Coverity covscan results for the master branch are available from
http://download.gluster.org/pub/gluster/glusterfs/static-analysis/master/glusterfs-coverity/2018-07-25-8ad159b2/

Coverity covscan results for other active branches are also available at
http://download.gluster.org/pub/gluster/glusterfs/static-analysis/

___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Announcing Gluster for Container Storage (GCS)

2018-07-25 Thread Vijay Bellur
Hi all,

We would like to let you  know that some of us have started focusing on an
initiative called ‘Gluster for Container Storage’ (in short GCS). As of
now, one can already use Gluster as storage for containers by making use of
different projects available in github repositories associated with gluster
 & Heketi .
The goal of the GCS initiative is to provide an easier integration of these
projects so that they can be consumed together as designed. We are
primarily focused on integration with Kubernetes (k8s) through this
initiative.

Key projects for GCS include:
Glusterd2 (GD2)

Repo: https://github.com/gluster/glusterd2

The challenge we have with current management layer of Gluster (glusterd)
is that it is not designed for a service oriented architecture. Heketi
overcame this limitation and made Gluster consumable in k8s by providing
all the necessary hooks needed for supporting Persistent Volume Claims.

Glusterd2 provides a service oriented architecture for volume & cluster
management. Gd2 also intends to provide many of the Heketi functionalities
needed by Kubernetes natively. Hence we are working on merging Heketi with
gd2 and you can follow more of this action in the issues associated with
the gd2 github repository.
gluster-block

Repo: https://github.com/gluster/gluster-block

This project intends to expose files in a gluster volume as block devices.
Gluster-block enables supporting ReadWriteOnce (RWO) PVCs and the
corresponding workloads in Kubernetes using gluster as the underlying
storage technology.

Gluster-block is intended to be consumed by stateful RWO applications like
databases and k8s infrastructure services like logging, metrics etc.
gluster-block is more preferred than file based Persistent Volumes in K8s
for stateful/transactional workloads as it provides better performance &
consistency guarantees.
anthill / operator

Repo: https://github.com/gluster/anthill

This project aims to add an operator for Gluster in Kubernetes., Since it
is relatively new, there are areas where you can contribute to make the
operator experience better (please refer to the list of issues). This
project intends to make the whole Gluster experience in k8s much smoother
by automatic management of operator tasks like installation, rolling
upgrades etc.
gluster-csi-driver

Repo: http://github.com/gluster/gluster-csi-driver

This project will provide CSI (Container Storage Interface) compliant
drivers for GlusterFS & gluster-block in k8s.
gluster-kubernetes

Repo: https://github.com/gluster/gluster-kubernetes

This project is intended to provide all the required installation and
management steps for getting gluster up and running in k8s.
GlusterFS

Repo: https://github.com/gluster/glusterfs

GlusterFS is the main and the core repository of Gluster. To support
storage in container world, we don’t need all the features of Gluster.
Hence, we would be focusing on a stack which would be absolutely required
in k8s. This would allow us to plan and execute tests well, and also
provide users with a setup which works without too many options to tweak.

Notice that glusterfs default volumes would continue to work as of now, but
the translator stack which is used in GCS will be much leaner and geared to
work optimally in k8s.
Monitoring
Repo: https://github.com/gluster/gluster-prometheus

As k8s ecosystem provides its own native monitoring mechanisms, we intend
to have this project be the placeholder for required monitoring plugins.
The scope of this project is currently WIP and we welcome your inputs to
shape the project.

More details on this can be found at:
https://lists.gluster.org/pipermail/gluster-users/2018-July/034435.html

Gluster-Containers

*Repo: https://github.com/gluster/gluster-containers
This repository provides
container specs / Dockerfiles that can be used with a container runtime
like cri-o & docker.Note that this is not an exhaustive or final list of
projects involved with GCS. We will continue to update the project list
depending on the new requirements and priorities that we discover in this
journey.*

*We welcome you to join this journey by looking up the repositories and
contributing to them. As always, we are happy to hear your thoughts about
this initiative and please stay tuned as we provide periodic updates about
GCS here!Regards,*

*Vijay*

*(on behalf of Gluster maintainers @ Red Hat)*
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Github teams/repo cleanup

2018-07-25 Thread Nigel Babu
On Wed, Jul 25, 2018 at 6:51 PM Niels de Vos  wrote:

> We had someone working on starting/stopping Jenkins slaves in Rackspace
> on-demand. He since has left Red Hat and I do not think the infra team
> had a great interest in this either (with the move out of Rackspace).
>
> It can be deleted from my point of view.
>

FYI, stopping a cloud server does not mean we don't get charged for it. So
I don't know if it was a useful exercise to begin with.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Github teams/repo cleanup

2018-07-25 Thread Niels de Vos
On Wed, Jul 25, 2018 at 02:38:57PM +0200, Michael Scherer wrote:
> Le mercredi 25 juillet 2018 à 14:08 +0200, Michael Scherer a écrit :
> > Le mercredi 25 juillet 2018 à 16:06 +0530, Nigel Babu a écrit :
> > > I think our team structure on Github has become unruly. I prefer
> > > that
> > > we
> > > use teams only when we can demonstrate that there is a strong need.
> > > At the
> > > moment, the gluster-maintainers and the glusterd2 projects have
> > > teams
> > > that
> > > have a strong need. If any other repo has a strong need for teams,
> > > please
> > > speak up. Otherwise, I suggest we delete the teams and add the
> > > relevant
> > > people as collaborators on the project.
> > > 
> > > It should be safe to delete the gerrit-hooks repo. These are now
> > > Github
> > > jobs. I'm not in favor of archiving the old projects if they're
> > > going
> > > to be
> > > hidden from someone looking for it. If they just move to the end of
> > > the
> > > listing, it's fine to archive.
> > 
> > So I did a test and just archived gluster/vagrant, and it can still
> > be
> > found.
> > 
> > So I am going to archives at least the salt stuff, and the gerrit-
> > hooks 
> > one. And remove the empty one.
> 
> So while cleaning thing up, I wonder if we can remove this one:
> https://github.com/gluster/jenkins-ssh-slaves-plugin
> 
> We have just a fork, lagging from upstream and I am sure we do not use
> it.

We had someone working on starting/stopping Jenkins slaves in Rackspace
on-demand. He since has left Red Hat and I do not think the infra team
had a great interest in this either (with the move out of Rackspace).

It can be deleted from my point of view.

Niels


signature.asc
Description: PGP signature
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Github teams/repo cleanup

2018-07-25 Thread Nigel Babu
> So while cleaning thing up, I wonder if we can remove this one:
> https://github.com/gluster/jenkins-ssh-slaves-plugin
>
> We have just a fork, lagging from upstream and I am sure we do not use
> it.
>

Safe to delete. We're not using it for sure.


>
> The same goes for:
> https://github.com/gluster/devstack-plugins
>
> since I think openstack did change a lot, that seems like some internal
>  configuration for dev, I guess we can remove it ?
>

This one seems ahead of the original fork, but I'd say delete.


>
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
>

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Github teams/repo cleanup

2018-07-25 Thread Michael Scherer
Le mercredi 25 juillet 2018 à 14:08 +0200, Michael Scherer a écrit :
> Le mercredi 25 juillet 2018 à 16:06 +0530, Nigel Babu a écrit :
> > I think our team structure on Github has become unruly. I prefer
> > that
> > we
> > use teams only when we can demonstrate that there is a strong need.
> > At the
> > moment, the gluster-maintainers and the glusterd2 projects have
> > teams
> > that
> > have a strong need. If any other repo has a strong need for teams,
> > please
> > speak up. Otherwise, I suggest we delete the teams and add the
> > relevant
> > people as collaborators on the project.
> > 
> > It should be safe to delete the gerrit-hooks repo. These are now
> > Github
> > jobs. I'm not in favor of archiving the old projects if they're
> > going
> > to be
> > hidden from someone looking for it. If they just move to the end of
> > the
> > listing, it's fine to archive.
> 
> So I did a test and just archived gluster/vagrant, and it can still
> be
> found.
> 
> So I am going to archives at least the salt stuff, and the gerrit-
> hooks 
> one. And remove the empty one.

So while cleaning thing up, I wonder if we can remove this one:
https://github.com/gluster/jenkins-ssh-slaves-plugin

We have just a fork, lagging from upstream and I am sure we do not use
it.

The same goes for:
https://github.com/gluster/devstack-plugins

since I think openstack did change a lot, that seems like some internal
 configuration for dev, I guess we can remove it ?

-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS



signature.asc
Description: This is a digitally signed message part
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Github teams/repo cleanup

2018-07-25 Thread Michael Scherer
Le mercredi 25 juillet 2018 à 16:06 +0530, Nigel Babu a écrit :
> I think our team structure on Github has become unruly. I prefer that
> we
> use teams only when we can demonstrate that there is a strong need.
> At the
> moment, the gluster-maintainers and the glusterd2 projects have teams
> that
> have a strong need. If any other repo has a strong need for teams,
> please
> speak up. Otherwise, I suggest we delete the teams and add the
> relevant
> people as collaborators on the project.
> 
> It should be safe to delete the gerrit-hooks repo. These are now
> Github
> jobs. I'm not in favor of archiving the old projects if they're going
> to be
> hidden from someone looking for it. If they just move to the end of
> the
> listing, it's fine to archive.

So I did a test and just archived gluster/vagrant, and it can still be
found.

So I am going to archives at least the salt stuff, and the gerrit-hooks 
one. And remove the empty one.


> On Fri, Jun 29, 2018 at 10:26 PM Michael Scherer  >
> wrote:
> 
> > Le vendredi 29 juin 2018 à 14:40 +0200, Michael Scherer a écrit :
> > > Hi,
> > > 
> > > So, after Gentoo hack, I started to look at all our teams on
> > > github,
> > > and what access does everybody have, etc, etc
> > > 
> > > And I have a few issues:
> > > - we have old repositories that are no longer used
> > > - we have team without description
> > > - we have people without 2FA who are admins of some team
> > > - github make this kind of audit really difficult without
> > > scripting
> > > (and the API is not stable yet for teams)
> > > 
> > > So I would propose the following rules, and apply them in 1 or 2
> > > weeks
> > > time.
> > > 
> > > For projects:
> > > 
> > > - archives all old projects, aka, ones that got no commit since 2
> > > years, unless people give a reason for the project to stay
> > > unarchived.
> > > Being archived do not remove it, it just hide it by default and
> > > set
> > > it
> > > readonly. It can be reverted without trouble.
> > > 
> > > See https://help.github.com/articles/archiving-a-github-repositor
> > > y/
> > > 
> > > - remove project who never started ("vagrant" is one example,
> > > there
> > > is
> > > only one readme file).
> > > 
> > > For teams:
> > > - if you are admin of a team, you have to turn on 2FA on your
> > > account.
> > > - if you are admin of the github org, you have to turn 2FA.
> > > 
> > > - if a team no longer have a purpose (for example, all repos got
> > > archived or removed), it will be removed.
> > > 
> > > - add a description in every team, that tell what kind of access
> > > does
> > > it give.
> > > 
> > > 
> > > This would permit to get a bit more clarity and security.
> > 
> > So to get some perspective after writing a script to get the
> > information, the repos I propose to archive:
> > 
> > Older than 3 years, we have:
> > 
> > - gmc-target
> > - gmc
> > - swiftkrbauth
> > - devstack-plugins
> > - forge
> > - glupy
> > - glusterfs-rackspace-regression-tester
> > - jenkins-ssh-slaves-plugin
> > - glusterfsiostat
> > 
> > 
> > Older than 2 years, we have:
> > - nagios-server-addons
> > - gluster-nagios-common
> > - gluster-nagios-addons
> > - mod_proxy_gluster
> > - gluster-tutorial
> > - gerrit-hooks
> > - distaf
> > - libgfapi-java-io
> > 
> > And to remove, because empty:
> > - vagrant
> > - bigdata
> > - gluster-manila
> > 
> > 
> > Once they are archived, I will take care of the code for finding
> > teams
> > to remove.
> > 
> > --
> > Michael Scherer
> > Sysadmin, Community Infrastructure and Platform, OSAS
> > 
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://lists.gluster.org/mailman/listinfo/gluster-devel
> 
> 
> 
-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS



signature.asc
Description: This is a digitally signed message part
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Github teams/repo cleanup

2018-07-25 Thread Nigel Babu
I think our team structure on Github has become unruly. I prefer that we
use teams only when we can demonstrate that there is a strong need. At the
moment, the gluster-maintainers and the glusterd2 projects have teams that
have a strong need. If any other repo has a strong need for teams, please
speak up. Otherwise, I suggest we delete the teams and add the relevant
people as collaborators on the project.

It should be safe to delete the gerrit-hooks repo. These are now Github
jobs. I'm not in favor of archiving the old projects if they're going to be
hidden from someone looking for it. If they just move to the end of the
listing, it's fine to archive.

On Fri, Jun 29, 2018 at 10:26 PM Michael Scherer 
wrote:

> Le vendredi 29 juin 2018 à 14:40 +0200, Michael Scherer a écrit :
> > Hi,
> >
> > So, after Gentoo hack, I started to look at all our teams on github,
> > and what access does everybody have, etc, etc
> >
> > And I have a few issues:
> > - we have old repositories that are no longer used
> > - we have team without description
> > - we have people without 2FA who are admins of some team
> > - github make this kind of audit really difficult without scripting
> > (and the API is not stable yet for teams)
> >
> > So I would propose the following rules, and apply them in 1 or 2
> > weeks
> > time.
> >
> > For projects:
> >
> > - archives all old projects, aka, ones that got no commit since 2
> > years, unless people give a reason for the project to stay
> > unarchived.
> > Being archived do not remove it, it just hide it by default and set
> > it
> > readonly. It can be reverted without trouble.
> >
> > See https://help.github.com/articles/archiving-a-github-repository/
> >
> > - remove project who never started ("vagrant" is one example, there
> > is
> > only one readme file).
> >
> > For teams:
> > - if you are admin of a team, you have to turn on 2FA on your
> > account.
> > - if you are admin of the github org, you have to turn 2FA.
> >
> > - if a team no longer have a purpose (for example, all repos got
> > archived or removed), it will be removed.
> >
> > - add a description in every team, that tell what kind of access does
> > it give.
> >
> >
> > This would permit to get a bit more clarity and security.
>
> So to get some perspective after writing a script to get the
> information, the repos I propose to archive:
>
> Older than 3 years, we have:
>
> - gmc-target
> - gmc
> - swiftkrbauth
> - devstack-plugins
> - forge
> - glupy
> - glusterfs-rackspace-regression-tester
> - jenkins-ssh-slaves-plugin
> - glusterfsiostat
>
>
> Older than 2 years, we have:
> - nagios-server-addons
> - gluster-nagios-common
> - gluster-nagios-addons
> - mod_proxy_gluster
> - gluster-tutorial
> - gerrit-hooks
> - distaf
> - libgfapi-java-io
>
> And to remove, because empty:
> - vagrant
> - bigdata
> - gluster-manila
>
>
> Once they are archived, I will take care of the code for finding teams
> to remove.
>
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel



-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Failing multiplex regressions!

2018-07-25 Thread Sanju Rakonde
Hi Shyam,

I need to work on this, but couldn't spend much time till now. I will try
to spend as much time as I can and get these fixed.
Mohit is also working on this AFAIK.

Thanks,
Sanju

On Wed, Jul 25, 2018 at 12:27 AM, Shyam Ranganathan 
wrote:

> Hi,
>
> Multiplex regression jobs are failing everyday, see [1].
>
> May I know is anyone is looking into this?
>
> It was Mohit the last time around, are you still working on this Mohit?
> What patches are in progress to address this, if you are on it?
>
> Thanks,
> Shyam
>
> [1] regression-test-with-multiplex -
> https://build.gluster.org/job/regression-test-with-multiplex/changes
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Thanks,
Sanju
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] How long should metrics collection on a cluster take?

2018-07-25 Thread Pranith Kumar Karampuri
On Tue, Jul 24, 2018 at 10:10 PM, Sankarshan Mukhopadhyay <
sankarshan.mukhopadh...@gmail.com> wrote:

> On Tue, Jul 24, 2018 at 9:48 PM, Pranith Kumar Karampuri
>  wrote:
> > hi,
> >   Quite a few commands to monitor gluster at the moment take almost a
> > second to give output.
>
> Is this at the (most) minimum recommended cluster size?
>

Yes, with a single volume with 3 bricks i.e. 3 nodes in cluster.


>
> > Some categories of these commands:
> > 1) Any command that needs to do some sort of mount/glfs_init.
> >  Examples: 1) heal info family of commands 2) statfs to find
> > space-availability etc (On my laptop replica 3 volume with all local
> bricks,
> > glfs_init takes 0.3 seconds on average)
> > 2) glusterd commands that need to wait for the previous command to
> unlock.
> > If the previous command is something related to lvm snapshot which takes
> > quite a few seconds, it would be even more time consuming.
> >
> > Nowadays container workloads have hundreds of volumes if not thousands.
> If
> > we want to serve any monitoring solution at this scale (I have seen
> > customers use upto 600 volumes at a time, it will only get bigger) and
> lets
> > say collecting metrics per volume takes 2 seconds per volume(Let us take
> the
> > worst example which has all major features enabled like
> > snapshot/geo-rep/quota etc etc), that will mean that it will take 20
> minutes
> > to collect metrics of the cluster with 600 volumes. What are the ways in
> > which we can make this number more manageable? I was initially thinking
> may
> > be it is possible to get gd2 to execute commands in parallel on different
> > volumes, so potentially we could get this done in ~2 seconds. But quite a
> > few of the metrics need a mount or equivalent of a mount(glfs_init) to
> > collect different information like statfs, number of pending heals, quota
> > usage etc. This may lead to high memory usage as the size of the mounts
> tend
> > to be high.
> >
>
> I am not sure if starting from the "worst example" (it certainly is
> not) is a good place to start from.


I didn't understand your statement. Are you saying 600 volumes is a worst
example?


> That said, for any environment
> with that number of disposable volumes, what kind of metrics do
> actually make any sense/impact?
>

Same metrics you track for long running volumes. It is just that the way
the metrics
are interpreted will be different. On a long running volume, you would look
at the metrics
and try to find why is the volume not giving performance as expected in the
last 1 hour. Where as
in this case, you would look at metrics and find the reason why volumes
that were
created and deleted in the last hour didn't give performance as expected.


>
> > I wanted to seek suggestions from others on how to come to a conclusion
> > about which path to take and what problems to solve.
> >
> > I will be happy to raise github issues based on our conclusions on this
> mail
> > thread.
> >
> > --
> > Pranith
> >
>
>
>
>
>
> --
> sankarshan mukhopadhyay
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel