Re: [Gluster-devel] Need help with https://bugzilla.redhat.com/show_bug.cgi?id=1224180

2016-09-13 Thread Pranith Kumar Karampuri
On Wed, Sep 14, 2016 at 12:08 PM, Xavier Hernandez 
wrote:

> On 13/09/16 21:00, Pranith Kumar Karampuri wrote:
>
>>
>>
>> On Tue, Sep 13, 2016 at 1:39 PM, Xavier Hernandez > > wrote:
>>
>> Hi Sanoj,
>>
>> On 13/09/16 09:41, Sanoj Unnikrishnan wrote:
>>
>> Hi Xavi,
>>
>> That explains a lot,
>> I see a couple of other scenario which can lead to similar
>> inconsistency.
>> 1) simultaneous node/brick crash of 3 bricks.
>>
>>
>> Although this is a real problem, the 3 bricks should crash exactly
>> at the same moment just after having successfully locked the inode
>> being modified and queried some information, but before sending the
>> write fop nor any down notification. The probability to have this
>> suffer this problem is really small.
>>
>> 2) if the disk space of underlying filesystem on which brick is
>> hosted exceeds for 3 bricks.
>>
>>
>> Yes. This is the same cause that makes quota fail.
>>
>>
>> I don't think we can address all the scenario unless we have a
>> log/journal mechanism like raid-5.
>>
>>
>> I completely agree. I don't see any solution valid for all cases.
>> BTW RAID-5 *is not* a solution. It doesn't have any log/journal.
>> Maybe something based on fdl xlator would work.
>>
>> Should we look at a quota specific fix or let it get fixed
>> whenever we introduce a log?
>>
>>
>> Not sure how to fix this in a way that doesn't seem too hacky.
>>
>> One possibility is to request permission to write some data before
>> actually writing it (specifying offset and size). And then be sure
>> that the write will succeed if all (or at least the minimum number
>> of data bricks) has acknowledged the previous write permission
>> request.
>>
>> Another approach would be to queue writes in a server side xlator
>> until a commit message is received, but sending back an answer
>> saying if there's enough space to do the write (this is, in some
>> way, a very primitive log/journal approach).
>>
>> However both approaches will have a big performance impact if they
>> cannot be executed in background.
>>
>> Maybe it would be worth investing in fdl instead of trying to find a
>> custom solution to this.
>>
>>
>> There are some things we shall do irrespective of this change:
>> 1) When the file is in a state that all 512 bytes of the fragment
>> represent the data, then we shouldn't increase the file size at all
>> which discards the write without any problems, i.e. this case is
>> recoverable.
>>
>
> I don't understand this case. You mean a write in an offset smaller than
> the total size of the file that doesn't increment it ? if that's the case,
> a sparse file could need to allocate new blocks even if the file size
> doesn't change.
>

I think you got it in the second point, anyways, I was mentioning the same
that if the failed write doesn't touch the previous contents we can keep
size as is and not increment the version,size. Example: If the file is
128KB to begin with and we append 4KB and the write fails then we keep the
file size as 128KB.


>
>
> 2) when we append data to a partially filled chunk and it fails on 3/6
>> bricks, the rest could be recovered by adjusting the file size to the
>> size represented by (previous block - 1)*k, we should probably provide
>> an option to do so?
>>
>
> We could do that, but this only represents a single case that later will
> also be covered by the journal.
>
> In any case, the solution here would be to restore previous file size
> instead of (previous block - 1) * k, since this can cause the file size to
> decrease. This works as long as we can assume that a failed write doesn't
> touch previous contents.
>
> 3) Proivde some utility/setfattr to perform recovery based on data
>> rather than versions. i.e. it needs to detect and tell which part of
>> data is not recoverable and which can be. Based on that, the user should
>> be able to recover.
>>
>
> This is not possible with the current implementation, at least in an
> efficient way. Only way to detect inconsistencies right now would be to
> create all possible combinations of k bricks and compute the decoded data
> for each of them. Then check if everything matches. If so, the block is
> healthy, otherwise there is at least one damaged fragment. Then it would be
> necessary to find a relation between the fragments used for each
> reconstruction and the obtained data to determine a probable candidate to
> be damaged.
>
> Anyway, if there are 3 bad fragments in a 4+2 configuration, it won't be
> possible to recover the data for that block. Of course, in this particular
> case, this would mean that we would be able to recover all file except the
> last written block.
>
> With syndrome decoding (not currently implemented) and using a new
> encoding matrix, this could be done in an efficient way.
>
> Xav

Re: [Gluster-devel] Need help with https://bugzilla.redhat.com/show_bug.cgi?id=1224180

2016-09-13 Thread Xavier Hernandez

On 13/09/16 21:00, Pranith Kumar Karampuri wrote:



On Tue, Sep 13, 2016 at 1:39 PM, Xavier Hernandez mailto:xhernan...@datalab.es>> wrote:

Hi Sanoj,

On 13/09/16 09:41, Sanoj Unnikrishnan wrote:

Hi Xavi,

That explains a lot,
I see a couple of other scenario which can lead to similar
inconsistency.
1) simultaneous node/brick crash of 3 bricks.


Although this is a real problem, the 3 bricks should crash exactly
at the same moment just after having successfully locked the inode
being modified and queried some information, but before sending the
write fop nor any down notification. The probability to have this
suffer this problem is really small.

2) if the disk space of underlying filesystem on which brick is
hosted exceeds for 3 bricks.


Yes. This is the same cause that makes quota fail.


I don't think we can address all the scenario unless we have a
log/journal mechanism like raid-5.


I completely agree. I don't see any solution valid for all cases.
BTW RAID-5 *is not* a solution. It doesn't have any log/journal.
Maybe something based on fdl xlator would work.

Should we look at a quota specific fix or let it get fixed
whenever we introduce a log?


Not sure how to fix this in a way that doesn't seem too hacky.

One possibility is to request permission to write some data before
actually writing it (specifying offset and size). And then be sure
that the write will succeed if all (or at least the minimum number
of data bricks) has acknowledged the previous write permission request.

Another approach would be to queue writes in a server side xlator
until a commit message is received, but sending back an answer
saying if there's enough space to do the write (this is, in some
way, a very primitive log/journal approach).

However both approaches will have a big performance impact if they
cannot be executed in background.

Maybe it would be worth investing in fdl instead of trying to find a
custom solution to this.


There are some things we shall do irrespective of this change:
1) When the file is in a state that all 512 bytes of the fragment
represent the data, then we shouldn't increase the file size at all
which discards the write without any problems, i.e. this case is
recoverable.


I don't understand this case. You mean a write in an offset smaller than 
the total size of the file that doesn't increment it ? if that's the 
case, a sparse file could need to allocate new blocks even if the file 
size doesn't change.




2) when we append data to a partially filled chunk and it fails on 3/6
bricks, the rest could be recovered by adjusting the file size to the
size represented by (previous block - 1)*k, we should probably provide
an option to do so?


We could do that, but this only represents a single case that later will 
also be covered by the journal.


In any case, the solution here would be to restore previous file size 
instead of (previous block - 1) * k, since this can cause the file size 
to decrease. This works as long as we can assume that a failed write 
doesn't touch previous contents.



3) Proivde some utility/setfattr to perform recovery based on data
rather than versions. i.e. it needs to detect and tell which part of
data is not recoverable and which can be. Based on that, the user should
be able to recover.


This is not possible with the current implementation, at least in an 
efficient way. Only way to detect inconsistencies right now would be to 
create all possible combinations of k bricks and compute the decoded 
data for each of them. Then check if everything matches. If so, the 
block is healthy, otherwise there is at least one damaged fragment. Then 
it would be necessary to find a relation between the fragments used for 
each reconstruction and the obtained data to determine a probable 
candidate to be damaged.


Anyway, if there are 3 bad fragments in a 4+2 configuration, it won't be 
possible to recover the data for that block. Of course, in this 
particular case, this would mean that we would be able to recover all 
file except the last written block.


With syndrome decoding (not currently implemented) and using a new 
encoding matrix, this could be done in an efficient way.


Xavi



What do you guys think?


Xavi



Thanks and Regards,
Sanoj

- Original Message -
From: "Xavier Hernandez" mailto:xhernan...@datalab.es>>
To: "Raghavendra Gowdappa" mailto:rgowd...@redhat.com>>, "Sanoj Unnikrishnan"
mailto:sunni...@redhat.com>>
Cc: "Pranith Kumar Karampuri" mailto:pkara...@redhat.com>>, "Ashish Pandey"
mailto:aspan...@redhat.com>>, "Gluster
Devel" mailto:gluster-devel@gluster.org>>
Sent: Tuesday, September 13, 2016 11:50:27 AM
Subject: Re: Need help with
https://bugzilla.redhat.com/sho

[Gluster-devel] [Heketi] Kubernetes Dynamic Provisioner for Gluster/Heketi

2016-09-13 Thread Luis Pabón
Hi all,
  I was able to spend some time setting up the environment to test
the Gluster/Heketi dynamic provisioner in Kubernetes.  It took me
a little while, but once I figure it out, I was able to start the
tests.  It is actually really easy to setup, so I wrote two
blogs[1][2] about how I was able to accomplish this.

Everything worked very well, but I did find one issue[3].  I will
also determine how hard it would be to add this to the Heketi
CI functional tests.

[1] Setting up Minikube: http://bit.ly/2cFfaue
[2] Testing Gluster/Heketi Dynanic Provisioner: http://bit.ly/2cvl92V
[3] https://github.com/heketi/heketi/issues/494

- Luis
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Changing Submit Type for glusterfs

2016-09-13 Thread Pranith Kumar Karampuri
On Tue, Sep 13, 2016 at 7:29 PM, Nigel Babu  wrote:

> On Fri, Sep 02, 2016 at 10:25:01AM +0530, Nigel Babu wrote:
> > > > The reason cherry-pick was chosen was to keep the branch linear and
> > > > avoid merge-commits as (I'm guessing here) this makes the tree hard
> to
> > > > follow.
> > > > Merge-if-necessary will not keep the branch linear. I'm not sure how
> > > > rebase-if-necessary works though.
> > > >
> > > > Vijay, can you provide anymore background for the choice of
> > > > cherry-pick and you opinion on the change?
> > > >
> > >
> > > Unfortunately I do not recollect the reason for cherry-pick being the
> > > current choice. FWIW, I think dependencies were being enforced a while
> > > back in the previous version(s) of gerrit. Not sure if something has
> > > changed in the more recent gerrit versions.
> > >
> >
> > According to the documentation, the behavior was intended to be like how
> it is
> > currently. If it worked in the past, it may have been a bug. Let me setup
> > a test with Rebase-If-Necessary. Then we can make an informed decision
> on which
> > way to go about it.
> >
> > --
> > nigelb
>
> I tested out Rebase-If-Necessary. This bit is very important:
>
> When cherry picking a change, Gerrit automatically appends onto the end of
> the
> commit message a short summary of the change's approvals, and a URL link
> back
> to the change on the web. The committer header is also set to the
> submitter,
> while the author header retains the original patch set author.
>
> When using Rebase-If-Necessary, Gerrit does none of this. I'm guessing
> this is
> a problem for us?
>

It is a problem, yes.


>
> --
> nigelb
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Need help with https://bugzilla.redhat.com/show_bug.cgi?id=1224180

2016-09-13 Thread Pranith Kumar Karampuri
On Tue, Sep 13, 2016 at 1:39 PM, Xavier Hernandez 
wrote:

> Hi Sanoj,
>
> On 13/09/16 09:41, Sanoj Unnikrishnan wrote:
>
>> Hi Xavi,
>>
>> That explains a lot,
>> I see a couple of other scenario which can lead to similar inconsistency.
>> 1) simultaneous node/brick crash of 3 bricks.
>>
>
> Although this is a real problem, the 3 bricks should crash exactly at the
> same moment just after having successfully locked the inode being modified
> and queried some information, but before sending the write fop nor any down
> notification. The probability to have this suffer this problem is really
> small.
>
> 2) if the disk space of underlying filesystem on which brick is hosted
>> exceeds for 3 bricks.
>>
>
> Yes. This is the same cause that makes quota fail.
>
>
>> I don't think we can address all the scenario unless we have a
>> log/journal mechanism like raid-5.
>>
>
> I completely agree. I don't see any solution valid for all cases. BTW
> RAID-5 *is not* a solution. It doesn't have any log/journal. Maybe
> something based on fdl xlator would work.
>
> Should we look at a quota specific fix or let it get fixed whenever we
>> introduce a log?
>>
>
> Not sure how to fix this in a way that doesn't seem too hacky.
>
> One possibility is to request permission to write some data before
> actually writing it (specifying offset and size). And then be sure that the
> write will succeed if all (or at least the minimum number of data bricks)
> has acknowledged the previous write permission request.
>
> Another approach would be to queue writes in a server side xlator until a
> commit message is received, but sending back an answer saying if there's
> enough space to do the write (this is, in some way, a very primitive
> log/journal approach).
>
> However both approaches will have a big performance impact if they cannot
> be executed in background.
>
> Maybe it would be worth investing in fdl instead of trying to find a
> custom solution to this.
>

There are some things we shall do irrespective of this change:
1) When the file is in a state that all 512 bytes of the fragment represent
the data, then we shouldn't increase the file size at all which discards
the write without any problems, i.e. this case is recoverable.
2) when we append data to a partially filled chunk and it fails on 3/6
bricks, the rest could be recovered by adjusting the file size to the size
represented by (previous block - 1)*k, we should probably provide an option
to do so?
3) Proivde some utility/setfattr to perform recovery based on data rather
than versions. i.e. it needs to detect and tell which part of data is not
recoverable and which can be. Based on that, the user should be able to
recover.

What do you guys think?


> Xavi
>
>
>
>> Thanks and Regards,
>> Sanoj
>>
>> - Original Message -
>> From: "Xavier Hernandez" 
>> To: "Raghavendra Gowdappa" , "Sanoj Unnikrishnan" <
>> sunni...@redhat.com>
>> Cc: "Pranith Kumar Karampuri" , "Ashish Pandey" <
>> aspan...@redhat.com>, "Gluster Devel" 
>> Sent: Tuesday, September 13, 2016 11:50:27 AM
>> Subject: Re: Need help with https://bugzilla.redhat.com/sh
>> ow_bug.cgi?id=1224180
>>
>> Hi Sanoj,
>>
>> I'm unable to see bug 1224180. Access is restricted.
>>
>> Not sure what is the problem exactly, but I see that quota is involved.
>> Currently disperse doesn't play well with quota when the limit is near.
>>
>> The reason is that not all bricks fail at the same time with EDQUOT due
>> to small differences is computed space. This causes a valid write to
>> succeed on some bricks and fail on others. If it fails simultaneously on
>> more than redundancy bricks but less that the number of data bricks,
>> there's no way to rollback the changes on the bricks that have
>> succeeded, so the operation is inconsistent and an I/O error is returned.
>>
>> For example, on a 6:2 configuration (4 data bricks and 2 redundancy), if
>> 3 bricks succeed and 3 fail, there are not enough bricks with the
>> updated version, but there aren't enough bricks with the old version
>> either.
>>
>> If you force 2 bricks to be down, the problem can appear more frequently
>> as only a single failure causes this problem.
>>
>> Xavi
>>
>> On 13/09/16 06:09, Raghavendra Gowdappa wrote:
>>
>>> +gluster-devel
>>>
>>> - Original Message -
>>>
 From: "Sanoj Unnikrishnan" 
 To: "Pranith Kumar Karampuri" , "Ashish Pandey" <
 aspan...@redhat.com>, xhernan...@datalab.es,
 "Raghavendra Gowdappa" 
 Sent: Monday, September 12, 2016 7:06:59 PM
 Subject: Need help with https://bugzilla.redhat.com/sh
 ow_bug.cgi?id=1224180

 Hello Xavi/Pranith,

 I have been able to reproduce the BZ with the following steps:

 gluster volume create v_disp disperse 6 redundancy 2
 $tm1:/export/sdb/br1
 $tm2:/export/sdb/b2 $tm3:/export/sdb/br3  $tm1:/export/sdb/b4
 $tm2:/export/sdb/b5 $tm3:/export/sdb/b6 force
 #(Used only 3 nodes, should not matter here)
 gluster volume start v_d

Re: [Gluster-devel] [Heketi] Block store related API design discussion

2016-09-13 Thread Luis Pabón
Hi Steve, 
  Good questions.  We still need to investigate the security
concerns around block storage.

- Luis

- Original Message -
From: "Stephen Watt" 
To: "Luis Pabón" 
Cc: "Humble Chirammal" , "Raghavendra Talur" 
, "Prasanna Kalever" , "gluster-devel" 
, "Michael Adam" , "Ramakrishna 
Yekulla" , "Mohamed Ashiq Liyazudeen" , 
"Engineering discussions on containers & RHS" 
Sent: Tuesday, September 13, 2016 12:10:54 PM
Subject: Re: [Gluster-devel] [Heketi] Block store related API design discussion

+ rhs-containers list

Also, some important requirements to figure out/think about are:

- How are you managing locking a block device against a container (or a
host?)
- Will your implementation work with OpenShift volume security for block
devices (FSGroups + Recursive chown, chmod and SELinux labeling)

If these aren't already figured out, would it be possible to create
separate cards in your trello board so we can track the progress on the
resolution of these two topics?

On Tue, Sep 13, 2016 at 11:06 AM, Luis Pabón  wrote:

> Very good points.  Thanks Prasanna for putting this together.  I agree with
> your comments in that Heketi is the high level abstraction API and it
> should have
> an API similar of what is described by Prasanna.
>
> I definitely do not think any File Api should be available in Heketi,
> because that is an implementation of the Block API.  The Heketi API should
> be similar to something like OpenStack Cinder.
>
> I think that the actual management of the Volumes used for Block storage
> and the files in them should be all managed by Heketi.  How they are
> actually created is still to be determined, but we could have Heketi
> create them, or have helper programs do that.
>
> We also need to document the exact workflow to enable a file in
> a Gluster volume to be exposed as a block device.  This will help
> determine where the creation of the file could take place.
>
> We can capture our decisions from these discussions in the
> following page:
>
> https://github.com/heketi/heketi/wiki/Proposed-Changes
>
> - Luis
>
>
> - Original Message -
> From: "Humble Chirammal" 
> To: "Raghavendra Talur" 
> Cc: "Prasanna Kalever" , "gluster-devel" <
> gluster-devel@gluster.org>, "Stephen Watt" , "Luis
> Pabon" , "Michael Adam" ,
> "Ramakrishna Yekulla" , "Mohamed Ashiq Liyazudeen" <
> mliya...@redhat.com>
> Sent: Tuesday, September 13, 2016 2:23:39 AM
> Subject: Re: [Gluster-devel] [Heketi] Block store related API design
> discussion
>
>
>
>
>
> - Original Message -
> | From: "Raghavendra Talur" 
> | To: "Prasanna Kalever" 
> | Cc: "gluster-devel" , "Stephen Watt" <
> sw...@redhat.com>, "Luis Pabon" ,
> | "Michael Adam" , "Humble Chirammal" <
> hchir...@redhat.com>, "Ramakrishna Yekulla"
> | , "Mohamed Ashiq Liyazudeen" 
> | Sent: Tuesday, September 13, 2016 11:08:44 AM
> | Subject: Re: [Gluster-devel] [Heketi] Block store related API design
> discussion
> |
> | On Mon, Sep 12, 2016 at 11:30 PM, Prasanna Kalever 
> | wrote:
> |
> | > Hi all,
> | >
> | > This mail is open for discussion on gluster block store integration
> with
> | > heketi and its REST API interface design constraints.
> | >
> | >
> | >  ___ Volume Request ...
> | > |
> | > |
> | > PVC claim -> Heketi --->|
> | > |
> | > |
> | > |
> | > |
> | > |__ BlockCreate
> | > |   |
> | > |   |__ BlockInfo
> | > |   |
> | > |___ Block Request (APIS)-> |__ BlockResize
> | > |
> | > |__ BlockList
> | > |
> | > |__ BlockDelete
> | >
> | > Heketi will have block API and volume API, when user submit a
> Persistent
> | > volume claim, Kubernetes provisioner based on the storage class(from
> PVC)
> | > talks to heketi for storage, heketi intern calls block or volume API's
> | > based on request.
> | >
> |
> | This is probably wrong. It won't be Heketi calling block or volume APIs.
> It
> | would be Kubernetes calling block or volume API *of* Heketi.
> |
> |
> | > With my limited understanding, heketi currently creates clusters from
> | > provided nodes, creates volumes and handover them to the user.
> | > For block related API's, it has to deal with files right ?
> | >
> | > Here is how block API's look like in short-
> | > Create: heketi has to create file in the volume and export it as a
> iscsi
> | > target device and hand it over to user.
> | > Info: show block stores information across all the clusters, connection
> | > info, size etc

Re: [Gluster-devel] [Heketi] Block store related API design discussion

2016-09-13 Thread Luis Pabón
Very good points.  Thanks Prasanna for putting this together.  I agree with
your comments in that Heketi is the high level abstraction API and it should 
have
an API similar of what is described by Prasanna.

I definitely do not think any File Api should be available in Heketi,
because that is an implementation of the Block API.  The Heketi API should
be similar to something like OpenStack Cinder.

I think that the actual management of the Volumes used for Block storage
and the files in them should be all managed by Heketi.  How they are
actually created is still to be determined, but we could have Heketi
create them, or have helper programs do that.

We also need to document the exact workflow to enable a file in
a Gluster volume to be exposed as a block device.  This will help
determine where the creation of the file could take place.

We can capture our decisions from these discussions in the
following page:

https://github.com/heketi/heketi/wiki/Proposed-Changes

- Luis


- Original Message -
From: "Humble Chirammal" 
To: "Raghavendra Talur" 
Cc: "Prasanna Kalever" , "gluster-devel" 
, "Stephen Watt" , "Luis Pabon" 
, "Michael Adam" , "Ramakrishna Yekulla" 
, "Mohamed Ashiq Liyazudeen" 
Sent: Tuesday, September 13, 2016 2:23:39 AM
Subject: Re: [Gluster-devel] [Heketi] Block store related API design discussion





- Original Message -
| From: "Raghavendra Talur" 
| To: "Prasanna Kalever" 
| Cc: "gluster-devel" , "Stephen Watt" 
, "Luis Pabon" ,
| "Michael Adam" , "Humble Chirammal" , 
"Ramakrishna Yekulla"
| , "Mohamed Ashiq Liyazudeen" 
| Sent: Tuesday, September 13, 2016 11:08:44 AM
| Subject: Re: [Gluster-devel] [Heketi] Block store related API design 
discussion
| 
| On Mon, Sep 12, 2016 at 11:30 PM, Prasanna Kalever 
| wrote:
| 
| > Hi all,
| >
| > This mail is open for discussion on gluster block store integration with
| > heketi and its REST API interface design constraints.
| >
| >
| >  ___ Volume Request ...
| > |
| > |
| > PVC claim -> Heketi --->|
| > |
| > |
| > |
| > |
| > |__ BlockCreate
| > |   |
| > |   |__ BlockInfo
| > |   |
| > |___ Block Request (APIS)-> |__ BlockResize
| > |
| > |__ BlockList
| > |
| > |__ BlockDelete
| >
| > Heketi will have block API and volume API, when user submit a Persistent
| > volume claim, Kubernetes provisioner based on the storage class(from PVC)
| > talks to heketi for storage, heketi intern calls block or volume API's
| > based on request.
| >
| 
| This is probably wrong. It won't be Heketi calling block or volume APIs. It
| would be Kubernetes calling block or volume API *of* Heketi.
| 
| 
| > With my limited understanding, heketi currently creates clusters from
| > provided nodes, creates volumes and handover them to the user.
| > For block related API's, it has to deal with files right ?
| >
| > Here is how block API's look like in short-
| > Create: heketi has to create file in the volume and export it as a iscsi
| > target device and hand it over to user.
| > Info: show block stores information across all the clusters, connection
| > info, size etc.
| > resize: resize the file in the volume, refresh connections from initiator
| > side
| > List: List the connections
| > Delete: logout the connections and delete the file in the gluster volume
| >
| > Couple of questions:
| > 1. Should Block API have sub API's such as FileCreate, FileList,
| > FileResize, File delete and etc then get it used in Block API as they
| > mostly deal with files.
| >
| 
| IMO, Heketi should not expose any File related API. It should only have
| APIs to service request for block devices; how the block devices are
| created and modified is an implementation detail.
| 
| 
| > 2. How do we create the actual file in the volume, meaning using FUSE
| > mount (which may involve an extra process running) or gfapi, again if gfapi
| > should we go with c API's, python bindings or go bindings ?
| >
| 3. Should we get targetcli related (LUN exporting) setup done from heketi
| > or do we seek help from gdeploy for this ?
| >
| 
| I would prefer to either have it in Heketi or in Kubernetes. If the API in
| Heketi promises just the creation of block device, then the rest of the
| implementation should be in Kubernetes(the export part). If the API in
| Heketi promises create and export both, I would say Heketi should have the
| implementation within itself.
| 
| 

IMO, we should not think a

Re: [Gluster-devel] Changing Submit Type for glusterfs

2016-09-13 Thread Nigel Babu
On Fri, Sep 02, 2016 at 10:25:01AM +0530, Nigel Babu wrote:
> > > The reason cherry-pick was chosen was to keep the branch linear and
> > > avoid merge-commits as (I'm guessing here) this makes the tree hard to
> > > follow.
> > > Merge-if-necessary will not keep the branch linear. I'm not sure how
> > > rebase-if-necessary works though.
> > >
> > > Vijay, can you provide anymore background for the choice of
> > > cherry-pick and you opinion on the change?
> > >
> >
> > Unfortunately I do not recollect the reason for cherry-pick being the
> > current choice. FWIW, I think dependencies were being enforced a while
> > back in the previous version(s) of gerrit. Not sure if something has
> > changed in the more recent gerrit versions.
> >
>
> According to the documentation, the behavior was intended to be like how it is
> currently. If it worked in the past, it may have been a bug. Let me setup
> a test with Rebase-If-Necessary. Then we can make an informed decision on 
> which
> way to go about it.
>
> --
> nigelb

I tested out Rebase-If-Necessary. This bit is very important:

When cherry picking a change, Gerrit automatically appends onto the end of the
commit message a short summary of the change's approvals, and a URL link back
to the change on the web. The committer header is also set to the submitter,
while the author header retains the original patch set author.

When using Rebase-If-Necessary, Gerrit does none of this. I'm guessing this is
a problem for us?

--
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Minutes from Gluster Bug Triage meeting today

2016-09-13 Thread Saravanakumar Arumugam

Hi,

Thanks all who joined!

Next week at the same time (Tuesday 12:00 UTC) we will have an other bug
triage meeting to catch the bugs that were not handled by developers and
maintainers yet. We'll stay repeating this meeting as safety net so that
bugs get the initial attention and developers can immediately start
working on the issues that were reported.

Bug triaging (in general, no need to only do it during the meeting) is
intended to help developers, in the hope that developers can focus on
writing bug fixes instead of spending much of their valued time
troubleshooting incorrectly/incompletely reported bugs.

More details about bug triaging can be found here:
http://gluster.readthedocs.io/en/latest/Contributors-Guide/Bug-Triage/

Meeting minutes below.

Thanks,
Saravana

==
Meeting summary :

agenda: https://public.pad.fsfe.org/p/gluster-bug-triage 
(Saravanakmr, 12:00:50)


Roll call (Saravanakmr, 12:01:02)
Next weeks meeting host (Saravanakmr, 12:04:25)
ACTION: Next weeks meeting host hgowtham (Saravanakmr, 12:06:11)
skoduri hosts the meeting on 27 September (ndevos, 12:06:12)

ndevos need to decide on how to provide/use debug builds 
(Saravanakmr, 12:07:16)
ACTION: ndevos need to decide on how to provide/use debug 
builds (Saravanakmr, 12:07:57)


jiffin will try to add an error for bug ownership to check-bugs.py 
(Saravanakmr, 12:08:15)
ACTION: jiffin will try to add an error for bug ownership to 
check-bugs.py (Saravanakmr, 12:09:03)


Group Triage (Saravanakmr, 12:09:23)
you can fine the bugs to triage here in 
https://public.pad.fsfe.org/p/gluster-bugs-to-triage (Saravanakmr, 12:09:32)
http://www.gluster.org/community/documentation/index.php/Bug_triage Bug 
triage guidelines can be found here ^^ (Saravanakmr, 12:09:44)


Open Floor (Saravanakmr, 12:21:14)



Meeting ended at 12:22:13 UTC (full logs).

Action items

Next weeks meeting host hgowtham
ndevos need to decide on how to provide/use debug builds
jiffin will try to add an error for bug ownership to check-bugs.py



Action items, by person

hgowtham
Next weeks meeting host hgowtham
jiffin
jiffin will try to add an error for bug ownership to check-bugs.py
ndevos
ndevos need to decide on how to provide/use debug builds



People present (lines said)

Saravanakmr (34)
skoduri (8)
ndevos (7)
hgowtham (7)
zodbot (3)
jiffin (1)
kkeithley (1)
==
Minutes: 
https://meetbot.fedoraproject.org/gluster-meeting/2016-09-13/gluster_bug_triage.2016-09-13-12.00.html
Minutes (text): 
https://meetbot.fedoraproject.org/gluster-meeting/2016-09-13/gluster_bug_triage.2016-09-13-12.00.txt
Log: 
https://meetbot.fedoraproject.org/gluster-meeting/2016-09-13/gluster_bug_triage.2016-09-13-12.00.log.html




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] REMINDER: Gluster Community Bug Triage meeting at 12:00 UTC (today)

2016-09-13 Thread Saravanakumar Arumugam

Hi,

This meeting is scheduled for anyone, who is interested in learning more
about, or assisting with the Bug Triage.

Meeting details:
- location: #gluster-meeting on Freenode IRC
(https://webchat.freenode.net/?channels=gluster-meeting  )
- date: every Tuesday
- time: 12:00 UTC
  (in your terminal, run: date -d "12:00 UTC")
- agenda:https://public.pad.fsfe.org/p/gluster-bug-triage

Currently the following items are listed:
* Roll Call
* Status of last weeks action items
* Group Triage
* Open Floor

The last two topics have space for additions. If you have a suitable bug
or topic to discuss, please add it to the agenda.

Appreciate your participation.

Thanks,
Saravana


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Need help with https://bugzilla.redhat.com/show_bug.cgi?id=1224180

2016-09-13 Thread Xavier Hernandez

Hi Sanoj,

On 13/09/16 09:41, Sanoj Unnikrishnan wrote:

Hi Xavi,

That explains a lot,
I see a couple of other scenario which can lead to similar inconsistency.
1) simultaneous node/brick crash of 3 bricks.


Although this is a real problem, the 3 bricks should crash exactly at 
the same moment just after having successfully locked the inode being 
modified and queried some information, but before sending the write fop 
nor any down notification. The probability to have this suffer this 
problem is really small.



2) if the disk space of underlying filesystem on which brick is hosted exceeds 
for 3 bricks.


Yes. This is the same cause that makes quota fail.



I don't think we can address all the scenario unless we have a log/journal 
mechanism like raid-5.


I completely agree. I don't see any solution valid for all cases. BTW 
RAID-5 *is not* a solution. It doesn't have any log/journal. Maybe 
something based on fdl xlator would work.



Should we look at a quota specific fix or let it get fixed whenever we 
introduce a log?


Not sure how to fix this in a way that doesn't seem too hacky.

One possibility is to request permission to write some data before 
actually writing it (specifying offset and size). And then be sure that 
the write will succeed if all (or at least the minimum number of data 
bricks) has acknowledged the previous write permission request.


Another approach would be to queue writes in a server side xlator until 
a commit message is received, but sending back an answer saying if 
there's enough space to do the write (this is, in some way, a very 
primitive log/journal approach).


However both approaches will have a big performance impact if they 
cannot be executed in background.


Maybe it would be worth investing in fdl instead of trying to find a 
custom solution to this.


Xavi



Thanks and Regards,
Sanoj

- Original Message -
From: "Xavier Hernandez" 
To: "Raghavendra Gowdappa" , "Sanoj Unnikrishnan" 

Cc: "Pranith Kumar Karampuri" , "Ashish Pandey" , 
"Gluster Devel" 
Sent: Tuesday, September 13, 2016 11:50:27 AM
Subject: Re: Need help with https://bugzilla.redhat.com/show_bug.cgi?id=1224180

Hi Sanoj,

I'm unable to see bug 1224180. Access is restricted.

Not sure what is the problem exactly, but I see that quota is involved.
Currently disperse doesn't play well with quota when the limit is near.

The reason is that not all bricks fail at the same time with EDQUOT due
to small differences is computed space. This causes a valid write to
succeed on some bricks and fail on others. If it fails simultaneously on
more than redundancy bricks but less that the number of data bricks,
there's no way to rollback the changes on the bricks that have
succeeded, so the operation is inconsistent and an I/O error is returned.

For example, on a 6:2 configuration (4 data bricks and 2 redundancy), if
3 bricks succeed and 3 fail, there are not enough bricks with the
updated version, but there aren't enough bricks with the old version either.

If you force 2 bricks to be down, the problem can appear more frequently
as only a single failure causes this problem.

Xavi

On 13/09/16 06:09, Raghavendra Gowdappa wrote:

+gluster-devel

- Original Message -

From: "Sanoj Unnikrishnan" 
To: "Pranith Kumar Karampuri" , "Ashish Pandey" 
, xhernan...@datalab.es,
"Raghavendra Gowdappa" 
Sent: Monday, September 12, 2016 7:06:59 PM
Subject: Need help with https://bugzilla.redhat.com/show_bug.cgi?id=1224180

Hello Xavi/Pranith,

I have been able to reproduce the BZ with the following steps:

gluster volume create v_disp disperse 6 redundancy 2 $tm1:/export/sdb/br1
$tm2:/export/sdb/b2 $tm3:/export/sdb/br3  $tm1:/export/sdb/b4
$tm2:/export/sdb/b5 $tm3:/export/sdb/b6 force
#(Used only 3 nodes, should not matter here)
gluster volume start v_disp
mount -t glusterfs $tm1:v_disp /gluster_vols/v_disp
mkdir /gluster_vols/v_disp/dir1
dd if=/dev/zero of=/gluster_vols/v_disp/dir1/x bs=10k count=9 &
gluster v quota v_disp enable
gluster v quota v_disp limit-usage /dir1 200MB
gluster v quota v_disp soft-timeout 0
gluster v quota v_disp hard-timeout 0
#optional remove 2 bricks (reproduces more often with this)
#pgrep glusterfsd | xargs kill -9

IO error on stdout when Quota exceeds, followed by Disk Quota exceeded.

Also note the issue is seen when A flush happens simultaneous with quota
limit hit, Hence Its not seen only on some runs.

The following are the error in logs.
[2016-09-12 10:40:02.431568] E [MSGID: 122034]
[ec-common.c:488:ec_child_select] 0-v_disp-disperse-0: Insufficient
available childs for this request (have 0, need 4)
[2016-09-12 10:40:02.431627] E [MSGID: 122037]
[ec-common.c:1830:ec_update_size_version_done] 0-Disperse: sku-debug:
pre-version=0/0, size=0post-version=1865/1865, size=209571840
[2016-09-12 10:40:02.431637] E [MSGID: 122037]
[ec-common.c:1835:ec_update_size_version_done] 0-v_disp-disperse-0: Failed
to update version and size [Input/output error]
[2016-09-12 10:40:02.43

Re: [Gluster-devel] Need help with https://bugzilla.redhat.com/show_bug.cgi?id=1224180

2016-09-13 Thread Sanoj Unnikrishnan
Hi Xavi,

That explains a lot,
I see a couple of other scenario which can lead to similar inconsistency.
1) simultaneous node/brick crash of 3 bricks.
2) if the disk space of underlying filesystem on which brick is hosted exceeds 
for 3 bricks.

I don't think we can address all the scenario unless we have a log/journal 
mechanism like raid-5.
Should we look at a quota specific fix or let it get fixed whenever we 
introduce a log?

Thanks and Regards,
Sanoj

- Original Message -
From: "Xavier Hernandez" 
To: "Raghavendra Gowdappa" , "Sanoj Unnikrishnan" 

Cc: "Pranith Kumar Karampuri" , "Ashish Pandey" 
, "Gluster Devel" 
Sent: Tuesday, September 13, 2016 11:50:27 AM
Subject: Re: Need help with https://bugzilla.redhat.com/show_bug.cgi?id=1224180

Hi Sanoj,

I'm unable to see bug 1224180. Access is restricted.

Not sure what is the problem exactly, but I see that quota is involved. 
Currently disperse doesn't play well with quota when the limit is near.

The reason is that not all bricks fail at the same time with EDQUOT due 
to small differences is computed space. This causes a valid write to 
succeed on some bricks and fail on others. If it fails simultaneously on 
more than redundancy bricks but less that the number of data bricks, 
there's no way to rollback the changes on the bricks that have 
succeeded, so the operation is inconsistent and an I/O error is returned.

For example, on a 6:2 configuration (4 data bricks and 2 redundancy), if 
3 bricks succeed and 3 fail, there are not enough bricks with the 
updated version, but there aren't enough bricks with the old version either.

If you force 2 bricks to be down, the problem can appear more frequently 
as only a single failure causes this problem.

Xavi

On 13/09/16 06:09, Raghavendra Gowdappa wrote:
> +gluster-devel
>
> - Original Message -
>> From: "Sanoj Unnikrishnan" 
>> To: "Pranith Kumar Karampuri" , "Ashish Pandey" 
>> , xhernan...@datalab.es,
>> "Raghavendra Gowdappa" 
>> Sent: Monday, September 12, 2016 7:06:59 PM
>> Subject: Need help with https://bugzilla.redhat.com/show_bug.cgi?id=1224180
>>
>> Hello Xavi/Pranith,
>>
>> I have been able to reproduce the BZ with the following steps:
>>
>> gluster volume create v_disp disperse 6 redundancy 2 $tm1:/export/sdb/br1
>> $tm2:/export/sdb/b2 $tm3:/export/sdb/br3  $tm1:/export/sdb/b4
>> $tm2:/export/sdb/b5 $tm3:/export/sdb/b6 force
>> #(Used only 3 nodes, should not matter here)
>> gluster volume start v_disp
>> mount -t glusterfs $tm1:v_disp /gluster_vols/v_disp
>> mkdir /gluster_vols/v_disp/dir1
>> dd if=/dev/zero of=/gluster_vols/v_disp/dir1/x bs=10k count=9 &
>> gluster v quota v_disp enable
>> gluster v quota v_disp limit-usage /dir1 200MB
>> gluster v quota v_disp soft-timeout 0
>> gluster v quota v_disp hard-timeout 0
>> #optional remove 2 bricks (reproduces more often with this)
>> #pgrep glusterfsd | xargs kill -9
>>
>> IO error on stdout when Quota exceeds, followed by Disk Quota exceeded.
>>
>> Also note the issue is seen when A flush happens simultaneous with quota
>> limit hit, Hence Its not seen only on some runs.
>>
>> The following are the error in logs.
>> [2016-09-12 10:40:02.431568] E [MSGID: 122034]
>> [ec-common.c:488:ec_child_select] 0-v_disp-disperse-0: Insufficient
>> available childs for this request (have 0, need 4)
>> [2016-09-12 10:40:02.431627] E [MSGID: 122037]
>> [ec-common.c:1830:ec_update_size_version_done] 0-Disperse: sku-debug:
>> pre-version=0/0, size=0post-version=1865/1865, size=209571840
>> [2016-09-12 10:40:02.431637] E [MSGID: 122037]
>> [ec-common.c:1835:ec_update_size_version_done] 0-v_disp-disperse-0: Failed
>> to update version and size [Input/output error]
>> [2016-09-12 10:40:02.431664] E [MSGID: 122034]
>> [ec-common.c:417:ec_child_select] 0-v_disp-disperse-0: sku-debug: mask: 36,
>> ec->xl_up 36, ec->node_mask 3f, parent->mask:36, fop->parent->healing:0,
>> id:29
>>
>> [2016-09-12 10:40:02.431673] E [MSGID: 122034]
>> [ec-common.c:480:ec_child_select] 0-v_disp-disperse-0: sku-debug: mask: 36,
>> remaining: 36, healing: 0, ec->xl_up 36, ec->node_mask 3f, parent->mask:36,
>> num:4, minimum: 1, id:29
>>
>> ...
>> [2016-09-12 10:40:02.487302] W [fuse-bridge.c:2311:fuse_writev_cbk]
>> 0-glusterfs-fuse: 41159: WRITE => -1
>> gfid=ee0b4aa1-1f44-486a-883c-acddc13ee318 fd=0x7f1d9c003edc (Input/output
>> error)
>> [2016-09-12 10:40:02.500151] W [MSGID: 122006]
>> [ec-combine.c:206:ec_iatt_combine] 0-v_disp-disperse-0: Failed to combine
>> iatt (inode: 9816911356190712600-9816911356190712600, links: 1-1, uid: 0-0,
>> gid: 0-0, rdev: 0-0, size: 52423680-52413440, mode: 100644-100644)
>> [2016-09-12 10:40:02.500188] N [MSGID: 122029]
>> [ec-combine.c:93:ec_combine_write] 0-v_disp-disperse-0: Mismatching iatt in
>> answers of 'WRITE'
>> [2016-09-12 10:40:02.504551] W [MSGID: 122006]
>> [ec-combine.c:206:ec_iatt_combine] 0-v_disp-disperse-0: Failed to combine
>> iatt (inode: 9816911356190712600-9816911356190712600, links: 1-1, uid: 0-0,
>> gid: 0