Re: [Gluster-devel] Metrics: and how to get them out from gluster

2017-09-01 Thread Xavier Hernandez

Hi Amar,

I don't have time to review the changes in experimental branch yet, but 
here are some comments about these ideas...


On 01/09/17 07:27, Amar Tumballi wrote:
Disclaimer: This email is long, and did take significant time to write. 
Do take time and read, review and give feedback, so we can have some 
metrics related tasks done by Gluster 4.0


---
** History:*

To understand what is happening inside GlusterFS process, over the 
years, we have opened many bugs and also coded few things with regard to 
statedump, and did put some effort into io-stats translator to improve 
the gluster's monitoring capabilities.


But surely there is more required! And some glimpse of it is captured in 
[1], [2], [3] & [4]. Also, I did send an email to this group [5] about 
possibilities of capturing this information.


** Current problem:*

When we talk about metrics or monitoring, we have to consider giving out 
these data to a tool which can preserve the readings in a periodic time, 
without a time graph, no metrics will make sense! So, the first 
challenge itself is how to get them out? Should getting the metrics out 
from each process need 'glusterd' interacting? or should we use signals? 
Which leads us to *'challenge #1'.*


One problem I see here is that we will have multiple bricks and multiple 
clients (including FUSE and gfapi).


I assume we want to be able to monitor whole volume performance 
(aggregate values of all mount points), specific mount performance, and 
even specific brick performance.


In this case, the signal approach seems quite difficult to me, specially 
for gfapi based clients. Even for fuse mounts and brick processes we 
would need to connect to each place where one of these processes is and 
send the signal there. In this case, some clients may be not prepared to 
be accessed remotely in an easy way.


Using glusterd this problem could be minimized, but I'm not sure that 
the interface would be easy to implement (basically because we would 
need some kind of filtering syntax to avoid huge outputs) and the output 
could be complex to parse for other tools, specially considering that 
the amount of data could be significant and it will can change with the 
addition or change of translators.


I propose a third approach. It's based on a virtual directory similar to 
/sys and /proc on linux. We already have /.meta in gluster. We could 
extend that in a way that we could have data there from each mount point 
(fuse of gfapi), and each brick. Then we could define an api to allow 
each xlator to publish information in that directory in a simple way.


Using this approach, monitor tools can check only the interesting data 
directly mounting the volume as any other client and reading the desired 
values.


To implement this we could centralize all statistics capturing in 
libglusterfs itself, and create a new translator (or reuse meta) to 
gather this information from libglusterfs and publish it into the 
virtual directory (probably we would need a server side and a client 
side xlator to be able to combine data from all mounts and bricks).




Next is, should we depend on io-stats to do the reporting? If yes, how 
to get information from between any two layers? Should we provide 
io-stats in between all the nodes of translator graph?


I whouldn't depend on io-stats for reporting all the data. The 
monitoring seems to me a deeper thing than what a single translator can do.


Using the virtual directory approach, io-stats can place its statistics 
there, but it doesn't need to be aware of all other possible statistics 
from other xlators because each one will report its own statistics 
independently.


or should we 
utilize STACK_WIND/UNWIND framework to get the details? This is our 
*'challenge #2'*


I think that gluster core itself (basically libglusterfs) should keep 
its own details on global things like this. This details could also be 
published in the virtual directory. From my point of view, io-stats 
should be left to provide global timings for the fops or be merged with 
the STACK_WIND/UNWIND framework and removed as an xlator.




Once the above decision will be taken, then the question is, "what about 
'metrics' from other translators? Who gives it out (ie, dumps it?)? Why 
do we need something similar to statedump, and can't we read info from 
statedump itself?".


I think it would be better and easier to move the information from the 
statedump to the virtual directory instead of trying to use the 
statedump to report everything.


But when we say 'metrics', we should have a key and 
a number associated with it, statedump has lot more, and no format. If 
its different from statedump, then what is our answer for translator 
code to give out metrics? This is our *'challenge #3*'


Using the virtual directory structure, our key would be an specific file 
name in some directory that represents the hierarchical structure of the 
volume (xlators), and the value would be its 

Re: [Gluster-devel] GlusterFS v3.12 - Nearing deadline for branch out

2017-07-19 Thread Xavier Hernandez

Hi,

On 17/07/17 17:30, Pranith Kumar Karampuri wrote:

hi,
   Status of the following features targeted for 3.12:
1) Need a way to resolve split-brain (#135) : Mostly will be merged in a
day.
2) Halo Hybrid mode (#217): Unfortunately didn't get time to follow up
on this, so will not make it to the release.
3) Implement heal throttling (#255): Won't be making it to 3.12
4) Delay generator xlator (#257): I can definitely get this in by End of
next week, otherwise we can do this for next release.
5) Parallel writes in EC (#251): This seems like a stretch for this
weekend but definitely completable by end of next week. I am hopeful
Xavi will have some cycles to complete the final reviews. Otherwise we
may have to push this out.
6) Discard support for EC (#254): Doable for this weekend IMO, also
depends on what Xavi thinks...
7) Last stripe caching (#256): We are targetting this instead of heal
throttling (#255). This is not added to 3.12 tracker. I can add this if
we can wait till next week. This also depends on Xavi's schedule.

Also added Xavi for his inputs.


Because of other higher priorities in my work, I have very little time 
to spend on this. All I can say is that I'll try to review the patches 
as soon as possible.


Xavi




On Wed, Jul 5, 2017 at 9:07 PM, Shyam > wrote:

Further to this,

1) I cleared up the projects lane [1] and also issues marked for
3.12 [2]
  - I did this optimistically, moving everything to 3.12 (both from
a projects and a milestones perspective), so if something is not
making it, drop a note, and we can clear up the tags accordingly.

2) Reviews posted and open against the issues in [1] can be viewed
here [3]

  - Request maintainers and contributors to take a look at these and
accelerate the reviews, to meet the feature cut-off deadline

  - Request feature owners to ensure that their patches are listed
in the link [3]

3) Finally, we need a status of open issues to understand how we can
help. Requesting all feature owners to post the same (as Amar has
requested).

Thanks,
Shyam

[1] Project lane: https://github.com/gluster/glusterfs/projects/1

[2] Issues with 3.12 milestone:
https://github.com/gluster/glusterfs/milestone/4

[3] Reviews needing attetion:
https://review.gluster.org/#/q/starredby:srangana%2540redhat.com


"Releases are made better together"


On 07/05/2017 03:18 AM, Amar Tumballi wrote:

All,

We are around 10 working days remaining for branching out for 3.12
release, after which, we will have just 15 more days open for
'critical'
features to get in, for which there should be more detailed
proposals.

If you have few things planned out, but haven't taken it to
completion
yet, OR you have sent some patches, but not yet reviewed, start
whining
now, and get these in.

Thanks,
Amar

--
Amar Tumballi (amarts)


___
Gluster-devel mailing list
Gluster-devel@gluster.org 
http://lists.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org 
http://lists.gluster.org/mailman/listinfo/gluster-devel





--
Pranith


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-07-07 Thread Xavier Hernandez

On 07/07/17 11:25, Pranith Kumar Karampuri wrote:



On Fri, Jul 7, 2017 at 2:46 PM, Xavier Hernandez <xhernan...@datalab.es
<mailto:xhernan...@datalab.es>> wrote:

On 07/07/17 10:12, Pranith Kumar Karampuri wrote:



On Fri, Jul 7, 2017 at 1:13 PM, Xavier Hernandez
<xhernan...@datalab.es <mailto:xhernan...@datalab.es>
<mailto:xhernan...@datalab.es <mailto:xhernan...@datalab.es>>>
wrote:

Hi Pranith,

On 05/07/17 12:28, Pranith Kumar Karampuri wrote:



On Tue, Jul 4, 2017 at 2:26 PM, Xavier Hernandez
<xhernan...@datalab.es <mailto:xhernan...@datalab.es>
<mailto:xhernan...@datalab.es <mailto:xhernan...@datalab.es>>
<mailto:xhernan...@datalab.es
<mailto:xhernan...@datalab.es> <mailto:xhernan...@datalab.es
<mailto:xhernan...@datalab.es>>>>
wrote:

Hi Pranith,

On 03/07/17 08:33, Pranith Kumar Karampuri wrote:

Xavi,
  Now that the change has been reverted, we can
resume this
discussion and decide on the exact format that
considers, tier, dht,
afr, ec. People working geo-rep/dht/afr/ec had
an internal
discussion
and we all agreed that this proposal would be a
good way
forward. I
think once we agree on the format and decide on
the initial
encoding/decoding functions of the xattr and
this change is
merged, we
can send patches on afr/ec/dht and geo-rep to
take it to
closure.

Could you propose the new format you have in
mind that
considers
all of
the xlators?


My idea was to create a new xattr not bound to any
particular
function but which could give enough information to
be used
in many
places.

Currently we have another attribute called
glusterfs.pathinfo that
returns hierarchical information about the location of a
file. Maybe
we can extend this to unify all these attributes
into a single
feature that could be used for multiple purposes.

Since we have time to discuss it, I would like to
design it with
more information than we already talked.

First of all, the amount of information that this
attribute can
contain is quite big if we expect to have volumes with
thousands of
bricks. Even in the most simple case of returning
only an
UUID, we
can easily go beyond the limit of 64KB.

Consider also, for example, what shard should return
when
pathinfo
is requested for a file. Probably it should return a
list of
shards,
each one with all its associated pathinfo. We are
talking
about big
amounts of data here.

I think this kind of information doesn't fit very
well in an
extended attribute. Another think to consider is
that most
probably
the requester of the data only needs a fragment of
it, so we are
generating big amounts of data only to be parsed and
reduced
later,
dismissing most of it.

What do you think about using a very special virtual
file to
manage
all this information ? it could be easily read using
normal read
fops, so it could manage big amounts of data easily.
Also,
accessing
only to some parts of the file we could go directly
where we
want,
avoiding the read of all remaining data.

A very basic idea could be this:

Each xlator would have a reserved area of the file.
We can
reserve
up to 4GB per xlator (32 bits). The remaining 32
bits of the
offset
would indicate the xlator we want to access.

At offset 0 we have generic information about the
volume.
One of the
  

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-07-07 Thread Xavier Hernandez

Hi Pranith,

On 05/07/17 12:28, Pranith Kumar Karampuri wrote:



On Tue, Jul 4, 2017 at 2:26 PM, Xavier Hernandez <xhernan...@datalab.es
<mailto:xhernan...@datalab.es>> wrote:

Hi Pranith,

On 03/07/17 08:33, Pranith Kumar Karampuri wrote:

Xavi,
  Now that the change has been reverted, we can resume this
discussion and decide on the exact format that considers, tier, dht,
afr, ec. People working geo-rep/dht/afr/ec had an internal
discussion
and we all agreed that this proposal would be a good way forward. I
think once we agree on the format and decide on the initial
encoding/decoding functions of the xattr and this change is
merged, we
can send patches on afr/ec/dht and geo-rep to take it to closure.

Could you propose the new format you have in mind that considers
all of
the xlators?


My idea was to create a new xattr not bound to any particular
function but which could give enough information to be used in many
places.

Currently we have another attribute called glusterfs.pathinfo that
returns hierarchical information about the location of a file. Maybe
we can extend this to unify all these attributes into a single
feature that could be used for multiple purposes.

Since we have time to discuss it, I would like to design it with
more information than we already talked.

First of all, the amount of information that this attribute can
contain is quite big if we expect to have volumes with thousands of
bricks. Even in the most simple case of returning only an UUID, we
can easily go beyond the limit of 64KB.

Consider also, for example, what shard should return when pathinfo
is requested for a file. Probably it should return a list of shards,
each one with all its associated pathinfo. We are talking about big
amounts of data here.

I think this kind of information doesn't fit very well in an
extended attribute. Another think to consider is that most probably
the requester of the data only needs a fragment of it, so we are
generating big amounts of data only to be parsed and reduced later,
dismissing most of it.

What do you think about using a very special virtual file to manage
all this information ? it could be easily read using normal read
fops, so it could manage big amounts of data easily. Also, accessing
only to some parts of the file we could go directly where we want,
avoiding the read of all remaining data.

A very basic idea could be this:

Each xlator would have a reserved area of the file. We can reserve
up to 4GB per xlator (32 bits). The remaining 32 bits of the offset
would indicate the xlator we want to access.

At offset 0 we have generic information about the volume. One of the
the things that this information should include is a basic hierarchy
of the whole volume and the offset for each xlator.

After reading this, the user will seek to the desired offset and
read the information related to the xlator it is interested in.

All the information should be stored in a format easily extensible
that will be kept compatible even if new information is added in the
future (for example doing special mappings of the 32 bits offsets
reserved for the xlator).

For example we can reserve the first megabyte of the xlator area to
have a mapping of attributes with its respective offset.

I think that using a binary format would simplify all this a lot.

Do you think this is a way to explore or should I stop wasting time
here ?


I think this just became a very big feature :-). Shall we just live with
it the way it is now?


I supposed it...

Only thing we need to check is if shard needs to handle this xattr. If 
so, what it should return ? only the UUID's corresponding to the first 
shard or the UUID's of all bricks containing at least one shard ? I 
guess that the first one is enough, but just to be sure...


My proposal was to implement a new xattr, for example glusterfs.layout, 
that contains enough information to be usable in all current use cases.


The idea would be that each xlator that makes a significant change in 
the way or the place where files are stored, should put information in 
this xattr. The information should include:


* Type (basically AFR, EC, DHT, ...)
* Basic configuration (replication and arbiter for AFR, data and 
redundancy for EC, # subvolumes for DHT, shard size for sharding, ...)

* Quorum imposed by the xlator
* UUID data comming from subvolumes (sorted by brick position)
* It should be easily extensible in the future

The last point is very important to avoid the issues we have seen now. 
We must be able to incorporate more information without breaking 
backward compatibility. To do so, we can add tags for each value.


For example, a distribute 2, replica 2 v

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-07-04 Thread Xavier Hernandez

Hi Pranith,

On 03/07/17 08:33, Pranith Kumar Karampuri wrote:

Xavi,
  Now that the change has been reverted, we can resume this
discussion and decide on the exact format that considers, tier, dht,
afr, ec. People working geo-rep/dht/afr/ec had an internal discussion
and we all agreed that this proposal would be a good way forward. I
think once we agree on the format and decide on the initial
encoding/decoding functions of the xattr and this change is merged, we
can send patches on afr/ec/dht and geo-rep to take it to closure.

Could you propose the new format you have in mind that considers all of
the xlators?


My idea was to create a new xattr not bound to any particular function 
but which could give enough information to be used in many places.


Currently we have another attribute called glusterfs.pathinfo that 
returns hierarchical information about the location of a file. Maybe we 
can extend this to unify all these attributes into a single feature that 
could be used for multiple purposes.


Since we have time to discuss it, I would like to design it with more 
information than we already talked.


First of all, the amount of information that this attribute can contain 
is quite big if we expect to have volumes with thousands of bricks. Even 
in the most simple case of returning only an UUID, we can easily go 
beyond the limit of 64KB.


Consider also, for example, what shard should return when pathinfo is 
requested for a file. Probably it should return a list of shards, each 
one with all its associated pathinfo. We are talking about big amounts 
of data here.


I think this kind of information doesn't fit very well in an extended 
attribute. Another think to consider is that most probably the requester 
of the data only needs a fragment of it, so we are generating big 
amounts of data only to be parsed and reduced later, dismissing most of it.


What do you think about using a very special virtual file to manage all 
this information ? it could be easily read using normal read fops, so it 
could manage big amounts of data easily. Also, accessing only to some 
parts of the file we could go directly where we want, avoiding the read 
of all remaining data.


A very basic idea could be this:

Each xlator would have a reserved area of the file. We can reserve up to 
4GB per xlator (32 bits). The remaining 32 bits of the offset would 
indicate the xlator we want to access.


At offset 0 we have generic information about the volume. One of the the 
things that this information should include is a basic hierarchy of the 
whole volume and the offset for each xlator.


After reading this, the user will seek to the desired offset and read 
the information related to the xlator it is interested in.


All the information should be stored in a format easily extensible that 
will be kept compatible even if new information is added in the future 
(for example doing special mappings of the 32 bits offsets reserved for 
the xlator).


For example we can reserve the first megabyte of the xlator area to have 
a mapping of attributes with its respective offset.


I think that using a binary format would simplify all this a lot.

Do you think this is a way to explore or should I stop wasting time here ?

Xavi





On Wed, Jun 21, 2017 at 2:08 PM, Karthik Subrahmanya
<ksubr...@redhat.com <mailto:ksubr...@redhat.com>> wrote:



On Wed, Jun 21, 2017 at 1:56 PM, Xavier Hernandez
<xhernan...@datalab.es <mailto:xhernan...@datalab.es>> wrote:

That's ok. I'm currently unable to write a patch for this on ec.

Sunil is working on this patch.

~Karthik

If no one can do it, I can try to do it in 6 - 7 hours...

Xavi


On Wednesday, June 21, 2017 09:48 CEST, Pranith Kumar Karampuri
<pkara...@redhat.com <mailto:pkara...@redhat.com>> wrote:




    On Wed, Jun 21, 2017 at 1:00 PM, Xavier Hernandez
<xhernan...@datalab.es <mailto:xhernan...@datalab.es>> wrote:

I'm ok with reverting node-uuid content to the previous
format and create a new xattr for the new format.
Currently, only rebalance will use it.

Only thing to consider is what can happen if we have a
half upgraded cluster where some clients have this change
and some not. Can rebalance work in this situation ? if
so, could there be any issue ?


I think there shouldn't be any problem, because this is
in-memory xattr so layers below afr/ec will only see node-uuid
xattr.
This also gives us a chance to do whatever we want to do in
future with this xattr without any problems about backward
compatibility.

You can check

https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507

<https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>
   

Re: [Gluster-devel] Disperse volume : Sequential Writes

2017-07-04 Thread Xavier Hernandez

Hi Pranith,

On 03/07/17 05:35, Pranith Kumar Karampuri wrote:

Ashish, Xavi,
   I think it is better to implement this change as a separate
read-after-write caching xlator which we can load between EC and client
xlator. That way EC will not get a lot more functionality than necessary
and may be this xlator can be used somewhere else in the stack if possible.


while this seems a good way to separate functionalities, it has a big 
problem. If we add a caching xlator between ec and *all* of its 
subvolumes, it will only be able to cache encoded data. So, when ec 
needs the "cached" data, it will need to issue a request to each of its 
subvolumes and compute the decoded data before being able to use it, so 
we don't avoid the decoding overhead.


Also, if we want to make the xlator generic, it will probably cache a 
lot more data than ec really needs. Increasing memory footprint 
considerably for no real use.


Additionally, this new xlator will need to guarantee that the cached 
data is current, so it will need its own locking logic (that would be 
another copy of the existing logic in one of the current xlators) 
which is slow and difficult to maintain, or it will need to intercept 
and reuse locking calls from parent xlators, which can be quite complex 
since we have multiple xlator levels where locks can be taken, not only ec.


This is a relatively simple change to make inside ec, but a very complex 
change (IMO) if we want to do it as a stand-alone xlator and be generic 
enough to be reused and work safely in other places of the stack.


If we want to separate functionalities I think we should create a new 
concept of xlator which is transversal to the "traditional" xlator stack.


Current xlators are linear in the sense that each one operates only at 
one place (it can be moved by reconfiguration, but once instantiated, it 
always work at the same place) and passes data to the next one.


A transversal xlator (or maybe a service xlator would be better) would 
be one not bound to any place of the stack, but could be used by all 
other xlators to implement some service, like caching, multithreading, 
locking, ... these are features that many xlators need but cannot use 
easily (nor efficiently) if they are implicitly implemented in some 
specific place of the stack outside its control.


The transaction framework we already talked, could be though as one of 
these service xlators. Multithreading could also benefit of this 
approach because xlators would have more control about what things can 
be processed by a background thread and which ones not. Probably there 
are other features that could benefit from this approach.


In the case of brick multiplexing, if some xlators are removed from each 
stack and loaded as global services, most probably the memory footprint 
will be lower and the resource usage more optimized.


Just an idea...

Xavi



On Fri, Jun 16, 2017 at 4:19 PM, Ashish Pandey <aspan...@redhat.com
<mailto:aspan...@redhat.com>> wrote:


I think it should be done as we have agreement on basic design.


*From: *"Pranith Kumar Karampuri" <pkara...@redhat.com
<mailto:pkara...@redhat.com>>
*To: *"Xavier Hernandez" <xhernan...@datalab.es
<mailto:xhernan...@datalab.es>>
*Cc: *"Ashish Pandey" <aspan...@redhat.com
<mailto:aspan...@redhat.com>>, "Gluster Devel"
<gluster-devel@gluster.org <mailto:gluster-devel@gluster.org>>
*Sent: *Friday, June 16, 2017 3:50:09 PM
*Subject: *Re: [Gluster-devel] Disperse volume : Sequential Writes




On Fri, Jun 16, 2017 at 3:12 PM, Xavier Hernandez
<xhernan...@datalab.es <mailto:xhernan...@datalab.es>> wrote:

On 16/06/17 10:51, Pranith Kumar Karampuri wrote:



On Fri, Jun 16, 2017 at 12:02 PM, Xavier Hernandez
<xhernan...@datalab.es <mailto:xhernan...@datalab.es>
<mailto:xhernan...@datalab.es
<mailto:xhernan...@datalab.es>>> wrote:

On 15/06/17 11:50, Pranith Kumar Karampuri wrote:



On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey
<aspan...@redhat.com <mailto:aspan...@redhat.com>
<mailto:aspan...@redhat.com <mailto:aspan...@redhat.com>>
<mailto:aspan...@redhat.com
<mailto:aspan...@redhat.com> <mailto:aspan...@redhat.com
<mailto:aspan...@redhat.com>>>> wrote:

Hi All,

We have been facing some issues in disperse (EC)
volume.
We know that currently EC is not good for random
IO as it
requires
READ-MODIFY-WRITE fop
   

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-21 Thread Xavier Hernandez

That's ok. I'm currently unable to write a patch for this on ec. If no one can 
do it, I can try to do it in 6 - 7 hours...

Xavi

On Wednesday, June 21, 2017 09:48 CEST, Pranith Kumar Karampuri 
<pkara...@redhat.com> wrote:
   On Wed, Jun 21, 2017 at 1:00 PM, Xavier Hernandez <xhernan...@datalab.es> 
wrote:I'm ok with reverting node-uuid content to the previous format and create 
a new xattr for the new format. Currently, only rebalance will use it.

Only thing to consider is what can happen if we have a half upgraded cluster 
where some clients have this change and some not. Can rebalance work in this 
situation ? if so, could there be any issue ? I think there shouldn't be any 
problem, because this is in-memory xattr so layers below afr/ec will only see 
node-uuid xattr.This also gives us a chance to do whatever we want to do in 
future with this xattr without any problems about backward compatibility.
 You can check 
https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
 for how karthik implemented this in AFR (this got merged accidentally 
yesterday, but looks like this is what we are settling on) 
Xavi

On Wednesday, June 21, 2017 06:56 CEST, Pranith Kumar Karampuri 
<pkara...@redhat.com> wrote:
   On Wed, Jun 21, 2017 at 10:07 AM, Nithya Balachandran <nbala...@redhat.com> 
wrote: On 20 June 2017 at 20:38, Aravinda <avish...@redhat.com> wrote:On 
06/20/2017 06:02 PM, Pranith Kumar Karampuri wrote:Xavi, Aravinda and I had a 
discussion on #gluster-dev and we agreed to go with the format Aravinda 
suggested for now and in future we wanted some more changes for dht to detect 
which subvolume went down came back up, at that time we will revisit the 
solution suggested by Xavi.
 Susanth is doing the dht changesAravinda is doing geo-rep changes Done. 
Geo-rep patch sent for review https://review.gluster.org/17582
  The proposed changes to the node-uuid behaviour (while good) are going to 
break tiering . Tiering changes will take a little more time to be coded and 
tested.  As this is a regression for 3.11 and a blocker for 3.11.1, I suggest 
we go back to the original node-uuid behaviour for now so as to unblock the 
release and target the proposed changes for the next 3.11 releases. Let me see 
if I understand the changes correctly. We are restoring the behavior of 
node-uuid xattr and adding a new xattr for parallel rebalance for both afr and 
ec, correct? Otherwise that is one more regression. If yes, we will also wait 
for Xavi's inputs. Jeff accidentally merged the afr patch yesterday which does 
these changes. If everyone is in agreement, we will leave it as is and add 
similar changes in ec as well. If we are not in agreement, then we will let the 
discussion progress :-)   Regards,Nithya--
Aravinda  Thanks to all of you guys for the discussions! On Tue, Jun 20, 2017 
at 5:05 PM, Xavier Hernandez <xhernan...@datalab.es> wrote:Hi Aravinda,

On 20/06/17 12:42, Aravinda wrote:I think following format can be easily 
adopted by all components

UUIDs of a subvolume are seperated by space and subvolumes are separated
by comma

For example, node1 and node2 are replica with U1 and U2 UUIDs
respectively and
node3 and node4 are replica with U3 and U4 UUIDs respectively

node-uuid can return "U1 U2,U3 U4"
While this is ok for current implementation, I think this can be insufficient 
if there are more layers of xlators that require to indicate some sort of 
grouping. Some representation that can represent hierarchy would be better. For 
example: "(U1 U2) (U3 U4)" (we can use spaces or comma as a separator).
 
Geo-rep can split by "," and then split by space and take first UUID
DHT can split the value by space or comma and get unique UUIDs list
This doesn't solve the problem I described in the previous email. Some more 
logic will need to be added to avoid more than one node from each replica-set 
to be active. If we have some explicit hierarchy information in the node-uuid 
value, more decisions can be taken.

An initial proposal I made was this:

DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))

This is harder to parse, but gives a lot of information: DHT with 2 subvolumes, 
each subvolume is an AFR with replica 2 and no arbiters. It's also easily 
extensible with any new xlator that changes the layout.

However maybe this is not the moment to do this, and probably we could 
implement this in a new xattr with a better name.

Xavi 
Another question is about the behavior when a node is down, existing
node-uuid xattr will not return that UUID if a node is down. What is the
behavior with the proposed xattr?

Let me know your thoughts.

regards
Aravinda VK

On 06/20/2017 03:06 PM, Aravinda wrote:Hi Xavi,

On 06/20/2017 02:51 PM, Xavier Hernandez wrote:Hi Aravinda,

On 20/06/17 11:05, Pranith Kumar Karampuri wrote:Adding more people to get a 
consensus about this.

On Tue, Jun 20, 2017 at 1:49 PM,

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-21 Thread Xavier Hernandez

I'm ok with reverting node-uuid content to the previous format and create a new 
xattr for the new format. Currently, only rebalance will use it.

Only thing to consider is what can happen if we have a half upgraded cluster 
where some clients have this change and some not. Can rebalance work in this 
situation ? if so, could there be any issue ?

Xavi

On Wednesday, June 21, 2017 06:56 CEST, Pranith Kumar Karampuri 
<pkara...@redhat.com> wrote:
   On Wed, Jun 21, 2017 at 10:07 AM, Nithya Balachandran <nbala...@redhat.com> 
wrote: On 20 June 2017 at 20:38, Aravinda <avish...@redhat.com> wrote:On 
06/20/2017 06:02 PM, Pranith Kumar Karampuri wrote:Xavi, Aravinda and I had a 
discussion on #gluster-dev and we agreed to go with the format Aravinda 
suggested for now and in future we wanted some more changes for dht to detect 
which subvolume went down came back up, at that time we will revisit the 
solution suggested by Xavi.
 Susanth is doing the dht changesAravinda is doing geo-rep changes Done. 
Geo-rep patch sent for review https://review.gluster.org/17582
  The proposed changes to the node-uuid behaviour (while good) are going to 
break tiering . Tiering changes will take a little more time to be coded and 
tested.  As this is a regression for 3.11 and a blocker for 3.11.1, I suggest 
we go back to the original node-uuid behaviour for now so as to unblock the 
release and target the proposed changes for the next 3.11 releases. Let me see 
if I understand the changes correctly. We are restoring the behavior of 
node-uuid xattr and adding a new xattr for parallel rebalance for both afr and 
ec, correct? Otherwise that is one more regression. If yes, we will also wait 
for Xavi's inputs. Jeff accidentally merged the afr patch yesterday which does 
these changes. If everyone is in agreement, we will leave it as is and add 
similar changes in ec as well. If we are not in agreement, then we will let the 
discussion progress :-)   Regards,Nithya--
Aravinda  Thanks to all of you guys for the discussions! On Tue, Jun 20, 2017 
at 5:05 PM, Xavier Hernandez <xhernan...@datalab.es> wrote:Hi Aravinda,

On 20/06/17 12:42, Aravinda wrote:I think following format can be easily 
adopted by all components

UUIDs of a subvolume are seperated by space and subvolumes are separated
by comma

For example, node1 and node2 are replica with U1 and U2 UUIDs
respectively and
node3 and node4 are replica with U3 and U4 UUIDs respectively

node-uuid can return "U1 U2,U3 U4"
While this is ok for current implementation, I think this can be insufficient 
if there are more layers of xlators that require to indicate some sort of 
grouping. Some representation that can represent hierarchy would be better. For 
example: "(U1 U2) (U3 U4)" (we can use spaces or comma as a separator).
 
Geo-rep can split by "," and then split by space and take first UUID
DHT can split the value by space or comma and get unique UUIDs list
This doesn't solve the problem I described in the previous email. Some more 
logic will need to be added to avoid more than one node from each replica-set 
to be active. If we have some explicit hierarchy information in the node-uuid 
value, more decisions can be taken.

An initial proposal I made was this:

DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))

This is harder to parse, but gives a lot of information: DHT with 2 subvolumes, 
each subvolume is an AFR with replica 2 and no arbiters. It's also easily 
extensible with any new xlator that changes the layout.

However maybe this is not the moment to do this, and probably we could 
implement this in a new xattr with a better name.

Xavi 
Another question is about the behavior when a node is down, existing
node-uuid xattr will not return that UUID if a node is down. What is the
behavior with the proposed xattr?

Let me know your thoughts.

regards
Aravinda VK

On 06/20/2017 03:06 PM, Aravinda wrote:Hi Xavi,

On 06/20/2017 02:51 PM, Xavier Hernandez wrote:Hi Aravinda,

On 20/06/17 11:05, Pranith Kumar Karampuri wrote:Adding more people to get a 
consensus about this.

On Tue, Jun 20, 2017 at 1:49 PM, Aravinda <avish...@redhat.com
<mailto:avish...@redhat.com>> wrote:


    regards
    Aravinda VK


    On 06/20/2017 01:26 PM, Xavier Hernandez wrote:

        Hi Pranith,

        adding gluster-devel, Kotresh and Aravinda,

        On 20/06/17 09:45, Pranith Kumar Karampuri wrote:



            On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez
            <xhernan...@datalab.es <mailto:xhernan...@datalab.es>
            <mailto:xhernan...@datalab.es
            <mailto:xhernan...@datalab.es>>> wrote:

                On 20/06/17 09:31, Pranith Kumar Karampuri wrote:

                    The way geo-replication works is:
                    On each machine, it does getxattr of node-uuid and
            check if its
                    own uuid
    

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Xavier Hernandez

Hi Aravinda,

On 20/06/17 12:42, Aravinda wrote:

I think following format can be easily adopted by all components

UUIDs of a subvolume are seperated by space and subvolumes are separated
by comma

For example, node1 and node2 are replica with U1 and U2 UUIDs
respectively and
node3 and node4 are replica with U3 and U4 UUIDs respectively

node-uuid can return "U1 U2,U3 U4"


While this is ok for current implementation, I think this can be 
insufficient if there are more layers of xlators that require to 
indicate some sort of grouping. Some representation that can represent 
hierarchy would be better. For example: "(U1 U2) (U3 U4)" (we can use 
spaces or comma as a separator).




Geo-rep can split by "," and then split by space and take first UUID
DHT can split the value by space or comma and get unique UUIDs list


This doesn't solve the problem I described in the previous email. Some 
more logic will need to be added to avoid more than one node from each 
replica-set to be active. If we have some explicit hierarchy information 
in the node-uuid value, more decisions can be taken.


An initial proposal I made was this:

DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))

This is harder to parse, but gives a lot of information: DHT with 2 
subvolumes, each subvolume is an AFR with replica 2 and no arbiters. 
It's also easily extensible with any new xlator that changes the layout.


However maybe this is not the moment to do this, and probably we could 
implement this in a new xattr with a better name.


Xavi



Another question is about the behavior when a node is down, existing
node-uuid xattr will not return that UUID if a node is down. What is the
behavior with the proposed xattr?

Let me know your thoughts.

regards
Aravinda VK

On 06/20/2017 03:06 PM, Aravinda wrote:

Hi Xavi,

On 06/20/2017 02:51 PM, Xavier Hernandez wrote:

Hi Aravinda,

On 20/06/17 11:05, Pranith Kumar Karampuri wrote:

Adding more people to get a consensus about this.

On Tue, Jun 20, 2017 at 1:49 PM, Aravinda <avish...@redhat.com
<mailto:avish...@redhat.com>> wrote:


regards
Aravinda VK


    On 06/20/2017 01:26 PM, Xavier Hernandez wrote:

Hi Pranith,

adding gluster-devel, Kotresh and Aravinda,

On 20/06/17 09:45, Pranith Kumar Karampuri wrote:



        On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez
<xhernan...@datalab.es <mailto:xhernan...@datalab.es>
<mailto:xhernan...@datalab.es
<mailto:xhernan...@datalab.es>>> wrote:

On 20/06/17 09:31, Pranith Kumar Karampuri wrote:

The way geo-replication works is:
On each machine, it does getxattr of node-uuid and
check if its
own uuid
is present in the list. If it is present then it
will consider
it active
otherwise it will be considered passive. With this
change we are
giving
all uuids instead of first-up subvolume. So all
machines think
they are
ACTIVE which is bad apparently. So that is the
reason. Even I
felt bad
that we are doing this change.


And what about changing the content of node-uuid to
include some
sort of hierarchy ?

for example:

a single brick:

NODE()

AFR/EC:

AFR[2](NODE(), NODE())
EC[3,1](NODE(), NODE(), NODE())

DHT:

DHT[2](AFR[2](NODE(), NODE()),
AFR[2](NODE(),
NODE()))

This gives a lot of information that can be used to
take the
appropriate decisions.


I guess that is not backward compatible. Shall I CC
gluster-devel and
Kotresh/Aravinda?


Is the change we did backward compatible ? if we only require
the first field to be a GUID to support backward compatibility,
we can use something like this:

No. But the necessary change can be made to Geo-rep code as well if
format is changed, Since all these are built/shipped together.

Geo-rep uses node-id as follows,

list = listxattr(node-uuid)
active_node_uuids = list.split(SPACE)
active_node_flag = True if self.node_id exists in active_node_uuids
else False


How was this case solved ?

suppose we have three servers and 2 bricks in each server. A
replicated volume is created using the following command:

gluster volume create test replica 2 server1:/brick1 server2:/brick1
server2:/brick2 server3:/brick1 server3:/brick1 server1:/brick2

In this case we have three replica-sets:

* server1:/brick1 server2:/brick1
* server2:

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Xavier Hernandez

Hi Aravinda,

On 20/06/17 11:05, Pranith Kumar Karampuri wrote:

Adding more people to get a consensus about this.

On Tue, Jun 20, 2017 at 1:49 PM, Aravinda <avish...@redhat.com
<mailto:avish...@redhat.com>> wrote:


regards
Aravinda VK


On 06/20/2017 01:26 PM, Xavier Hernandez wrote:

Hi Pranith,

adding gluster-devel, Kotresh and Aravinda,

On 20/06/17 09:45, Pranith Kumar Karampuri wrote:



On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez
<xhernan...@datalab.es <mailto:xhernan...@datalab.es>
<mailto:xhernan...@datalab.es
<mailto:xhernan...@datalab.es>>> wrote:

On 20/06/17 09:31, Pranith Kumar Karampuri wrote:

The way geo-replication works is:
On each machine, it does getxattr of node-uuid and
check if its
own uuid
is present in the list. If it is present then it
will consider
it active
otherwise it will be considered passive. With this
change we are
giving
all uuids instead of first-up subvolume. So all
machines think
they are
ACTIVE which is bad apparently. So that is the
reason. Even I
felt bad
that we are doing this change.


And what about changing the content of node-uuid to
include some
sort of hierarchy ?

for example:

a single brick:

NODE()

AFR/EC:

AFR[2](NODE(), NODE())
EC[3,1](NODE(), NODE(), NODE())

DHT:

DHT[2](AFR[2](NODE(), NODE()),
AFR[2](NODE(),
NODE()))

This gives a lot of information that can be used to take the
appropriate decisions.


I guess that is not backward compatible. Shall I CC
gluster-devel and
Kotresh/Aravinda?


Is the change we did backward compatible ? if we only require
the first field to be a GUID to support backward compatibility,
we can use something like this:

No. But the necessary change can be made to Geo-rep code as well if
format is changed, Since all these are built/shipped together.

Geo-rep uses node-id as follows,

list = listxattr(node-uuid)
active_node_uuids = list.split(SPACE)
active_node_flag = True if self.node_id exists in active_node_uuids
else False


How was this case solved ?

suppose we have three servers and 2 bricks in each server. A replicated 
volume is created using the following command:


gluster volume create test replica 2 server1:/brick1 server2:/brick1 
server2:/brick2 server3:/brick1 server3:/brick1 server1:/brick2


In this case we have three replica-sets:

* server1:/brick1 server2:/brick1
* server2:/brick2 server3:/brick1
* server3:/brick2 server2:/brick2

Old AFR implementation for node-uuid always returned the uuid of the 
node of the first brick, so in this case we will get the uuid of the 
three nodes because all of them are the first brick of a replica-set.


Does this mean that with this configuration all nodes are active ? Is 
this a problem ? Is there any other check to avoid this situation if 
it's not good ?


Xavi





Bricks:



AFR/EC:
(, )

DHT:
((, ...), (, ...))

In this case, AFR and EC would return the same  they
returned before the patch, but between '(' and ')' they put the
full list of guid's of all nodes. The first  can be used
by geo-replication. The list after the first  can be used
for rebalance.

Not sure if there's any user of node-uuid above DHT.

Xavi




Xavi


    On Tue, Jun 20, 2017 at 12:46 PM, Xavier Hernandez
<xhernan...@datalab.es
<mailto:xhernan...@datalab.es> <mailto:xhernan...@datalab.es
<mailto:xhernan...@datalab.es>>
<mailto:xhernan...@datalab.es
<mailto:xhernan...@datalab.es> <mailto:xhernan...@datalab.es
<mailto:xhernan...@datalab.es>>>>
wrote:

Hi Pranith,

On 20/06/17 07:53, Pranith Kumar Karampuri wrote:

hi Xavi,
   We all made the mistake of not
sending about changing
behavior of
node-uuid xattr so that rebalance can use
multiple nodes
for doing
rebalance. Because of this on geo-rep all
the worker

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Xavier Hernandez

Hi Pranith,

adding gluster-devel, Kotresh and Aravinda,

On 20/06/17 09:45, Pranith Kumar Karampuri wrote:



On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez <xhernan...@datalab.es
<mailto:xhernan...@datalab.es>> wrote:

On 20/06/17 09:31, Pranith Kumar Karampuri wrote:

The way geo-replication works is:
On each machine, it does getxattr of node-uuid and check if its
own uuid
is present in the list. If it is present then it will consider
it active
otherwise it will be considered passive. With this change we are
giving
all uuids instead of first-up subvolume. So all machines think
they are
ACTIVE which is bad apparently. So that is the reason. Even I
felt bad
that we are doing this change.


And what about changing the content of node-uuid to include some
sort of hierarchy ?

for example:

a single brick:

NODE()

AFR/EC:

AFR[2](NODE(), NODE())
EC[3,1](NODE(), NODE(), NODE())

DHT:

DHT[2](AFR[2](NODE(), NODE()), AFR[2](NODE(),
NODE()))

This gives a lot of information that can be used to take the
appropriate decisions.


I guess that is not backward compatible. Shall I CC gluster-devel and
Kotresh/Aravinda?


Is the change we did backward compatible ? if we only require the first 
field to be a GUID to support backward compatibility, we can use 
something like this:


Bricks:



AFR/EC:
(, )

DHT:
((, ...), (, ...))

In this case, AFR and EC would return the same  they returned 
before the patch, but between '(' and ')' they put the full list of 
guid's of all nodes. The first  can be used by geo-replication. 
The list after the first  can be used for rebalance.


Not sure if there's any user of node-uuid above DHT.

Xavi





Xavi


On Tue, Jun 20, 2017 at 12:46 PM, Xavier Hernandez
<xhernan...@datalab.es <mailto:xhernan...@datalab.es>
<mailto:xhernan...@datalab.es <mailto:xhernan...@datalab.es>>>
wrote:

Hi Pranith,

On 20/06/17 07:53, Pranith Kumar Karampuri wrote:

hi Xavi,
   We all made the mistake of not sending about changing
behavior of
node-uuid xattr so that rebalance can use multiple nodes
for doing
rebalance. Because of this on geo-rep all the workers
are becoming
active instead of one per EC/AFR subvolume. So we are
frantically trying
to restore the functionality of node-uuid and introduce
a new
xattr for
the new behavior. Sunil will be sending out a patch for
this.


Wouldn't it be better to change geo-rep behavior to use the
new data
? I think it's better as it's now, since it gives more
information
to upper layers so that they can take more accurate decisions.

Xavi


--
Pranith





--
Pranith





--
Pranith


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Self-heal on read-only volumes

2017-06-20 Thread Xavier Hernandez

Hi Karthik,

thanks for the information.

Xavi

On 16/06/17 13:25, Karthik Subrahmanya wrote:

Hi Xavi,

In my opinion it can not be called as a bug, it is kind of an
improvement to the read-only and WORM translators.
The solution to this is to identify the internal FOPs and allowing them
to pass, even the read-only or WORM options are enabled.
The patch [1] from Kotresh resolves this issue, which is currently under
review.

[1] https://review.gluster.org/#/c/16855/

Regards,
Karthik

On Fri, Jun 16, 2017 at 4:26 PM, Pranith Kumar Karampuri
<pkara...@redhat.com <mailto:pkara...@redhat.com>> wrote:

I remember either Kotresh/Karthik recently sent patches to do
something similar. Adding them to check if the know something about this

On Fri, Jun 16, 2017 at 1:25 PM, Xavier Hernandez
<xhernan...@datalab.es <mailto:xhernan...@datalab.es>> wrote:

Hi,

currently it seems that a read-only replica 2 volume cannot be
healed because all attempts to make changes by the self-heal
daemon on the damaged brick will fail with EROFS.

It's true that no regular writes are allowed, so there's no
possibility to cause damage by partial writes or similar things.
However a read-only brick can still fail because of disk errors
and some files could get corrupted or the entire disk will need
to be replaced.

Is this a bug or the only way to solve this problem is to make
the volume read-write until self-heal finishes ?

Thanks,

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org <mailto:Gluster-devel@gluster.org>
http://lists.gluster.org/mailman/listinfo/gluster-devel
<http://lists.gluster.org/mailman/listinfo/gluster-devel>




--
Pranith




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Performance experiments with io-stats translator

2017-06-07 Thread Xavier Hernandez

Hi Krutika,

On 06/06/17 13:35, Krutika Dhananjay wrote:

Hi,

As part of identifying performance bottlenecks within gluster stack for
VM image store use-case, I loaded io-stats at multiple points on the
client and brick stack and ran randrd test using fio from within the
hosted vms in parallel.

Before I get to the results, a little bit about the configuration ...

3 node cluster; 1x3 plain replicate volume with group virt settings,
direct-io.
3 FUSE clients, one per node in the cluster (which implies reads are
served from the replica that is local to the client).

io-stats was loaded at the following places:
On the client stack: Above client-io-threads and above protocol/client-0
(the first child of AFR).
On the brick stack: Below protocol/server, above and below io-threads
and just above storage/posix.

Based on a 60-second run of randrd test and subsequent analysis of the
stats dumped by the individual io-stats instances, the following is what
I found:

_*​​Translator Position*_*   *_*Avg Latency of READ
fop as seen by this translator*_

1. parent of client-io-threads1666us

∆ (1,2) = 50us

2. parent of protocol/client-01616us

∆(2,3) = 1453us

- end of client stack -
- beginning of brick stack ---

3. child of protocol/server   163us

∆(3,4) = 7us

4. parent of io-threads156us

∆(4,5) = 20us

5. child-of-io-threads  136us

∆ (5,6) = 11us

6. parent of storage/posix   125us
...
 end of brick stack 

So it seems like the biggest bottleneck here is a combination of the
network + epoll, rpc layer?
I must admit I am no expert with networks, but I'm assuming if the
client is reading from the local brick, then
even latency contribution from the actual network won't be much, in
which case bulk of the latency is coming from epoll, rpc layer, etc at
both client and brick end? Please correct me if I'm wrong.

I will, of course, do some more runs and confirm if the pattern is
consistent.


very interesting. These results are similar to what I also observed when 
doing some ec tests.


My personal feeling is that there's high serialization and/or contention 
in the network layer caused by mutexes, but I don't have data to support 
that.


Xavi



-Krutika


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] GFID2 - Proposal to add extra byte to existing GFID

2017-05-15 Thread Xavier Hernandez
Hi Amar, 
On May 15, 2017 2:15 PM, Amar Tumballi <atumb...@redhat.com> wrote:
>
>
>
> On Tue, Apr 11, 2017 at 2:59 PM, Amar Tumballi <ama...@gmail.com> wrote:
>>
>> Comments inline.
>>
>> On Mon, Dec 19, 2016 at 1:47 PM, Xavier Hernandez <xhernan...@datalab.es> wrote:
>>>
>>> On 12/19/2016 07:57 AM, Aravinda wrote:
>>>>
>>>>
>>>> regards
>>>> Aravinda
>>>>
>>>> On 12/16/2016 05:47 PM, Xavier Hernandez wrote:
>>>>>
>>>>> On 12/16/2016 08:31 AM, Aravinda wrote:
>>>>>>
>>>>>> Proposal to add one more byte to GFID to store "Type" information.
>>>>>> Extra byte will represent type(directory: 00, file: 01, Symlink: 02
>>>>>> etc)
>>>>>>
>>>>>> For example, if a directory GFID is f4f18c02-0360-4cdc-8c00-0164e49a7afd
>>>>>> then, GFID2 will be 00f4f18c02-0360-4cdc-8c00-0164e49a7afd.
>>>>>>
>>>>>> Changes to Backend store
>>>>>> 
>>>>>> Existing: .glusterfs/gfid[0:2]/gfid/[2:4]/gfid
>>>>>> Proposed: .glusterfs/gfid2[0:2]/gfid2[2:4]/gfid2[4:6]/gfid2
>>>>>>
>>>>>> Advantages:
>>>>>> ---
>>>>>> - Automatic grouping in .glusterfs directory based on file Type.
>>>>>> - Easy identification of Type by looking at GFID in logs/status output
>>>>>>   etc.
>>
>>
>> Above two will be good enough points to bump up the priority for the feature.
>>  
>>>>>>
>>>>>> - Crawling(Quota/AFR): List of directories can be easily fetched by
>>>>>>   crawling `.glusterfs/gfid2[0:2]/` directory. This enables easy
>>>>>>   parallel Crawling.
>>
>>
>> With the current design, we still have to do a distributed readdir() to get all 
>> the entries in the directory. This layout change, along with proposed 
>> DHT2/EHT/DHT2+ (name for me doesn't matter here) layout, where directory 
>> entries would be created in just one place should enhance the performance overall.
>>  
>>>>>>
>>>>>> - Quota - Marker: Marker transator can mark xtime of current file and
>>>>>>   parent directory. No need to update xtime xattr of all directories
>>>>>>   till root.
>>>>>> - Geo-replication: - Crawl can be multithreaded during initial sync.
>>>>>>   With marker changes above it will be more effective in crawling.
>>>>>>
>>  
>>>>>>
>>>>>> Please add if any more advantageous.
>>>>>>
>>>>>> Disadvantageous:
>>>>>> 
>>>>>> Functionality is not changed with the above change except the length
>>>>>> of the ID. I can't think of any disadvantages except the code changes
>>>>>> to accommodate this change. Let me know if I missed anything here.
>>>>>
>>>>>
>>>>> One disadvantage is that 17 bytes is a very ugly number for
>>>>> structures. Compilers will add paddings that will make any structure
>>>>> containing a GFID noticeable bigger. This will also cause troubles on
>>>>> all binary formats where a GFID is used, making them incompatible. One
>>>>> clear case of this is the XDR encoding of the gluster protocol.
>>>>> Currently a GFID is defined this way in many places:
>>>>>
>>>>>         opaque gfid[16]
>>>>>
>>>>> This seems to make it quite complex to allow a mix of gluster versions
>>>>> in the same cluster (for example in a middle of an upgrade).
>>
>>
>> Totally agree with Xavier here. Not in support of adding one more byte.
>>  
>>>>>
>>>>>
>>>>> What about this alternative approach:
>>>>>
>>>>> Based on the RFC4122 [1] that describes the format of an UUID, we can
>>>>> define a new structure for new GFID's using the same length.
>>>>>
>>>>> Currently all GFID's are generated using the "random" method. This
>>>>> means that all GFID have this structure:
>>>>>
>>>>>         --Mxxx-Nxxx-
>>>>>
>>>>> Where N can be 8, 9, A or B, and M is 4.
>>>>>
>>&g

Re: [Gluster-devel] [DHT] The myth of two hops for linkto file resolution

2017-05-04 Thread Xavier Hernandez

Hi,

On 30/04/17 06:03, Raghavendra Gowdappa wrote:

All,

Its a common perception that the resolution of a file having linkto file on the 
hashed-subvol requires two hops:

1. client to hashed-subvol.
2. client to the subvol where file actually resides.

While it is true that a fresh lookup behaves this way, the other fact that 
get's ignored is that fresh lookups on files are almost always prevented by 
readdirplus. Since readdirplus picks the dentry from the subvolume where actual 
file (data-file) resides, the two hop cost is most likely never witnessed by 
the application.


This is true for workloads that list directory contents before accessing 
the files, but there are other use cases that directly access the file 
without navigating through the file system. In this case fresh lookups 
are needed.


Xavi



A word of caution is that I've not done any testing to prove this observation 
:).

regards,
Raghavendra
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Pluggable interface for erasure coding?

2017-03-02 Thread Xavier Hernandez

Hi Niels,

On 02/03/17 07:58, Niels de Vos wrote:

Hi guys,

I think this is a topic/question that has come up before, but I can not
find any references or feature requests related to it. Because there are
different libraries for Erasure Coding, it would be interesting to be
able to select alternatives to the bundled implementation that Gluster
has.


I agree.


Are there any plans to make the current Erasure Coding
implementation more pluggable?


Yes. I've had this in my todo list for a long time. Once I even tried to 
implement the necessary infrastructure but didn't finish and now the 
code has changed too much to reuse it.



Would this be a possible feature request,
or would it require a major rewrite of the current interface?


At the time I tried it, it required major changes. Now that the code has 
been considerably restructured to incorporate the dynamic code 
generation feature, maybe it doesn't require so many changes, though I'm 
not sure.




Here at FAST [0] I have briefly spoken to Per Simonsen from MemoScale
[1]. This company offers a (proprietary) library for Erasure Coding,
optimized for different architectures, and  with some unique(?) features
for recovering a failed fragment/disk. If Gluster allows alternative
implementations for the encoding, it would help organisations and
researchers to get results of their work in a distributed filesystem.
And with that, spread the word about how easy to adapt and extend
Gluster is :-)


That could be interesting. Is there any place where I can find 
additional information about the features of this library ?


Xavi



Thanks,
Niels


0. https://www.usenix.org/conference/fast17
1. https://memoscale.com/



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] release-3.10: Final call for release notes updates

2017-02-20 Thread Xavier Hernandez

Hi Shyam,

I've added some comments [1] for the issue between disperse's dynamic 
code generator and SELinux. It assumes that [2] will be backported to 3.10.


Xavi

[1] https://review.gluster.org/16685
[2] https://review.gluster.org/16614

On 20/02/17 04:04, Shyam wrote:

Hi,

Please find the latest release notes for 3.10 here [1]

This mail is to request feature owners, or folks who have tested to
update the release notes (by sending gerrit commits to the same) for any
updates that is desired (e.g feature related update, known issues in a
feature etc.).

The release notes serve as our first point of public facing
documentation about what is in a release, so any and all feedback and
updates welcome here.

The bug ID to use for updating the release notes would be [2]

Example release notes commits are at [3]

Thanks,
Shyam

[1] Current release notes:
https://github.com/gluster/glusterfs/blob/release-3.10/doc/release-notes/3.10.0.md


[2] Bug to use for release-notes updates:
https://bugzilla.redhat.com/show_bug.cgi?id=1417735

[3] Example release-note update:
https://review.gluster.org/#/q/topic:bug-1417735
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] https://review.gluster.org/#/c/16643/

2017-02-20 Thread Xavier Hernandez

Hi Nithya,

I've merged it. However Vijay said in another email [1] that backports 
to 3.9 are not needed anymore.


Xavi

[1] 
http://lists.gluster.org/pipermail/gluster-devel/2017-February/052107.html


On 20/02/17 09:19, Nithya Balachandran wrote:

Hi,

Can this be merged ? This is holding up my 3.9 patch backports.

Regards,
Nithya


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Reviews needed

2017-02-16 Thread Xavier Hernandez

Hi everyone,

I would need some reviews if you have some time:

A memory leak fix in fuse:
* Patch already merged in master and 3.10
* Backport to 3.9: https://review.gluster.org/16402
* Backport to 3.8: https://review.gluster.org/16403

A safe fallback for dynamic code generation in EC:
* Master: https://review.gluster.org/16614

A fix for incompatibilities with FreeBSD:
* Master: https://review.gluster.org/16417

A fix for FreeBSD's statvfs():
* Patch already merged in master
* Backport to 3.10: https://review.gluster.org/16631
* Backport to 3.9: https://review.gluster.org/16632
* Backport to 3.8: https://review.gluster.org/16634

I also have two reviews for 3.7 but I think it won't have any new 
releases, right ?


Thank you very much :)

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Release 3.10: Request fix status for RC1 tagging

2017-02-16 Thread Xavier Hernandez

Hi Shyam,

On 16/02/17 02:47, Shyam wrote:

Hi,

The 3.10 release tracker [1], shows 6 bugs needing a fix in 3.10. We
need to get RC1 out so that we can start tracking the same for a
potential release.

Request folks on these bugs to provide a date by when we can expect a
fix for these issues.

Request others to add any other bug to the tracker as appropriate.

Current bug list [2]:
  - 1415226: Kaleb/Niels do we need to do more for the python dependency
or is the last fix in?

  - 1417915: Vitaly Lipatov/Niels, I assume one of you would do the
backport for this one into 3.10

  - 1421590: Jeff, this needs a fix? Also, Samikshan can you provide
Jeff with a .t that can reproduce this (if possible)?

  - 1421649: Ashis/Niels when can we expect a fix to land for this?


I think this will require more thinking and participation from experts 
on security and selinux to come up with a good and clean solution. Not 
sure if this can be done before 3.10 release.


There is a workaround (set disperse.cpu-extensions = none) and a 
mitigating solution (patch https://review.gluster.org/16614) though.




  - 1421956: Xavi, I guess you would backport the fix on mainline once
that is merged into 3.10, right?


Yes. It's on review on master (https://review.gluster.org/16614). As 
soon as it's merged I'll backport it to 3.10.


Xavi



  - 1422363: Poornima, I am awaiting a merge of the same into mainline
and also an update of the commit message for the backport to 3.10,
before merging this into 3.10, request you to take care of the same.

Pranith, is a bug filed and added to the tracker for the mail below?
  -
http://lists.gluster.org/pipermail/maintainers/2017-February/002221.html

Thanks,
Shyam

[1] Tracker bug: https://bugzilla.redhat.com/show_bug.cgi?id=1416031

[2] Open bugs against the tracker:
https://bugzilla.redhat.com/buglist.cgi?quicksearch=1415226%201417915%201421590%201421649%201421956%201422363_id=7089913

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Creating new options for multiple gluster versions

2017-01-30 Thread Xavier Hernandez

Hi Atin,

On 31/01/17 05:45, Atin Mukherjee wrote:



On Mon, Jan 30, 2017 at 9:02 PM, Xavier Hernandez <xhernan...@datalab.es
<mailto:xhernan...@datalab.es>> wrote:

Hi Atin,

On 30/01/17 15:25, Atin Mukherjee wrote:



On Mon, Jan 30, 2017 at 7:30 PM, Xavier Hernandez
<xhernan...@datalab.es <mailto:xhernan...@datalab.es>
<mailto:xhernan...@datalab.es <mailto:xhernan...@datalab.es>>>
wrote:

Hi,

I'm wondering how a new option needs to be created to be
available
to different versions of gluster.

When a new option is created for 3.7 for example, it needs
to have a
GD_OP_VERSION referencing the next 3.7 release. This ensures
that
there won't be any problem with previous versions.

However what happens with 3.8 ?

3.8.0 is greater than any 3.7.x, however the new option won't be
available until the next 3.8 release. How this needs to be
handled ?


I'd discourage to backport any new volume options from mainline
to the
stable releases branches like 3.7 & 3.8. This creates a lot of
backward
compatibility issues w.r.t clients. Any new option is actually
an RFE
and supposed to be slated for only upcoming releases.


Even if it's needed to solve an issue in all versions ?

For example, a hardcoded timeout is seen to be insufficient in some
configurations, so it needs to be increased, but increasing it will
be too much for many of the environments where the current timeout
has worked fine. It could even be not enough for other environments
still not tried, needed a future increase.

With a new option, this can be solved case by case and only when needed.

How can this be solved ?


Hi Xavi,

Let me try to explain this a bit in detail. A new option with an
op-version say 30721 (considering 3.7.21 is the next update of 3.7 which
is the oldest active branch) is introduced in mainline and then the same
is backported to 3.7 (slated for 3.7.21) &  3.8 branch (slated for
3.8.9).  Now say if an user forms a cluster of three nodes with gluster
versions as 3.7.21, 3.8.9 & 3.8.8 respectively and tries to set this
option, volume set would always fail as in 3.8.8 this option is not
defined. Also any client running with 3.8 version would see a
compatibility issue here. Also the op-version number of the new option
has to be same across different release branches.


Thanks for the explanation. This confirms what I already thought. So the 
question is: now that 3.10 has already been branched, does it mean that 
any new option won't be available for LTS users until 3.12 is released ? 
I think this is not acceptable, specially for changes intended to fix an 
issue, not introducing new features.




With the current form of op-version management, I don't think this can
be solved, the only way is to ask users to upgrade to the latest.


As I said, someone using 3.10 LTS won't be able to upgrade until 3.12 is 
released. What would we say to them when we add a new option to 3.11 ?


Maybe we should add a new kind of option that causes no failure if not 
recognized. They are simply ignored. Many options do not cause any 
visible functional change, so they could be defined even if some nodes 
of the cluster don't recognize them (for example performance improvement 
options or some timeout values).


Anyway, with the introduction of brick multiplexing, I think we should 
reconsider how options are negotiated. I think that the op-version 
infrastructure is not enough to handle multiplexed bricks because we 
won't be able to upgrade a single volume that needs some new feature. We 
would need to upgrade all volumes, including the clients, even if they 
do not need anything new.


What do you think ?

Xavi



Kaushal - if you have any other ideas, please suggest.


Thanks,

Xavi




Thanks,

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org <mailto:Gluster-devel@gluster.org>
<mailto:Gluster-devel@gluster.org
<mailto:Gluster-devel@gluster.org>>
http://lists.gluster.org/mailman/listinfo/gluster-devel
<http://lists.gluster.org/mailman/listinfo/gluster-devel>
<http://lists.gluster.org/mailman/listinfo/gluster-devel
<http://lists.gluster.org/mailman/listinfo/gluster-devel>>




--

~ Atin (atinm)





--

~ Atin (atinm)


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Spurious regression failure? tests/basic/ec/ec-background-heals.t

2017-01-26 Thread Xavier Hernandez

Hi Atin,

I don't clearly see what's the problem. Even if the truncate causes a 
dirty flag to be set, eventually it should be removed before the 
$HEAL_TIMEOUT value.


For now I've marked the test as bad.

Patch is: https://review.gluster.org/16470

Xavi

On 25/01/17 17:24, Atin Mukherjee wrote:

Can we please address this as early as possible, my patch has hit this
failure 3 out of 4 recheck attempts now. I'm guessing some recent
changes has caused it.

On Wed, 25 Jan 2017 at 12:10, Ashish Pandey > wrote:


Pranith,

In this test tests/basic/ec/ec-background-heals.t, I think the line
number 86 actually creating a heal entry instead of
helping data heal quickly. What if all the data was already healed
at that moment, truncate came and in preop set the dirty flag and at the
end, as part of the heal, dirty flag was unset on previous good
bricks only and the brick which acted as heal-sink still has dirty
marked by truncate.
That is why we are only seeing "1" as get_pending_heal_count. If a
file was actually not healed it should be "2".
If heal on this file completes and unset of dirty flag happens
before truncate everything will be fine.

I think we can wait for file to be heal without truncate?

 71 #Test that disabling background-heals still drains the queue
 72 TEST $CLI volume set $V0 disperse.background-heals 1
 73 TEST touch $M0/{a,b,c,d}
 74 TEST kill_brick $V0 $H0 $B0/${V0}2
 75 EXPECT_WITHIN $CONFIG_UPDATE_TIMEOUT "1" mount_get_option_value
$M0 $V0-disperse-0 background-heals
 76 EXPECT_WITHIN $CONFIG_UPDATE_TIMEOUT "200"
mount_get_option_value $M0 $V0-disperse-0 heal-wait-qlength
 77 TEST truncate -s 1GB $M0/a
 78 echo abc > $M0/b
 79 echo abc > $M0/c
 80 echo abc > $M0/d
 81 TEST $CLI volume start $V0 force
 82 EXPECT_WITHIN $CHILD_UP_TIMEOUT "3" ec_child_up_count $V0 0
 83 TEST chown root:root $M0/{a,b,c,d}
 84 TEST $CLI volume set $V0 disperse.background-heals 0
 85 EXPECT_NOT "0" mount_get_option_value $M0 $V0-disperse-0
heal-waiters

 86 TEST truncate -s 0 $M0/a # This completes the heal fast ;-) <<<

 87 EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0


Ashish






*From: *"Raghavendra Gowdappa" >
*To: *"Nithya Balachandran" >
*Cc: *"Gluster Devel" >, "Pranith Kumar Karampuri"
>, "Ashish Pandey"
>
*Sent: *Wednesday, January 25, 2017 9:41:38 AM
*Subject: *Re: [Gluster-devel] Spurious regression
failure?tests/basic/ec/ec-background-heals.t


Found another failure on same test:
https://build.gluster.org/job/centos6-regression/2874/consoleFull

- Original Message -
> From: "Nithya Balachandran" >
> To: "Gluster Devel" >, "Pranith Kumar Karampuri"
>, "Ashish Pandey"
> >
> Sent: Tuesday, January 24, 2017 9:16:31 AM
> Subject: [Gluster-devel] Spurious regression
failure?tests/basic/ec/ec-background-heals.t
>
> Hi,
>
>
> Can you please take a look at
> https://build.gluster.org/job/centos6-regression/2859/console ?
>
> tests/basic/ec/ec-background-heals.t has failed.
>
> Thanks,
> Nithya
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org 
> http://lists.gluster.org/mailman/listinfo/gluster-devel
___

Gluster-devel mailing list

Gluster-devel@gluster.org 

http://lists.gluster.org/mailman/listinfo/gluster-devel

--
- Atin (atinm)


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] Lot of EIO errors in disperse volume

2017-01-23 Thread Xavier Hernandez

Hi Ram,

On 23/01/17 09:29, Ankireddypalle Reddy wrote:

Xavi,
   For each of the files that were reported by self-heal I have 
verified that size/version indeed do not match. How would the patch help in 
this case..


it was only to remove noise.

I've been looking at the log and I've seen the following message:

D [MSGID: 0] [ec-heal.c:2275:ec_heal_data] 0-glusterfsProd-disperse-8: 
b331cbc0-bf99-4267-be2d-0686e467ff9d: Skipping heal as only 0 number of 
subvolumes could be locked


The most probable cause of this message is that another server is doing 
the self-heal of the file. In that case the reported error is ENOTCONN 
(maybe not the best option, but this explains where this error comes from).


Can you look at the self-heal logs of the other servers (or mount 
points) to see if there's something different related to self-heal ?


Xavi



Thanks and Regards,
Ram

Sent from my iPhone


On Jan 23, 2017, at 3:11 AM, Xavier Hernandez <xhernan...@datalab.es> wrote:

Hi Ram,


On 20/01/17 21:06, Ankireddypalle Reddy wrote:
Attachments (2):

1



glustershd.log
<https://imap.commvault.com/webconsole/embedded.do?url=https://imap.commvault.com/webconsole/api/drive/publicshare/346714/file/bf5c8c5cfa18417d813fd7cd50372165/action/preview=https://imap.commvault.com/webconsole/api/contentstore/publicshare/346714/file/bf5c8c5cfa18417d813fd7cd50372165/action/download>
[Download]
<https://imap.commvault.com/webconsole/api/contentstore/publicshare/346714/file/bf5c8c5cfa18417d813fd7cd50372165/action/download>(7.34
MB)

2



heal.txt
<https://imap.commvault.com/webconsole/embedded.do?url=https://imap.commvault.com/webconsole/api/drive/publicshare/346714/file/7e02ac52561a490980c3fc23030181bd/action/preview=https://imap.commvault.com/webconsole/api/contentstore/publicshare/346714/file/7e02ac52561a490980c3fc23030181bd/action/download>
[Download]
<https://imap.commvault.com/webconsole/api/contentstore/publicshare/346714/file/7e02ac52561a490980c3fc23030181bd/action/download>(4.72
KB)

Xavi/Ashish,
  Thanks for checking the issue.  For now I am
first focusing on trying to understand why the heals are failing. I
debugged further and had to make the following change in the code. Heals
were failing with EINVAL as the gf_uuid_is_null check was failing in
ec_heal.

+++ b/xlators/cluster/ec/src/ec-heald.c
@@ -145,7 +145,7 @@ ec_shd_inode_find (xlator_t *this, xlator_t *subvol,
-gf_uuid_copy (loc.gfid, gfid);
+gf_uuid_copy (loc.inode->gfid, gfid);


This change is not correct. The just created inode will be filled by a 
successful lookup. If it's not correctly filled, it means something is failing 
there.



After making the changes heals have now started
failing with error code ENOTCONN.  Manually triggering the heal or
self-heal  always shows the same files. I then enabled TRACE level
logging for a while and collected the logs.

The message "W [MSGID: 122056] [ec-combine.c:875:ec_combine_check]
0-glusterfsProd-disperse-1: Mismatching xdata in answers of 'LOOKUP' for
483d1b8a-bc3d-4c7b-8239-076df42465d4" repeated 3 times between
[2017-01-20 19:57:04.083827] and [2017-01-20 19:58:04.055500]
[2017-01-20 19:58:04.103848] W [MSGID: 122002]
[ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-0: Heal failed
[Transport endpoint is not connected]
[2017-01-20 19:58:04.109726] W [MSGID: 122053]
[ec-common.c:117:ec_check_status] 0-glusterfsProd-disperse-0: Operation
LOOKUP failed on some subvolumes for
d4198d6c-345e-4f6b-a511-c81ee767c80c (up=7, mask=7, remaining=0, good=3,
bad=4)
[2017-01-20 19:58:04.110554] W [MSGID: 122002]
[ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-1: Heal failed
[Transport endpoint is not connected]


Could you apply patch https://review.gluster.org/16435/ ? I think you are 
seeing a lot of false positive errors that are adding a lot of noise to the 
real problem.




   Please find attached the trace logs and heal info output.


I'll examine the logs to see if there's something, but the previous patch will 
help a lot.

Xavi



Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Friday, January 20, 2017 3:05 AM
To: Ankireddypalle Reddy; Ashish Pandey
Cc: gluster-us...@gluster.org; Gluster Devel (gluster-devel@gluster.org)
Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in
disperse volume


On 20/01/17 08:55, Ankireddypalle Reddy wrote:
Xavi,
  Thanks.  Please let me know the functions that we need to

track for any inconsistencies in the return codes from multiple bricks
for issue 1. I will start doing that.


 1. Why the write fails in first place


The best way would be to see the logs. Related functions already log
messages when this happens.

In ec_check_status() there's a message logged if something has failed,
but before that there should also be some error messages indicati

Re: [Gluster-devel] [Gluster-users] Lot of EIO errors in disperse volume

2017-01-23 Thread Xavier Hernandez

Hi Ram,

On 20/01/17 21:06, Ankireddypalle Reddy wrote:

Attachments (2):

1



glustershd.log
<https://imap.commvault.com/webconsole/embedded.do?url=https://imap.commvault.com/webconsole/api/drive/publicshare/346714/file/bf5c8c5cfa18417d813fd7cd50372165/action/preview=https://imap.commvault.com/webconsole/api/contentstore/publicshare/346714/file/bf5c8c5cfa18417d813fd7cd50372165/action/download>
[Download]
<https://imap.commvault.com/webconsole/api/contentstore/publicshare/346714/file/bf5c8c5cfa18417d813fd7cd50372165/action/download>(7.34
MB)

2



heal.txt
<https://imap.commvault.com/webconsole/embedded.do?url=https://imap.commvault.com/webconsole/api/drive/publicshare/346714/file/7e02ac52561a490980c3fc23030181bd/action/preview=https://imap.commvault.com/webconsole/api/contentstore/publicshare/346714/file/7e02ac52561a490980c3fc23030181bd/action/download>
[Download]
<https://imap.commvault.com/webconsole/api/contentstore/publicshare/346714/file/7e02ac52561a490980c3fc23030181bd/action/download>(4.72
KB)

Xavi/Ashish,
   Thanks for checking the issue.  For now I am
first focusing on trying to understand why the heals are failing. I
debugged further and had to make the following change in the code. Heals
were failing with EINVAL as the gf_uuid_is_null check was failing in
ec_heal.

+++ b/xlators/cluster/ec/src/ec-heald.c
@@ -145,7 +145,7 @@ ec_shd_inode_find (xlator_t *this, xlator_t *subvol,
-gf_uuid_copy (loc.gfid, gfid);
+gf_uuid_copy (loc.inode->gfid, gfid);


This change is not correct. The just created inode will be filled by a 
successful lookup. If it's not correctly filled, it means something is 
failing there.




 After making the changes heals have now started
failing with error code ENOTCONN.  Manually triggering the heal or
self-heal  always shows the same files. I then enabled TRACE level
logging for a while and collected the logs.

The message "W [MSGID: 122056] [ec-combine.c:875:ec_combine_check]
0-glusterfsProd-disperse-1: Mismatching xdata in answers of 'LOOKUP' for
483d1b8a-bc3d-4c7b-8239-076df42465d4" repeated 3 times between
[2017-01-20 19:57:04.083827] and [2017-01-20 19:58:04.055500]
[2017-01-20 19:58:04.103848] W [MSGID: 122002]
[ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-0: Heal failed
[Transport endpoint is not connected]
 [2017-01-20 19:58:04.109726] W [MSGID: 122053]
[ec-common.c:117:ec_check_status] 0-glusterfsProd-disperse-0: Operation
LOOKUP failed on some subvolumes for
d4198d6c-345e-4f6b-a511-c81ee767c80c (up=7, mask=7, remaining=0, good=3,
bad=4)
[2017-01-20 19:58:04.110554] W [MSGID: 122002]
[ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-1: Heal failed
[Transport endpoint is not connected]


Could you apply patch https://review.gluster.org/16435/ ? I think you 
are seeing a lot of false positive errors that are adding a lot of noise 
to the real problem.





Please find attached the trace logs and heal info output.


I'll examine the logs to see if there's something, but the previous 
patch will help a lot.


Xavi



Thanks and Regards,
Ram

-----Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Friday, January 20, 2017 3:05 AM
To: Ankireddypalle Reddy; Ashish Pandey
Cc: gluster-us...@gluster.org; Gluster Devel (gluster-devel@gluster.org)
Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in
disperse volume

On 20/01/17 08:55, Ankireddypalle Reddy wrote:

Xavi,
   Thanks.  Please let me know the functions that we need to

track for any inconsistencies in the return codes from multiple bricks
for issue 1. I will start doing that.


  1. Why the write fails in first place


The best way would be to see the logs. Related functions already log
messages when this happens.

In ec_check_status() there's a message logged if something has failed,
but before that there should also be some error messages indicating the
reason of the failure.

Please, note that some of the errors logged by ec_check_status() are not
real problems. See patch http://review.gluster.org/16435/ for more info.

Xavi



Thanks and Regards,
Ram

-----Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Friday, January 20, 2017 2:41 AM
To: Ankireddypalle Reddy; Ashish Pandey
Cc: gluster-us...@gluster.org; Gluster Devel (gluster-devel@gluster.org)
Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in

disperse volume


Hi Ram,

On 20/01/17 08:02, Ankireddypalle Reddy wrote:

Ashish,

 Thanks for looking in to the issue. In the given
example the size/version matches for file on glusterfs4 and glusterfs5
nodes. The file is empty on glusterfs6. Now what happens if glusterfs5
goes down. Though the SLA factor of 2 is met  still I will not be able
to access the data.


True, but having a brick with inconsistent data is the same as having it
down. 

Re: [Gluster-devel] [Gluster-users] Lot of EIO errors in disperse volume

2017-01-20 Thread Xavier Hernandez

On 20/01/17 08:55, Ankireddypalle Reddy wrote:

Xavi,
   Thanks.  Please let me know the functions that we need to track for 
any inconsistencies in the return codes from multiple bricks for issue 1. I 
will start doing that.

  1. Why the write fails in first place


The best way would be to see the logs. Related functions already log 
messages when this happens.


In ec_check_status() there's a message logged if something has failed, 
but before that there should also be some error messages indicating the 
reason of the failure.


Please, note that some of the errors logged by ec_check_status() are not 
real problems. See patch http://review.gluster.org/16435/ for more info.


Xavi



Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Friday, January 20, 2017 2:41 AM
To: Ankireddypalle Reddy; Ashish Pandey
Cc: gluster-us...@gluster.org; Gluster Devel (gluster-devel@gluster.org)
Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse 
volume

Hi Ram,

On 20/01/17 08:02, Ankireddypalle Reddy wrote:

Ashish,

 Thanks for looking in to the issue. In the given
example the size/version matches for file on glusterfs4 and glusterfs5
nodes. The file is empty on glusterfs6. Now what happens if glusterfs5
goes down. Though the SLA factor of 2 is met  still I will not be able
to access the data.


True, but having a brick with inconsistent data is the same as having it
down. You would have lost 2 bricks out of 3.

The problem is how to detect the inconsistent data, what is causing it
and why self-heal (apparently) is not healing it.


The problem is that writes did not fail for the file
to indicate the issue to the application.


That's the expected behavior. Since we have redundancy 1, a loss or
failure on a single fragment is hidden from the application. However
this triggers internal procedures to repair the problem on the
background. This also seems to not be working.

There are two issues to identify here:

1. Why the write fails in first place
2. Why self-heal is unable to repair it

Probably the root cause is the same for both problems, but I'm not sure.

For 1 there should be some warning or error in the mount log, since one
of the bricks is reporting an error, even if ec is able to report
success to the application later.

For 2 the analysis could be more complex, but most probably there should
be some warning or error message in the mount log and/or self-heal log
of one of the servers.

Xavi


 It’s not that we are
encountering issue for every file on the mount point. The issue happens
randomly for different files.



[root@glusterfs4 glusterfs]# gluster volume info



Volume Name: glusterfsProd

Type: Distributed-Disperse

Volume ID: 622aa3ee-958f-485f-b3c1-fb0f6c8db34c

Status: Started

Number of Bricks: 12 x (2 + 1) = 36

Transport-type: tcp

Bricks:

Brick1: glusterfs4sds:/ws/disk1/ws_brick

Brick2: glusterfs5sds:/ws/disk1/ws_brick

Brick3: glusterfs6sds:/ws/disk1/ws_brick

Brick4: glusterfs4sds:/ws/disk10/ws_brick

Brick5: glusterfs5sds:/ws/disk10/ws_brick

Brick6: glusterfs6sds:/ws/disk10/ws_brick

Brick7: glusterfs4sds:/ws/disk11/ws_brick

Brick8: glusterfs5sds:/ws/disk11/ws_brick

Brick9: glusterfs6sds:/ws/disk11/ws_brick

Brick10: glusterfs4sds:/ws/disk2/ws_brick

Brick11: glusterfs5sds:/ws/disk2/ws_brick

Brick12: glusterfs6sds:/ws/disk2/ws_brick

Brick13: glusterfs4sds:/ws/disk3/ws_brick

Brick14: glusterfs5sds:/ws/disk3/ws_brick

Brick15: glusterfs6sds:/ws/disk3/ws_brick

Brick16: glusterfs4sds:/ws/disk4/ws_brick

Brick17: glusterfs5sds:/ws/disk4/ws_brick

Brick18: glusterfs6sds:/ws/disk4/ws_brick

Brick19: glusterfs4sds:/ws/disk5/ws_brick

Brick20: glusterfs5sds:/ws/disk5/ws_brick

Brick21: glusterfs6sds:/ws/disk5/ws_brick

Brick22: glusterfs4sds:/ws/disk6/ws_brick

Brick23: glusterfs5sds:/ws/disk6/ws_brick

Brick24: glusterfs6sds:/ws/disk6/ws_brick

Brick25: glusterfs4sds:/ws/disk7/ws_brick

Brick26: glusterfs5sds:/ws/disk7/ws_brick

Brick27: glusterfs6sds:/ws/disk7/ws_brick

Brick28: glusterfs4sds:/ws/disk8/ws_brick

Brick29: glusterfs5sds:/ws/disk8/ws_brick

Brick30: glusterfs6sds:/ws/disk8/ws_brick

Brick31: glusterfs4sds:/ws/disk9/ws_brick

Brick32: glusterfs5sds:/ws/disk9/ws_brick

Brick33: glusterfs6sds:/ws/disk9/ws_brick

Brick34: glusterfs4sds:/ws/disk12/ws_brick

Brick35: glusterfs5sds:/ws/disk12/ws_brick

Brick36: glusterfs6sds:/ws/disk12/ws_brick

Options Reconfigured:

storage.build-pgfid: on

performance.readdir-ahead: on

nfs.export-dirs: off

nfs.export-volumes: off

nfs.disable: on

auth.allow: glusterfs4sds,glusterfs5sds,glusterfs6sds

diagnostics.client-log-level: INFO

[root@glusterfs4 glusterfs]#



Thanks and Regards,

Ram

*From:*Ashish Pandey [mailto:aspan...@redhat.com]
*Sent:* Thursday, January 19, 2017 10:36 PM
*To:* Ankireddypalle Reddy
*Cc:* Xavier Hernandez; gluster-us...@gluster.org; Gluster Devel
(gluster-devel@gluster.org)
*Subject:* Re: [Gluster

Re: [Gluster-devel] [Gluster-users] Lot of EIO errors in disperse volume

2017-01-19 Thread Xavier Hernandez

Hi Ram,

On 20/01/17 08:02, Ankireddypalle Reddy wrote:

Ashish,

 Thanks for looking in to the issue. In the given
example the size/version matches for file on glusterfs4 and glusterfs5
nodes. The file is empty on glusterfs6. Now what happens if glusterfs5
goes down. Though the SLA factor of 2 is met  still I will not be able
to access the data.


True, but having a brick with inconsistent data is the same as having it 
down. You would have lost 2 bricks out of 3.


The problem is how to detect the inconsistent data, what is causing it 
and why self-heal (apparently) is not healing it.



The problem is that writes did not fail for the file
to indicate the issue to the application.


That's the expected behavior. Since we have redundancy 1, a loss or 
failure on a single fragment is hidden from the application. However 
this triggers internal procedures to repair the problem on the 
background. This also seems to not be working.


There are two issues to identify here:

1. Why the write fails in first place
2. Why self-heal is unable to repair it

Probably the root cause is the same for both problems, but I'm not sure.

For 1 there should be some warning or error in the mount log, since one 
of the bricks is reporting an error, even if ec is able to report 
success to the application later.


For 2 the analysis could be more complex, but most probably there should 
be some warning or error message in the mount log and/or self-heal log 
of one of the servers.


Xavi


 It’s not that we are
encountering issue for every file on the mount point. The issue happens
randomly for different files.



[root@glusterfs4 glusterfs]# gluster volume info



Volume Name: glusterfsProd

Type: Distributed-Disperse

Volume ID: 622aa3ee-958f-485f-b3c1-fb0f6c8db34c

Status: Started

Number of Bricks: 12 x (2 + 1) = 36

Transport-type: tcp

Bricks:

Brick1: glusterfs4sds:/ws/disk1/ws_brick

Brick2: glusterfs5sds:/ws/disk1/ws_brick

Brick3: glusterfs6sds:/ws/disk1/ws_brick

Brick4: glusterfs4sds:/ws/disk10/ws_brick

Brick5: glusterfs5sds:/ws/disk10/ws_brick

Brick6: glusterfs6sds:/ws/disk10/ws_brick

Brick7: glusterfs4sds:/ws/disk11/ws_brick

Brick8: glusterfs5sds:/ws/disk11/ws_brick

Brick9: glusterfs6sds:/ws/disk11/ws_brick

Brick10: glusterfs4sds:/ws/disk2/ws_brick

Brick11: glusterfs5sds:/ws/disk2/ws_brick

Brick12: glusterfs6sds:/ws/disk2/ws_brick

Brick13: glusterfs4sds:/ws/disk3/ws_brick

Brick14: glusterfs5sds:/ws/disk3/ws_brick

Brick15: glusterfs6sds:/ws/disk3/ws_brick

Brick16: glusterfs4sds:/ws/disk4/ws_brick

Brick17: glusterfs5sds:/ws/disk4/ws_brick

Brick18: glusterfs6sds:/ws/disk4/ws_brick

Brick19: glusterfs4sds:/ws/disk5/ws_brick

Brick20: glusterfs5sds:/ws/disk5/ws_brick

Brick21: glusterfs6sds:/ws/disk5/ws_brick

Brick22: glusterfs4sds:/ws/disk6/ws_brick

Brick23: glusterfs5sds:/ws/disk6/ws_brick

Brick24: glusterfs6sds:/ws/disk6/ws_brick

Brick25: glusterfs4sds:/ws/disk7/ws_brick

Brick26: glusterfs5sds:/ws/disk7/ws_brick

Brick27: glusterfs6sds:/ws/disk7/ws_brick

Brick28: glusterfs4sds:/ws/disk8/ws_brick

Brick29: glusterfs5sds:/ws/disk8/ws_brick

Brick30: glusterfs6sds:/ws/disk8/ws_brick

Brick31: glusterfs4sds:/ws/disk9/ws_brick

Brick32: glusterfs5sds:/ws/disk9/ws_brick

Brick33: glusterfs6sds:/ws/disk9/ws_brick

Brick34: glusterfs4sds:/ws/disk12/ws_brick

Brick35: glusterfs5sds:/ws/disk12/ws_brick

Brick36: glusterfs6sds:/ws/disk12/ws_brick

Options Reconfigured:

storage.build-pgfid: on

performance.readdir-ahead: on

nfs.export-dirs: off

nfs.export-volumes: off

nfs.disable: on

auth.allow: glusterfs4sds,glusterfs5sds,glusterfs6sds

diagnostics.client-log-level: INFO

[root@glusterfs4 glusterfs]#



Thanks and Regards,

Ram

*From:*Ashish Pandey [mailto:aspan...@redhat.com]
*Sent:* Thursday, January 19, 2017 10:36 PM
*To:* Ankireddypalle Reddy
*Cc:* Xavier Hernandez; gluster-us...@gluster.org; Gluster Devel
(gluster-devel@gluster.org)
*Subject:* Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in
disperse volume



Ram,



I don't understand what do you mean by saying "redundancy factor of 2 is
met in a 3:1 disperse volume".
You have given the xattr's of only 3 bricks.

All the above 2 sentence and output of getxattr contradicts each other.

In the given scenario, if you have (2+1) ec configuration, and 2 bricks
are having same size and version, then there should not be

any problem to access this file.  Run heal and 3rd fragment will also
bee healthy.



I think there has been major gap in providing the complete and correct
information about the volume and all the logs and activities.
Could you please provide the following -

1 - gluster v info - please give us the output of this command

2 - Let's consider only one file which you are not able to access and
find out the reason.

3 - Try to create and write some files on mount point and see if there
is any issue with new file creation if yes, why, provide logs.



Specific but enough information is required to fin

Re: [Gluster-devel] [Gluster-users] Lot of EIO errors in disperse volume

2017-01-16 Thread Xavier Hernandez

Hi Ram,

On 16/01/17 12:33, Ankireddypalle Reddy wrote:

Xavi,
  Thanks. Is there any other way to map from GFID to path.


The only way I know is to search all files from bricks and lookup for 
the trusted.gfid xattr.



I will look for a way to share the TRACE logs. Easier way might be to add some 
extra logging. I could do that if you could let me know functions in which you 
are interested..


The problem is that I don't know where the problem is. One possibility 
could be to track all return values from all bricks for all writes and 
then identify which ones belong to an inconsistent file.


But if this doesn't reveal anything interesting we'll need to look at 
some other place. And this can be very tedious and slow.


Anyway, what we are looking now is not the source of an EIO, since there 
are two bricks with consistent state and the file should be perfectly 
readable and writable. It's true that there's some problem here and it 
could derive in EIO if one of the healthy bricks degrades, but at least 
this file shouldn't be giving EIO errors for now.


Xavi



Sent on from my iPhone


On Jan 16, 2017, at 6:23 AM, Xavier Hernandez <xhernan...@datalab.es> wrote:

Hi Ram,


On 13/01/17 18:41, Ankireddypalle Reddy wrote:
Xavi,
I enabled TRACE logging. The log grew up to 120GB and could not 
make much out of it. Then I started logging GFID in the code where we were 
seeing errors.

[2017-01-13 17:02:01.761349] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-0: dict=0x7fa6706bc690 
((trusted.ec.size:0:0:0:0:30:6b:0:0:)(trusted.ec.version:0:0:0:0:0:0:2a:38:0:0:0:0:0:0:2a:38:))
[2017-01-13 17:02:01.761360] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-0: dict=0x7fa6706bed64 
((trusted.ec.size:0:0:0:0:0:0:0:0:)(trusted.ec.version:0:0:0:0:0:0:0:0:0:0:0:0:0:0:2a:38:))
[2017-01-13 17:02:01.761365] W [MSGID: 122056] 
[ec-combine.c:881:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-13 17:02:01.761405] I [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-0: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-13 17:02:01.761417] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-0: dict=0x7fa6706bbb14 
((trusted.ec.size:0:0:0:0:30:6b:0:0:)(trusted.ec.version:0:0:0:0:0:0:2a:38:0:0:0:0:0:0:2a:38:))
[2017-01-13 17:02:01.761428] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-0: dict=0x7fa6706bed64 
((trusted.ec.size:0:0:0:0:0:0:0:0:)(trusted.ec.version:0:0:0:0:0:0:0:0:0:0:0:0:0:0:2a:38:))
[2017-01-13 17:02:01.761433] W [MSGID: 122056] 
[ec-combine.c:881:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-13 17:02:01.761442] W [MSGID: 122006] 
[ec-combine.c:214:ec_iatt_combine] 0-glusterfsProd-disperse-0: Failed to 
combine iatt (inode: 11275691004192850514-11275691004192850514, gfid: 
60b990ed-d741-4176-9c7b-4d3a25fb8252  -  60b990ed-d741-4176-9c7b-4d3a25fb8252,  
links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0,size: 406650880-406683648, mode: 
100775-100775)

The file for which we are seeing this error turns out to be having a GFID of 
60b990ed-d741-4176-9c7b-4d3a25fb8252

Then I tried looking for find out the file with this GFID. It pointed me to 
following path. I was expecting a real file system path from the following 
turorial:
https://gluster.readthedocs.io/en/latest/Troubleshooting/gfid-to-path/


I think this method only works if bricks have the inode cached.



getfattr -n trusted.glusterfs.pathinfo -e text 
/mnt/gfid/.gfid/60b990ed-d741-4176-9c7b-4d3a25fb8252
getfattr: Removing leading '/' from absolute path names
# file: mnt/gfid/.gfid/60b990ed-d741-4176-9c7b-4d3a25fb8252
trusted.glusterfs.pathinfo="( ( 
<POSIX(/ws/disk1/ws_brick):glusterfs6:/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252>
 
<POSIX(/ws/disk1/ws_brick):glusterfs5:/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252>))"

Then I looked for the xatttrs for these files from all the 3 bricks

[root@glusterfs4 glusterfs]# getfattr -d -m . -e hex 
/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
getfattr: Removing leading '/' from absolute path names
# file: ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
trusted.bit-rot.version=0x02005877a8dc00041138
trusted.ec.config=0x080301000200
trusted.ec.size=0x
trusted.ec.version=0x2a38
trusted.gfid=0x60b990edd74141769c7b4d3a25fb8252

[root@glusterfs5 bricks]# getfattr -d -m . -e hex 
/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
getfattr: Removing leading '/' from absolute path names
# file: ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
trusted.bit-rot.version=0x02005877a8dc000c92d0
trusted.ec.config=0x080301000200
trusted.ec.dirty=0x0016
trusted.ec.size=0x000

Re: [Gluster-devel] [Gluster-users] Lot of EIO errors in disperse volume

2017-01-16 Thread Xavier Hernandez

Hi Ram,

On 13/01/17 18:41, Ankireddypalle Reddy wrote:

Xavi,
 I enabled TRACE logging. The log grew up to 120GB and could not 
make much out of it. Then I started logging GFID in the code where we were 
seeing errors.

[2017-01-13 17:02:01.761349] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-0: dict=0x7fa6706bc690 
((trusted.ec.size:0:0:0:0:30:6b:0:0:)(trusted.ec.version:0:0:0:0:0:0:2a:38:0:0:0:0:0:0:2a:38:))
[2017-01-13 17:02:01.761360] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-0: dict=0x7fa6706bed64 
((trusted.ec.size:0:0:0:0:0:0:0:0:)(trusted.ec.version:0:0:0:0:0:0:0:0:0:0:0:0:0:0:2a:38:))
[2017-01-13 17:02:01.761365] W [MSGID: 122056] 
[ec-combine.c:881:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-13 17:02:01.761405] I [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-0: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-13 17:02:01.761417] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-0: dict=0x7fa6706bbb14 
((trusted.ec.size:0:0:0:0:30:6b:0:0:)(trusted.ec.version:0:0:0:0:0:0:2a:38:0:0:0:0:0:0:2a:38:))
[2017-01-13 17:02:01.761428] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-0: dict=0x7fa6706bed64 
((trusted.ec.size:0:0:0:0:0:0:0:0:)(trusted.ec.version:0:0:0:0:0:0:0:0:0:0:0:0:0:0:2a:38:))
[2017-01-13 17:02:01.761433] W [MSGID: 122056] 
[ec-combine.c:881:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-13 17:02:01.761442] W [MSGID: 122006] 
[ec-combine.c:214:ec_iatt_combine] 0-glusterfsProd-disperse-0: Failed to 
combine iatt (inode: 11275691004192850514-11275691004192850514, gfid: 
60b990ed-d741-4176-9c7b-4d3a25fb8252  -  60b990ed-d741-4176-9c7b-4d3a25fb8252,  
links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0,size: 406650880-406683648, mode: 
100775-100775)

The file for which we are seeing this error turns out to be having a GFID of 
60b990ed-d741-4176-9c7b-4d3a25fb8252

Then I tried looking for find out the file with this GFID. It pointed me to 
following path. I was expecting a real file system path from the following 
turorial:
https://gluster.readthedocs.io/en/latest/Troubleshooting/gfid-to-path/


I think this method only works if bricks have the inode cached.



getfattr -n trusted.glusterfs.pathinfo -e text 
/mnt/gfid/.gfid/60b990ed-d741-4176-9c7b-4d3a25fb8252
getfattr: Removing leading '/' from absolute path names
# file: mnt/gfid/.gfid/60b990ed-d741-4176-9c7b-4d3a25fb8252
trusted.glusterfs.pathinfo="( ( 
<POSIX(/ws/disk1/ws_brick):glusterfs6:/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252>
 
<POSIX(/ws/disk1/ws_brick):glusterfs5:/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252>))"

Then I looked for the xatttrs for these files from all the 3 bricks

[root@glusterfs4 glusterfs]# getfattr -d -m . -e hex 
/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
getfattr: Removing leading '/' from absolute path names
# file: ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
trusted.bit-rot.version=0x02005877a8dc00041138
trusted.ec.config=0x080301000200
trusted.ec.size=0x
trusted.ec.version=0x2a38
trusted.gfid=0x60b990edd74141769c7b4d3a25fb8252

[root@glusterfs5 bricks]# getfattr -d -m . -e hex 
/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
getfattr: Removing leading '/' from absolute path names
# file: ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
trusted.bit-rot.version=0x02005877a8dc000c92d0
trusted.ec.config=0x080301000200
trusted.ec.dirty=0x0016
trusted.ec.size=0x306b
trusted.ec.version=0x2a382a38
trusted.gfid=0x60b990edd74141769c7b4d3a25fb8252

[root@glusterfs6 ee]# getfattr -d -m . -e hex 
/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
getfattr: Removing leading '/' from absolute path names
# file: ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
trusted.bit-rot.version=0x02005877a8dc000c9436
trusted.ec.config=0x080301000200
trusted.ec.dirty=0x0016
trusted.ec.size=0x306b
trusted.ec.version=0x2a382a38
trusted.gfid=0x60b990edd74141769c7b4d3a25fb8252

It turns out that the size and version in fact does not match for one of the 
files.


It seems as if the brick on glusterfs4 didn't receive any write request 
(or they failed for some reason). Do you still have the trace log ? is 
there any way I could download it ?


Xavi



Thanks and Regards,
Ram

-Original Message-
From: gluster-devel-boun...@gluster.org 
[mailto:gluster-devel-boun...@gluster.org] On Behalf Of Ankireddypalle Reddy
Sent: Friday, January 13, 2017 4:17 AM
To: Xavier Hernandez
Cc: gluster-us...@gluster.org; Gluste

Re: [Gluster-devel] Question about EC locking

2017-01-13 Thread Xavier Hernandez

Hi,

On 13/01/17 10:58, jayakrishnan mm wrote:

Hi Xavier,
I went through the source  code. Some questions remain.

1. If two clients try to write to same file, it should succeed, even if
they overlap. (Locks should ensure it happens in sequence, in the bricks).
from the source code
 lock->flock.l_type = F_WRLCK;
 lock->flock.l_whence = SEEK_SET;

fop->flock.l_len += ec_adjust_offset(fop->xl->private,
 >flock.l_start, 1);
fop->flock.l_len = ec_adjust_size(fop->xl->private,
  fop->flock.l_len, 1);
if flock.l_len is 0, the entire file  is locked for writing

In my test case  with 2 clients, I always  get  flock.l_len as 0. But
still  I am able to write to the same file  from both clients at the
 same time.


How are you sure you are really writing at the same time ? do you get 
partial writes from some of the client ?




If it is  acquiring lock chunk by chunk, why I am getting l_len =0
always ?


EC doesn't acquire partial locks. The entire file is locked when a 
modification is needed. This makes possible to reuse locks for future 
operations (eager locking).



Why I am not getting the actual write size  and offset f(for
flock.l_len & flock.l_start respectively) for each  write FOP ?
(In afr , it is set to transaction.len transaction.start respectively,
which in turn is  write length & offset  for the normal write case)


Because an erasure code splits the data is smaller fragments for each 
brick, so offsets and lengths need to be adjusted.




2. As per source code ,a full file lock is taken by the shd also.

ec_heal_inodelk(heal, F_WRLCK, 1, 0, 0);
 which means  offset=0 & size=0  in  ec_heal_lock() function in ec-heal.c
flock.l_start = offset;
flock.l_len = size;
Does it mean , in a single file write cannot happen simultaneously with
healing?


Correct. Heal procedure is like an additional client. If a client and 
the heal process try to write at the same time, they must be serialized, 
like any other regular write. However heal only takes the full lock for 
some critical operations. Regular self heal of file contents is done 
locking chunk by chunk.


Xavi



Correct me , if I am wrong.

Best Regards
JK






On Wed, Dec 14, 2016 at 12:07 PM, jayakrishnan mm
<jayakrishnan...@gmail.com <mailto:jayakrishnan...@gmail.com>> wrote:

Thanks Xavier, for making it clear.
    Regards
    JK


On Dec 13, 2016 3:52 PM, "Xavier Hernandez" <xhernan...@datalab.es
<mailto:xhernan...@datalab.es>> wrote:

Hi JK,


On 12/13/2016 08:34 AM, jayakrishnan mm wrote:

Dear Xavi,

How do I test  the locks, for example locks  for write fop.
I have two
clients(independent), both  are  trying to write to same file.


1. According to my understanding, both  can successfully
write  if the
offsets don't overlap . I mean, the WRITE FOP  takes a chunk
lock on the
file . As
long as the clients don't try  to write to the same chunk,
it should be
OK. If no locks  present, it can lead to inconsistency.


With locks all writes will be fine as defined by posix (i.e. the
final result will be equivalent to the sequential execution of
both operations, though in an undefined order), even if they
overlap. Without locks, there are chances that some bricks
execute the operations in one order and the remaining bricks
execute the same operations in the reverse order, causing data
corruption.




2.  Different FOPs can always run simultaneously. (Example
WRITE  and
READ FOPs, or  two READ FOPs).


All fops can be executed concurrently. If there's any chance
that two operations could interfere, locks are taken in the
appropriate places. For example, reads cannot be merged with
overlapping writes. Otherwise they could return inconsistent data.



3. WRITE & some metadata FOP (like setattr)  together .
Cannot happen
together with locks , even though chances  are very low.


As in 2, if there's any possible interference, the appropriate
locks will be taken.

You can look at the code to see which locks are taken for each
fop. See the corresponding ec_manager_() function, in the
EC_STATE_LOCK switch case. There you will see calls to
ec_lock_prepare_xxx() for each taken lock.

Xavi


Pls. clarify.

Best regards
JK



On Wed, Nov 30, 2016 at 5:49 PM, jayakrishnan mm
<jayakrishnan...@gmail.com
<mailto:jayakrishnan...@gmail.com>
<mailto:jayakrishnan...@

Re: [Gluster-devel] [Gluster-users] Lot of EIO errors in disperse volume

2017-01-13 Thread Xavier Hernandez
V_MAGNETIC/V_30970/CHUNK_390607
  trusted.ec.version=0x0009000b

The attribute value seems to be same on all the 3 bricks.


That's a clear indication that the ec warning is not related to this 
directory because trusted.ec.version always increases, never decreases, 
and the directory has a value smaller that the one that appears in the 
log message.


If you show all dict entries in the log, it seems that it does refer to 
a directory because trusted.ec.size is not present, but it must be 
another directory than the one you looked at. We would need to find 
which one is having this issue. The TRACE log would be helpful here.





   Also please note that every single time that the 
trusted.ec.version was found to mismatch the  same values are getting logged. 
Following are 2 more instances of trusted.ec.version mismatch.

[2017-01-12 20:14:25.554540] I [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-2: 'trusted.ec.version' is different in two dicts (16, 
16)
[2017-01-12 20:14:25.554588] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-2: dict=0x7f0b6495a9f0 
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:0:0:e:))
[2017-01-12 20:14:25.554608] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-2: dict=0x7f0b6495903c 
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:0:0:e:))
[2017-01-12 20:14:25.554624] W [MSGID: 122053] 
[ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-2: Operation failed 
on some subvolumes (up=7, mask=7, remaining=0, good=3, bad=4)
[2017-01-12 20:14:25.554632] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 
0-glusterfsProd-disperse-2: Heal failed [Invalid argument]
[2017-01-12 20:14:25.98] I [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-2: 'trusted.ec.version' is different in two dicts (16, 
16)
[2017-01-12 20:14:25.555622] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-2: dict=0x7f0b64956c24 
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:0:0:e:))
[2017-01-12 20:14:25.555638] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-2: dict=0x7f0b64964e8c 
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:0:0:e:))



I think that this refers to the same directory. This seems an attempt to 
heal it that has failed. So it makes sense that it finds exactly the 
same values.





In glustershd.log lot of similar errors are logged.

[2017-01-12 21:10:53.728770] I [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-0: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-12 21:10:53.728804] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-0: dict=0x7f21694b6f50 
((trusted.ec.size:0:0:0:0:42:3a:0:0:)(trusted.ec.version:0:0:0:0:0:0:37:5f:0:0:0:0:0:0:37:5f:))
[2017-01-12 21:10:53.728827] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-0: dict=0x7f21694b62bc 
((trusted.ec.size:0:0:0:0:0:ca:0:0:)(trusted.ec.version:0:0:0:0:0:0:0:a1:0:0:0:0:0:0:37:5f:))
[2017-01-12 21:10:53.728842] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-12 21:10:53.728854] W [MSGID: 122053] 
[ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-0: Operation failed 
on some subvolumes (up=7, mask=7, remaining=0, good=6, bad=1)
[2017-01-12 21:10:53.728876] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 
0-glusterfsProd-disperse-0: Heal failed [Invalid argument]


This seems an attempt to heal a file, but I see a lot of differences 
between both versions. The size on one brick is 13.238.272 bytes, but on 
the other brick it's 1.111.097.344 bytes. That's a huge difference.


Looking at the trusted.ec.version, I see that the 'data' version is very 
different (from 161 to 14.175), however the metadata version is exactly 
the same. This really seems like a lot of writes while one brick was 
down (or disconnected for some reason, or writes failed for some 
reason). One brick has lost about 14.000 writes of ~80KB.


I think the most important thing right now would be to identify which 
files and directories are having these problems to be able to identify 
the cause. Again, the TRACE log will be really useful.


Xavi



Thanks and Regards,
Ram

-Original Message-----
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Thursday, January 12, 2017 6:40 AM
To: Ankireddypalle Reddy
Cc: Gluster Devel (gluster-devel@gluster.org); gluster-us...@gluster.org
Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse 
volume

Hi Ram,


On 12/01/17 11:49, Ankireddypalle Reddy wrote:

Xavi,
  As I mentioned before the error could happen f

Re: [Gluster-devel] [Gluster-users] Lot of EIO errors in disperse volume

2017-01-12 Thread Xavier Hernandez

Hi Ram,


On 12/01/17 11:49, Ankireddypalle Reddy wrote:

Xavi,
  As I mentioned before the error could happen for any FOP. Will try to 
run with TRACE debug level. Is there a possibility that we are checking for 
this attribute on a directory, because a directory does not seem to be having 
this attribute set.


No, directories do not have this attribute and no one should be reading 
it from a directory.



Also is the function to check size and version called after it is decided that 
heal should be run or is this check is the one which decides whether a heal 
should be run.


Almost all checks that trigger a heal are done in the lookup fop when 
some discrepancy is detected.


The function that checks size and version is called later once a lock on 
the inode is acquired (even if no heal is needed). However further 
failures in the processing of any fop can also trigger a self-heal.


Xavi



Thanks and Regards,
Ram

Sent from my iPhone


On Jan 12, 2017, at 2:25 AM, Xavier Hernandez <xhernan...@datalab.es> wrote:

Hi Ram,


On 12/01/17 02:36, Ankireddypalle Reddy wrote:
Xavi,
 I added some more logging information. The trusted.ec.size field 
values are in fact different.
  trusted.ec.sizel1 = 62719407423488l2 = 0


That's very weird. Directories do not have this attribute. It's only present on 
regular files. But you said that the error happens while creating the file, so 
it doesn't make much sense because file creation always sets trusted.ec.size to 
0.

Could you reproduce the problem with diagnostics.client-log-level set to TRACE 
and send the log to me ? it will create a big log, but I'll have much more 
information about what's going on.

Do you have a mixed setup with nodes of different types ? for example mixed 
32/64 bits architectures or different operating systems ? I ask this because 
62719407423488 in hex is 0x390B, which has the lower 32 bits set to 0, 
but has garbage above that.



  This is a fairly static setup with no brick/ node failure.  Please 
explain why  is that a heal is being triggered and what could have acutually 
caused these size xattrs to differ.  This is causing random I/O failures and is 
impacting the backup schedules.


The launch of self-heal is normal because it has detected an inconsistency. The 
real problem is what originates that inconsistency.

Xavi



[ 2017-01-12 01:19:18.256970] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-12 01:19:18.257015] W [MSGID: 122053] 
[ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-8: Operation failed 
on some subvolumes (up=7, mask=7, remaining=0, good=3, bad=4)
[2017-01-12 01:19:18.257018] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 
0-glusterfsProd-disperse-8: Heal failed [Invalid argument]
[2017-01-12 01:19:21.002028] E [dict.c:197:key_value_cmp] 
0-glusterfsProd-disperse-4: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-12 01:19:21.002056] E [dict.c:166:log_value] 
0-glusterfsProd-disperse-4: trusted.ec.size [ l1 = 62719407423488 l2 = 0 i1 = 0 
i2 = 0 ]
[2017-01-12 01:19:21.002064] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-4: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-12 01:19:21.209640] E [dict.c:197:key_value_cmp] 
0-glusterfsProd-disperse-4: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-12 01:19:21.209673] E [dict.c:166:log_value] 
0-glusterfsProd-disperse-4: trusted.ec.size [ l1 = 62719407423488 l2 = 0 i1 = 0 
i2 = 0 ]
[2017-01-12 01:19:21.209686] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-4: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-12 01:19:21.209719] W [MSGID: 122053] 
[ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-4: Operation failed 
on some subvolumes (up=7, mask=7, remaining=0, good=6, bad=1)
[2017-01-12 01:19:21.209753] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 
0-glusterfsProd-disperse-4: Heal failed [Invalid argument]

Thanks and Regards,
Ram

-Original Message-
From: Ankireddypalle Reddy
Sent: Wednesday, January 11, 2017 9:29 AM
To: Ankireddypalle Reddy; Xavier Hernandez; Gluster Devel 
(gluster-devel@gluster.org); gluster-us...@gluster.org
Subject: RE: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse 
volume

Xavi,
   I built a debug binary to log more information. This is what is 
getting logged. Looks like it is the attribute trusted.ec.size which is 
different among the bricks in a sub volume.

In glustershd.log :

[2017-01-11 14:19:45.023845] N [MSGID: 122029] 
[ec-generic.c:683:ec_combine_lookup] 0-glusterfsProd-disperse-8: Mismatching 
iatt in answers of 'GF_FOP_LOOKUP'
[2017-01-11 14:19:45.027718] E [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-6: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-11 14:19:45.027736] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine

Re: [Gluster-devel] [Gluster-users] Lot of EIO errors in disperse volume

2017-01-11 Thread Xavier Hernandez

Hi Ram,

On 12/01/17 02:36, Ankireddypalle Reddy wrote:

Xavi,
  I added some more logging information. The trusted.ec.size field 
values are in fact different.
   trusted.ec.sizel1 = 62719407423488l2 = 0


That's very weird. Directories do not have this attribute. It's only 
present on regular files. But you said that the error happens while 
creating the file, so it doesn't make much sense because file creation 
always sets trusted.ec.size to 0.


Could you reproduce the problem with diagnostics.client-log-level set to 
TRACE and send the log to me ? it will create a big log, but I'll have 
much more information about what's going on.


Do you have a mixed setup with nodes of different types ? for example 
mixed 32/64 bits architectures or different operating systems ? I ask 
this because 62719407423488 in hex is 0x390B, which has the 
lower 32 bits set to 0, but has garbage above that.




   This is a fairly static setup with no brick/ node failure.  Please 
explain why  is that a heal is being triggered and what could have acutually 
caused these size xattrs to differ.  This is causing random I/O failures and is 
impacting the backup schedules.


The launch of self-heal is normal because it has detected an 
inconsistency. The real problem is what originates that inconsistency.


Xavi



[ 2017-01-12 01:19:18.256970] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-12 01:19:18.257015] W [MSGID: 122053] 
[ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-8: Operation failed 
on some subvolumes (up=7, mask=7, remaining=0, good=3, bad=4)
[2017-01-12 01:19:18.257018] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 
0-glusterfsProd-disperse-8: Heal failed [Invalid argument]
[2017-01-12 01:19:21.002028] E [dict.c:197:key_value_cmp] 
0-glusterfsProd-disperse-4: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-12 01:19:21.002056] E [dict.c:166:log_value] 
0-glusterfsProd-disperse-4: trusted.ec.size [ l1 = 62719407423488 l2 = 0 i1 = 0 
i2 = 0 ]
[2017-01-12 01:19:21.002064] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-4: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-12 01:19:21.209640] E [dict.c:197:key_value_cmp] 
0-glusterfsProd-disperse-4: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-12 01:19:21.209673] E [dict.c:166:log_value] 
0-glusterfsProd-disperse-4: trusted.ec.size [ l1 = 62719407423488 l2 = 0 i1 = 0 
i2 = 0 ]
[2017-01-12 01:19:21.209686] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-4: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-12 01:19:21.209719] W [MSGID: 122053] 
[ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-4: Operation failed 
on some subvolumes (up=7, mask=7, remaining=0, good=6, bad=1)
[2017-01-12 01:19:21.209753] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 
0-glusterfsProd-disperse-4: Heal failed [Invalid argument]

Thanks and Regards,
Ram

-Original Message-
From: Ankireddypalle Reddy
Sent: Wednesday, January 11, 2017 9:29 AM
To: Ankireddypalle Reddy; Xavier Hernandez; Gluster Devel 
(gluster-devel@gluster.org); gluster-us...@gluster.org
Subject: RE: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse 
volume

Xavi,
I built a debug binary to log more information. This is what is 
getting logged. Looks like it is the attribute trusted.ec.size which is 
different among the bricks in a sub volume.

In glustershd.log :

[2017-01-11 14:19:45.023845] N [MSGID: 122029] 
[ec-generic.c:683:ec_combine_lookup] 0-glusterfsProd-disperse-8: Mismatching 
iatt in answers of 'GF_FOP_LOOKUP'
[2017-01-11 14:19:45.027718] E [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-6: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-11 14:19:45.027736] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-11 14:19:45.027763] E [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-6: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-11 14:19:45.027781] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-11 14:19:45.027793] W [MSGID: 122053] 
[ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-6: Operation failed 
on some subvolumes (up=7, mask=7, remaining=0, good=6, bad=1)
[2017-01-11 14:19:45.027815] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 
0-glusterfsProd-disperse-6: Heal failed [Invalid argument]
[2017-01-11 14:19:45.029035] E [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-8: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-11 14:19:45.029057] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-11 14:19:45.029089] E [dict.c:166

Re: [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-10 Thread Xavier Hernandez

Hi Ram,

On 10/01/17 14:42, Ankireddypalle Reddy wrote:

Attachments (2):

1



ec.txt
<https://imap.commvault.com/webconsole/embedded.do?url=https://imap.commvault.com/webconsole/api/drive/publicshare/346714/file/ee2d1536c2dc4dff94afb12132b4f8f6/action/preview=https://imap.commvault.com/webconsole/api/contentstore/publicshare/346714/file/ee2d1536c2dc4dff94afb12132b4f8f6/action/download>
[Download]
<https://imap.commvault.com/webconsole/api/contentstore/publicshare/346714/file/ee2d1536c2dc4dff94afb12132b4f8f6/action/download>(11.50
KB)

2



ws-glus.log
<https://imap.commvault.com/webconsole/embedded.do?url=https://imap.commvault.com/webconsole/api/drive/publicshare/346714/file/cff3e0506e754b9a939db02da1cbbd58/action/preview=https://imap.commvault.com/webconsole/api/contentstore/publicshare/346714/file/cff3e0506e754b9a939db02da1cbbd58/action/download>
[Download]
<https://imap.commvault.com/webconsole/api/contentstore/publicshare/346714/file/cff3e0506e754b9a939db02da1cbbd58/action/download>(3.48
MB)

Xavi,
  We are encountering errors for different kinds of FOPS.
  The open failed for the following file:

  cvd_2017_01_10_02_28_26.log:98182 1f9fe 01/10 00:57:10 8414465
[MEDIAFS] 20117519-52075477 SingleInstancer_FS::StartDataFile2:
Failed to create the data file
[/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK_51342720/SFILE_CONTAINER_062],
error=0xECCC0005:{CQiFile::Open(92)} +
{CQiUTFOSAPI::open(96)/ErrNo.5.(Input/output error)-Open failed,
File=/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK_51342720/SFILE_CONTAINER_062,
OperationFlag=0xC1, PermissionMode=0x1FF}

  I've attached the extended attributes for the directories
  /ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/ and

/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK_51342720
from all the bricks.

 The attributes look fine to me. I've also attached some log
cuts to illustrate the problem.


I need the extended attributes of the file itself, not the parent 
directories.


Xavi



Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Tuesday, January 10, 2017 7:53 AM
To: Ankireddypalle Reddy; Gluster Devel (gluster-devel@gluster.org);
gluster-us...@gluster.org
Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume

Hi Ram,

the error is caused by an extended attribute that does not match on all
3 bricks of the disperse set. Most probable value is trusted.ec.version,
but could be others.

At first sight, I don't see any change from 3.7.8 that could have caused
this. I'll check again.

What kind of operations are you doing ? this can help me narrow the search.

Xavi

On 10/01/17 13:43, Ankireddypalle Reddy wrote:

Xavi,
  Thanks. If you could please explain what to look for in the

extended attributes then I will check and let you know if I find
anything suspicious.  Also we noticed that some of these operations
would succeed if retried. Do you know of any communicated related errors
that are being reported/triaged.


Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Tuesday, January 10, 2017 7:23 AM
To: Ankireddypalle Reddy; Gluster Devel (gluster-devel@gluster.org);
gluster-us...@gluster.org
Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume

Hi Ram,

On 10/01/17 13:14, Ankireddypalle Reddy wrote:

Attachment (1):

1



ecxattrs.txt
<https://imap.commvault.com/webconsole/embedded.do?url=https://imap.c
o
mmvault.com/webconsole/api/drive/publicshare/346714/file/1272e6827874
4
f15bf1a54f2b31b559d/action/preview=https://imap.commvault.
com/webconsole/api/contentstore/publicshare/346714/file/1272e68278744
f
15bf1a54f2b31b559d/action/download>
[Download]
<https://imap.commvault.com/webconsole/api/contentstore/publicshare/3
4
6714/file/1272e68278744f15bf1a54f2b31b559d/action/download>(5.92
KB)

Xavi,
 Please find attached the extended attributes for a
directory from all the bricks. Free space check failed for this with
error number EIO.


What do you mean ? what operation have you made to check the free

space on that directory ?


If it's a recursive check, I need the extended attributes from the

exact file that triggers the EIO. The attached attributes seem
consistent and that directory shouldn't cause any problem. Does an 'ls'
on that directory fail or does it show the contents ?


Xavi



Thanks and Regards,
Ram

-----Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Tuesday, January 10, 2017 6:45 AM
To: Ankireddypalle Reddy; Gluster Devel (gluster-devel@gluster.org);
gluster-us...@gluster.org
Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume

Hi Ram,

can you execute the following command on all bricks on a file that is
giving EIO ?

getfattr -m. -e hex -d 

Xavi

On 10/01/17 12:4

Re: [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-10 Thread Xavier Hernandez

Hi Ram,

the error is caused by an extended attribute that does not match on all 
3 bricks of the disperse set. Most probable value is trusted.ec.version, 
but could be others.


At first sight, I don't see any change from 3.7.8 that could have caused 
this. I'll check again.


What kind of operations are you doing ? this can help me narrow the search.

Xavi

On 10/01/17 13:43, Ankireddypalle Reddy wrote:

Xavi,
  Thanks. If you could please explain what to look for in the extended 
attributes then I will check and let you know if I find anything suspicious.  
Also we noticed that some of these operations would succeed if retried. Do you 
know of any communicated related errors that are being reported/triaged.

Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Tuesday, January 10, 2017 7:23 AM
To: Ankireddypalle Reddy; Gluster Devel (gluster-devel@gluster.org); 
gluster-us...@gluster.org
Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume

Hi Ram,

On 10/01/17 13:14, Ankireddypalle Reddy wrote:

Attachment (1):

1



ecxattrs.txt
<https://imap.commvault.com/webconsole/embedded.do?url=https://imap.co
mmvault.com/webconsole/api/drive/publicshare/346714/file/1272e68278744
f15bf1a54f2b31b559d/action/preview=https://imap.commvault.
com/webconsole/api/contentstore/publicshare/346714/file/1272e68278744f
15bf1a54f2b31b559d/action/download>
[Download]
<https://imap.commvault.com/webconsole/api/contentstore/publicshare/34
6714/file/1272e68278744f15bf1a54f2b31b559d/action/download>(5.92
KB)

Xavi,
 Please find attached the extended attributes for a
directory from all the bricks. Free space check failed for this with
error number EIO.


What do you mean ? what operation have you made to check the free space on that 
directory ?

If it's a recursive check, I need the extended attributes from the exact file 
that triggers the EIO. The attached attributes seem consistent and that 
directory shouldn't cause any problem. Does an 'ls' on that directory fail or 
does it show the contents ?

Xavi



Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Tuesday, January 10, 2017 6:45 AM
To: Ankireddypalle Reddy; Gluster Devel (gluster-devel@gluster.org);
gluster-us...@gluster.org
Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume

Hi Ram,

can you execute the following command on all bricks on a file that is
giving EIO ?

getfattr -m. -e hex -d 

Xavi

On 10/01/17 12:41, Ankireddypalle Reddy wrote:

Xavi,
We have been running 3.7.8 on these servers. We upgraded

to 3.7.18 yesterday. We upgraded all the servers at a time.  The
volume was brought down during upgrade.


Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Tuesday, January 10, 2017 6:35 AM
To: Ankireddypalle Reddy; Gluster Devel (gluster-devel@gluster.org);
gluster-us...@gluster.org
Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume

Hi Ram,

how did you upgrade gluster ? from which version ?

Did you upgrade one server at a time and waited until self-heal

finished before upgrading the next server ?


Xavi

On 10/01/17 11:39, Ankireddypalle Reddy wrote:

Hi,

  We upgraded to GlusterFS 3.7.18 yesterday.  We see lot of
failures in our applications. Most of the errors are EIO. The
following log lines are commonly seen in the logs:



The message "W [MSGID: 122056] [ec-combine.c:873:ec_combine_check]
0-StoragePool-disperse-4: Mismatching xdata in answers of 'LOOKUP'"
repeated 2 times between [2017-01-10 02:46:25.069809] and
[2017-01-10 02:46:25.069835]

[2017-01-10 02:46:25.069852] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-5:
Mismatching xdata in answers of 'LOOKUP'

The message "W [MSGID: 122056] [ec-combine.c:873:ec_combine_check]
0-StoragePool-disperse-5: Mismatching xdata in answers of 'LOOKUP'"
repeated 2 times between [2017-01-10 02:46:25.069852] and
[2017-01-10 02:46:25.069873]

[2017-01-10 02:46:25.069910] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-6:
Mismatching xdata in answers of 'LOOKUP'

...

[2017-01-10 02:46:26.520774] I [MSGID: 109036]
[dht-common.c:9076:dht_log_new_layout_for_dir_selfheal]
0-StoragePool-dht: Setting layout of
/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854213/CHUNK_51334585 with
[Subvol_name: StoragePool-disperse-0, Err: -1 , Start: 3221225466 ,
Stop: 3758096376 , Hash: 1 ], [Subvol_name: StoragePool-disperse-1, Err:
-1 , Start: 3758096377 , Stop: 4294967295 , Hash: 1 ], [Subvol_name:
StoragePool-disperse-2, Err: -1 , Start: 0 , Stop: 536870910 , Hash:
1 ], [Subvol_name: StoragePool-disperse-3, Err: -1 , Start:
536870911 ,
Stop: 1073741821 , Hash: 1 ], [Subvol_name: StoragePool-disperse-4, Err:
-1 , Start: 1073741822 , Stop: 1610612732 , Hash: 1 ], [Subvol_name:
Storage

Re: [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-10 Thread Xavier Hernandez

Hi Ram,

On 10/01/17 13:14, Ankireddypalle Reddy wrote:

Attachment (1):

1



ecxattrs.txt
<https://imap.commvault.com/webconsole/embedded.do?url=https://imap.commvault.com/webconsole/api/drive/publicshare/346714/file/1272e68278744f15bf1a54f2b31b559d/action/preview=https://imap.commvault.com/webconsole/api/contentstore/publicshare/346714/file/1272e68278744f15bf1a54f2b31b559d/action/download>
[Download]
<https://imap.commvault.com/webconsole/api/contentstore/publicshare/346714/file/1272e68278744f15bf1a54f2b31b559d/action/download>(5.92
KB)

Xavi,
 Please find attached the extended attributes for a
directory from all the bricks. Free space check failed for this with
error number EIO.


What do you mean ? what operation have you made to check the free space 
on that directory ?


If it's a recursive check, I need the extended attributes from the exact 
file that triggers the EIO. The attached attributes seem consistent and 
that directory shouldn't cause any problem. Does an 'ls' on that 
directory fail or does it show the contents ?


Xavi



Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Tuesday, January 10, 2017 6:45 AM
To: Ankireddypalle Reddy; Gluster Devel (gluster-devel@gluster.org);
gluster-us...@gluster.org
Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume

Hi Ram,

can you execute the following command on all bricks on a file that is
giving EIO ?

getfattr -m. -e hex -d 

Xavi

On 10/01/17 12:41, Ankireddypalle Reddy wrote:

Xavi,
We have been running 3.7.8 on these servers. We upgraded

to 3.7.18 yesterday. We upgraded all the servers at a time.  The volume
was brought down during upgrade.


Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Tuesday, January 10, 2017 6:35 AM
To: Ankireddypalle Reddy; Gluster Devel (gluster-devel@gluster.org);
gluster-us...@gluster.org
Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume

Hi Ram,

how did you upgrade gluster ? from which version ?

Did you upgrade one server at a time and waited until self-heal

finished before upgrading the next server ?


Xavi

On 10/01/17 11:39, Ankireddypalle Reddy wrote:

Hi,

  We upgraded to GlusterFS 3.7.18 yesterday.  We see lot of
failures in our applications. Most of the errors are EIO. The
following log lines are commonly seen in the logs:



The message "W [MSGID: 122056] [ec-combine.c:873:ec_combine_check]
0-StoragePool-disperse-4: Mismatching xdata in answers of 'LOOKUP'"
repeated 2 times between [2017-01-10 02:46:25.069809] and [2017-01-10
02:46:25.069835]

[2017-01-10 02:46:25.069852] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-5:
Mismatching xdata in answers of 'LOOKUP'

The message "W [MSGID: 122056] [ec-combine.c:873:ec_combine_check]
0-StoragePool-disperse-5: Mismatching xdata in answers of 'LOOKUP'"
repeated 2 times between [2017-01-10 02:46:25.069852] and [2017-01-10
02:46:25.069873]

[2017-01-10 02:46:25.069910] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-6:
Mismatching xdata in answers of 'LOOKUP'

...

[2017-01-10 02:46:26.520774] I [MSGID: 109036]
[dht-common.c:9076:dht_log_new_layout_for_dir_selfheal]
0-StoragePool-dht: Setting layout of
/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854213/CHUNK_51334585 with
[Subvol_name: StoragePool-disperse-0, Err: -1 , Start: 3221225466 ,
Stop: 3758096376 , Hash: 1 ], [Subvol_name: StoragePool-disperse-1, Err:
-1 , Start: 3758096377 , Stop: 4294967295 , Hash: 1 ], [Subvol_name:
StoragePool-disperse-2, Err: -1 , Start: 0 , Stop: 536870910 , Hash:
1 ], [Subvol_name: StoragePool-disperse-3, Err: -1 , Start: 536870911
,
Stop: 1073741821 , Hash: 1 ], [Subvol_name: StoragePool-disperse-4, Err:
-1 , Start: 1073741822 , Stop: 1610612732 , Hash: 1 ], [Subvol_name:
StoragePool-disperse-5, Err: -1 , Start: 1610612733 , Stop:
2147483643 ,
Hash: 1 ], [Subvol_name: StoragePool-disperse-6, Err: -1 , Start:
2147483644 , Stop: 2684354554 , Hash: 1 ], [Subvol_name:
StoragePool-disperse-7, Err: -1 , Start: 2684354555 , Stop:
3221225465 ,
Hash: 1 ],

[2017-01-10 02:46:26.522841] N [MSGID: 122031]
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-3:
Mismatching dictionary in answers of 'GF_FOP_XATTROP'

The message "N [MSGID: 122031] [ec-generic.c:1130:ec_combine_xattrop]
0-StoragePool-disperse-3: Mismatching dictionary in answers of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10
02:46:26.522841] and [2017-01-10 02:46:26.522894]

[2017-01-10 02:46:26.522898] W [MSGID: 122040]
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-3:
Failed to get size and version [Input/output error]

[2017-01-10 02:46:26.523115] N [MSGID: 122031]
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-6:
Mismatching dictionary in answers of 'GF_FOP_XATTROP'

The mes

Re: [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-10 Thread Xavier Hernandez

Hi Ram,

can you execute the following command on all bricks on a file that is 
giving EIO ?


getfattr -m. -e hex -d 

Xavi

On 10/01/17 12:41, Ankireddypalle Reddy wrote:

Xavi,
We have been running 3.7.8 on these servers. We upgraded to 3.7.18 
yesterday. We upgraded all the servers at a time.  The volume was brought down 
during upgrade.

Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Tuesday, January 10, 2017 6:35 AM
To: Ankireddypalle Reddy; Gluster Devel (gluster-devel@gluster.org); 
gluster-us...@gluster.org
Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume

Hi Ram,

how did you upgrade gluster ? from which version ?

Did you upgrade one server at a time and waited until self-heal finished before 
upgrading the next server ?

Xavi

On 10/01/17 11:39, Ankireddypalle Reddy wrote:

Hi,

  We upgraded to GlusterFS 3.7.18 yesterday.  We see lot of
failures in our applications. Most of the errors are EIO. The
following log lines are commonly seen in the logs:



The message "W [MSGID: 122056] [ec-combine.c:873:ec_combine_check]
0-StoragePool-disperse-4: Mismatching xdata in answers of 'LOOKUP'"
repeated 2 times between [2017-01-10 02:46:25.069809] and [2017-01-10
02:46:25.069835]

[2017-01-10 02:46:25.069852] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-5:
Mismatching xdata in answers of 'LOOKUP'

The message "W [MSGID: 122056] [ec-combine.c:873:ec_combine_check]
0-StoragePool-disperse-5: Mismatching xdata in answers of 'LOOKUP'"
repeated 2 times between [2017-01-10 02:46:25.069852] and [2017-01-10
02:46:25.069873]

[2017-01-10 02:46:25.069910] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-6:
Mismatching xdata in answers of 'LOOKUP'

...

[2017-01-10 02:46:26.520774] I [MSGID: 109036]
[dht-common.c:9076:dht_log_new_layout_for_dir_selfheal]
0-StoragePool-dht: Setting layout of
/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854213/CHUNK_51334585 with
[Subvol_name: StoragePool-disperse-0, Err: -1 , Start: 3221225466 ,
Stop: 3758096376 , Hash: 1 ], [Subvol_name: StoragePool-disperse-1, Err:
-1 , Start: 3758096377 , Stop: 4294967295 , Hash: 1 ], [Subvol_name:
StoragePool-disperse-2, Err: -1 , Start: 0 , Stop: 536870910 , Hash: 1
], [Subvol_name: StoragePool-disperse-3, Err: -1 , Start: 536870911 ,
Stop: 1073741821 , Hash: 1 ], [Subvol_name: StoragePool-disperse-4, Err:
-1 , Start: 1073741822 , Stop: 1610612732 , Hash: 1 ], [Subvol_name:
StoragePool-disperse-5, Err: -1 , Start: 1610612733 , Stop: 2147483643
,
Hash: 1 ], [Subvol_name: StoragePool-disperse-6, Err: -1 , Start:
2147483644 , Stop: 2684354554 , Hash: 1 ], [Subvol_name:
StoragePool-disperse-7, Err: -1 , Start: 2684354555 , Stop: 3221225465
,
Hash: 1 ],

[2017-01-10 02:46:26.522841] N [MSGID: 122031]
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-3:
Mismatching dictionary in answers of 'GF_FOP_XATTROP'

The message "N [MSGID: 122031] [ec-generic.c:1130:ec_combine_xattrop]
0-StoragePool-disperse-3: Mismatching dictionary in answers of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10
02:46:26.522841] and [2017-01-10 02:46:26.522894]

[2017-01-10 02:46:26.522898] W [MSGID: 122040]
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-3:
Failed to get size and version [Input/output error]

[2017-01-10 02:46:26.523115] N [MSGID: 122031]
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-6:
Mismatching dictionary in answers of 'GF_FOP_XATTROP'

The message "N [MSGID: 122031] [ec-generic.c:1130:ec_combine_xattrop]
0-StoragePool-disperse-6: Mismatching dictionary in answers of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10
02:46:26.523115] and [2017-01-10 02:46:26.523143]

[2017-01-10 02:46:26.523147] W [MSGID: 122040]
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-6:
Failed to get size and version [Input/output error]

[2017-01-10 02:46:26.523302] N [MSGID: 122031]
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-2:
Mismatching dictionary in answers of 'GF_FOP_XATTROP'

The message "N [MSGID: 122031] [ec-generic.c:1130:ec_combine_xattrop]
0-StoragePool-disperse-2: Mismatching dictionary in answers of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10
02:46:26.523302] and [2017-01-10 02:46:26.523324]

[2017-01-10 02:46:26.523328] W [MSGID: 122040]
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-2:
Failed to get size and version [Input/output error]



[root@glusterfs3 Log_Files]# gluster --version

glusterfs 3.7.18 built on Dec  8 2016 06:34:26



[root@glusterfs3 Log_Files]# gluster volume info



Volume Name: StoragePool

Type: Distributed-Disperse

Volume ID: 149e976f-4e21-451c-bf0f-f5691208531f

Status: Started

Number of Bricks: 8 x (2 + 1) = 24

Transport-type: tcp

Bricks:

Brick1: glusterfs1sds:/ws/disk1/ws_brick

Brick2: glusterfs2sds:/ws/disk1/ws_brick

Re: [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-10 Thread Xavier Hernandez

Hi Ram,

how did you upgrade gluster ? from which version ?

Did you upgrade one server at a time and waited until self-heal finished 
before upgrading the next server ?


Xavi

On 10/01/17 11:39, Ankireddypalle Reddy wrote:

Hi,

  We upgraded to GlusterFS 3.7.18 yesterday.  We see lot of failures
in our applications. Most of the errors are EIO. The following log lines
are commonly seen in the logs:



The message "W [MSGID: 122056] [ec-combine.c:873:ec_combine_check]
0-StoragePool-disperse-4: Mismatching xdata in answers of 'LOOKUP'"
repeated 2 times between [2017-01-10 02:46:25.069809] and [2017-01-10
02:46:25.069835]

[2017-01-10 02:46:25.069852] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-5:
Mismatching xdata in answers of 'LOOKUP'

The message "W [MSGID: 122056] [ec-combine.c:873:ec_combine_check]
0-StoragePool-disperse-5: Mismatching xdata in answers of 'LOOKUP'"
repeated 2 times between [2017-01-10 02:46:25.069852] and [2017-01-10
02:46:25.069873]

[2017-01-10 02:46:25.069910] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-6:
Mismatching xdata in answers of 'LOOKUP'

…

[2017-01-10 02:46:26.520774] I [MSGID: 109036]
[dht-common.c:9076:dht_log_new_layout_for_dir_selfheal]
0-StoragePool-dht: Setting layout of
/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854213/CHUNK_51334585 with
[Subvol_name: StoragePool-disperse-0, Err: -1 , Start: 3221225466 ,
Stop: 3758096376 , Hash: 1 ], [Subvol_name: StoragePool-disperse-1, Err:
-1 , Start: 3758096377 , Stop: 4294967295 , Hash: 1 ], [Subvol_name:
StoragePool-disperse-2, Err: -1 , Start: 0 , Stop: 536870910 , Hash: 1
], [Subvol_name: StoragePool-disperse-3, Err: -1 , Start: 536870911 ,
Stop: 1073741821 , Hash: 1 ], [Subvol_name: StoragePool-disperse-4, Err:
-1 , Start: 1073741822 , Stop: 1610612732 , Hash: 1 ], [Subvol_name:
StoragePool-disperse-5, Err: -1 , Start: 1610612733 , Stop: 2147483643 ,
Hash: 1 ], [Subvol_name: StoragePool-disperse-6, Err: -1 , Start:
2147483644 , Stop: 2684354554 , Hash: 1 ], [Subvol_name:
StoragePool-disperse-7, Err: -1 , Start: 2684354555 , Stop: 3221225465 ,
Hash: 1 ],

[2017-01-10 02:46:26.522841] N [MSGID: 122031]
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-3:
Mismatching dictionary in answers of 'GF_FOP_XATTROP'

The message "N [MSGID: 122031] [ec-generic.c:1130:ec_combine_xattrop]
0-StoragePool-disperse-3: Mismatching dictionary in answers of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10 02:46:26.522841]
and [2017-01-10 02:46:26.522894]

[2017-01-10 02:46:26.522898] W [MSGID: 122040]
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-3: Failed
to get size and version [Input/output error]

[2017-01-10 02:46:26.523115] N [MSGID: 122031]
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-6:
Mismatching dictionary in answers of 'GF_FOP_XATTROP'

The message "N [MSGID: 122031] [ec-generic.c:1130:ec_combine_xattrop]
0-StoragePool-disperse-6: Mismatching dictionary in answers of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10 02:46:26.523115]
and [2017-01-10 02:46:26.523143]

[2017-01-10 02:46:26.523147] W [MSGID: 122040]
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-6: Failed
to get size and version [Input/output error]

[2017-01-10 02:46:26.523302] N [MSGID: 122031]
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-2:
Mismatching dictionary in answers of 'GF_FOP_XATTROP'

The message "N [MSGID: 122031] [ec-generic.c:1130:ec_combine_xattrop]
0-StoragePool-disperse-2: Mismatching dictionary in answers of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10 02:46:26.523302]
and [2017-01-10 02:46:26.523324]

[2017-01-10 02:46:26.523328] W [MSGID: 122040]
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-2: Failed
to get size and version [Input/output error]



[root@glusterfs3 Log_Files]# gluster --version

glusterfs 3.7.18 built on Dec  8 2016 06:34:26



[root@glusterfs3 Log_Files]# gluster volume info



Volume Name: StoragePool

Type: Distributed-Disperse

Volume ID: 149e976f-4e21-451c-bf0f-f5691208531f

Status: Started

Number of Bricks: 8 x (2 + 1) = 24

Transport-type: tcp

Bricks:

Brick1: glusterfs1sds:/ws/disk1/ws_brick

Brick2: glusterfs2sds:/ws/disk1/ws_brick

Brick3: glusterfs3sds:/ws/disk1/ws_brick

Brick4: glusterfs1sds:/ws/disk2/ws_brick

Brick5: glusterfs2sds:/ws/disk2/ws_brick

Brick6: glusterfs3sds:/ws/disk2/ws_brick

Brick7: glusterfs1sds:/ws/disk3/ws_brick

Brick8: glusterfs2sds:/ws/disk3/ws_brick

Brick9: glusterfs3sds:/ws/disk3/ws_brick

Brick10: glusterfs1sds:/ws/disk4/ws_brick

Brick11: glusterfs2sds:/ws/disk4/ws_brick

Brick12: glusterfs3sds:/ws/disk4/ws_brick

Brick13: glusterfs1sds:/ws/disk5/ws_brick

Brick14: glusterfs2sds:/ws/disk5/ws_brick

Brick15: glusterfs3sds:/ws/disk5/ws_brick

Brick16: glusterfs1sds:/ws/disk6/ws_brick

Brick17: glusterfs2sds:/ws/disk6/ws_brick

Brick18: glusterfs3sds:/ws/disk6/ws_brick

Brick19: 

Re: [Gluster-devel] GFID2 - Proposal to add extra byte to existing GFID

2016-12-16 Thread Xavier Hernandez

On 12/16/2016 08:31 AM, Aravinda wrote:

Proposal to add one more byte to GFID to store "Type" information.
Extra byte will represent type(directory: 00, file: 01, Symlink: 02
etc)

For example, if a directory GFID is f4f18c02-0360-4cdc-8c00-0164e49a7afd
then, GFID2 will be 00f4f18c02-0360-4cdc-8c00-0164e49a7afd.

Changes to Backend store

Existing: .glusterfs/gfid[0:2]/gfid/[2:4]/gfid
Proposed: .glusterfs/gfid2[0:2]/gfid2[2:4]/gfid2[4:6]/gfid2

Advantages:
---
- Automatic grouping in .glusterfs directory based on file Type.
- Easy identification of Type by looking at GFID in logs/status output
  etc.
- Crawling(Quota/AFR): List of directories can be easily fetched by
  crawling `.glusterfs/gfid2[0:2]/` directory. This enables easy
  parallel Crawling.
- Quota - Marker: Marker transator can mark xtime of current file and
  parent directory. No need to update xtime xattr of all directories
  till root.
- Geo-replication: - Crawl can be multithreaded during initial sync.
  With marker changes above it will be more effective in crawling.

Please add if any more advantageous.

Disadvantageous:

Functionality is not changed with the above change except the length
of the ID. I can't think of any disadvantages except the code changes
to accommodate this change. Let me know if I missed anything here.


One disadvantage is that 17 bytes is a very ugly number for structures. 
Compilers will add paddings that will make any structure containing a 
GFID noticeable bigger. This will also cause troubles on all binary 
formats where a GFID is used, making them incompatible. One clear case 
of this is the XDR encoding of the gluster protocol. Currently a GFID is 
defined this way in many places:


opaque gfid[16]

This seems to make it quite complex to allow a mix of gluster versions 
in the same cluster (for example in a middle of an upgrade).


What about this alternative approach:

Based on the RFC4122 [1] that describes the format of an UUID, we can 
define a new structure for new GFID's using the same length.


Currently all GFID's are generated using the "random" method. This means 
that all GFID have this structure:


--Mxxx-Nxxx-

Where N can be 8, 9, A or B, and M is 4.

There are some special GFID's that have a M=0 and N=0, for example the 
root GFID.


What I propose is to use a new variant of GFID, for example E or F 
(officially marked as reserved for future definition) or even 0 to 7. We 
could use M as an internal version for the GFID structure (defined by 
ourselves when needed). Then we could use the first 4 or 8 bits of each 
GFID as you propose, without needing to extend current GFID length nor 
risking to collide with existing GFID's.


If we are concerned about the collision probability (quite small but 
still bigger than the current version) because we loose some random 
bits, we could use N = 0..7 and leave M random. This way we get 5 more 
random bits, from which we could use 4 to represent the inode type.


I think this way everything will work smoothly with older versions with 
minimal effort.


What do you think ?

Xavi

[1] https://www.ietf.org/rfc/rfc4122.txt



Changes:
-
- Code changes to accommodate 17 bytes GFID instead of 16 bytes(Read
  and Write)
- Migration Tool to upgrade GFIDs in Volume/Cluster

Let me know your thoughts.



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] 1402538 : Assertion failure during rebalance of symbolic links

2016-12-15 Thread Xavier Hernandez

On 12/15/2016 01:41 PM, Nithya Balachandran wrote:



On 15 December 2016 at 18:07, Xavier Hernandez <xhernan...@datalab.es
<mailto:xhernan...@datalab.es>> wrote:

On 12/15/2016 12:48 PM, Raghavendra Gowdappa wrote:

I need to step back a little to understand the RCA correctly.

If I understand the code correctly, the callstack which resulted
in failed setattr is (in rebalance process):

dht_lookup -> dht_lookup_cbk -> dht_lookup_everwhere ->
dht_lookup_everywhere_cbk -> dht_lookup_everywhere_done ->
dht_linkfile_create -> dht_lookup_linkfile_create_cbk ->
dht_linkfile_attr_heal -> setattr

However, this setattr doesn't change the file type.



STACK_WIND (copy, dht_linkfile_setattr_cbk, subvol,
subvol->fops->setattr, _local->loc,
, (GF_SET_ATTR_UID | GF_SET_ATTR_GID),
xattr);



As can be seen above, the setattr call only changes UID/GID. So,
I am at loss to explain why the file type changed. Has anyone
has any other explanation?


Does the inode passed to setattr represent the regular file just
created ? or does it contain information about the previous file
(the one it's being replaced) that in this case is a symbolic link ?

Right, IIUC, the reason this fails is the inode for the actual sym link
has type LINK which does not match the stbuf returned in the setattr on
the linkto file. The file does _not_ change types.


That seems a big problem to me. All fops should receive consistent data, 
otherwise its behavior is undefined. Any xlator may rely on received 
data to decide what to do. In this particular case, ec could check the 
data from the answer, but maybe in the future another xlator needs to 
decide what to do before getting the answers. If we receive inconsistent 
data, that won't be possible.


It seems not right to me to share the same inode to represent two 
distinct files, even if they are related to the same file from the top 
view. I think taht each DHT's subvolume should have it's private inode 
representation, specially if they represent different files.


Xavi




Xavi


regards,
Raghavendra





___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] 1402538 : Assertion failure during rebalance of symbolic links

2016-12-15 Thread Xavier Hernandez

On 12/15/2016 12:48 PM, Raghavendra Gowdappa wrote:

I need to step back a little to understand the RCA correctly.

If I understand the code correctly, the callstack which resulted in failed 
setattr is (in rebalance process):

dht_lookup -> dht_lookup_cbk -> dht_lookup_everwhere -> dht_lookup_everywhere_cbk -> 
dht_lookup_everywhere_done -> dht_linkfile_create -> dht_lookup_linkfile_create_cbk -> 
dht_linkfile_attr_heal -> setattr

However, this setattr doesn't change the file type.



STACK_WIND (copy, dht_linkfile_setattr_cbk, subvol,
subvol->fops->setattr, _local->loc,
, (GF_SET_ATTR_UID | GF_SET_ATTR_GID), xattr);



As can be seen above, the setattr call only changes UID/GID. So, I am at loss 
to explain why the file type changed. Has anyone has any other explanation?


Does the inode passed to setattr represent the regular file just created 
? or does it contain information about the previous file (the one it's 
being replaced) that in this case is a symbolic link ?


Xavi



regards,
Raghavendra



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] 1402538 : Assertion failure during rebalance of symbolic links

2016-12-14 Thread Xavier Hernandez

On 12/14/2016 10:28 AM, Pranith Kumar Karampuri wrote:



On Wed, Dec 14, 2016 at 2:54 PM, Xavier Hernandez <xhernan...@datalab.es
<mailto:xhernan...@datalab.es>> wrote:

On 12/14/2016 10:17 AM, Pranith Kumar Karampuri wrote:



On Wed, Dec 14, 2016 at 1:48 PM, Xavier Hernandez
<xhernan...@datalab.es <mailto:xhernan...@datalab.es>
<mailto:xhernan...@datalab.es <mailto:xhernan...@datalab.es>>>
wrote:

There's another issue with the patch that Ashish sent.

The original problem is that a setattr on a symbolic link gets
transformed to a regular file while the fop is being
executed. Even
if we apply the Ashish' patch to avoid the assert, the
setattr fop
will still succeed and incorrectly change the attributes of a
gluster special file that shouldn't change.

I think that's a bigger problem that needs to be addressed
globally.

I'm sure this is not an easy solution, but probably the best way
would be to have distinct inodes for the gluster link files
and the
original file. This way most of these problems should be solved.


Is there any reason why there is a difference in type of the file on
hashed/cached subvols? We can have the same type of file on both dht
subvolumes? That will prevent unlink of regular file and
recreate with
the actual type of the file?


I think the problem is not only the type of the inode. There are
more things involved. If we allow operations intended for regular
files to succeed on the dht link file itself, the operation won't be
visible and may affect future actions.

How it's prevented that the setattr modifies an already created link
file ? or at least, are these changes propagated to the real file
later and the link is restored to the original state ? if so, how
dht detects all this without any locks ? if it's able to detect
that, why does it send the setattr request anyway ?


I think the assert messages are coming at the time of marking the file
as '-T' file. DHT makes sure the actual fop happens on the
cached subvolume. But this linkto file will be present in hashed
subvolume indicating it is a linkto file (i.e. 'T' file and there will
be an extended attribute telling where the actual file is in an extended
attrbute). With EC in picture this marking of linkto file by doing a
setattr as '-T' file is asserting.


If I understand correctly, the real file and the link file not only 
share the gfid but also share the inode structure itself (otherwise dht 
would send the new inode to setattr and that problem won't happen). This 
means that we have a single inode structure to represent a symbolic link 
and a regular file at the same time. This seems very bad to me.









Xavi


    On 12/14/2016 09:02 AM, Xavier Hernandez wrote:

On 12/14/2016 06:10 AM, Raghavendra Gowdappa wrote:



- Original Message -

From: "Pranith Kumar Karampuri"
<pkara...@redhat.com <mailto:pkara...@redhat.com>
<mailto:pkara...@redhat.com
<mailto:pkara...@redhat.com>>>
To: "Ashish Pandey" <aspan...@redhat.com
<mailto:aspan...@redhat.com>
<mailto:aspan...@redhat.com
<mailto:aspan...@redhat.com>>>
Cc: "Gluster Devel" <gluster-devel@gluster.org
<mailto:gluster-devel@gluster.org>
<mailto:gluster-devel@gluster.org
<mailto:gluster-devel@gluster.org>>>, "Shyam Ranganathan"
<srang...@redhat.com
<mailto:srang...@redhat.com> <mailto:srang...@redhat.com
<mailto:srang...@redhat.com>>>,
"Nithya Balachandran"
    <nbala...@redhat.com
<mailto:nbala...@redhat.com> <mailto:nbala...@redhat.com
<mailto:nbala...@redhat.com>>>,
"Xavier Hernandez" <xhernan...@datalab.es
<mailto:xhernan...@datalab.es>
<mailto:xhernan...@datalab.es
<mailto:xhernan...@datalab.es>>>,
"Raghavendra Gowdappa" <rgowd...@redhat.com
<mailto:rgowd...@redhat.com>
<mailto:rgowd...@redhat.com
<mailto:rgowd...@redhat.com>>>
Sent: Tuesday, December 13, 2016 9:29:46 PM
Subject: Re: 1402538 : Assertion failure during
rebalance of symbolic
   

Re: [Gluster-devel] 1402538 : Assertion failure during rebalance of symbolic links

2016-12-14 Thread Xavier Hernandez

On 12/14/2016 10:17 AM, Pranith Kumar Karampuri wrote:



On Wed, Dec 14, 2016 at 1:48 PM, Xavier Hernandez <xhernan...@datalab.es
<mailto:xhernan...@datalab.es>> wrote:

There's another issue with the patch that Ashish sent.

The original problem is that a setattr on a symbolic link gets
transformed to a regular file while the fop is being executed. Even
if we apply the Ashish' patch to avoid the assert, the setattr fop
will still succeed and incorrectly change the attributes of a
gluster special file that shouldn't change.

I think that's a bigger problem that needs to be addressed globally.

I'm sure this is not an easy solution, but probably the best way
would be to have distinct inodes for the gluster link files and the
original file. This way most of these problems should be solved.


Is there any reason why there is a difference in type of the file on
hashed/cached subvols? We can have the same type of file on both dht
subvolumes? That will prevent unlink of regular file and recreate with
the actual type of the file?


I think the problem is not only the type of the inode. There are more 
things involved. If we allow operations intended for regular files to 
succeed on the dht link file itself, the operation won't be visible and 
may affect future actions.


How it's prevented that the setattr modifies an already created link 
file ? or at least, are these changes propagated to the real file later 
and the link is restored to the original state ? if so, how dht detects 
all this without any locks ? if it's able to detect that, why does it 
send the setattr request anyway ?






Xavi


On 12/14/2016 09:02 AM, Xavier Hernandez wrote:

On 12/14/2016 06:10 AM, Raghavendra Gowdappa wrote:



- Original Message -

From: "Pranith Kumar Karampuri" <pkara...@redhat.com
<mailto:pkara...@redhat.com>>
To: "Ashish Pandey" <aspan...@redhat.com
<mailto:aspan...@redhat.com>>
Cc: "Gluster Devel" <gluster-devel@gluster.org
<mailto:gluster-devel@gluster.org>>, "Shyam Ranganathan"
<srang...@redhat.com <mailto:srang...@redhat.com>>,
"Nithya Balachandran"
<nbala...@redhat.com <mailto:nbala...@redhat.com>>,
"Xavier Hernandez" <xhernan...@datalab.es
<mailto:xhernan...@datalab.es>>,
"Raghavendra Gowdappa" <rgowd...@redhat.com
<mailto:rgowd...@redhat.com>>
Sent: Tuesday, December 13, 2016 9:29:46 PM
Subject: Re: 1402538 : Assertion failure during
rebalance of symbolic
links

On Tue, Dec 13, 2016 at 2:45 PM, Ashish Pandey
<aspan...@redhat.com <mailto:aspan...@redhat.com>>
wrote:

Hi All,

We have been seeing an issue where re balancing
symbolic links leads
to an
assertion failure in EC volume.

The root cause of this is that while migrating
symbolic links to
other sub
volume, it creates a link file (with attributes
.T) .
This file is a regular file.
Now, during migration a setattr comes to this link
and because of
possible
race, posix_stat return stats of this "T" file.
In ec_manager_seattr, we receive callbacks and check
the type of
entry. If
it is a regular file we try to get size and if it is
not there, we
raise an
assert.
So, basically we are checking a size of the link
(which will not have
size) which has been returned as regular file and we
are ending up when
this condition
becomes TRUE.

Now, this looks like a problem with re balance and
difficult to fix at
this point (as per the discussion).
We have an alternative to fix it in EC but that will
be more like a
hack
than an actual fix. We should not modify EC
to deal with an individual issue which is in other
translator.


I am afraid, dht doesn't have a better way of handling this.
While DHT
maintains abstraction (of a symbo

Re: [Gluster-devel] 1402538 : Assertion failure during rebalance of symbolic links

2016-12-14 Thread Xavier Hernandez

There's another issue with the patch that Ashish sent.

The original problem is that a setattr on a symbolic link gets 
transformed to a regular file while the fop is being executed. Even if 
we apply the Ashish' patch to avoid the assert, the setattr fop will 
still succeed and incorrectly change the attributes of a gluster special 
file that shouldn't change.


I think that's a bigger problem that needs to be addressed globally.

I'm sure this is not an easy solution, but probably the best way would 
be to have distinct inodes for the gluster link files and the original 
file. This way most of these problems should be solved.


Xavi

On 12/14/2016 09:02 AM, Xavier Hernandez wrote:

On 12/14/2016 06:10 AM, Raghavendra Gowdappa wrote:



- Original Message -

From: "Pranith Kumar Karampuri" <pkara...@redhat.com>
To: "Ashish Pandey" <aspan...@redhat.com>
Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Shyam Ranganathan"
<srang...@redhat.com>, "Nithya Balachandran"
<nbala...@redhat.com>, "Xavier Hernandez" <xhernan...@datalab.es>,
"Raghavendra Gowdappa" <rgowd...@redhat.com>
Sent: Tuesday, December 13, 2016 9:29:46 PM
Subject: Re: 1402538 : Assertion failure during rebalance of symbolic
links

On Tue, Dec 13, 2016 at 2:45 PM, Ashish Pandey <aspan...@redhat.com>
wrote:


Hi All,

We have been seeing an issue where re balancing symbolic links leads
to an
assertion failure in EC volume.

The root cause of this is that while migrating symbolic links to
other sub
volume, it creates a link file (with attributes .T) .
This file is a regular file.
Now, during migration a setattr comes to this link and because of
possible
race, posix_stat return stats of this "T" file.
In ec_manager_seattr, we receive callbacks and check the type of
entry. If
it is a regular file we try to get size and if it is not there, we
raise an
assert.
So, basically we are checking a size of the link (which will not have
size) which has been returned as regular file and we are ending up when
this condition
becomes TRUE.

Now, this looks like a problem with re balance and difficult to fix at
this point (as per the discussion).
We have an alternative to fix it in EC but that will be more like a
hack
than an actual fix. We should not modify EC
to deal with an individual issue which is in other translator.


I am afraid, dht doesn't have a better way of handling this. While DHT
maintains abstraction (of a symbolic link) to layers above, the layers
below it cannot be shielded from seeing the details like a linkto file
etc.


That's ok, and I think it's the right thing to do. From the point of
view of EC, it's irrelevant how the file is seen by upper layers. It
only cares about the files below it.


If the concern really is that the file is changing its type in a span
of single fop, we can probably explore the option of locking (or other
synchronization mechanisms) to prevent migration taking place, while a
fop is in progress.


That's the real problem. Some operations receive an inode referencing a
symbolic link on input but the iatt structures from the callback
reference a regular file. It's even worse because it's an asynchronous
race so some of the bricks may return a regular file and some may return
a symbolic link. If there are more than redundancy bricks returning a
different type, the most probably result will be an I/O error caused by
inconsistent answers.

Ashish wrote a patch to check the type of the inode at the input instead
of relying on the answers. While this could avoid the assertion issued
by ec, it doesn't solve the race, leaving room for the I/O errors I
mentioned earlier.


But, I assume there will be performance penalties for that too.


Yes. I don't see any other way to really solve this problem. A lock is
needed.

In ec we already have a problem that will need an additional lock on
rmdir, unlink and rename to avoid some races. This change will also need
support from locks xlator to avoid granting locks on deleted inodes. If
dht is using one of these operations to replace the symbolic link by the
gluster link file, I think this change could solve the I/O errors, but
I'm not sure we could completely solve the problem.

I'm not sure how dht does the transform from a symbolic link to a
gluster link file, but if it involves more than one fop from the point
of view of ec, there's nothing that ec can do to solve the problem. If
another client accesses the file, ec can return any intermediate state.
DHT should take some lock to do all operations atomically and avoid
problems on other clients.

I think that the mid-term approach to completely solve the problem
without a performance impact should be to implement some kind of
transaction mechanism that will reuse lock requests. This would allow,
among other things, that multiple atomic operations could be performed
by different xlators but

Re: [Gluster-devel] 1402538 : Assertion failure during rebalance of symbolic links

2016-12-14 Thread Xavier Hernandez

On 12/14/2016 06:10 AM, Raghavendra Gowdappa wrote:



- Original Message -

From: "Pranith Kumar Karampuri" <pkara...@redhat.com>
To: "Ashish Pandey" <aspan...@redhat.com>
Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Shyam Ranganathan" 
<srang...@redhat.com>, "Nithya Balachandran"
<nbala...@redhat.com>, "Xavier Hernandez" <xhernan...@datalab.es>, "Raghavendra 
Gowdappa" <rgowd...@redhat.com>
Sent: Tuesday, December 13, 2016 9:29:46 PM
Subject: Re: 1402538 : Assertion failure during rebalance of symbolic links

On Tue, Dec 13, 2016 at 2:45 PM, Ashish Pandey <aspan...@redhat.com> wrote:


Hi All,

We have been seeing an issue where re balancing symbolic links leads to an
assertion failure in EC volume.

The root cause of this is that while migrating symbolic links to other sub
volume, it creates a link file (with attributes .T) .
This file is a regular file.
Now, during migration a setattr comes to this link and because of possible
race, posix_stat return stats of this "T" file.
In ec_manager_seattr, we receive callbacks and check the type of entry. If
it is a regular file we try to get size and if it is not there, we raise an
assert.
So, basically we are checking a size of the link (which will not have
size) which has been returned as regular file and we are ending up when
this condition
becomes TRUE.

Now, this looks like a problem with re balance and difficult to fix at
this point (as per the discussion).
We have an alternative to fix it in EC but that will be more like a hack
than an actual fix. We should not modify EC
to deal with an individual issue which is in other translator.


I am afraid, dht doesn't have a better way of handling this. While DHT 
maintains abstraction (of a symbolic link) to layers above, the layers below it 
cannot be shielded from seeing the details like a linkto file etc.


That's ok, and I think it's the right thing to do. From the point of 
view of EC, it's irrelevant how the file is seen by upper layers. It 
only cares about the files below it.



If the concern really is that the file is changing its type in a span of single 
fop, we can probably explore the option of locking (or other synchronization 
mechanisms) to prevent migration taking place, while a fop is in progress.


That's the real problem. Some operations receive an inode referencing a 
symbolic link on input but the iatt structures from the callback 
reference a regular file. It's even worse because it's an asynchronous 
race so some of the bricks may return a regular file and some may return 
a symbolic link. If there are more than redundancy bricks returning a 
different type, the most probably result will be an I/O error caused by 
inconsistent answers.


Ashish wrote a patch to check the type of the inode at the input instead 
of relying on the answers. While this could avoid the assertion issued 
by ec, it doesn't solve the race, leaving room for the I/O errors I 
mentioned earlier.



But, I assume there will be performance penalties for that too.


Yes. I don't see any other way to really solve this problem. A lock is 
needed.


In ec we already have a problem that will need an additional lock on 
rmdir, unlink and rename to avoid some races. This change will also need 
support from locks xlator to avoid granting locks on deleted inodes. If 
dht is using one of these operations to replace the symbolic link by the 
gluster link file, I think this change could solve the I/O errors, but 
I'm not sure we could completely solve the problem.


I'm not sure how dht does the transform from a symbolic link to a 
gluster link file, but if it involves more than one fop from the point 
of view of ec, there's nothing that ec can do to solve the problem. If 
another client accesses the file, ec can return any intermediate state. 
DHT should take some lock to do all operations atomically and avoid 
problems on other clients.


I think that the mid-term approach to completely solve the problem 
without a performance impact should be to implement some kind of 
transaction mechanism that will reuse lock requests. This would allow, 
among other things, that multiple atomic operations could be performed 
by different xlators but sharing the locks instead of requiring each 
xlator to take an inodelk on its own.


Xavi





Now the question is how to proceed with this? Any suggestions?



Raghavendra/Nithya,
 Could one of you explain the difficulties in fixing this issue in
DHT so that Xavi will also be caught up with why we should add this change
in EC in the short term.




Details on this bug can be found here -
https://bugzilla.redhat.com/show_bug.cgi?id=1402538


Ashish







--
Pranith



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Question about EC locking

2016-12-12 Thread Xavier Hernandez

Hi JK,

On 12/13/2016 08:34 AM, jayakrishnan mm wrote:

Dear Xavi,

How do I test  the locks, for example locks  for write fop. I have two
clients(independent), both  are  trying to write to same file.


1. According to my understanding, both  can successfully write  if the
offsets don't overlap . I mean, the WRITE FOP  takes a chunk lock on the
file . As
long as the clients don't try  to write to the same chunk, it should be
OK. If no locks  present, it can lead to inconsistency.


With locks all writes will be fine as defined by posix (i.e. the final 
result will be equivalent to the sequential execution of both 
operations, though in an undefined order), even if they overlap. Without 
locks, there are chances that some bricks execute the operations in one 
order and the remaining bricks execute the same operations in the 
reverse order, causing data corruption.





2.  Different FOPs can always run simultaneously. (Example  WRITE  and
READ FOPs, or  two READ FOPs).


All fops can be executed concurrently. If there's any chance that two 
operations could interfere, locks are taken in the appropriate places. 
For example, reads cannot be merged with overlapping writes. Otherwise 
they could return inconsistent data.




3. WRITE & some metadata FOP (like setattr)  together . Cannot happen
together with locks , even though chances  are very low.


As in 2, if there's any possible interference, the appropriate locks 
will be taken.


You can look at the code to see which locks are taken for each fop. See 
the corresponding ec_manager_() function, in the EC_STATE_LOCK 
switch case. There you will see calls to ec_lock_prepare_xxx() for each 
taken lock.


Xavi



Pls. clarify.

Best regards
JK



On Wed, Nov 30, 2016 at 5:49 PM, jayakrishnan mm
<jayakrishnan...@gmail.com <mailto:jayakrishnan...@gmail.com>> wrote:

Hi Xavier,

Thank you very much for your explanation. This helped  me to
understand  more  about  locking in EC.

Best Regards
JK


On Mon, Nov 28, 2016 at 4:17 PM, Xavier Hernandez
<xhernan...@datalab.es <mailto:xhernan...@datalab.es>> wrote:

Hi,

On 11/28/2016 02:59 AM, jayakrishnan mm wrote:

Hi Xavier,

Notice  that EC xlator uses blocking locks. Any specific
reason for this?


In a distributed filesystem like gluster a synchronization
mechanism is a must to avoid data corruption.


Do you think this will  affect the  performance ?


Of course the need for locks has a performance impact, and we
cannot avoid them to guarantee data integrity. However some
optimizations have been applied, specially the eager locking
which allows a lock to be reused without unlocking/locking again.


(In comparison AFR  first tries  non blocking locks  and if not
successful, tries blocking locks then)


EC also tries a non-blocking lock first.


Also, why two locks  are  needed  per FOP ? One for normal
I/O and
another for self healing?


The only fop that currently needs two locks is 'rename', and
only when source and destination directories are different. All
other fops only take one lock at most.

Best regards,

Xavi


Best regards
JK


___
Gluster-devel mailing list
Gluster-devel@gluster.org <mailto:Gluster-devel@gluster.org>
http://www.gluster.org/mailman/listinfo/gluster-devel
<http://www.gluster.org/mailman/listinfo/gluster-devel>






___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Question about EC locking

2016-11-28 Thread Xavier Hernandez

Hi,

On 11/28/2016 02:59 AM, jayakrishnan mm wrote:

Hi Xavier,

Notice  that EC xlator uses blocking locks. Any specific reason for this?


In a distributed filesystem like gluster a synchronization mechanism is 
a must to avoid data corruption.




Do you think this will  affect the  performance ?


Of course the need for locks has a performance impact, and we cannot 
avoid them to guarantee data integrity. However some optimizations have 
been applied, specially the eager locking which allows a lock to be 
reused without unlocking/locking again.




(In comparison AFR  first tries  non blocking locks  and if not
successful, tries blocking locks then)


EC also tries a non-blocking lock first.



Also, why two locks  are  needed  per FOP ? One for normal I/O and
another for self healing?


The only fop that currently needs two locks is 'rename', and only when 
source and destination directories are different. All other fops only 
take one lock at most.


Best regards,

Xavi



Best regards
JK


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Why vandermonde matrix is used in EC?

2016-11-27 Thread Xavier Hernandez

On 11/27/2016 09:58 PM, 한우형 wrote:

Hi,

Thank you so much for the speedy reply, but I have some more questions.

1) I understand Non-systematic encoding/decoding doesn't alter
performance when one or more bricks are down. but why systematic
approach has service degradation?
I think when parity part is down there's no performance degradation, and
when not-parity part is down it needs to be encoded. but It is same with
Non-systeamtic case.


If a systematic implementation does increase performance in a 
perceptible way, then a failure of one brick will give less performance 
to users. Even if that performance is the same that we currently have, 
it will be worse from the perspective of the users.


Note that there's no distinction between "data" bricks and "parity" 
bricks. Each file will use a different brick for its parity, so a 
failure of a brick will always cause trouble to some files. This would 
also allow a distribution of the read load among all available bricks.


Anyway, as I said in the other email, it's not so clear that a 
systematic implementation would really have an important improvement on 
performance.




2) In systematic approach, what kind of metatdata need to be checked?
Can't we just try to read not-parity part?


If a brick is down, it's clear that we'll need to read from parity, but 
when the brick comes up again it can contain old data (data modified 
while it was down), so we cannot simply read from that brick. We need to 
verify in some way that the other bricks do not contain updated data.


Best regards,

Xavi




Best regards,
Han

2016-11-24 17:26 GMT+09:00 Xavier Hernandez <xhernan...@datalab.es
<mailto:xhernan...@datalab.es>>:

Hi Han,

On 11/24/2016 04:25 AM, 한우형 wrote:

Hi,

I'm working on dispersed volume(ec) and I found ec encode/decode
algorithm is using non-systematic vandermonde matrix.

My question is this: why non-systematic algorithm is used?


Non-systematic encoding/decoding doesn't alter performance when one
or more bricks are down. This means that you won't have service
degradation when you are having troubles with one brick or you are
doing maintenance.

From the implementation perspective, a systematic approach would
need to talk to all bricks anyway to check for critical metadata
(gluster doesn't have a centralized metadata server). This means
that the theoretical benefit of a systematic decoding for reads
would be masked by the overhead needed for metadata operations
(involving additional network round-trips).

That said, it's true that a systematic approach would have some
benefits, like a little less CPU overhead. Not sure if the
performance would benefit significantly though.

If we use
systematic algorithm(not systematic vandermonde, It's not MDS)


A non-systematic Vandermonde matrix *IS* MDS. In fact, pure
Vandermonde matrices are non-systematic by definition. Some
alterations need to be done to make them systematic, and these
transformations can lead to a non MDS matrix if not made with care.

we can
boost read performance. (no need to decode step in read)


Though it would probably have some benefits, I'm not so sure that
performance would improve significantly.

Current implementation of ec decoding can process 1GB/s of data per
CPU core on low end processors (Intel Atoms with SSE2) using block
sizes of 128KB and a 4+2 configuration. Currently this is much
faster than what a pure distributed volume on same hardware can read
for a single client/single thread.

So, for now, the non-systematic approach doesn't seem a bottle-neck
for gluster. Anyway there are plans to provide a systematic version,
but it's not a priority as of now.

Best regards,

Xavi


Best regards,
Han



___
Gluster-devel mailing list
Gluster-devel@gluster.org <mailto:Gluster-devel@gluster.org>
http://www.gluster.org/mailman/listinfo/gluster-devel
<http://www.gluster.org/mailman/listinfo/gluster-devel>





___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Why vandermonde matrix is used in EC?

2016-11-24 Thread Xavier Hernandez

Hi Han,

On 11/24/2016 04:25 AM, 한우형 wrote:

Hi,

I'm working on dispersed volume(ec) and I found ec encode/decode
algorithm is using non-systematic vandermonde matrix.

My question is this: why non-systematic algorithm is used?


Non-systematic encoding/decoding doesn't alter performance when one or 
more bricks are down. This means that you won't have service degradation 
when you are having troubles with one brick or you are doing maintenance.


From the implementation perspective, a systematic approach would need 
to talk to all bricks anyway to check for critical metadata (gluster 
doesn't have a centralized metadata server). This means that the 
theoretical benefit of a systematic decoding for reads would be masked 
by the overhead needed for metadata operations (involving additional 
network round-trips).


That said, it's true that a systematic approach would have some 
benefits, like a little less CPU overhead. Not sure if the performance 
would benefit significantly though.



If we use
systematic algorithm(not systematic vandermonde, It's not MDS)


A non-systematic Vandermonde matrix *IS* MDS. In fact, pure Vandermonde 
matrices are non-systematic by definition. Some alterations need to be 
done to make them systematic, and these transformations can lead to a 
non MDS matrix if not made with care.



we can
boost read performance. (no need to decode step in read)


Though it would probably have some benefits, I'm not so sure that 
performance would improve significantly.


Current implementation of ec decoding can process 1GB/s of data per CPU 
core on low end processors (Intel Atoms with SSE2) using block sizes of 
128KB and a 4+2 configuration. Currently this is much faster than what a 
pure distributed volume on same hardware can read for a single 
client/single thread.


So, for now, the non-systematic approach doesn't seem a bottle-neck for 
gluster. Anyway there are plans to provide a systematic version, but 
it's not a priority as of now.


Best regards,

Xavi



Best regards,
Han



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Possible problem introduced by http://review.gluster.org/15573

2016-10-24 Thread Xavier Hernandez

Hi Soumya,

On 21/10/16 16:15, Soumya Koduri wrote:



On 10/21/2016 06:35 PM, Soumya Koduri wrote:

Hi Xavi,

On 10/21/2016 12:57 PM, Xavier Hernandez wrote:

Looking at the code, I think that the added fd_unref() should only be
called if the fop preparation fails. Otherwise the callback already
unreferences the fd.

Code flow:

* glfs_fsync_async_common() takes an fd ref and calls STACK_WIND passing
that fd.
* Just after that a ref is released.
* When glfs_io_async_cbk() is called another ref is released.

Note that if fop preparation fails, a single fd_unref() is called, but
on success two fd_unref() are called.


Sorry for the inconvenience caused. I think its patch#15573 hasn't
caused the problem but has highlighted another ref leak in the code.

From the code I see that glfs_io_async_cbk() does fd_unref (glfd->fd)
but not the fd passed in STACK_WIND_COOKIE() of the fop.

If I take any fop, for eg.,
glfs_fsync_common() {

   fd = glfs_resolve_fd (glfd->fs, subvol, glfd);


}

Here in glfs_resolve_fd ()

fd_t *
__glfs_resolve_fd (struct glfs *fs, xlator_t *subvol, struct glfs_fd
*glfd)
{
fd_t *fd = NULL;

if (glfd->fd->inode->table->xl == subvol)
return fd_ref (glfd->fd);


Here we can see that  we are taking extra ref additional to the

ref already taken for glfd->fd. That means the caller of this function
needs to fd_unref(fd) irrespective of subsequent fd_unref (glfd->fd).

fd = __glfs_migrate_fd (fs, subvol, glfd);
if (!fd)
return NULL;


if (subvol == fs->active_subvol) {
fd_unref (glfd->fd);
glfd->fd = fd_ref (fd);
}


I think the issue is here during graph_switch(). You have

mentioned as well that the crash happens post graph_switch. Maybe here
we are missing an extra ref to be taken for fd additional to glfd->fd. I
need to look through __glfs_migrate_fd() to confirm that. But these are
my initial thoughts.


Looking into this, I think we should fix glfs_io_async_cbk() not to
fd_unref(glfd->fd). glfd->fd should be active though out the lifetime of
glfd (i.e, until it is closed). Thoughts?


I don't know gfapi internals in deep, but at first sight I think this 
would be the right think to do. Assuming that glfd will keep a reference 
to the fd until it's destroyed, and that a glfd reference is taken 
during the lifetime of each request that needs it, the fd_unref() in 
glfd_io_async_cbk() seems unnecessary. I think it was there just to 
release the fd acquired if glfs_resolve_fd(), but it's better to place 
it where it's now.


Another question is if we really need to take an additional reference in 
glfs_resolve_fd() ?


Can an fd returned by this function live more time than the associated 
glfd in some circumstances ?



Also could you please check if it is the second/subsequent fsync_async()
call which results in crash?


I'll try to test it as soon as possible, but this is on a server that we 
need to put in production very soon and we have decided to go with fuse 
for now. We'll have a lot of work to do this week. Once I have some free 
time I'll build a test environment to check it, probably next week.


Xavi



Thanks,
Soumya



Please let me know your comments.

Thanks,
Soumya




Xavi

On 21/10/16 09:03, Xavier Hernandez wrote:

Hi,

I've just tried Gluster 3.8.5 with Proxmox using gfapi and I
consistently see a crash each time an attempt to connect to the volume
is made.

The backtrace of the crash shows this:

#0  pthread_spin_lock () at
../nptl/sysdeps/x86_64/pthread_spin_lock.S:24
#1  0x7fe5345776a5 in fd_unref (fd=0x7fe523f7205c) at fd.c:553
#2  0x7fe53482ba18 in glfs_io_async_cbk (op_ret=,
op_errno=0, frame=, cookie=0x7fe526c67040,
iovec=iovec@entry=0x0, count=count@entry=0)
at glfs-fops.c:839
#3  0x7fe53482beed in glfs_fsync_async_cbk (frame=,
cookie=, this=, op_ret=,
op_errno=,
prebuf=, postbuf=0x7fe5217fe890, xdata=0x0) at
glfs-fops.c:1382
#4  0x7fe520be2eb7 in ?? () from
/usr/lib/x86_64-linux-gnu/glusterfs/3.8.5/xlator/debug/io-stats.so
#5  0x7fe5345d118a in default_fsync_cbk (frame=0x7fe52ceef3ac,
cookie=0x560ef95398e8, this=0x8, op_ret=0, op_errno=0, prebuf=0x1,
postbuf=0x7fe5217fe890, xdata=0x0) at defaults.c:1508
#6  0x7fe5345d118a in default_fsync_cbk (frame=0x7fe52ceef204,
cookie=0x560ef95398e8, this=0x8, op_ret=0, op_errno=0, prebuf=0x1,
postbuf=0x7fe5217fe890, xdata=0x0) at defaults.c:1508
#7  0x7fe525f78219 in dht_fsync_cbk (frame=0x7fe52ceef2d8,
cookie=0x560ef95398e8, this=0x0, op_ret=0, op_errno=0,
prebuf=0x7fe5217fe820, postbuf=0x7fe5217fe890, xdata=0x0)
at dht-inode-read.c:873
#8  0x7fe5261bbc7f in client3_3_fsync_cbk (req=0x7fe525f78030
, iov=0x7fe526c61040, count=8,
myframe=0x7fe52ceef130) at
client-rpc-fops.c:975
#9  0x7fe5343201f0 in rpc_clnt_handle_reply (clnt=0x18,
clnt@entry=0x7fe526fafac0, pollin=0x7fe526c3a1c0) at rpc-clnt.c:791
#10 0x7f

Re: [Gluster-devel] Possible problem introduced by http://review.gluster.org/15573

2016-10-24 Thread Xavier Hernandez



On 21/10/16 15:05, Soumya Koduri wrote:

Hi Xavi,

On 10/21/2016 12:57 PM, Xavier Hernandez wrote:

Looking at the code, I think that the added fd_unref() should only be
called if the fop preparation fails. Otherwise the callback already
unreferences the fd.

Code flow:

* glfs_fsync_async_common() takes an fd ref and calls STACK_WIND passing
that fd.
* Just after that a ref is released.
* When glfs_io_async_cbk() is called another ref is released.

Note that if fop preparation fails, a single fd_unref() is called, but
on success two fd_unref() are called.


Sorry for the inconvenience caused. I think its patch#15573 hasn't
caused the problem but has highlighted another ref leak in the code.

From the code I see that glfs_io_async_cbk() does fd_unref (glfd->fd)
but not the fd passed in STACK_WIND_COOKIE() of the fop.


I think it's the same because the fd passed in STACK_WIND_COOKIE() also 
comes from glfd->fd.




If I take any fop, for eg.,
glfs_fsync_common() {

   fd = glfs_resolve_fd (glfd->fs, subvol, glfd);


}

Here in glfs_resolve_fd ()

fd_t *
__glfs_resolve_fd (struct glfs *fs, xlator_t *subvol, struct glfs_fd *glfd)
{
fd_t *fd = NULL;

if (glfd->fd->inode->table->xl == subvol)
return fd_ref (glfd->fd);


Here we can see that  we are taking extra ref additional to the

ref already taken for glfd->fd. That means the caller of this function
needs to fd_unref(fd) irrespective of subsequent fd_unref (glfd->fd).


I agree here. This additional ref must be released somewhere.



fd = __glfs_migrate_fd (fs, subvol, glfd);
if (!fd)
return NULL;


if (subvol == fs->active_subvol) {
fd_unref (glfd->fd);
glfd->fd = fd_ref (fd);
}


I think the issue is here during graph_switch(). You have

mentioned as well that the crash happens post graph_switch. Maybe here
we are missing an extra ref to be taken for fd additional to glfd->fd. I
need to look through __glfs_migrate_fd() to confirm that. But these are
my initial thoughts.


I think this is ok. The fd returned by __glfs_migrate_fd() already has a 
reference. We release the fd currently assigned to glfd->fd (that has 
only one reference) and assign the new fd to it, taking an additional 
reference (two in total) like in the previous case.


Xavi



Please let me know your comments.

Thanks,
Soumya




Xavi

On 21/10/16 09:03, Xavier Hernandez wrote:

Hi,

I've just tried Gluster 3.8.5 with Proxmox using gfapi and I
consistently see a crash each time an attempt to connect to the volume
is made.

The backtrace of the crash shows this:

#0  pthread_spin_lock () at
../nptl/sysdeps/x86_64/pthread_spin_lock.S:24
#1  0x7fe5345776a5 in fd_unref (fd=0x7fe523f7205c) at fd.c:553
#2  0x7fe53482ba18 in glfs_io_async_cbk (op_ret=,
op_errno=0, frame=, cookie=0x7fe526c67040,
iovec=iovec@entry=0x0, count=count@entry=0)
at glfs-fops.c:839
#3  0x7fe53482beed in glfs_fsync_async_cbk (frame=,
cookie=, this=, op_ret=,
op_errno=,
prebuf=, postbuf=0x7fe5217fe890, xdata=0x0) at
glfs-fops.c:1382
#4  0x7fe520be2eb7 in ?? () from
/usr/lib/x86_64-linux-gnu/glusterfs/3.8.5/xlator/debug/io-stats.so
#5  0x7fe5345d118a in default_fsync_cbk (frame=0x7fe52ceef3ac,
cookie=0x560ef95398e8, this=0x8, op_ret=0, op_errno=0, prebuf=0x1,
postbuf=0x7fe5217fe890, xdata=0x0) at defaults.c:1508
#6  0x7fe5345d118a in default_fsync_cbk (frame=0x7fe52ceef204,
cookie=0x560ef95398e8, this=0x8, op_ret=0, op_errno=0, prebuf=0x1,
postbuf=0x7fe5217fe890, xdata=0x0) at defaults.c:1508
#7  0x7fe525f78219 in dht_fsync_cbk (frame=0x7fe52ceef2d8,
cookie=0x560ef95398e8, this=0x0, op_ret=0, op_errno=0,
prebuf=0x7fe5217fe820, postbuf=0x7fe5217fe890, xdata=0x0)
at dht-inode-read.c:873
#8  0x7fe5261bbc7f in client3_3_fsync_cbk (req=0x7fe525f78030
, iov=0x7fe526c61040, count=8, myframe=0x7fe52ceef130) at
client-rpc-fops.c:975
#9  0x7fe5343201f0 in rpc_clnt_handle_reply (clnt=0x18,
clnt@entry=0x7fe526fafac0, pollin=0x7fe526c3a1c0) at rpc-clnt.c:791
#10 0x7fe53432056c in rpc_clnt_notify (trans=,
mydata=0x7fe526fafaf0, event=, data=0x7fe526c3a1c0) at
rpc-clnt.c:962
#11 0x7fe53431c8a3 in rpc_transport_notify (this=,
event=, data=) at rpc-transport.c:541
#12 0x7fe5283e8d96 in socket_event_poll_in (this=0x7fe526c69440) at
socket.c:2267
#13 0x7fe5283eaf37 in socket_event_handler (fd=,
idx=5, data=0x7fe526c69440, poll_in=1, poll_out=0, poll_err=0) at
socket.c:2397
#14 0x7fe5345ab3f6 in event_dispatch_epoll_handler
(event=0x7fe5217fecc0, event_pool=0x7fe526ca2040) at event-epoll.c:571
#15 event_dispatch_epoll_worker (data=0x7fe527c0f0c0) at
event-epoll.c:674
#16 0x7fe5324140a4 in start_thread (arg=0x7fe5217ff700) at
pthread_create.c:309
#17 0x7fe53214962d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

The fd being unreferenced contains this:

(gdb) print *fd
$6 = {
  pid = 976

Re: [Gluster-devel] Possible problem introduced by http://review.gluster.org/15573

2016-10-21 Thread Xavier Hernandez

Hi Niels,

On 21/10/16 10:03, Niels de Vos wrote:

On Fri, Oct 21, 2016 at 09:03:30AM +0200, Xavier Hernandez wrote:

Hi,

I've just tried Gluster 3.8.5 with Proxmox using gfapi and I consistently
see a crash each time an attempt to connect to the volume is made.


Thanks, that likely is the same bug as
https://bugzilla.redhat.com/1379241 .


I'm not sure it's the same problem. The crash on my case happens always 
and immediately. When creating an image, the file is created but size is 
0. The stack trace is quite different also.


Xavi



Satheesaran, could you revert commit 7a50690 from the build that you
were testing, and see if that causes the problem to go away again? Let
me know of you want me to provide RPMs for testing.

Niels



The backtrace of the crash shows this:

#0  pthread_spin_lock () at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:24
#1  0x7fe5345776a5 in fd_unref (fd=0x7fe523f7205c) at fd.c:553
#2  0x7fe53482ba18 in glfs_io_async_cbk (op_ret=,
op_errno=0, frame=, cookie=0x7fe526c67040,
iovec=iovec@entry=0x0, count=count@entry=0)
at glfs-fops.c:839
#3  0x7fe53482beed in glfs_fsync_async_cbk (frame=,
cookie=, this=, op_ret=,
op_errno=,
prebuf=, postbuf=0x7fe5217fe890, xdata=0x0) at
glfs-fops.c:1382
#4  0x7fe520be2eb7 in ?? () from
/usr/lib/x86_64-linux-gnu/glusterfs/3.8.5/xlator/debug/io-stats.so
#5  0x7fe5345d118a in default_fsync_cbk (frame=0x7fe52ceef3ac,
cookie=0x560ef95398e8, this=0x8, op_ret=0, op_errno=0, prebuf=0x1,
postbuf=0x7fe5217fe890, xdata=0x0) at defaults.c:1508
#6  0x7fe5345d118a in default_fsync_cbk (frame=0x7fe52ceef204,
cookie=0x560ef95398e8, this=0x8, op_ret=0, op_errno=0, prebuf=0x1,
postbuf=0x7fe5217fe890, xdata=0x0) at defaults.c:1508
#7  0x7fe525f78219 in dht_fsync_cbk (frame=0x7fe52ceef2d8,
cookie=0x560ef95398e8, this=0x0, op_ret=0, op_errno=0,
prebuf=0x7fe5217fe820, postbuf=0x7fe5217fe890, xdata=0x0)
at dht-inode-read.c:873
#8  0x7fe5261bbc7f in client3_3_fsync_cbk (req=0x7fe525f78030
, iov=0x7fe526c61040, count=8, myframe=0x7fe52ceef130) at
client-rpc-fops.c:975
#9  0x7fe5343201f0 in rpc_clnt_handle_reply (clnt=0x18,
clnt@entry=0x7fe526fafac0, pollin=0x7fe526c3a1c0) at rpc-clnt.c:791
#10 0x7fe53432056c in rpc_clnt_notify (trans=,
mydata=0x7fe526fafaf0, event=, data=0x7fe526c3a1c0) at
rpc-clnt.c:962
#11 0x7fe53431c8a3 in rpc_transport_notify (this=,
event=, data=) at rpc-transport.c:541
#12 0x7fe5283e8d96 in socket_event_poll_in (this=0x7fe526c69440) at
socket.c:2267
#13 0x7fe5283eaf37 in socket_event_handler (fd=, idx=5,
data=0x7fe526c69440, poll_in=1, poll_out=0, poll_err=0) at socket.c:2397
#14 0x7fe5345ab3f6 in event_dispatch_epoll_handler
(event=0x7fe5217fecc0, event_pool=0x7fe526ca2040) at event-epoll.c:571
#15 event_dispatch_epoll_worker (data=0x7fe527c0f0c0) at event-epoll.c:674
#16 0x7fe5324140a4 in start_thread (arg=0x7fe5217ff700) at
pthread_create.c:309
#17 0x7fe53214962d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

The fd being unreferenced contains this:

(gdb) print *fd
$6 = {
  pid = 97649,
  flags = 2,
  refcount = 0,
  inode_list = {
next = 0x7fe523f7206c,
prev = 0x7fe523f7206c
  },
  inode = 0x0,
  lock = {
spinlock = 1,
mutex = {
  __data = {
__lock = 1,
__count = 0,
__owner = 0,
__nusers = 0,
__kind = 0,
__spins = 0,
__elision = 0,
__list = {
  __prev = 0x0,
  __next = 0x0
}
  },
  __size = "\001", '\000' ,
  __align = 1
}
  },
  _ctx = 0x7fe52ec31c40,
  xl_count = 11,
  lk_ctx = 0x7fe526c126a0,
  anonymous = _gf_false
}

fd->inode is NULL, explaining the cause of the crash. We also see that
fd->refcount is already 0. So I'm wondering if this couldn't be an extra
fd_unref() introduced by that patch.

The crash seems to happen immediately after a graph switch.

Xavi

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Possible problem introduced by http://review.gluster.org/15573

2016-10-21 Thread Xavier Hernandez
Looking at the code, I think that the added fd_unref() should only be 
called if the fop preparation fails. Otherwise the callback already 
unreferences the fd.


Code flow:

* glfs_fsync_async_common() takes an fd ref and calls STACK_WIND passing 
that fd.

* Just after that a ref is released.
* When glfs_io_async_cbk() is called another ref is released.

Note that if fop preparation fails, a single fd_unref() is called, but 
on success two fd_unref() are called.


Xavi

On 21/10/16 09:03, Xavier Hernandez wrote:

Hi,

I've just tried Gluster 3.8.5 with Proxmox using gfapi and I
consistently see a crash each time an attempt to connect to the volume
is made.

The backtrace of the crash shows this:

#0  pthread_spin_lock () at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:24
#1  0x7fe5345776a5 in fd_unref (fd=0x7fe523f7205c) at fd.c:553
#2  0x7fe53482ba18 in glfs_io_async_cbk (op_ret=,
op_errno=0, frame=, cookie=0x7fe526c67040,
iovec=iovec@entry=0x0, count=count@entry=0)
at glfs-fops.c:839
#3  0x7fe53482beed in glfs_fsync_async_cbk (frame=,
cookie=, this=, op_ret=,
op_errno=,
prebuf=, postbuf=0x7fe5217fe890, xdata=0x0) at
glfs-fops.c:1382
#4  0x7fe520be2eb7 in ?? () from
/usr/lib/x86_64-linux-gnu/glusterfs/3.8.5/xlator/debug/io-stats.so
#5  0x7fe5345d118a in default_fsync_cbk (frame=0x7fe52ceef3ac,
cookie=0x560ef95398e8, this=0x8, op_ret=0, op_errno=0, prebuf=0x1,
postbuf=0x7fe5217fe890, xdata=0x0) at defaults.c:1508
#6  0x7fe5345d118a in default_fsync_cbk (frame=0x7fe52ceef204,
cookie=0x560ef95398e8, this=0x8, op_ret=0, op_errno=0, prebuf=0x1,
postbuf=0x7fe5217fe890, xdata=0x0) at defaults.c:1508
#7  0x7fe525f78219 in dht_fsync_cbk (frame=0x7fe52ceef2d8,
cookie=0x560ef95398e8, this=0x0, op_ret=0, op_errno=0,
prebuf=0x7fe5217fe820, postbuf=0x7fe5217fe890, xdata=0x0)
at dht-inode-read.c:873
#8  0x7fe5261bbc7f in client3_3_fsync_cbk (req=0x7fe525f78030
, iov=0x7fe526c61040, count=8, myframe=0x7fe52ceef130) at
client-rpc-fops.c:975
#9  0x7fe5343201f0 in rpc_clnt_handle_reply (clnt=0x18,
clnt@entry=0x7fe526fafac0, pollin=0x7fe526c3a1c0) at rpc-clnt.c:791
#10 0x7fe53432056c in rpc_clnt_notify (trans=,
mydata=0x7fe526fafaf0, event=, data=0x7fe526c3a1c0) at
rpc-clnt.c:962
#11 0x7fe53431c8a3 in rpc_transport_notify (this=,
event=, data=) at rpc-transport.c:541
#12 0x7fe5283e8d96 in socket_event_poll_in (this=0x7fe526c69440) at
socket.c:2267
#13 0x7fe5283eaf37 in socket_event_handler (fd=,
idx=5, data=0x7fe526c69440, poll_in=1, poll_out=0, poll_err=0) at
socket.c:2397
#14 0x7fe5345ab3f6 in event_dispatch_epoll_handler
(event=0x7fe5217fecc0, event_pool=0x7fe526ca2040) at event-epoll.c:571
#15 event_dispatch_epoll_worker (data=0x7fe527c0f0c0) at event-epoll.c:674
#16 0x7fe5324140a4 in start_thread (arg=0x7fe5217ff700) at
pthread_create.c:309
#17 0x7fe53214962d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

The fd being unreferenced contains this:

(gdb) print *fd
$6 = {
  pid = 97649,
  flags = 2,
  refcount = 0,
  inode_list = {
next = 0x7fe523f7206c,
prev = 0x7fe523f7206c
  },
  inode = 0x0,
  lock = {
spinlock = 1,
mutex = {
  __data = {
__lock = 1,
__count = 0,
__owner = 0,
__nusers = 0,
__kind = 0,
__spins = 0,
__elision = 0,
__list = {
  __prev = 0x0,
  __next = 0x0
}
  },
  __size = "\001", '\000' ,
  __align = 1
}
  },
  _ctx = 0x7fe52ec31c40,
  xl_count = 11,
  lk_ctx = 0x7fe526c126a0,
  anonymous = _gf_false
}

fd->inode is NULL, explaining the cause of the crash. We also see that
fd->refcount is already 0. So I'm wondering if this couldn't be an extra
fd_unref() introduced by that patch.

The crash seems to happen immediately after a graph switch.

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Possible problem introduced by http://review.gluster.org/15573

2016-10-21 Thread Xavier Hernandez

Hi,

I've just tried Gluster 3.8.5 with Proxmox using gfapi and I 
consistently see a crash each time an attempt to connect to the volume 
is made.


The backtrace of the crash shows this:

#0  pthread_spin_lock () at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:24
#1  0x7fe5345776a5 in fd_unref (fd=0x7fe523f7205c) at fd.c:553
#2  0x7fe53482ba18 in glfs_io_async_cbk (op_ret=, 
op_errno=0, frame=, cookie=0x7fe526c67040, 
iovec=iovec@entry=0x0, count=count@entry=0)

at glfs-fops.c:839
#3  0x7fe53482beed in glfs_fsync_async_cbk (frame=, 
cookie=, this=, op_ret=, 
op_errno=,
prebuf=, postbuf=0x7fe5217fe890, xdata=0x0) at 
glfs-fops.c:1382
#4  0x7fe520be2eb7 in ?? () from 
/usr/lib/x86_64-linux-gnu/glusterfs/3.8.5/xlator/debug/io-stats.so
#5  0x7fe5345d118a in default_fsync_cbk (frame=0x7fe52ceef3ac, 
cookie=0x560ef95398e8, this=0x8, op_ret=0, op_errno=0, prebuf=0x1, 
postbuf=0x7fe5217fe890, xdata=0x0) at defaults.c:1508
#6  0x7fe5345d118a in default_fsync_cbk (frame=0x7fe52ceef204, 
cookie=0x560ef95398e8, this=0x8, op_ret=0, op_errno=0, prebuf=0x1, 
postbuf=0x7fe5217fe890, xdata=0x0) at defaults.c:1508
#7  0x7fe525f78219 in dht_fsync_cbk (frame=0x7fe52ceef2d8, 
cookie=0x560ef95398e8, this=0x0, op_ret=0, op_errno=0, 
prebuf=0x7fe5217fe820, postbuf=0x7fe5217fe890, xdata=0x0)

at dht-inode-read.c:873
#8  0x7fe5261bbc7f in client3_3_fsync_cbk (req=0x7fe525f78030 
, iov=0x7fe526c61040, count=8, myframe=0x7fe52ceef130) at 
client-rpc-fops.c:975
#9  0x7fe5343201f0 in rpc_clnt_handle_reply (clnt=0x18, 
clnt@entry=0x7fe526fafac0, pollin=0x7fe526c3a1c0) at rpc-clnt.c:791
#10 0x7fe53432056c in rpc_clnt_notify (trans=, 
mydata=0x7fe526fafaf0, event=, data=0x7fe526c3a1c0) at 
rpc-clnt.c:962
#11 0x7fe53431c8a3 in rpc_transport_notify (this=, 
event=, data=) at rpc-transport.c:541
#12 0x7fe5283e8d96 in socket_event_poll_in (this=0x7fe526c69440) at 
socket.c:2267
#13 0x7fe5283eaf37 in socket_event_handler (fd=, 
idx=5, data=0x7fe526c69440, poll_in=1, poll_out=0, poll_err=0) at 
socket.c:2397
#14 0x7fe5345ab3f6 in event_dispatch_epoll_handler 
(event=0x7fe5217fecc0, event_pool=0x7fe526ca2040) at event-epoll.c:571

#15 event_dispatch_epoll_worker (data=0x7fe527c0f0c0) at event-epoll.c:674
#16 0x7fe5324140a4 in start_thread (arg=0x7fe5217ff700) at 
pthread_create.c:309
#17 0x7fe53214962d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111


The fd being unreferenced contains this:

(gdb) print *fd
$6 = {
  pid = 97649,
  flags = 2,
  refcount = 0,
  inode_list = {
next = 0x7fe523f7206c,
prev = 0x7fe523f7206c
  },
  inode = 0x0,
  lock = {
spinlock = 1,
mutex = {
  __data = {
__lock = 1,
__count = 0,
__owner = 0,
__nusers = 0,
__kind = 0,
__spins = 0,
__elision = 0,
__list = {
  __prev = 0x0,
  __next = 0x0
}
  },
  __size = "\001", '\000' ,
  __align = 1
}
  },
  _ctx = 0x7fe52ec31c40,
  xl_count = 11,
  lk_ctx = 0x7fe526c126a0,
  anonymous = _gf_false
}

fd->inode is NULL, explaining the cause of the crash. We also see that 
fd->refcount is already 0. So I'm wondering if this couldn't be an extra 
fd_unref() introduced by that patch.


The crash seems to happen immediately after a graph switch.

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Multiplexing - good news, bad news, and a plea for help

2016-09-20 Thread Xavier Hernandez



On 19/09/16 15:26, Jeff Darcy wrote:

I have brick multiplexing[1] functional to the point that it passes all basic 
AFR, EC, and quota tests.  There are still some issues with tiering, and I 
wouldn't consider snapshots functional at all, but it seemed like a good point 
to see how well it works.  I ran some *very simple* tests with 20 volumes, each 
2x distribute on top of 2x replicate.

First, the good news: it worked!  Getting 80 bricks to come up in the same 
process, and then run I/O correctly across all of those, is pretty cool.  Also, 
memory consumption is *way* down.  RSS size went from 1.1GB before (total 
across 80 processes) to about 400MB (one process) with multiplexing.  Each 
process seems to consume approximately 8MB globally plus 5MB per brick, so 
(8+5)*80 = 1040 vs. 8+(5*80) = 408.  Just considering the amount of memory, 
this means we could support about three times as many bricks as before.  When 
memory *contention* is considered, the difference is likely to be even greater.

Bad news: some of our code doesn't scale very well in terms of CPU use.  To 
test performance I ran a test which would create 20,000 files across all 20 
volumes, then write and delete them, all using 100 client threads.  This is 
similar to what smallfile does, but deliberately constructed to use a minimum 
of disk space - at any given, only one file per thread (maximum) actually has 
4KB worth of data in it.  This allows me to run it against SSDs or even 
ramdisks even with high brick counts, to factor out slow disks in a study of 
CPU/memory issues.  Here are some results and observations.

* On my first run, the multiplexed version of the test took 77% longer to run 
than the non-multiplexed version (5:42 vs. 3:13).  And that was after I'd done 
some hacking to use 16 epoll threads.  There's something a bit broken about 
trying to set that option normally, so that the value you set doesn't actually 
make it to the place that tries to spawn the threads.  Bumping this up further 
to 32 threads didn't seem to help.

* A little profiling showed me that we're spending almost all of our time in 
pthread_spin_lock.  I disabled the code to use spinlocks instead of regular 
mutexes, which immediately improved performance and also reduced CPU time by 
almost 50%.

* The next round of profiling showed that a lot of the locking is in mem-pool 
code, and a lot of that in turn is from dictionary code.  Changing the dict 
code to use malloc/free instead of mem_get/mem_put gave another noticeable 
boost.


That's weird, since the only purpose of the mem-pool was precisely to 
improve performance of allocation of objects that are frequently 
allocated/released.




At this point run time was down to 4:50, which is 20% better than where I 
started but still far short of non-multiplexed performance.  I can drive that 
down still further by converting more things to use malloc/free.  There seems 
to be a significant opportunity here to improve performance - even without 
multiplexing - by taking a more careful look at our memory-management 
strategies:

* Tune the mem-pool implementation to scale better with hundreds of threads.

* Use mem-pools more selectively, or even abandon them altogether.

* Try a different memory allocator such as jemalloc.

I'd certainly appreciate some help/collaboration in studying these options 
further.  It's a great opportunity to make a large impact on overall 
performance without a lot of code or specialized knowledge.  Even so, however, 
I don't think memory management is our only internal scalability problem.  
There must be something else limiting parallelism, and quite severely at that.  
My first guess is io-threads, so I'll be looking into that first, but if 
anybody else has any ideas please let me know.  There's no *good* reason why 
running many bricks in one process should be slower than running them in 
separate processes.  If it remains slower, then the limit on the number of 
bricks and volumes we can support will remain unreasonably low.  Also, the 
problems I'm seeing here probably don't *only* affect multiplexing.  Excessive 
memory/CPU use and poor parallelism are issues that we kind of need to address 
anyway, so if anybody has any ideas please let me know.


You have made a really good job :)

Some points I would look into:

* Consider http://review.gluster.org/15036/. With all communications 
going through the same socket, the problem this patch tries to solve 
could become worse.


* We should consider the possibility of implementing a global thread 
pool, which would replace io-threads, epoll threads and maybe others. 
Synctasks should also rely on this thread pool. This has the benefit of 
better controlling the total number of threads. Otherwise when we have 
more threads than processor cores, we waste resources unnecessarily and 
we won't get a real gain. Even worse, it could start to degrade due to 
contention.


* There are *too many* mutexes in the code. We should 

Re: [Gluster-devel] Review request for 3.9 patches

2016-09-19 Thread Xavier Hernandez

Hi Poornima,

On 19/09/16 07:01, Poornima Gurusiddaiah wrote:

Hi All,

There are 3 more patches that we need for enabling md-cache invalidation in 3.9.
Request your help with the reviews:

http://review.gluster.org/#/c/15378/   - afr: Implement IPC fop
http://review.gluster.org/#/c/15387/   - ec: Implement IPC fop


The patch is ok for me. Only concern is if it shouldn't reference a bug 
instead of having 'rfc' as topic.


Xavi


http://review.gluster.org/#/c/15398/   - mdc/upcall/afr: Reduce the window of 
stale read


Thanks,
Poornima

- Original Message -

From: "Poornima Gurusiddaiah" 
To: "Gluster Devel" , "Raghavendra Gowdappa" 
, "Rajesh Joseph"
, "Raghavendra Talur" , "Soumya Koduri" 
, "Niels de Vos"
, "Anoop Chirayath Manjiyil Sajan" 
Sent: Tuesday, August 30, 2016 5:13:36 AM
Subject: Re: [Gluster-devel] Review request for 3.9 patches

Hi,

Few more patches, have addressed the review comments, could you please review
these patches:

http://review.gluster.org/15002   md-cache: Register the list of xattrs with
cache-invalidation
http://review.gluster.org/15300   dht, md-cache, upcall: Add invalidation of
IATT when the layout changes
http://review.gluster.org/15324   md-cache: Process all the cache
invalidation flags
http://review.gluster.org/15313   upcall: Mark the clients as accessed even
on readdir entries
http://review.gluster.org/15193   io-stats: Add stats for upcall
notifications

Regards,
Poornima

- Original Message -


From: "Poornima Gurusiddaiah" 
To: "Gluster Devel" , "Raghavendra Gowdappa"
, "Rajesh Joseph" , "Raghavendra
Talur" , "Soumya Koduri" , "Niels de
Vos" , "Anoop Chirayath Manjiyil Sajan"

Sent: Thursday, August 25, 2016 5:22:43 AM
Subject: Review request for 3.9 patches



Hi,



There are few patches that are part of the effort of integrating md-cache
with upcall.
Hope to take these patches for 3.9, it would be great if you can review
these
patches:



upcall patches:
http://review.gluster.org/#/c/15313/
http://review.gluster.org/#/c/15301/



md-cache patches:
http://review.gluster.org/#/c/15002/
http://review.gluster.org/#/c/15045/
http://review.gluster.org/#/c/15185/
http://review.gluster.org/#/c/15224/
http://review.gluster.org/#/c/15225/
http://review.gluster.org/#/c/15300/
http://review.gluster.org/#/c/15314/



Thanks,
Poornima

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Query regards to heal xattr heal in dht

2016-09-15 Thread Xavier Hernandez



On 15/09/16 11:31, Raghavendra G wrote:



On Thu, Sep 15, 2016 at 12:02 PM, Nithya Balachandran
> wrote:



On 8 September 2016 at 12:02, Mohit Agrawal > wrote:

Hi All,

   I have one another solution to heal user xattr but before
implement it i would like to discuss with you.

   Can i call function (dht_dir_xattr_heal internally it is
calling syncop_setxattr) to heal xattr in dht_getxattr_cbk in last
   after make sure we have a valid xattr.
   In function(dht_dir_xattr_heal) it will copy blindly all user
xattr on all subvolume or i can compare subvol xattr with valid
xattr if there is any mismatch then i will call syncop_setxattr
otherwise no need to call. syncop_setxattr.



This can be problematic if a particular xattr is being removed - it
might still exist on some subvols. IIUC, the heal would go and reset
it again?

One option is to use the hash subvol for the dir as the source - so
perform xattr op on hashed subvol first and on the others only if it
succeeds on the hashed. This does have the problem of being unable
to set xattrs if the hashed subvol is unavailable. This might not be
such a big deal in case of distributed replicate or distribute
disperse volumes but will affect pure distribute. However, this way
we can at least be reasonably certain of the correctness (leaving
rebalance out of the picture).


* What is the behavior of getxattr when hashed subvol is down? Should we
succeed with values from non-hashed subvols or should we fail getxattr?
With hashed-subvol as source of truth, its difficult to determine
correctness of xattrs and their values when it is down.

* setxattr is an inode operation (as opposed to entry operation). So, we
cannot calculate hashed-subvol as in (get)(set)xattr, parent layout and
"basename" is not available. This forces us to store hashed subvol in
inode-ctx. Now, when the hashed-subvol changes we need to update these
inode-ctxs too.

What do you think about a Quorum based solution to this problem?

1. setxattr succeeds only if it is successful on at least (n/2 + 1)
number of subvols.
2. getxattr succeeds only if it is successful and values match on at
least (n/2 + 1) number of subvols.

The flip-side of this solution is we are increasing the probability of
failure of (get)(set)xattr operations as opposed to the hashed-subvol as
source of truth solution. Or are we - how do we compare probability of
hashed-subvol going down with probability of (n/2 + 1) nodes going down
simultaneously? Is it 1/n vs (1/n*1/n*... (n/2+1 times)?. Is 1/n correct
probability for _a specific subvol (hashed-subvol)_ going down (as
opposed to _any one subvol_ going down)?


If we suppose p to be the probability of failure of a subvolume in a 
period of time (a year for example), all subvolumes have the same 
probability, and we have N subvolumes, then:


Probability of failure of hashed-subvol: p
Probability of failure of N/2 + 1 or more subvols: 

Note that this probability says how much probable is that N/2 + 1 
subvols or more fail in the specified period of time, but not 
necessarily simultaneously. If we suppose that subvolumes are recovered 
as fast as possible, the real probability of simultaneous failure will 
be much smaller.


In worst case (not recovering the failed subvolumes in the given period 
of time), if p < 0.5 or N = 2 (and p != 1), then it's always better to 
check N/2 + 1 subvolumes. Otherwise, it's better to check the hashed-subvol.


I think that p should always be much smaller than 0.5 for small periods 
of time where subvolume recovery could no be completed before other 
failures, so checking half plus one subvols should always be the best 
option in terms of probability. Performance can suffer though if some 
kind of synchronization is needed.


Xavi







   Let me know if this approach is suitable.



Regards
Mohit Agrawal

On Wed, Sep 7, 2016 at 10:27 PM, Pranith Kumar Karampuri
> wrote:



On Wed, Sep 7, 2016 at 9:46 PM, Mohit Agrawal
> wrote:

Hi Pranith,


In current approach i am getting list of xattr from
first up volume and update the user attributes from that
xattr to
all other volumes.

I have assumed first up subvol is source and rest of
them are sink as we are doing same in dht_dir_attr_heal.


I think first up subvol is different for different mounts as
per my understanding, I could be wrong.



Regards
Mohit Agrawal

On Wed, Sep 7, 2016 at 9:34 PM, Pranith Kumar Karampuri
  

Re: [Gluster-devel] Need help with https://bugzilla.redhat.com/show_bug.cgi?id=1224180

2016-09-13 Thread Xavier Hernandez

Hi Sanoj,

On 13/09/16 09:41, Sanoj Unnikrishnan wrote:

Hi Xavi,

That explains a lot,
I see a couple of other scenario which can lead to similar inconsistency.
1) simultaneous node/brick crash of 3 bricks.


Although this is a real problem, the 3 bricks should crash exactly at 
the same moment just after having successfully locked the inode being 
modified and queried some information, but before sending the write fop 
nor any down notification. The probability to have this suffer this 
problem is really small.



2) if the disk space of underlying filesystem on which brick is hosted exceeds 
for 3 bricks.


Yes. This is the same cause that makes quota fail.



I don't think we can address all the scenario unless we have a log/journal 
mechanism like raid-5.


I completely agree. I don't see any solution valid for all cases. BTW 
RAID-5 *is not* a solution. It doesn't have any log/journal. Maybe 
something based on fdl xlator would work.



Should we look at a quota specific fix or let it get fixed whenever we 
introduce a log?


Not sure how to fix this in a way that doesn't seem too hacky.

One possibility is to request permission to write some data before 
actually writing it (specifying offset and size). And then be sure that 
the write will succeed if all (or at least the minimum number of data 
bricks) has acknowledged the previous write permission request.


Another approach would be to queue writes in a server side xlator until 
a commit message is received, but sending back an answer saying if 
there's enough space to do the write (this is, in some way, a very 
primitive log/journal approach).


However both approaches will have a big performance impact if they 
cannot be executed in background.


Maybe it would be worth investing in fdl instead of trying to find a 
custom solution to this.


Xavi



Thanks and Regards,
Sanoj

- Original Message -
From: "Xavier Hernandez" <xhernan...@datalab.es>
To: "Raghavendra Gowdappa" <rgowd...@redhat.com>, "Sanoj Unnikrishnan" 
<sunni...@redhat.com>
Cc: "Pranith Kumar Karampuri" <pkara...@redhat.com>, "Ashish Pandey" <aspan...@redhat.com>, 
"Gluster Devel" <gluster-devel@gluster.org>
Sent: Tuesday, September 13, 2016 11:50:27 AM
Subject: Re: Need help with https://bugzilla.redhat.com/show_bug.cgi?id=1224180

Hi Sanoj,

I'm unable to see bug 1224180. Access is restricted.

Not sure what is the problem exactly, but I see that quota is involved.
Currently disperse doesn't play well with quota when the limit is near.

The reason is that not all bricks fail at the same time with EDQUOT due
to small differences is computed space. This causes a valid write to
succeed on some bricks and fail on others. If it fails simultaneously on
more than redundancy bricks but less that the number of data bricks,
there's no way to rollback the changes on the bricks that have
succeeded, so the operation is inconsistent and an I/O error is returned.

For example, on a 6:2 configuration (4 data bricks and 2 redundancy), if
3 bricks succeed and 3 fail, there are not enough bricks with the
updated version, but there aren't enough bricks with the old version either.

If you force 2 bricks to be down, the problem can appear more frequently
as only a single failure causes this problem.

Xavi

On 13/09/16 06:09, Raghavendra Gowdappa wrote:

+gluster-devel

- Original Message -

From: "Sanoj Unnikrishnan" <sunni...@redhat.com>
To: "Pranith Kumar Karampuri" <pkara...@redhat.com>, "Ashish Pandey" 
<aspan...@redhat.com>, xhernan...@datalab.es,
"Raghavendra Gowdappa" <rgowd...@redhat.com>
Sent: Monday, September 12, 2016 7:06:59 PM
Subject: Need help with https://bugzilla.redhat.com/show_bug.cgi?id=1224180

Hello Xavi/Pranith,

I have been able to reproduce the BZ with the following steps:

gluster volume create v_disp disperse 6 redundancy 2 $tm1:/export/sdb/br1
$tm2:/export/sdb/b2 $tm3:/export/sdb/br3  $tm1:/export/sdb/b4
$tm2:/export/sdb/b5 $tm3:/export/sdb/b6 force
#(Used only 3 nodes, should not matter here)
gluster volume start v_disp
mount -t glusterfs $tm1:v_disp /gluster_vols/v_disp
mkdir /gluster_vols/v_disp/dir1
dd if=/dev/zero of=/gluster_vols/v_disp/dir1/x bs=10k count=9 &
gluster v quota v_disp enable
gluster v quota v_disp limit-usage /dir1 200MB
gluster v quota v_disp soft-timeout 0
gluster v quota v_disp hard-timeout 0
#optional remove 2 bricks (reproduces more often with this)
#pgrep glusterfsd | xargs kill -9

IO error on stdout when Quota exceeds, followed by Disk Quota exceeded.

Also note the issue is seen when A flush happens simultaneous with quota
limit hit, Hence Its not seen only on some runs.

The following are the error in logs.
[2016-09-12 10:40:02.431568] E [MSGID: 122034]
[ec-common.c:488:ec_child_select] 0-v_disp-disperse-0: Insufficient
available childs for this request (have

Re: [Gluster-devel] Need help with https://bugzilla.redhat.com/show_bug.cgi?id=1224180

2016-09-13 Thread Xavier Hernandez

Hi Sanoj,

I'm unable to see bug 1224180. Access is restricted.

Not sure what is the problem exactly, but I see that quota is involved. 
Currently disperse doesn't play well with quota when the limit is near.


The reason is that not all bricks fail at the same time with EDQUOT due 
to small differences is computed space. This causes a valid write to 
succeed on some bricks and fail on others. If it fails simultaneously on 
more than redundancy bricks but less that the number of data bricks, 
there's no way to rollback the changes on the bricks that have 
succeeded, so the operation is inconsistent and an I/O error is returned.


For example, on a 6:2 configuration (4 data bricks and 2 redundancy), if 
3 bricks succeed and 3 fail, there are not enough bricks with the 
updated version, but there aren't enough bricks with the old version either.


If you force 2 bricks to be down, the problem can appear more frequently 
as only a single failure causes this problem.


Xavi

On 13/09/16 06:09, Raghavendra Gowdappa wrote:

+gluster-devel

- Original Message -

From: "Sanoj Unnikrishnan" 
To: "Pranith Kumar Karampuri" , "Ashish Pandey" 
, xhernan...@datalab.es,
"Raghavendra Gowdappa" 
Sent: Monday, September 12, 2016 7:06:59 PM
Subject: Need help with https://bugzilla.redhat.com/show_bug.cgi?id=1224180

Hello Xavi/Pranith,

I have been able to reproduce the BZ with the following steps:

gluster volume create v_disp disperse 6 redundancy 2 $tm1:/export/sdb/br1
$tm2:/export/sdb/b2 $tm3:/export/sdb/br3  $tm1:/export/sdb/b4
$tm2:/export/sdb/b5 $tm3:/export/sdb/b6 force
#(Used only 3 nodes, should not matter here)
gluster volume start v_disp
mount -t glusterfs $tm1:v_disp /gluster_vols/v_disp
mkdir /gluster_vols/v_disp/dir1
dd if=/dev/zero of=/gluster_vols/v_disp/dir1/x bs=10k count=9 &
gluster v quota v_disp enable
gluster v quota v_disp limit-usage /dir1 200MB
gluster v quota v_disp soft-timeout 0
gluster v quota v_disp hard-timeout 0
#optional remove 2 bricks (reproduces more often with this)
#pgrep glusterfsd | xargs kill -9

IO error on stdout when Quota exceeds, followed by Disk Quota exceeded.

Also note the issue is seen when A flush happens simultaneous with quota
limit hit, Hence Its not seen only on some runs.

The following are the error in logs.
[2016-09-12 10:40:02.431568] E [MSGID: 122034]
[ec-common.c:488:ec_child_select] 0-v_disp-disperse-0: Insufficient
available childs for this request (have 0, need 4)
[2016-09-12 10:40:02.431627] E [MSGID: 122037]
[ec-common.c:1830:ec_update_size_version_done] 0-Disperse: sku-debug:
pre-version=0/0, size=0post-version=1865/1865, size=209571840
[2016-09-12 10:40:02.431637] E [MSGID: 122037]
[ec-common.c:1835:ec_update_size_version_done] 0-v_disp-disperse-0: Failed
to update version and size [Input/output error]
[2016-09-12 10:40:02.431664] E [MSGID: 122034]
[ec-common.c:417:ec_child_select] 0-v_disp-disperse-0: sku-debug: mask: 36,
ec->xl_up 36, ec->node_mask 3f, parent->mask:36, fop->parent->healing:0,
id:29

[2016-09-12 10:40:02.431673] E [MSGID: 122034]
[ec-common.c:480:ec_child_select] 0-v_disp-disperse-0: sku-debug: mask: 36,
remaining: 36, healing: 0, ec->xl_up 36, ec->node_mask 3f, parent->mask:36,
num:4, minimum: 1, id:29

...
[2016-09-12 10:40:02.487302] W [fuse-bridge.c:2311:fuse_writev_cbk]
0-glusterfs-fuse: 41159: WRITE => -1
gfid=ee0b4aa1-1f44-486a-883c-acddc13ee318 fd=0x7f1d9c003edc (Input/output
error)
[2016-09-12 10:40:02.500151] W [MSGID: 122006]
[ec-combine.c:206:ec_iatt_combine] 0-v_disp-disperse-0: Failed to combine
iatt (inode: 9816911356190712600-9816911356190712600, links: 1-1, uid: 0-0,
gid: 0-0, rdev: 0-0, size: 52423680-52413440, mode: 100644-100644)
[2016-09-12 10:40:02.500188] N [MSGID: 122029]
[ec-combine.c:93:ec_combine_write] 0-v_disp-disperse-0: Mismatching iatt in
answers of 'WRITE'
[2016-09-12 10:40:02.504551] W [MSGID: 122006]
[ec-combine.c:206:ec_iatt_combine] 0-v_disp-disperse-0: Failed to combine
iatt (inode: 9816911356190712600-9816911356190712600, links: 1-1, uid: 0-0,
gid: 0-0, rdev: 0-0, size: 52423680-52413440, mode: 100644-100644)



[2016-09-12 10:40:02.571272] N [MSGID: 122029]
[ec-combine.c:93:ec_combine_write] 0-v_disp-disperse-0: Mismatching iatt in
answers of 'WRITE'
[2016-09-12 10:40:02.571510] W [MSGID: 122006]
[ec-combine.c:206:ec_iatt_combine] 0-v_disp-disperse-0: Failed to combine
iatt (inode: 9816911356190712600-9816911356190712600, links: 1-1, uid: 0-0,
gid: 0-0, rdev: 0-0, size: 52423680-52413440, mode: 100644-100644)
[2016-09-12 10:40:02.571544] N [MSGID: 122029]
[ec-combine.c:93:ec_combine_write] 0-v_disp-disperse-0: Mismatching iatt in
answers of 'WRITE'
[2016-09-12 10:40:02.571772] W [fuse-bridge.c:1290:fuse_err_cbk]
0-glusterfs-fuse: 41160: FLUSH() ERR => -1 (Input/output error)

Also, for some fops before the write I noticed the fop->mask field as 0, Its
not clear why this happens ??

[2016-09-12 

Re: [Gluster-devel] Spurious termination of fuse invalidation notifier thread

2016-09-06 Thread Xavier Hernandez

Hi Raghavendra,

On 06/09/16 06:11, Raghavendra Gowdappa wrote:



- Original Message -

From: "Xavier Hernandez" <xhernan...@datalab.es>
To: "Raghavendra Gowdappa" <rgowd...@redhat.com>, "Kaleb Keithley" <kkeit...@redhat.com>, 
"Pranith Kumar Karampuri"
<pkara...@redhat.com>
Cc: "Csaba Henk" <ch...@redhat.com>, "Gluster Devel" <gluster-devel@gluster.org>
Sent: Monday, September 5, 2016 12:46:43 PM
Subject: Re: Spurious termination of fuse invalidation notifier thread

Hi Raghavendra,

On 03/09/16 05:42, Raghavendra Gowdappa wrote:

Hi Xavi/Kaleb/Pranith,

During few of our older conversations (like [1], but not only one), some of
you had reported that the thread which writes invalidation notifications
(of inodes, entries) to /dev/fuse terminates spuriously. Csaba tried to
reproduce the issue, but without success. It would be helpful if you
provide any information on reproducer and/or possible reasons for the
behavior.


I didn't found what really caused the problem. I only saw the
termination message on a production server after some days working but
hadn't had the opportunity to debug it.

Looking at the code, the only conclusion I got is that the result from
the write to /dev/fuse was unexpected. The patch solves this and I
haven't seen the problem again.

The old code only manages ENOENT error. It exits the thread for any
other error. I guess that in some situations a write to /dev/fuse can
return other "non fatal" errors.


Thanks Xavi. Now I remember the changes. Since you have not seen spurious 
termination after the changes, I assume the issue is fixed.


Yes, I haven't seen the issue again since the patch was applied.





As a guess, I think it may be a failure in an entry invalidation.
Looking at the code of fuse, it may return ENOTDIR if parent of the
entry is not a directory and some race happens doing rm/create while
sending invalidations in the background. Another possibility is
ENOTEMPTY if the entry references a non empty directory (again probably
caused by races between user mode operations and background
invalidations). Anyway this is only a guess, I have no more information.

Xavi



[1]
http://review.gluster.org/#/c/13274/1/xlators/mount/fuse/src/fuse-bridge.c

regards,
Raghavendra




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Spurious termination of fuse invalidation notifier thread

2016-09-05 Thread Xavier Hernandez

Hi Raghavendra,

On 03/09/16 05:42, Raghavendra Gowdappa wrote:

Hi Xavi/Kaleb/Pranith,

During few of our older conversations (like [1], but not only one), some of you 
had reported that the thread which writes invalidation notifications (of 
inodes, entries) to /dev/fuse terminates spuriously. Csaba tried to reproduce 
the issue, but without success. It would be helpful if you provide any 
information on reproducer and/or possible reasons for the behavior.


I didn't found what really caused the problem. I only saw the 
termination message on a production server after some days working but 
hadn't had the opportunity to debug it.


Looking at the code, the only conclusion I got is that the result from 
the write to /dev/fuse was unexpected. The patch solves this and I 
haven't seen the problem again.


The old code only manages ENOENT error. It exits the thread for any 
other error. I guess that in some situations a write to /dev/fuse can 
return other "non fatal" errors.


As a guess, I think it may be a failure in an entry invalidation. 
Looking at the code of fuse, it may return ENOTDIR if parent of the 
entry is not a directory and some race happens doing rm/create while 
sending invalidations in the background. Another possibility is 
ENOTEMPTY if the entry references a non empty directory (again probably 
caused by races between user mode operations and background 
invalidations). Anyway this is only a guess, I have no more information.


Xavi



[1] http://review.gluster.org/#/c/13274/1/xlators/mount/fuse/src/fuse-bridge.c

regards,
Raghavendra


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Notifications (was Re: GF_PARENT_DOWN on SIGKILL)

2016-07-25 Thread Xavier Hernandez

Hi Jeff,

On 22/07/16 16:14, Jeff Darcy wrote:

I don't think we need any list traversal because notify sends it down
the graph.


Good point.  I think we need to change that, BTW.  Relying on
translators to propagate notifications has proven very fragile, as many
of those events are overloaded to mean very different things to
different translators (e.g. just being up vs. having quorum) with
different rules for when they should or should not be propagated.  Going
forward, I think we can save ourselves a lot of headaches by treating
notification as an infrastructure responsibility, and changing
translators to use something else (e.g. IPC fops or upcalls) for their
internal needs.


I partially agree. I think gluster core should have more control about 
event propagation, however some events, as they are currently used, need 
to be controlled by the xlator in some way.


For example GF_EVENT_CHILD_UP cannot be immediately sent once the xlator 
itself has been completely initialized because this would mean that 
upper xlators could start sending requests. In the specific case of ec, 
this is not good, because with a single child up, the volume is 
completely inoperable. On the other side, waiting for all subvolumes to 
be up (or down) is a waste of time if there's some problem because ec 
can begin working with less bricks.


Maybe the infrastructure should include some way to receive specific 
events from the xlatoes and use them to propagate events between xlators 
when appropriate. This way the xlator is not responsible to propagate 
any event, but to maintain informed the gluster core about its current 
state.


Xavi



But that's a different issue.  For now, just pushing one PARENT_DOWN
to the top of the graph should suffice.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] GF_PARENT_DOWN on SIGKILL

2016-07-25 Thread Xavier Hernandez

Hi Jeff,

On 22/07/16 15:37, Jeff Darcy wrote:

Gah! sorry sorry, I meant to send the mail as SIGTERM. Not SIGKILL. So xavi
and I were wondering why cleanup_and_exit() is not sending GF_PARENT_DOWN
event.


OK, then that grinding sound you hear is my brain shifting gears.  ;)  It
seems that cleanup_and_exit will call xlator.fini in some few cases, but
it doesn't do anything that would send notify events.  I'll bet the answer
to "why" is just that nobody thought of it or got around to it.  The next
question I'd ask is: can you do what you need to do from ec.fini instead?
That would require enabling it in should_call_fini as well, but otherwise
seems pretty straightforward.


As far as I know, there's no explicit guarantee on the order in which 
fini is called, so we cannot rely on it to do cleanup because ec needs 
that all its underlying xlators be fully functional to finish the cleanup.


If this can be explicitly enforced and maintained, I think it could be 
moved but with some tricks, since fini is exepected to be a synchronous 
operation and the ec cleanup is asynchronous.




If the answer to that question is no, then things get more complicated.
Can we do one loop that sends GF_EVENT_PARENT_DOWN events, then another
that calls fini?  Can we just do a basic list traversal (as we do now for
fini) or do we need to do something more complicated to deal with cluster
translators?  I think a separate loop doing basic list traversal would
work, even with brick multiplexing, so it's probably worth just coding it
up as an experiment.


The main "difficulty" here is the asynchronous behavior of the cleanup. 
Nothing else can be shut down until the cleanup finishes.


Maybe the GF_EVENT_PARENT_DOWN should account for this 
asynchronous/delayed operation, while the fini should be kept as a 
synchronous cleanup and resource release operation.


Xavi


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] performance issues Manoj found in EC testing

2016-06-28 Thread Xavier Hernandez

Hi Pranith,

On 28/06/16 08:08, Pranith Kumar Karampuri wrote:




On Tue, Jun 28, 2016 at 10:21 AM, Poornima Gurusiddaiah
<pguru...@redhat.com <mailto:pguru...@redhat.com>> wrote:

Regards,
Poornima



*From: *"Pranith Kumar Karampuri" <pkara...@redhat.com
<mailto:pkara...@redhat.com>>
*To: *"Xavier Hernandez" <xhernan...@datalab.es
<mailto:xhernan...@datalab.es>>
*Cc: *"Gluster Devel" <gluster-devel@gluster.org
<mailto:gluster-devel@gluster.org>>
*Sent: *Monday, June 27, 2016 5:48:24 PM
*Subject: *Re: [Gluster-devel] performance issues Manoj found in
EC testing



On Mon, Jun 27, 2016 at 12:42 PM, Pranith Kumar Karampuri
<pkara...@redhat.com <mailto:pkara...@redhat.com>> wrote:



On Mon, Jun 27, 2016 at 11:52 AM, Xavier Hernandez
<xhernan...@datalab.es <mailto:xhernan...@datalab.es>> wrote:

Hi Manoj,

I always enable client-io-threads option for disperse
volumes. It improves performance sensibly, most probably
because of the problem you have detected.

I don't see any other way to solve that problem.


I agree. Updated the bug with same info.


I think it would be a lot better to have a true thread
pool (and maybe an I/O thread pool shared by fuse,
client and server xlators) in libglusterfs instead of
the io-threads xlator. This would allow each xlator to
decide when and what should be parallelized in a more
intelligent way, since basing the decision solely on the
fop type seems too simplistic to me.

In the specific case of EC, there are a lot of
operations to perform for a single high level fop, and
not all of them require the same priority. Also some of
them could be executed in parallel instead of sequentially.


I think it is high time we actually schedule(for which
release) to get this in gluster. May be you should send out
a doc where we can work out details? I will be happy to
explore options to integrate io-threads, syncop/barrier with
this infra based on the design may be.


I was just thinking why we can't reuse synctask framework. It
already scales up/down based on the tasks. At max it uses 16
threads. Whatever we want to be executed in parallel we can
create a synctask around it and run it. Would that be good enough?

Yes, synctask framework can be preferred over io-threads, else it
would mean 16 synctask threads + 16(?) io-threads for one instance
of mount, this will blow out the gfapi clients if they have many
mounts from the same process. Also using synctask would mean code
changes in EC?


Yes it will need some changes but I don't think they are big changes. I
think the functions to decode/encode already exist. We just to need to
move encoding/decoding as tasks and run as synctasks.


I was also thinking in sleeping fops. Currently when they are resumed, 
they are processed in the same thread that was processing another fop. 
This could add latencies to fops or unnecessary delays in lock 
management. If they can be scheduled to be executed by another thread, 
these delays are drastically reduced.


On the other hand, splitting the computation of EC encoding into 
multiple threads is bad because current implementation takes advantage 
of internal CPU memory cache, which is really fast. We should compute 
all fragments of a single request in the same thread. Multiple 
independent computations could be executed by different threads.




Xavi,
  Long time back we chatted a bit about synctask code and you wanted
the scheduling to happen by kernel or something. Apart from that do you
see any other issues? At least if the tasks are synchronous i.e. nothing
goes out the wire, task scheduling = thread scheduling by kernel and it
works exactly like thread-pool you were referring to. It does
multi-tasking only if the tasks are asynchronous in nature.


How would this work ? should we have to create a new synctask for each 
background function we want to execute ? I think this has an important 
overhead, since each synctask requires its own stack, creates a frame 
that we don't really need in most cases, and it causes context switches.


We could have hundreds or thousands of requests per second. they could 
even require more than one background task for each request in some 
cases. I'm not sure if synctasks are the right choice in this case.


I think that a thread pool is more lightweight.

Xavi






Xavi

Re: [Gluster-devel] performance issues Manoj found in EC testing

2016-06-27 Thread Xavier Hernandez

Hi Manoj,

I always enable client-io-threads option for disperse volumes. It 
improves performance sensibly, most probably because of the problem you 
have detected.


I don't see any other way to solve that problem.

I think it would be a lot better to have a true thread pool (and maybe 
an I/O thread pool shared by fuse, client and server xlators) in 
libglusterfs instead of the io-threads xlator. This would allow each 
xlator to decide when and what should be parallelized in a more 
intelligent way, since basing the decision solely on the fop type seems 
too simplistic to me.


In the specific case of EC, there are a lot of operations to perform for 
a single high level fop, and not all of them require the same priority. 
Also some of them could be executed in parallel instead of sequentially.


Xavi

On 25/06/16 19:42, Manoj Pillai wrote:


- Original Message -

From: "Pranith Kumar Karampuri" <pkara...@redhat.com>
To: "Xavier Hernandez" <xhernan...@datalab.es>
Cc: "Manoj Pillai" <mpil...@redhat.com>, "Gluster Devel" 
<gluster-devel@gluster.org>
Sent: Thursday, June 23, 2016 8:50:44 PM
Subject: performance issues Manoj found in EC testing

hi Xavi,
  Meet Manoj from performance team Redhat. He has been testing EC
performance in his stretch clusters. He found some interesting things we
would like to share with you.

1) When we perform multiple streams of big file writes(12 parallel dds I
think) he found one thread to be always hot (99%CPU always). He was asking
me if fuse_reader thread does any extra processing in EC compared to
replicate. Initially I thought it would just lock and epoll threads will
perform the encoding but later realized that once we have the lock and
version details, next writes on the file would be encoded in the same
thread that comes to EC. write-behind could play a role and make the writes
come to EC in an epoll thread but we saw consistently there was just one
thread that is hot. Not multiple threads. We will be able to confirm this
in tomorrow's testing.

2) This is one more thing Raghavendra G found, that our current
implementation of epoll doesn't let other epoll threads pick messages from
a socket while one thread is processing one message from that socket. In
EC's case that can be encoding of the write/decoding read. This will not
let replies of operations on different files to be processed in parallel.
He thinks this can be fixed for 3.9.

Manoj will be raising a bug to gather all his findings. I just wanted to
introduce him and let you know the interesting things he is finding before
you see the bug :-).
--
Pranith


Thanks, Pranith :).

Here's the bug: https://bugzilla.redhat.com/show_bug.cgi?id=1349953

Comparing EC and replica-2 runs, the hot thread is seen in both cases, so
I have not opened this as an EC bug. But initial impression is that
performance impact for EC is particularly bad (details in the bug).

-- Manoj


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Wrong assumptions about disperse

2016-06-20 Thread Xavier Hernandez

Hi Shyam,

On 17/06/16 15:59, Shyam wrote:

On 06/17/2016 04:59 AM, Xavier Hernandez wrote:

Firstly, thanks for the overall post, was informative and helps clarify
some aspects of EC.


AFAIK the real problem of EC is the communications
layer. It adds a lot of latency and having to communicate simultaneously
and coordinate 6 or more bricks has a big impact.


Can you elaborate this more? Is this coordination cost lesser if EC is
coordinated from the server side graphs? (like leader/follower models in
JBR)? I have heard some remarks about a transaction manager in Gluster,
that you proposed, how does that help/fit in to resolve this issue?


I think one of the big problems is in the communications layer. I did 
some tests some time ago with unexpected results. On a pure distributed 
volume with a single brick mounted through FUSE on the same server that 
contains the brick (no physical network communications happen) I did the 
following tests:


* Modify protocol/server to immediately return a buffer of 0's for all 
readv requests (I virtually disable all server side xlators for readv 
requests).


Observed read speed for a dd with bs=128 KB: 349 MB/s
Observed read speed for a dd with bs=32 MB (multiple 128KB readv 
requests in parallel): 744 MB/s


* Modify protocol/client to immediately return a buffer of 0's for all 
readv requests (this avoids all RPC/networking code for readv requests).


Observed read speed for bs=128 KB: 428 MB/s
Observed read speed for bs=32 MB: 1530 MB/s

* An iperf reported a speed of 4.7 GB/s

The network layer seems to be adding a high overhead, specially when 
many requests are sent in parallel. This is very bad for disperse.


I think the coordination effort will be similar in the server side with 
current implementation. If we use the leader approach, coordination will 
be much easier/fast in theory. However all communications will be 
directed to a single server. That could make the communications problem 
worse (I haven't tested any of this, though).


The transaction approach was thought with the idea of moving fop sorting 
to the server side, without having to explicitly take locks on the 
client. This would reduce the number of network round-trips and should 
reduce the latency, improving overall performance.


This should have a perceptible impact in write requests, that currently 
are serialized on the client side. If we move the coordination to the 
server side, the client can send multiple write requests in parallel, 
making better use of the network bandwidth. This also gives the brick 
the opportunity to combine multiple write requests into a single disk 
write. This is specially important for ec, that splits big blocks into 
smaller ones for each brick.




I am curious here w.r.t DHT2, where we are breaking this down into DHT2
client and server pieces, and on the MDC (metadata cluster), the leader
brick of DHT2 subvolume is responsible for some actions, like in-memory
inode locking (say), which would otherwise be a cross subvolume lock
(costlier).


Unfortunatly I haven't had time to read the details about DHT2 
implementation so I cannot say much here.




We also need transactions when we are going to update 2 different
objects with contents (simplest example is creating the inode for the
file and linking its name into the parent directory), IOW when we have a
composite operation.

The above xaction needs recording, which is a lesser evil when dealing
with a local brick, but will suffer performance penalties when dealing
with replication or EC. I am looking at ways where this xaction
recording can be compounded with the first real operation that needs to
be performed on the subvolume, but that may not always work.

So what are your thoughts in regard to improving the client side
coordination problem that you are facing?


My point of view is that *any* coordination will work much better in the 
server side. Additionally, one of the features of the transaction 
framework was that multiple xlators could share a single transaction on 
the same inode, reducing the number of operations needed for the general 
case (currently if two xlators need an exclusive lock, each of them 
needs to issue an independent inodelk/entrylk fop). I know this is 
evolving to the leader/follower pattern, and to have data and metadata 
separated for gluster. I'm not a big fan of this approach, though.


Independently of all these changes, improving network performance will 
benefit *all* approaches.


Regards,

Xavi



Thanks,
Shyam

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Wrong assumptions about disperse

2016-06-17 Thread Xavier Hernandez

Hi all,

I've seen in many places the belief that disperse, or erasure coding in 
general, is slow because of the complex or costly math involved. It's 
true that there's an overhead compared to a simple copy like replica 
does, but this overhead is way more smaller than many people think.


The math used by disperse, if tested alone outside gluster, is much 
faster than it seems. AFAIK the real problem of EC is the communications 
layer. It adds a lot of latency and having to communicate simultaneously 
and coordinate 6 or more bricks has a big impact.


Erasure coding also suffers from partial writes, that require a 
read-modify-write cycle. However this is completely avoided in many 
situations where the volume is optimally configured and writes are in 
blocks of multiples of 4096 bytes and aligned (typical on VMs, databases 
and many other workloads). It could even be avoided in other situations 
taking advantage of the write-behind xlator (not done yet).


I've used a single core of two machines to test the raw math: one quite 
limited (Atom D525 1.8 GHz) and another more powerful but not a top CPU 
(Xeon E5-2630L 2.0 GHz).


Common parameters:

* nonsystematic vandermonde matrix (the same used by ec)
* algorithm slightly slower than the one used bye ec (I haven't 
implemented some optimizations in the test program, but I think the 
difference should be very small)

* buffer size: 128 KiB
* number of iterations: 16384
* total size processed: 2 GiB
* results in MiB/s for a single core

Config   Atom   Xeon
  2+1 633   1856
  4+1 405   1203
  4+2 324984
  4+3 275807
  8+2 227611
  8+3 202545
  8+4 182501
 16+3 116303
 16+4 111295

The same tests using Intel SSE2 extensions (not present in EC yet, but 
the patch is in review):


Config   Atom   Xeon
  2+1 821   3047
  4+1 767   2246
  4+2 629   1887
  4+3 535   1632
  8+2 466   1237
  8+3 423   1104
  8+4 388   1044
 16+3 289675
 16+4 271637

With AVX2 it should be faster, but my machines doesn't support it.

This is even much much faster when a systematic matrix is used. For 
example a 16+4 configuration using SSE on a Xeon core can encode at 3865 
MiB/s. However this won't be a big difference inside gluster.


Currently EC encoding/decoding for small/medium configurations is not 
the bottle-neck of disperse. Maybe for big configurations on slow 
machines, it could have some impact (I don't have resources to test 
those big configurations properly).


Regards,

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Failure to release unusable file open fd_count on glusterfs v3.7.11

2016-06-09 Thread Xavier Hernandez
 

Hi, 

thanks for testing it. I've identified an fd leak in the
disperse xlator. I've filed a bug [1] for this. 

Xavi 

[1]
https://bugzilla.redhat.com/show_bug.cgi?id=1344396 

On 08.06.2016
05:00, 彭繼霆 wrote: 

> Hi, I have a volume created with 3 bricks.After
delete file which was created by "echo", the file has been move to
unlink folder.Excepted result, opened fd should be zero, and unlink
folder contains no file.But actually, opened fd is not zero, unlink
folder contains a file.
> Here are some examples:# gluster volume info
ec2
> 
> Volume Name: ec2
> Type: Disperse
> Volume ID:
47988520-0e18-4413-9e55-3ec3f3352600
> Status: Started
> Number of
Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1:
giting1:/export/ec2/fs
> Brick2: giting2:/export/ec2/fs
> Brick3:
giting3:/export/ec2/fs
> Options Reconfigured:
>
performance.readdir-ahead: on
> 
> # gluster v status ec2
> 
> Status of
volume: ec2
> Gluster process TCP Port RDMA Port Online Pid
>
--
>
Brick giting1:/export/ec2/fs 49154 0 Y 10856
> Brick
giting2:/export/ec2/fs 49154 0 Y 7967
> Brick giting3:/export/ec2/fs
49153 0 Y 7216
> NFS Server on localhost N/A N/A N N/A
> Self-heal
Daemon on localhost N/A N/A Y 10884
> NFS Server on giting3 2049 0 Y
7236
> Self-heal Daemon on giting3 N/A N/A Y 7244
> NFS Server on
giting2 2049 0 Y 7987
> Self-heal Daemon on giting2 N/A N/A Y 7995
> 
>
Task Status of Volume ec2
>
--
>
There are no active volume tasks
> 
> # mount -t glusterfs giting1:ec2
/ec2
> # df -h
> Filesystem Size Used Avail Use% Mounted on
>
/dev/mapper/centos-root 18G 12G 5.8G 67% /
> devtmpfs 1.9G 0 1.9G 0%
/dev
> tmpfs 1.9G 0 1.9G 0% /dev/shm
> tmpfs 1.9G 41M 1.9G 3% /run
>
tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup
> /dev/sdb 40G 33M 40G 1%
/export/bk1
> /dev/sda1 497M 168M 330M 34% /boot
> tmpfs 380M 0 380M 0%
/run/user/0
> giting1:dht 80G 66M 80G 1% /dht
> giting1:/ec1 35G 24G 12G
67% /volume/ec1
> giting1:ec2 35G 24G 12G 67% /ec2 
> 
> # gluster v top
ec2 open
> 
> Brick: giting1:/export/ec2/fs 
> Current open fds: 0, Max
open fds: 0, Max openfd time: N/A 
> Brick: giting2:/export/ec2/fs 
>
Current open fds: 0, Max open fds: 0, Max openfd time: N/A 
> Brick:
giting3:/export/ec2/fs 
> Current open fds: 0, Max open fds: 0, Max
openfd time: N/A 
> 
> # for ((i=0;i /ec2/test.txt; done 
> 
> # gluster
v top ec2 open 
> Brick: giting1:/export/ec2/fs 
> Current open fds: 9,
Max open fds: 10, Max openfd time: 2016-06-08 10:09:23.665717 
> Count
filename 
> === 
> 10 /test.txt 
> Brick:
giting3:/export/ec2/fs 
> Current open fds: 9, Max open fds: 10, Max
openfd time: 2016-06-08 10:09:23.299795 
> Count filename 
>
=== 
> 10 /test.txt 
> Brick: giting2:/export/ec2/fs

> Current open fds: 9, Max open fds: 10, Max openfd time: 2016-06-08
10:09:23.236294 
> Count filename 
> === 
> 10
/test.txt 
> 
> # ll /export/ec2/fs/.glusterfs/unlink/
> 
> total 0
> 
>
# rm /ec2/test.txt
> 
> # ls -l /export/ec2/fs/.glusterfs/unlink/ 
>
total 8 
> -rw-r--r-- 1 root root 512 Jun 8 18:09
a053b266-15c5-4ac7-ac44-841e177c7ebe 
> 
> # gluster v top ec2 open 
>
Brick: giting1:/export/ec2/fs 
> Current open fds: 8, Max open fds: 10,
Max openfd time: 2016-06-08 10:09:23.665717 
> Count filename 
>
=== 
> 10 /test.txt 
> Brick: giting2:/export/ec2/fs

> Current open fds: 8, Max open fds: 10, Max openfd time: 2016-06-08
10:09:23.236294 
> Count filename 
> === 
> 10
/test.txt 
> Brick: giting3:/export/ec2/fs 
> Current open fds: 8, Max
open fds: 10, Max openfd time: 2016-06-08 10:09:23.299795 
> Count
filename 
> === 
> 10 /test.txt 
> 
> Reference:
Commit: storage/posix: Implement .unlink directory
>
https://github.com/gluster/glusterfs/commit/195548f55b09bf71db92929b7b734407b863093c
[1] 
> 
> Regards,
> 
> Gi-ting Peng
 

Links:
--
[1]
https://github.com/gluster/glusterfs/commit/195548f55b09bf71db92929b7b734407b863093c
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] dht mkdir preop check, afr and (non-)readable afr subvols

2016-06-06 Thread Xavier Hernandez

Hi Raghavendra,

On 06/06/16 10:54, Raghavendra G wrote:



On Wed, Jun 1, 2016 at 12:50 PM, Xavier Hernandez <xhernan...@datalab.es
<mailto:xhernan...@datalab.es>> wrote:

Hi,

On 01/06/16 08:53, Raghavendra Gowdappa wrote:



- Original Message -

    From: "Xavier Hernandez" <xhernan...@datalab.es
<mailto:xhernan...@datalab.es>>
To: "Pranith Kumar Karampuri" <pkara...@redhat.com
<mailto:pkara...@redhat.com>>, "Raghavendra G"
<raghaven...@gluster.com <mailto:raghaven...@gluster.com>>
Cc: "Gluster Devel" <gluster-devel@gluster.org
<mailto:gluster-devel@gluster.org>>
Sent: Wednesday, June 1, 2016 11:57:12 AM
Subject: Re: [Gluster-devel] dht mkdir preop check, afr and
(non-)readable afr subvols

Oops, you are right. For entry operations the current
version of the
parent directory is not checked, just to avoid this problem.

This means that mkdir will be sent to all alive subvolumes.
However it
still selects the group of answers that have a minimum
quorum equal or
greater than #bricks - redundancy. So it should be still valid.


What if the quorum is met on "bad" subvolumes? and mkdir was
successful on bad subvolumes? Do we consider mkdir as
successful? If yes, even EC suffers from the problem described
in bz https://bugzilla.redhat.com/show_bug.cgi?id=1341429.


I don't understand the real problem. How a subvolume of EC could be
in bad state from the point of view of DHT ?

If you use xattrs to configure something in the parent directories,
you should have needed to use setxattr or xattrop to do that. These
operations do consider good/bad bricks because they touch inode
metadata. This will only succeed if enough (quorum) bricks have
successfully processed it. If quorum is met but for an error answer,
an error will be reported to DHT and the majority of bricks will be
left in the old state (these should be considered the good
subvolumes). If some brick has succeeded, it will be considered bad
and will be healed. If no quorum is met (even for an error answer),
EIO will be returned and the state of the directory should be
considered unknown/damaged.


Yes. Ideally, dht should use a getxattr for the layout xattr. But, for
performance reasons we thought of overloading mkdir by introducing
pre-operations (done by bricks). With plain dht it is a simple
comparison of xattrs passed as argument and xattrs stored on disk. But,
I failed to include afr and EC in the picture.


I still miss something. Looking at the patch that implements this 
(http://review.gluster.org/13885), it seems that mkdir fails if the 
parent xattr is no correctly set, so it's not possible to create a 
directory on a "bad" brick.


If the majority of the subvolumes of ec fail, the whole request will 
fail and this failure will be reported to DHT. If the majority succeed, 
it will be reported to DHT, even is some of the subvolumes have failed.


Maybe if you give me a specific example I may see the real problem.

Xavi


Hence this issue. How
difficult for EC and AFR to bring this kind of check? Is it even
possible for afr and EC to implement this kind of pre-op checks with
reasonable complexity?


If a later mkdir checks this value in storage/posix and succeeds in
enough bricks, it necessarily means that is has succeeded in good
bricks, because there cannot be enough bricks with the bad xattr value.

Note that quorum is always > #bricks/2 so we cannot have a quorum
with good and bad bricks at the same time.

Xavi




Xavi

On 01/06/16 06:51, Pranith Kumar Karampuri wrote:

Xavi,
But if we keep winding only to good subvolumes,
there is a case
where bad subvolumes will never catch up right? i.e. if
we keep creating
files in same directory and everytime self-heal
completes there are more
entries mounts would have created on the good subvolumes
alone. I think
I must have missed this in the reviews if this is the
current behavior.
It was not in the earlier releases. Right?

Pranith

On Tue, May 31, 2016 at 2:17 PM, Raghavendra G
<raghaven...@gluster.com <mailto:raghaven...@gluster.com>
<mailto:raghaven...@gluster.com
<mailto:raghaven...@gluster.com>>> wrote:



On Tue, May 31, 2016 at 12:37 PM, Xavier Hernandez
<xhernan...

Re: [Gluster-devel] dht mkdir preop check, afr and (non-)readable afr subvols

2016-06-01 Thread Xavier Hernandez
Oops, you are right. For entry operations the current version of the 
parent directory is not checked, just to avoid this problem.


This means that mkdir will be sent to all alive subvolumes. However it 
still selects the group of answers that have a minimum quorum equal or 
greater than #bricks - redundancy. So it should be still valid.


Xavi

On 01/06/16 06:51, Pranith Kumar Karampuri wrote:

Xavi,
But if we keep winding only to good subvolumes, there is a case
where bad subvolumes will never catch up right? i.e. if we keep creating
files in same directory and everytime self-heal completes there are more
entries mounts would have created on the good subvolumes alone. I think
I must have missed this in the reviews if this is the current behavior.
It was not in the earlier releases. Right?

Pranith

On Tue, May 31, 2016 at 2:17 PM, Raghavendra G <raghaven...@gluster.com
<mailto:raghaven...@gluster.com>> wrote:



On Tue, May 31, 2016 at 12:37 PM, Xavier Hernandez
<xhernan...@datalab.es <mailto:xhernan...@datalab.es>> wrote:

Hi,

On 31/05/16 07:05, Raghavendra Gowdappa wrote:

+gluster-devel, +Xavi

Hi all,

The context is [1], where bricks do pre-operation checks
before doing a fop and proceed with fop only if pre-op check
is successful.

@Xavi,

We need your inputs on behavior of EC subvolumes as well.


If I understand correctly, EC shouldn't have any problems here.

EC sends the mkdir request to all subvolumes that are currently
considered "good" and tries to combine the answers. Answers that
match in return code, errno (if necessary) and xdata contents
(except for some special xattrs that are ignored for combination
purposes), are grouped.

Then it takes the group with more members/answers. If that group
has a minimum size of #bricks - redundancy, it is considered the
good answer. Otherwise EIO is returned because bricks are in an
inconsistent state.

If there's any answer in another group, it's considered bad and
gets marked so that self-heal will repair it using the good
information from the majority of bricks.

xdata is combined and returned even if return code is -1.

Is that enough to cover the needed behavior ?


Thanks Xavi. That's sufficient for the feature in question. One of
the main cases I was interested in was what would be the behaviour
if mkdir succeeds on "bad" subvolume and fails on "good" subvolume.
Since you never wind mkdir to "bad" subvolume(s), this situation
never arises.




Xavi



[1] http://review.gluster.org/13885

regards,
Raghavendra

- Original Message -

From: "Pranith Kumar Karampuri" <pkara...@redhat.com
<mailto:pkara...@redhat.com>>
To: "Raghavendra Gowdappa" <rgowd...@redhat.com
<mailto:rgowd...@redhat.com>>
Cc: "team-quine-afr" <team-quine-...@redhat.com
<mailto:team-quine-...@redhat.com>>, "rhs-zteam"
<rhs-zt...@redhat.com <mailto:rhs-zt...@redhat.com>>
Sent: Tuesday, May 31, 2016 10:22:49 AM
Subject: Re: dht mkdir preop check, afr and
(non-)readable afr subvols

I think you should start a discussion on gluster-devel
so that Xavi gets a
chance to respond on the mails as well.

On Tue, May 31, 2016 at 10:21 AM, Raghavendra Gowdappa
<rgowd...@redhat.com <mailto:rgowd...@redhat.com>>
wrote:

Also note that we've plans to extend this pre-op
check to all dentry
operations which also depend parent layout. So, the
discussion need to
cover all dentry operations like:

1. create
2. mkdir
3. rmdir
4. mknod
5. symlink
6. unlink
7. rename

We also plan to have similar checks in lock codepath
for directories too
(planning to use hashed-subvolume as lock-subvolume
for directories). So,
more fops :)
8. lk (posix locks)
9. inodelk
10. entrylk

regards,
Raghavendra

- Original Message -

From: "Raghavendra Gowdappa"
<rg

Re: [Gluster-devel] dht mkdir preop check, afr and (non-)readable afr subvols

2016-05-31 Thread Xavier Hernandez

Hi,

On 31/05/16 07:05, Raghavendra Gowdappa wrote:

+gluster-devel, +Xavi

Hi all,

The context is [1], where bricks do pre-operation checks before doing a fop and 
proceed with fop only if pre-op check is successful.

@Xavi,

We need your inputs on behavior of EC subvolumes as well.


If I understand correctly, EC shouldn't have any problems here.

EC sends the mkdir request to all subvolumes that are currently 
considered "good" and tries to combine the answers. Answers that match 
in return code, errno (if necessary) and xdata contents (except for some 
special xattrs that are ignored for combination purposes), are grouped.


Then it takes the group with more members/answers. If that group has a 
minimum size of #bricks - redundancy, it is considered the good answer. 
Otherwise EIO is returned because bricks are in an inconsistent state.


If there's any answer in another group, it's considered bad and gets 
marked so that self-heal will repair it using the good information from 
the majority of bricks.


xdata is combined and returned even if return code is -1.

Is that enough to cover the needed behavior ?

Xavi



[1] http://review.gluster.org/13885

regards,
Raghavendra

- Original Message -

From: "Pranith Kumar Karampuri" 
To: "Raghavendra Gowdappa" 
Cc: "team-quine-afr" , "rhs-zteam" 

Sent: Tuesday, May 31, 2016 10:22:49 AM
Subject: Re: dht mkdir preop check, afr and (non-)readable afr subvols

I think you should start a discussion on gluster-devel so that Xavi gets a
chance to respond on the mails as well.

On Tue, May 31, 2016 at 10:21 AM, Raghavendra Gowdappa 
wrote:


Also note that we've plans to extend this pre-op check to all dentry
operations which also depend parent layout. So, the discussion need to
cover all dentry operations like:

1. create
2. mkdir
3. rmdir
4. mknod
5. symlink
6. unlink
7. rename

We also plan to have similar checks in lock codepath for directories too
(planning to use hashed-subvolume as lock-subvolume for directories). So,
more fops :)
8. lk (posix locks)
9. inodelk
10. entrylk

regards,
Raghavendra

- Original Message -

From: "Raghavendra Gowdappa" 
To: "team-quine-afr" 
Cc: "rhs-zteam" 
Sent: Tuesday, May 31, 2016 10:15:04 AM
Subject: dht mkdir preop check, afr and (non-)readable afr subvols

Hi all,

I have some queries related to the behavior of afr_mkdir with respect to
readable subvols.

1. While winding mkdir to subvols does afr check whether the subvolume is
good/readable? Or does it wind to all subvols irrespective of whether a
subvol is good/bad? In the latter case, what if
   a. mkdir succeeds on non-readable subvolume
   b. fails on readable subvolume

  What is the result reported to higher layers in the above scenario? If
  mkdir is failed, is it cleaned up on non-readable subvolume where it
  failed?

I am interested in this case as dht-preop check relies on layout xattrs

and I

assume layout xattrs in particular (and all xattrs in general) are
guaranteed to be correct only on a readable subvolume of afr. So, in

essence

we shouldn't be winding down mkdir on non-readable subvols as whatever

the

decision brick makes as part of pre-op check is inherently flawed.

regards,
Raghavendra

--
Pranith


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Possible bug in the communications layer ?

2016-05-09 Thread Xavier Hernandez

I've uploaded a patch for this problem:

http://review.gluster.org/14270

Any review will be very appreciated :)

Thanks,

Xavi

On 09/05/16 12:35, Raghavendra Gowdappa wrote:



- Original Message -

From: "Xavier Hernandez" <xhernan...@datalab.es>
To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
Cc: "Gluster Devel" <gluster-devel@gluster.org>
Sent: Monday, May 9, 2016 3:07:16 PM
Subject: Re: [Gluster-devel] Possible bug in the communications layer ?

Hi Raghavendra,

I've finally found the bug. It was obvious but I didn't see it.


Same here :).



  1561 case SP_STATE_ACCEPTED_SUCCESS_REPLY_INIT:
  1562 default_read_size = xdr_sizeof ((xdrproc_t)
xdr_gfs3_read_rsp,
  1563 _rsp);
  1564
  1565 proghdr_buf = frag->fragcurrent;
  1566
  1567 __socket_proto_init_pending (priv,
default_read_size);
  1568
  1569 frag->call_body.reply.accepted_success_state
  1570 = SP_STATE_READING_PROC_HEADER;
  1571
  1572 /* fall through */
  1573
  1574 case SP_STATE_READING_PROC_HEADER:
  1575 __socket_proto_read (priv, ret);
  1576
  1577 gf_trace_add("xdrmem_create", default_read_size,
(uintptr_t)proghdr_buf);
  1578 /* there can be 'xdata' in read response, figure
it out */
  1579 xdrmem_create (, proghdr_buf, default_read_size,
  1580XDR_DECODE);
  1581
  1582 /* This will fail if there is xdata sent from
server, if not,
  1583well and good, we don't need to worry about  */
  1584 xdr_gfs3_read_rsp (, _rsp);
  1585
  1586 free (read_rsp.xdata.xdata_val);
  1587
  1588 /* need to round off to proper roof (%4), as XDR
packing pads
  1589the end of opaque object with '0' */
  1590 size = roof (read_rsp.xdata.xdata_len, 4);
  1591
  1592 if (!size) {
  1593 frag->call_body.reply.accepted_success_state
  1594 = SP_STATE_READ_PROC_OPAQUE;
  1595 goto read_proc_opaque;
  1596 }
  1597
  1598 __socket_proto_init_pending (priv, size);
  1599
  1600 frag->call_body.reply.accepted_success_state
  1601 = SP_STATE_READING_PROC_OPAQUE;

The main problem here is that we are using two local variables
(proghdr_buf and default_read_size) in two distinct states that might be
called at different times.

The particular case that is failing is the following:

1. In state SP_STATE_ACCEPTED_SUCCESS_REPLY_INIT, everything is prepared
to read 116 bytes. default_read_size is set to 116 and proghdr_buf
points to the buffer where data will be written.

2. In state SP_STATE_READING_PROC_HEADER, a partial read of 88 bytes is
done. At this point the function returns and proghdr_buf and
default_read_size are lost.

3. When more data is available, this function is called again and it
starts executing at state SP_STATE_READING_PROC_HEADER.

4. The remaining 28 bytes are read.

5. When it checks the buffer and tries to decode it to see if there's
xdata present, it uses the default values of proghdr_buf and
default_read_size, that are 0. This causes the decode to leave
read_rsp.xdata.xdata_len set to 0.

6. The program interprets that xdata_len being 0 means that there's no
xdata, so it continues reading the remaining of the RPC packet into the
payload buffer.

If you want, I can send a patch for this.


Yes. That would be helpful. The analysis is correct and moving initialization 
of prog_hdrbuf to line 1578 will fix the issue. If you are too busy, please let 
me know and I can patch it up too :).

Thanks for debugging the issue :).

regards,
Raghavendra.



Xavi

On 05/05/16 10:21, Xavier Hernandez wrote:

I've undone all changes and now I'm unable to reproduce the problem, so
the modification I did is probably incorrect and not the root cause, as
you described.

I'll continue investigating...

Xavi

On 04/05/16 15:01, Xavier Hernandez wrote:

On 04/05/16 14:47, Raghavendra Gowdappa wrote:



- Original Message -

From: "Xavier Hernandez" <xhernan...@datalab.es>
To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
Cc: "Gluster Devel" <gluster-devel@gluster.org>
Sent: Wednesday, May 4, 2016 5:37:56 PM
Subject: Re: [Gluster-devel] Possible bug in the communications layer ?

I think I've found the problem.

1567 case SP_STATE_READING_PROC_HEADER:
  1568 __socket_proto_read (priv, ret);
  1569
  1570 /* there can be 'xdata' in read response, figure
it out */
  1571 xdrmem_create (, proghdr_buf,
default_read_size,
  1572 

Re: [Gluster-devel] Possible bug in the communications layer ?

2016-05-09 Thread Xavier Hernandez

Hi Raghavendra,

I've finally found the bug. It was obvious but I didn't see it.

 1561 case SP_STATE_ACCEPTED_SUCCESS_REPLY_INIT:
 1562 default_read_size = xdr_sizeof ((xdrproc_t) 
xdr_gfs3_read_rsp,

 1563 _rsp);
 1564
 1565 proghdr_buf = frag->fragcurrent;
 1566
 1567 __socket_proto_init_pending (priv, 
default_read_size);

 1568
 1569 frag->call_body.reply.accepted_success_state
 1570 = SP_STATE_READING_PROC_HEADER;
 1571
 1572 /* fall through */
 1573
 1574 case SP_STATE_READING_PROC_HEADER:
 1575 __socket_proto_read (priv, ret);
 1576
 1577 gf_trace_add("xdrmem_create", default_read_size, 
(uintptr_t)proghdr_buf);
 1578 /* there can be 'xdata' in read response, figure 
it out */

 1579 xdrmem_create (, proghdr_buf, default_read_size,
 1580XDR_DECODE);
 1581
 1582 /* This will fail if there is xdata sent from 
server, if not,

 1583well and good, we don't need to worry about  */
 1584 xdr_gfs3_read_rsp (, _rsp);
 1585
 1586 free (read_rsp.xdata.xdata_val);
 1587
 1588 /* need to round off to proper roof (%4), as XDR 
packing pads

 1589the end of opaque object with '0' */
 1590 size = roof (read_rsp.xdata.xdata_len, 4);
 1591
 1592 if (!size) {
 1593 frag->call_body.reply.accepted_success_state
 1594 = SP_STATE_READ_PROC_OPAQUE;
 1595 goto read_proc_opaque;
 1596 }
 1597
 1598 __socket_proto_init_pending (priv, size);
 1599
 1600 frag->call_body.reply.accepted_success_state
 1601 = SP_STATE_READING_PROC_OPAQUE;

The main problem here is that we are using two local variables 
(proghdr_buf and default_read_size) in two distinct states that might be 
called at different times.


The particular case that is failing is the following:

1. In state SP_STATE_ACCEPTED_SUCCESS_REPLY_INIT, everything is prepared 
to read 116 bytes. default_read_size is set to 116 and proghdr_buf 
points to the buffer where data will be written.


2. In state SP_STATE_READING_PROC_HEADER, a partial read of 88 bytes is 
done. At this point the function returns and proghdr_buf and 
default_read_size are lost.


3. When more data is available, this function is called again and it 
starts executing at state SP_STATE_READING_PROC_HEADER.


4. The remaining 28 bytes are read.

5. When it checks the buffer and tries to decode it to see if there's 
xdata present, it uses the default values of proghdr_buf and 
default_read_size, that are 0. This causes the decode to leave 
read_rsp.xdata.xdata_len set to 0.


6. The program interprets that xdata_len being 0 means that there's no 
xdata, so it continues reading the remaining of the RPC packet into the 
payload buffer.


If you want, I can send a patch for this.

Xavi

On 05/05/16 10:21, Xavier Hernandez wrote:

I've undone all changes and now I'm unable to reproduce the problem, so
the modification I did is probably incorrect and not the root cause, as
you described.

I'll continue investigating...

Xavi

On 04/05/16 15:01, Xavier Hernandez wrote:

On 04/05/16 14:47, Raghavendra Gowdappa wrote:



- Original Message -

From: "Xavier Hernandez" <xhernan...@datalab.es>
To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
Cc: "Gluster Devel" <gluster-devel@gluster.org>
Sent: Wednesday, May 4, 2016 5:37:56 PM
Subject: Re: [Gluster-devel] Possible bug in the communications layer ?

I think I've found the problem.

1567 case SP_STATE_READING_PROC_HEADER:
  1568 __socket_proto_read (priv, ret);
  1569
  1570 /* there can be 'xdata' in read response, figure
it out */
  1571 xdrmem_create (, proghdr_buf,
default_read_size,
  1572XDR_DECODE);
  1573
  1574 /* This will fail if there is xdata sent from
server, if not,
  1575well and good, we don't need to worry
about  */
  1576 xdr_gfs3_read_rsp (, _rsp);
  1577
  1578 free (read_rsp.xdata.xdata_val);
  1579
  1580 /* need to round off to proper roof (%4), as XDR
packing pads
  1581the end of opaque object with '0' */
  1582 size = roof (read_rsp.xdata.xdata_len, 4);
  1583
  1584 if (!size) {
  1585
frag->call_body.reply.accepted_success_state
  1586 = SP_STATE_READ_PROC_OPAQUE;
  1587 goto read_proc_opaque;
  1588 }
  1589
  159

Re: [Gluster-devel] Bugs with incorrect status

2016-05-06 Thread Xavier Hernandez
I think there's a problem with the script that generates this report. 
The changes I2fac59 and Ie1934f are bound to bug 1332054, not 1236065.


Xavi

On 06/05/16 10:41, Niels de Vos wrote:

1236065 (mainline) MODIFIED: Disperse volume: FUSE I/O error after self healing 
the failed disk files
  [master] I2fac59 cluster/ec: Fix spurious failure of test bug-1236065.t (NEW)
  [master] Ie1934f disperse: mark bug-1236065.t as bad_test (MERGED)
  [master] I225e31 cluster/ec: Fix tracking of good bricks (MERGED)
  ** xhernan...@datalab.es: Bug 1236065 should be in POST, change I2fac59 under 
review **

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] Fwd: dht_is_subvol_filled messages on client

2016-05-05 Thread Xavier Hernandez

On 05/05/16 13:59, Kaushal M wrote:

On Thu, May 5, 2016 at 4:37 PM, Xavier Hernandez <xhernan...@datalab.es> wrote:

On 05/05/16 11:31, Kaushal M wrote:


On Thu, May 5, 2016 at 2:36 PM, David Gossage
<dgoss...@carouselchecks.com> wrote:





On Thu, May 5, 2016 at 3:28 AM, Serkan Çoban <cobanser...@gmail.com>
wrote:



Hi,

You can find the output below link:
https://www.dropbox.com/s/wzrh5yp494ogksc/status_detail.txt?dl=0

Thanks,
Serkan




Maybe not issue, but playing one of these things is not like the other I
notice of all the bricks only one seems to be different at a quick glance

Brick: Brick 1.1.1.235:/bricks/20
TCP Port : 49170
RDMA Port: 0
Online   : Y
Pid  : 26736
File System  : ext4
Device   : /dev/mapper/vol0-vol_root
Mount Options: rw,relatime,data=ordered
Inode Size   : 256
Disk Space Free  : 86.1GB
Total Disk Space : 96.0GB
Inode Count  : 6406144
Free Inodes  : 6381374

Every other brick seems to be 7TB and xfs but this one.



Looks like the brick fs isn't mounted, and the root-fs is being used
instead. But that still leaves enough inodes free.

What I suspect is that one of the cluster translators is mixing up
stats when aggregating from multiple bricks.
From the log snippet you gave in the first mail, it seems like the
disperse translator is possibly involved.



Currently ec takes the number of potential files in the subvolume (f_files)
as the maximum of all its subvolumes, but it takes the available count
(f_ffree) as the minumum of all its volumes.

This causes max to be ~781.000.000, but free will be ~6.300.000. This gives
a ~0.8% available, i.e. almost 100% full.

Given the circumstances I think it's the correct thing to do.


Thanks for giving the reasoning Xavi.

But why is the number of potential files the maximum?
IIUC, a file (or parts of it) will be written to all subvolumes in the
disperse set.
So wouldn't the smallest subvolume limit the number of files that
could be possibly created?


I'm not very sure why this decision was taken. In theory ec only 
supports identical subvolumes because of the way it works. This means 
that all bricks should report the same maximum.


When this doesn't happen, I suppose that the motivation was that this 
number should report the theoretic maximum number of files that the 
volume can contain.




~kaushal



Xavi




BTW, how large is the volume you have? Those are a lot of bricks!

~kaushal









On Thu, May 5, 2016 at 9:33 AM, Xavier Hernandez <xhernan...@datalab.es>
wrote:


Can you post the result of 'gluster volume status v0 detail' ?


On 05/05/16 06:49, Serkan Çoban wrote:



Hi, Can anyone suggest something for this issue? df, du has no issue
for the bricks yet one subvolume not being used by gluster..

On Wed, May 4, 2016 at 4:40 PM, Serkan Çoban <cobanser...@gmail.com>
wrote:



Hi,

I changed cluster.min-free-inodes to "0". Remount the volume on
clients. inode full messages not coming to syslog anymore but I see
disperse-56 subvolume still not being used.
Anything I can do to resolve this issue? Maybe I can destroy and
recreate the volume but I am not sure It will fix this issue...
Maybe the disperse size 16+4 is too big should I change it to 8+2?

On Tue, May 3, 2016 at 2:36 PM, Serkan Çoban <cobanser...@gmail.com>
wrote:



I also checked the df output all 20 bricks are same like below:
/dev/sdu1 7.3T 34M 7.3T 1% /bricks/20

On Tue, May 3, 2016 at 1:40 PM, Raghavendra G
<raghaven...@gluster.com>
wrote:





On Mon, May 2, 2016 at 11:41 AM, Serkan Çoban
<cobanser...@gmail.com>
wrote:





1. What is the out put of du -hs ? Please get
this
information for each of the brick that are part of disperse.





Sorry. I needed df output of the filesystem containing brick. Not
du.
Sorry
about that.



There are 20 bricks in disperse-56 and the du -hs output is like:
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
1.8M /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20

I see that gluster is not writing to this disperse set. All other
disperse sets are filled 13GB but this one is empty. I see
directory
structure created but no files in directories.
How can I fix the issue? I will try to rebalance but I don't think
it
will write to this disperse set...



On Sat, Apr 30, 2016 at 9:22 AM, Raghavendra G
<raghaven...@gluster.com>
wrote:





On Fri, Apr 29, 2016 at 12:32 AM, Serkan Çoban
<cobanser...@gmail.com>
wrote:




Hi, I cannot get an answer from user list, so asking to devel
list.

I am getting [dht-diskusage.c:277:dht_is_subvol_filled]
0-v0-dht:
inodes on subvolume 'v0-disperse-56' are at (100.00 %), consider
adding more bricks.

message 

Re: [Gluster-devel] [Gluster-users] Fwd: dht_is_subvol_filled messages on client

2016-05-05 Thread Xavier Hernandez

On 05/05/16 11:31, Kaushal M wrote:

On Thu, May 5, 2016 at 2:36 PM, David Gossage
<dgoss...@carouselchecks.com> wrote:




On Thu, May 5, 2016 at 3:28 AM, Serkan Çoban <cobanser...@gmail.com> wrote:


Hi,

You can find the output below link:
https://www.dropbox.com/s/wzrh5yp494ogksc/status_detail.txt?dl=0

Thanks,
Serkan



Maybe not issue, but playing one of these things is not like the other I
notice of all the bricks only one seems to be different at a quick glance

Brick: Brick 1.1.1.235:/bricks/20
TCP Port : 49170
RDMA Port: 0
Online   : Y
Pid  : 26736
File System  : ext4
Device   : /dev/mapper/vol0-vol_root
Mount Options: rw,relatime,data=ordered
Inode Size   : 256
Disk Space Free  : 86.1GB
Total Disk Space : 96.0GB
Inode Count  : 6406144
Free Inodes  : 6381374

Every other brick seems to be 7TB and xfs but this one.


Looks like the brick fs isn't mounted, and the root-fs is being used
instead. But that still leaves enough inodes free.

What I suspect is that one of the cluster translators is mixing up
stats when aggregating from multiple bricks.
From the log snippet you gave in the first mail, it seems like the
disperse translator is possibly involved.


Currently ec takes the number of potential files in the subvolume 
(f_files) as the maximum of all its subvolumes, but it takes the 
available count (f_ffree) as the minumum of all its volumes.


This causes max to be ~781.000.000, but free will be ~6.300.000. This 
gives a ~0.8% available, i.e. almost 100% full.


Given the circumstances I think it's the correct thing to do.

Xavi



BTW, how large is the volume you have? Those are a lot of bricks!

~kaushal









On Thu, May 5, 2016 at 9:33 AM, Xavier Hernandez <xhernan...@datalab.es>
wrote:

Can you post the result of 'gluster volume status v0 detail' ?


On 05/05/16 06:49, Serkan Çoban wrote:


Hi, Can anyone suggest something for this issue? df, du has no issue
for the bricks yet one subvolume not being used by gluster..

On Wed, May 4, 2016 at 4:40 PM, Serkan Çoban <cobanser...@gmail.com>
wrote:


Hi,

I changed cluster.min-free-inodes to "0". Remount the volume on
clients. inode full messages not coming to syslog anymore but I see
disperse-56 subvolume still not being used.
Anything I can do to resolve this issue? Maybe I can destroy and
recreate the volume but I am not sure It will fix this issue...
Maybe the disperse size 16+4 is too big should I change it to 8+2?

On Tue, May 3, 2016 at 2:36 PM, Serkan Çoban <cobanser...@gmail.com>
wrote:


I also checked the df output all 20 bricks are same like below:
/dev/sdu1 7.3T 34M 7.3T 1% /bricks/20

On Tue, May 3, 2016 at 1:40 PM, Raghavendra G
<raghaven...@gluster.com>
wrote:




On Mon, May 2, 2016 at 11:41 AM, Serkan Çoban
<cobanser...@gmail.com>
wrote:




1. What is the out put of du -hs ? Please get
this
information for each of the brick that are part of disperse.




Sorry. I needed df output of the filesystem containing brick. Not
du.
Sorry
about that.



There are 20 bricks in disperse-56 and the du -hs output is like:
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
1.8M /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20

I see that gluster is not writing to this disperse set. All other
disperse sets are filled 13GB but this one is empty. I see
directory
structure created but no files in directories.
How can I fix the issue? I will try to rebalance but I don't think
it
will write to this disperse set...



On Sat, Apr 30, 2016 at 9:22 AM, Raghavendra G
<raghaven...@gluster.com>
wrote:




On Fri, Apr 29, 2016 at 12:32 AM, Serkan Çoban
<cobanser...@gmail.com>
wrote:



Hi, I cannot get an answer from user list, so asking to devel
list.

I am getting [dht-diskusage.c:277:dht_is_subvol_filled] 0-v0-dht:
inodes on subvolume 'v0-disperse-56' are at (100.00 %), consider
adding more bricks.

message on client logs.My cluster is empty there are only a
couple
of
GB files for testing. Why this message appear in syslog?




dht uses disk usage information from backend export.

1. What is the out put of du -hs ? Please get
this
information for each of the brick that are part of disperse.
2. Once you get du information from each brick, the value seen by
dht
will
be based on how cluster/disperse aggregates du info (basically
statfs
fop).

The reason for 100% disk usage may be,
In case of 1, backend fs might be shared by data other than brick.
In case of 2, some issues with aggregation.


Is is safe to
ignore it?




dht will try not to have data files on the subvol in question
(v0-disperse-56). Hence lookup cost will be two hops for files
hashing
to
disperse-5

Re: [Gluster-devel] [Gluster-users] Fwd: dht_is_subvol_filled messages on client

2016-05-05 Thread Xavier Hernandez

Can you post the result of 'gluster volume status v0 detail' ?

On 05/05/16 06:49, Serkan Çoban wrote:

Hi, Can anyone suggest something for this issue? df, du has no issue
for the bricks yet one subvolume not being used by gluster..

On Wed, May 4, 2016 at 4:40 PM, Serkan Çoban  wrote:

Hi,

I changed cluster.min-free-inodes to "0". Remount the volume on
clients. inode full messages not coming to syslog anymore but I see
disperse-56 subvolume still not being used.
Anything I can do to resolve this issue? Maybe I can destroy and
recreate the volume but I am not sure It will fix this issue...
Maybe the disperse size 16+4 is too big should I change it to 8+2?

On Tue, May 3, 2016 at 2:36 PM, Serkan Çoban  wrote:

I also checked the df output all 20 bricks are same like below:
/dev/sdu1 7.3T 34M 7.3T 1% /bricks/20

On Tue, May 3, 2016 at 1:40 PM, Raghavendra G  wrote:



On Mon, May 2, 2016 at 11:41 AM, Serkan Çoban  wrote:



1. What is the out put of du -hs ? Please get this
information for each of the brick that are part of disperse.



Sorry. I needed df output of the filesystem containing brick. Not du. Sorry
about that.



There are 20 bricks in disperse-56 and the du -hs output is like:
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
1.8M /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20
80K /bricks/20

I see that gluster is not writing to this disperse set. All other
disperse sets are filled 13GB but this one is empty. I see directory
structure created but no files in directories.
How can I fix the issue? I will try to rebalance but I don't think it
will write to this disperse set...



On Sat, Apr 30, 2016 at 9:22 AM, Raghavendra G 
wrote:



On Fri, Apr 29, 2016 at 12:32 AM, Serkan Çoban 
wrote:


Hi, I cannot get an answer from user list, so asking to devel list.

I am getting [dht-diskusage.c:277:dht_is_subvol_filled] 0-v0-dht:
inodes on subvolume 'v0-disperse-56' are at (100.00 %), consider
adding more bricks.

message on client logs.My cluster is empty there are only a couple of
GB files for testing. Why this message appear in syslog?



dht uses disk usage information from backend export.

1. What is the out put of du -hs ? Please get this
information for each of the brick that are part of disperse.
2. Once you get du information from each brick, the value seen by dht
will
be based on how cluster/disperse aggregates du info (basically statfs
fop).

The reason for 100% disk usage may be,
In case of 1, backend fs might be shared by data other than brick.
In case of 2, some issues with aggregation.


Is is safe to
ignore it?



dht will try not to have data files on the subvol in question
(v0-disperse-56). Hence lookup cost will be two hops for files hashing
to
disperse-56 (note that other fops like read/write/open still have the
cost
of single hop and dont suffer from this penalty). Other than that there
is
no significant harm unless disperse-56 is really running out of space.

regards,
Raghavendra


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel





--
Raghavendra G

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel





--
Raghavendra G

___
Gluster-users mailing list
gluster-us...@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Possible bug in the communications layer ?

2016-05-04 Thread Xavier Hernandez

On 04/05/16 14:47, Raghavendra Gowdappa wrote:



- Original Message -

From: "Xavier Hernandez" <xhernan...@datalab.es>
To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
Cc: "Gluster Devel" <gluster-devel@gluster.org>
Sent: Wednesday, May 4, 2016 5:37:56 PM
Subject: Re: [Gluster-devel] Possible bug in the communications layer ?

I think I've found the problem.

1567 case SP_STATE_READING_PROC_HEADER:
  1568 __socket_proto_read (priv, ret);
  1569
  1570 /* there can be 'xdata' in read response, figure
it out */
  1571 xdrmem_create (, proghdr_buf, default_read_size,
  1572XDR_DECODE);
  1573
  1574 /* This will fail if there is xdata sent from
server, if not,
  1575well and good, we don't need to worry about  */
  1576 xdr_gfs3_read_rsp (, _rsp);
  1577
  1578 free (read_rsp.xdata.xdata_val);
  1579
  1580 /* need to round off to proper roof (%4), as XDR
packing pads
  1581the end of opaque object with '0' */
  1582 size = roof (read_rsp.xdata.xdata_len, 4);
  1583
  1584 if (!size) {
  1585 frag->call_body.reply.accepted_success_state
  1586 = SP_STATE_READ_PROC_OPAQUE;
  1587 goto read_proc_opaque;
  1588 }
  1589
  1590 __socket_proto_init_pending (priv, size);
  1591
  1592 frag->call_body.reply.accepted_success_state
  1593 = SP_STATE_READING_PROC_OPAQUE;
  1594
  1595 case SP_STATE_READING_PROC_OPAQUE:
  1596 __socket_proto_read (priv, ret);
  1597
  1598 frag->call_body.reply.accepted_success_state
  1599 = SP_STATE_READ_PROC_OPAQUE;

On line 1568 we read, at most, 116 bytes because we calculate the size
of a read response without xdata. Then we detect that we really need
more data for xdata (BTW, read_rsp.xdata.xdata_val will be always
allocated even if xdr_gfs3_read_rsp() fails ?)


No. It need not be. Its guaranteed that only on a successful completion it is 
allocated. However, _if_ decoding fails only because xdr stream doesn't include 
xdata bits, but xdata_len is zero (by initializing it to default_read_size), 
then xdr library would've filled read_rsp.xdata.xdata_len 
(read_rsp.xdata.xdata_val can still be NULL).


The question is: is it guaranteed that after an unsuccessful completion 
xdata_val will be NULL (i.e. not touched by the function, even if 
xadata_len is != 0) ? otherwise the free() could corrupt memory.






So we get into line 1596 with the pending info initialized to read the
remaining data. This is the __socket_proto_read macro:

   166 /* This will be used in a switch case and breaks from the switch
case if all
   167  * the pending data is not read.
   168  */
   169 #define __socket_proto_read(priv, ret)
   \
   170 {
   \
   171 size_t bytes_read = 0;
   \
   172 struct gf_sock_incoming *in;
   \
   173 in = >incoming;
   \
   174
   \
   175 __socket_proto_update_pending (priv);
   \
   176
   \
   177 ret = __socket_readv (this,
   \
   178   in->pending_vector, 1,
   \
   179   >pending_vector,
   \
   180   >pending_count,
   \
   181   _read);
   \
   182 if (ret == -1)
   \
   183 break;
   \
   184 __socket_proto_update_priv_after_read (priv, ret,
bytes_read); \
   185 }

We read from the socket using __socket_readv(). It it fails, we quit.
However if the socket doesn't have more data to read, this function does
not return -1:

   555 ret = __socket_cached_read (this,
opvector, opcount);
   556
   557 if (ret == 0) {
   558
gf_log(this->name,GF_LOG_DEBUG,"EOF on socket");
   559 errno = ENODATA;
   560 ret = -1;
   561 }
   562 if (ret == -1 && errno == EAGAIN) {
   563 /* done for now */
   564 break;
   565 }
   566 this->total_bytes_read += ret;

If __socket_cached_read() fails with errno == EAGAIN, we break and
return opcount, which is >= 0. Causing the process to continue instead
of waiting for more data.


No. If you observe, there is a call to another macro 
__socket_proto_update_priv_after_rea

Re: [Gluster-devel] Possible bug in the communications layer ?

2016-05-04 Thread Xavier Hernandez

I think I've found the problem.

1567 case SP_STATE_READING_PROC_HEADER:
 1568 __socket_proto_read (priv, ret);
 1569
 1570 /* there can be 'xdata' in read response, figure 
it out */

 1571 xdrmem_create (, proghdr_buf, default_read_size,
 1572XDR_DECODE);
 1573
 1574 /* This will fail if there is xdata sent from 
server, if not,

 1575well and good, we don't need to worry about  */
 1576 xdr_gfs3_read_rsp (, _rsp);
 1577
 1578 free (read_rsp.xdata.xdata_val);
 1579
 1580 /* need to round off to proper roof (%4), as XDR 
packing pads

 1581the end of opaque object with '0' */
 1582 size = roof (read_rsp.xdata.xdata_len, 4);
 1583
 1584 if (!size) {
 1585 frag->call_body.reply.accepted_success_state
 1586 = SP_STATE_READ_PROC_OPAQUE;
 1587 goto read_proc_opaque;
 1588 }
 1589
 1590 __socket_proto_init_pending (priv, size);
 1591
 1592 frag->call_body.reply.accepted_success_state
 1593 = SP_STATE_READING_PROC_OPAQUE;
 1594
 1595 case SP_STATE_READING_PROC_OPAQUE:
 1596 __socket_proto_read (priv, ret);
 1597
 1598 frag->call_body.reply.accepted_success_state
 1599 = SP_STATE_READ_PROC_OPAQUE;

On line 1568 we read, at most, 116 bytes because we calculate the size 
of a read response without xdata. Then we detect that we really need 
more data for xdata (BTW, read_rsp.xdata.xdata_val will be always 
allocated even if xdr_gfs3_read_rsp() fails ?)


So we get into line 1596 with the pending info initialized to read the 
remaining data. This is the __socket_proto_read macro:


  166 /* This will be used in a switch case and breaks from the switch 
case if all

  167  * the pending data is not read.
  168  */
  169 #define __socket_proto_read(priv, ret) 
  \
  170 { 
  \
  171 size_t bytes_read = 0; 
  \
  172 struct gf_sock_incoming *in; 
  \
  173 in = >incoming; 
  \
  174 
  \
  175 __socket_proto_update_pending (priv); 
  \
  176 
  \
  177 ret = __socket_readv (this, 
  \
  178   in->pending_vector, 1, 
  \
  179   >pending_vector, 
  \
  180   >pending_count, 
  \
  181   _read); 
  \
  182 if (ret == -1) 
  \
  183 break; 
  \
  184 __socket_proto_update_priv_after_read (priv, ret, 
bytes_read); \

  185 }

We read from the socket using __socket_readv(). It it fails, we quit. 
However if the socket doesn't have more data to read, this function does 
not return -1:


  555 ret = __socket_cached_read (this, 
opvector, opcount);

  556
  557 if (ret == 0) {
  558 
gf_log(this->name,GF_LOG_DEBUG,"EOF on socket");

  559 errno = ENODATA;
  560 ret = -1;
  561 }
  562 if (ret == -1 && errno == EAGAIN) {
  563 /* done for now */
  564 break;
  565 }
  566 this->total_bytes_read += ret;

If __socket_cached_read() fails with errno == EAGAIN, we break and 
return opcount, which is >= 0. Causing the process to continue instead 
of waiting for more data.


As a side note, there's another problem here: if errno is not EAGAIN, 
we'll update this->total_bytes_read substracting one. This shouldn't be 
done when ret < 0.


There are other places where ret is set to -1, but opcount is returned. 
I guess that we should also set opcount = -1 on these places, but I 
don't have a deep knowledge about this implementation.


I've done a quick test checking for (ret != 0) instead of (ret == -1) in 
__socket_proto_read() and it seemed to work.


Could anyone with more knowledge about the communications layer verify 
this and explain what would be the best solution ?


Xavi

On 29/04/16 14:52, Xavier Hernandez wrote:

With your patch applied, it seems that the bug is not hit.

I guess it's a timing issue that the new logging hides. Maybe no more
data available after reading the partial readv header ? (it will arrive
later).

I'll continue testing...

Xavi

On 29/04/16 13:48, Raghavendra Gowdappa wrote:

Attaching the patch.

- Original Message -

From: "Raghavendra Gowdappa" <rgowd...@redhat.com

[Gluster-devel] Improve EXPECT/EXPECT_WITHIN result check in tests

2016-05-02 Thread Xavier Hernandez

Hi,

I've found an spurious failure caused by an incorrect check of the 
expected value in EXPECT_WITHIN.


The problem is that the value passed to EXPECT_WITHIN (EXPECT also has 
the same problem) is considered a regular expression but most tests do 
not pass a full/valid regular expression.


For example, most tests expect a '0' result and they pass "0" as an 
argument to EXPECT/EXPECT_WITHIN. This will match with "0", that's ok, 
but it will also match with 10, 20, 102, ... and that's bad.


There are also some tests that do use a regular expression, like "^0$" 
to correctly match only "0", however current implementation of 
EXPECT_WITHIN uses the following check:


if [[ "$a" =~ "$e" ]]; then

Where "$e" is the regular expression. However putting $e between 
quotation marks (") makes special regular expression characters to not 
be considered. This means that "^0$" will be searched literally and it 
won't never match. When the timeout expires, test_expect_footer is 
called, which does the same check but without using quotation marks. At 
this time the check succeeds, but we have waited an unnecessary amount 
of time.


This is not the first time that I've found something similar.

Would it be ok to change all regular expression checks to something like 
this in include.rc ?


if [[ "$a" =~ ^$e$ ]]; then

This will allow using regular expressions in EXPECT and EXPECT_WITHIN, 
but will enforce full answer match in all cases, avoiding some possible 
side effects.


This needs some changes in many tests, but I think it's worth doing.

What do you think ?

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Possible bug in the communications layer ?

2016-04-29 Thread Xavier Hernandez

With your patch applied, it seems that the bug is not hit.

I guess it's a timing issue that the new logging hides. Maybe no more 
data available after reading the partial readv header ? (it will arrive 
later).


I'll continue testing...

Xavi

On 29/04/16 13:48, Raghavendra Gowdappa wrote:

Attaching the patch.

- Original Message -

From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
To: "Xavier Hernandez" <xhernan...@datalab.es>
Cc: "Gluster Devel" <gluster-devel@gluster.org>
Sent: Friday, April 29, 2016 5:14:02 PM
Subject: Re: [Gluster-devel] Possible bug in the communications layer ?



- Original Message -

From: "Xavier Hernandez" <xhernan...@datalab.es>
To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
Cc: "Gluster Devel" <gluster-devel@gluster.org>
Sent: Friday, April 29, 2016 1:21:57 PM
Subject: Re: [Gluster-devel] Possible bug in the communications layer ?

Hi Raghavendra,

yes, the readv response contains xdata. The dict length is 38 (0x26)
and, at the moment of failure, rsp.xdata.xdata_len already contains 0x26.


rsp.xdata.xdata_len having 0x26 even when decoding failed indicates that the
approach used in socket.c to get the length of xdata is correct. However, I
cannot find any other way of xdata going into payload vector other than
xdata_len being zero. Just to be double sure, I've a patch containing debug
message printing xdata_len when decoding fails in socket.c. Can you please
apply the patch, run the tests and revert back with results?



Xavi

On 29/04/16 09:10, Raghavendra Gowdappa wrote:



- Original Message -

From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
To: "Xavier Hernandez" <xhernan...@datalab.es>
Cc: "Gluster Devel" <gluster-devel@gluster.org>
Sent: Friday, April 29, 2016 12:36:43 PM
Subject: Re: [Gluster-devel] Possible bug in the communications layer ?



- Original Message -

From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
To: "Xavier Hernandez" <xhernan...@datalab.es>
Cc: "Jeff Darcy" <jda...@redhat.com>, "Gluster Devel"
<gluster-devel@gluster.org>
Sent: Friday, April 29, 2016 12:07:59 PM
Subject: Re: [Gluster-devel] Possible bug in the communications layer ?



- Original Message -

From: "Xavier Hernandez" <xhernan...@datalab.es>
To: "Jeff Darcy" <jda...@redhat.com>
Cc: "Gluster Devel" <gluster-devel@gluster.org>
Sent: Thursday, April 28, 2016 8:15:36 PM
Subject: Re: [Gluster-devel] Possible bug in the communications layer
?



Hi Jeff,

On 28.04.2016 15:20, Jeff Darcy wrote:



This happens with Gluster 3.7.11 accessed through Ganesha and gfapi.
The
volume is a distributed-disperse 4*(4+2). I'm able to reproduce the
problem
easily doing the following test: iozone -t2 -s10g -r1024k -i0 -w
-F/iozone{1..2}.dat echo 3 >/proc/sys/vm/drop_caches iozone -t2 -s10g
-r1024k -i1 -w -F/iozone{1..2}.dat The error happens soon after
starting
the
read test. As can be seen in the data below, client3_3_readv_cbk() is
processing an iovec of 116 bytes, however it should be of 154 bytes
(the
buffer in memory really seems to contain 154 bytes). The data on the
network
seems ok (at least I haven't been able to identify any problem), so
this
must be a processing error on the client side. The last field in cut
buffer
of the sequentialized data corresponds to the length of the xdata
field:
0x26. So at least 38 more byte should be present.
Nice detective work, Xavi.  It would be *very* interesting to see what
the value of the "count" parameter is (it's unfortunately optimized
out).
I'll bet it's two, and iov[1].iov_len is 38.  I have a weak memory of
some problems with how this iov is put together, a couple of years
ago,
and it looks like you might have tripped over one more.
It seems you are right. The count is 2 and the first 38 bytes of the
second
vector contains the remaining data of xdata field.


This is the bug. client3_3_readv_cbk (and for that matter all the
actors/cbks) expects response in utmost two vectors:
1. Program header containing request or response. This is subjected to
decoding/encoding. This vector should point to a buffer that contains
the
entire program header/response contiguously.
2. If the procedure returns payload (like readv response or a write
request),
second vector contains the buffer pointing to the entire (contiguous)
payload. Note that this payload is raw and is not subjected to
encoding/decoding.

In your case, this _clean_ separation is broken with part of program
header
slipping into 2nd vector supposed to contain read data (may be because
of
rpc fragmentation). I think this is a bug in socket layer. I'll update
more
on this.


Does your read response include xdata too? I think the code related to
reading xdata in readv response

Re: [Gluster-devel] Regression-test-burn-in crash in EC test

2016-04-29 Thread Xavier Hernandez

Hi Jeff,

On 27/04/16 20:01, Jeff Darcy wrote:

One of the "rewards" of reviewing and merging people's patches is getting email 
if the next regression-test-burn-in should fail - even if it fails for a completely 
unrelated reason.  Today I got one that's not among the usual suspects.  The failure was 
a core dump in tests/bugs/disperse/bug-1304988.t, weighing in at a respectable 42 frames.

#0  0x7fef25976cb9 in dht_rename_lock_cbk
#1  0x7fef25955f62 in dht_inodelk_done
#2  0x7fef25957352 in dht_blocking_inodelk_cbk
#3  0x7fef32e02f8f in default_inodelk_cbk
#4  0x7fef25c029a3 in ec_manager_inodelk
#5  0x7fef25bf9802 in __ec_manager
#6  0x7fef25bf990c in ec_manager
#7  0x7fef25c03038 in ec_inodelk
#8  0x7fef25bee7ad in ec_gf_inodelk
#9  0x7fef25957758 in dht_blocking_inodelk_rec
#10 0x7fef25957b2d in dht_blocking_inodelk
#11 0x7fef2597713f in dht_rename_lock
#12 0x7fef25977835 in dht_rename
#13 0x7fef32e0f032 in default_rename
#14 0x7fef32e0f032 in default_rename
#15 0x7fef32e0f032 in default_rename
#16 0x7fef32e0f032 in default_rename
#17 0x7fef32e0f032 in default_rename
#18 0x7fef32e07c29 in default_rename_resume
#19 0x7fef32d8ed40 in call_resume_wind
#20 0x7fef32d98b2f in call_resume
#21 0x7fef24cfc568 in open_and_resume
#22 0x7fef24cffb99 in ob_rename
#23 0x7fef24aee482 in mdc_rename
#24 0x7fef248d68e5 in io_stats_rename
#25 0x7fef32e0f032 in default_rename
#26 0x7fef2ab1b2b9 in fuse_rename_resume
#27 0x7fef2ab12c47 in fuse_fop_resume
#28 0x7fef2ab107cc in fuse_resolve_done
#29 0x7fef2ab108a2 in fuse_resolve_all
#30 0x7fef2ab10900 in fuse_resolve_continue
#31 0x7fef2ab0fb7c in fuse_resolve_parent
#32 0x7fef2ab1077d in fuse_resolve
#33 0x7fef2ab10879 in fuse_resolve_all
#34 0x7fef2ab10900 in fuse_resolve_continue
#35 0x7fef2ab0fb7c in fuse_resolve_parent
#36 0x7fef2ab1077d in fuse_resolve
#37 0x7fef2ab10824 in fuse_resolve_all
#38 0x7fef2ab1093e in fuse_resolve_and_resume
#39 0x7fef2ab1b40e in fuse_rename
#40 0x7fef2ab2a96a in fuse_thread_proc
#41 0x7fef3204daa1 in start_thread

In other words we started at FUSE, went through a bunch of performance 
translators, through DHT to EC, and then crashed on the way back.  It seems a 
little odd that we turn the fop around immediately in EC, and that we have 
default_inodelk_cbk at frame 3.  Could one of the DHT or EC people please take 
a look at it?  Thanks!


The part regarding to ec seems ok. This is uncommon, but can happen. 
When ec_gf_inodelk() is called, it sends a inodelk request to all its 
subvolumes. It may happen that the callbacks of all these requests are 
received before returning from ec_gf_inodelk() itself. This executes the 
callback inside the same thread of the caller.


The reason why default_inodelk_cbk() is seen is because ec uses this 
function to report the result back to the caller (instead of calling 
STACK_UNWIND() itself).


This seems what have happened here.

The frames returned by ec to upper xlators are the same used by them 
(the frame in dht_blocking_lock() is the same that receives 
dht_blocking_inodelk_cbk()) and ec doesn't touch them, however the frame 
at 0x7fef1003ca5c is absolutely corrupted.


We can see the call state from the core:

(gdb) f 4
#4  0x7fef25c029a3 in ec_manager_inodelk (fop=0x7fef1000d37c, 
state=5) at 
/home/jenkins/root/workspace/regression-test-burn-in/xlators/cluster/ec/src/ec-locks.c:645

645 fop->cbks.inodelk(fop->req_frame, fop, fop->xl,
(gdb) print fop->answer
$30 = (ec_cbk_data_t *) 0x7fef180094ac
(gdb) print fop->answer->op_ret
$31 = 0
(gdb) print fop->answer->op_errno
$32 = 0
(gdb) print fop->answer->count
$33 = 6
(gdb) print fop->answer->mask
$34 = 63

As we can see there's an actual answer to the request with a success 
result (op_ret == 0 and op_errno == 0) composed of the combination of 
answers from 6 subvolumes (count == 6).


Looking at the dht code I have been unable to see any possible cause either.

The test is doing renames where source and target directories are 
different. At the same time a new ec-set is added and rebalance started. 
Rebalance will cause dht to also move files between bricks. Maybe this 
is causing some race in dht ?


I'll try to continue investigating when I have some time.

Xavi




https://build.gluster.org/job/regression-test-burn-in/868/console
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Possible bug in the communications layer ?

2016-04-29 Thread Xavier Hernandez

Hi Raghavendra,

yes, the readv response contains xdata. The dict length is 38 (0x26) 
and, at the moment of failure, rsp.xdata.xdata_len already contains 0x26.


Xavi

On 29/04/16 09:10, Raghavendra Gowdappa wrote:



- Original Message -

From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
To: "Xavier Hernandez" <xhernan...@datalab.es>
Cc: "Gluster Devel" <gluster-devel@gluster.org>
Sent: Friday, April 29, 2016 12:36:43 PM
Subject: Re: [Gluster-devel] Possible bug in the communications layer ?



- Original Message -

From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
To: "Xavier Hernandez" <xhernan...@datalab.es>
Cc: "Jeff Darcy" <jda...@redhat.com>, "Gluster Devel"
<gluster-devel@gluster.org>
Sent: Friday, April 29, 2016 12:07:59 PM
Subject: Re: [Gluster-devel] Possible bug in the communications layer ?



- Original Message -

From: "Xavier Hernandez" <xhernan...@datalab.es>
To: "Jeff Darcy" <jda...@redhat.com>
Cc: "Gluster Devel" <gluster-devel@gluster.org>
Sent: Thursday, April 28, 2016 8:15:36 PM
Subject: Re: [Gluster-devel] Possible bug in the communications layer ?



Hi Jeff,

On 28.04.2016 15:20, Jeff Darcy wrote:



This happens with Gluster 3.7.11 accessed through Ganesha and gfapi. The
volume is a distributed-disperse 4*(4+2). I'm able to reproduce the
problem
easily doing the following test: iozone -t2 -s10g -r1024k -i0 -w
-F/iozone{1..2}.dat echo 3 >/proc/sys/vm/drop_caches iozone -t2 -s10g
-r1024k -i1 -w -F/iozone{1..2}.dat The error happens soon after starting
the
read test. As can be seen in the data below, client3_3_readv_cbk() is
processing an iovec of 116 bytes, however it should be of 154 bytes (the
buffer in memory really seems to contain 154 bytes). The data on the
network
seems ok (at least I haven't been able to identify any problem), so this
must be a processing error on the client side. The last field in cut
buffer
of the sequentialized data corresponds to the length of the xdata field:
0x26. So at least 38 more byte should be present.
Nice detective work, Xavi.  It would be *very* interesting to see what
the value of the "count" parameter is (it's unfortunately optimized out).
I'll bet it's two, and iov[1].iov_len is 38.  I have a weak memory of
some problems with how this iov is put together, a couple of years ago,
and it looks like you might have tripped over one more.
It seems you are right. The count is 2 and the first 38 bytes of the
second
vector contains the remaining data of xdata field.


This is the bug. client3_3_readv_cbk (and for that matter all the
actors/cbks) expects response in utmost two vectors:
1. Program header containing request or response. This is subjected to
decoding/encoding. This vector should point to a buffer that contains the
entire program header/response contiguously.
2. If the procedure returns payload (like readv response or a write
request),
second vector contains the buffer pointing to the entire (contiguous)
payload. Note that this payload is raw and is not subjected to
encoding/decoding.

In your case, this _clean_ separation is broken with part of program header
slipping into 2nd vector supposed to contain read data (may be because of
rpc fragmentation). I think this is a bug in socket layer. I'll update more
on this.


Does your read response include xdata too? I think the code related to
reading xdata in readv response is a bit murky.



case SP_STATE_ACCEPTED_SUCCESS_REPLY_INIT:
default_read_size = xdr_sizeof ((xdrproc_t)
xdr_gfs3_read_rsp,
_rsp);

proghdr_buf = frag->fragcurrent;

__socket_proto_init_pending (priv, default_read_size);

frag->call_body.reply.accepted_success_state
= SP_STATE_READING_PROC_HEADER;

/* fall through */

case SP_STATE_READING_PROC_HEADER:
__socket_proto_read (priv, ret);


By this time we've read read response _minus_ the xdata


I meant we have read "readv response header"



/* there can be 'xdata' in read response, figure it out */
xdrmem_create (, proghdr_buf, default_read_size,
   XDR_DECODE);


We created xdr stream above with "default_read_size" (this doesn't
include xdata)


/* This will fail if there is xdata sent from server, if not,
   well and good, we don't need to worry about  */


what if xdata is present and decoding failed (as length of xdr stream
above - default_read_size - doesn't include xdata)? would we have a
valid value in read_rsp.xdata.xdata_len? This is the part I am
confused about. If read_rsp.xdata.xdata_len is not correct then there
is a possibility that 

Re: [Gluster-devel] Possible bug in the communications layer ?

2016-04-28 Thread Xavier Hernandez
 

I've filed bug https://bugzilla.redhat.com/show_bug.cgi?id=1331502
and added Raghavendra Gowdappa in the CC list (he appears as a
maintainer of RPC). 

Xavi 

On 28.04.2016 18:42, Xavier Hernandez
wrote: 

> Hi Niels, 
> 
> On 28.04.2016 15:44, Niels de Vos wrote: 
>

>> On Thu, Apr 28, 2016 at 02:43:01PM +0200, Xavier Hernandez wrote:
>>

>>> Hi, I've seen what seems a bug in the communications layer. The
first sign is an "XDR decoding failed" error in the logs. This happens
with Gluster 3.7.11 accessed through Ganesha and gfapi. The volume is a
distributed-disperse 4*(4+2). I'm able to reproduce the problem easily
doing the following test: iozone -t2 -s10g -r1024k -i0 -w
-F/iozone{1..2}.dat echo 3 >/proc/sys/vm/drop_caches iozone -t2 -s10g
-r1024k -i1 -w -F/iozone{1..2}.dat The error happens soon after starting
the read test.
>> 
>> Do you know if this only happens on disperse
volumes, or also with
>> others?
> 
> I have only seen the problem with
EC. Not sure if it's because I do most of my tests with EC or because
other volumes do not manifest this problem. However I've only seen this
when I started testing Ganesha. I have never seen the problem with
FUSE.
> 
> I think the reason is that FUSE is slower than Ganesha
(little more than 120 MB/s vs 400 MB/s) and the combination of events
needed to cause this problem would be much more unlikely to happen on
FUSE. Since EC also talks to many bricks simultaneously (6 in this
case), maybe this makes it more sensible to communications problems
compared to a replicated volume.
> 
>> If you have captured a network
trace, could you provide it to
>> me? You can use 'editcap -s0 ...' to
copy only the relevant packets.
>> But, I dont have an issue to download
a few GB either if that is easier
>> for you.
>> 
>>> As can be seen in
the data below, client3_3_readv_cbk() is processing an iovec of 116
bytes, however it should be of 154 bytes (the buffer in memory really
seems to contain 154 bytes). The data on the network seems ok (at least
I haven't been able to identify any problem), so this must be a
processing error on the client side. The last field in cut buffer of the
sequentialized data corresponds to the length of the xdata field: 0x26.
So at least 38 more byte should be present. My guess is that some corner
case is hit reading fragmented network packets due to a high load. Debug
information: Breakpoint 1, client3_3_readv_cbk (req=0x7f540e64106c,
iov=0x7f540e6410ac, count=, myframe=0x7f54259a4d54) at
client-rpc-fops.c:3021 3021 gf_msg (this->name, GF_LOG_ERROR, EINVAL,
(gdb) print *iov $1 = {iov_base = 0x7f53e994e018, iov_len = 116} (gdb)
x/116xb 0x7f53e994e018 0x7f53e994e018: 0x00 0x00 0x80 0x00 0x00 0x00
0x00 0x00 0x7f53e994e020: 0xa8 0xbf 0xa3 0xe0 0x5f 0x48 0x4c 0x1e
>> 
>>
Hmm, I'm not sure how this is layed out in memory. 0x80 would be one
of
>> first bytes in RPC payload, it signals 'last record' for the
RPC
>> procedure, and we only send one record anyway. The four bytes
combined
>> like (0x80 0x.. 0x.. 0x..) should be (0x80 |
rpc-record-size). Reading
>> this in Wireshark from a .pcap.gz is much
easier :)
> 
> The RPC header is already decoded here. The 116 bytes are
only the content of the readv answer, as decoded by xdr_gfs3_read_rsp()
function.
> 
> This means that the first 4 bytes are the op_ret, next 4
are op_errno, followed by an encoded iatt (with gfid and ino number as
the first fields). Then a size field and the length of a stream of bytes
(corresponding to the encoded xdata).
> 
> I've a network capture. I can
upload it if you want, but I think this is what you are trying to see:
>

> (gdb) f 1
> #1 0x7fa88cb5bab0 in rpc_clnt_handle_reply
(clnt=clnt@entry=0x7fa870413e30, pollin=pollin@entry=0x7fa86400efe0) at
rpc-clnt.c:764
> 764 req->cbkfn (req, req->rsp, req->rspcnt,
saved_frame->frame);
> (gdb) print *pollin
> $1 = {vector = {{iov_base =
0x7fa7ef58, iov_len = 140}, {iov_base = 0x7fa7ef46, iov_len =
32808}, {iov_base = 0x0, iov_len = 0} }, count = 2,
> vectored = 1 '01',
private = 0x7fa86c00ec60, iobref = 0x7fa86400e560, hdr_iobuf =
0x7fa868022ee0, is_reply = 1 '01'}
> (gdb) x/140xb 0x7fa7ef58
>
0x7fa7ef58: 0x00 0x00 0x89 0x68 0x00 0x00 0x00 0x01
>
0x7fa7ef580008: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
>
0x7fa7ef580010: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
>
0x7fa7ef580018: 0x00 0x00 0x80 0x00 0x00 0x00 0x00 0x00
>
0x7fa7ef580020: 0xc3 0x8e 0x35 0xf0 0x31 0xa1 0x45 0x01
>
0x7fa7ef580028: 0x8a 0x21 0x06 0x4b 0x08 0x4c 0x59 0xdf
>
0x7fa7ef580030: 0x8a 0x21 0x06 0x4b 0x08 0x4c 0x59 0xdf
>
0x7fa7ef580038: 0x00 0x00 0x00 0x00 0x00 0x00 0x08 0x00
>
0x7fa7ef580040: 0x00 0x00 0x81 0xa0 0x00 0x00 0x00 0x01
>
0x7fa7ef580048: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
>
0x7fa7ef580050: 0x

Re: [Gluster-devel] Possible bug in the communications layer ?

2016-04-28 Thread Xavier Hernandez
 

Hi Niels, 

On 28.04.2016 15:44, Niels de Vos wrote: 

> On Thu, Apr
28, 2016 at 02:43:01PM +0200, Xavier Hernandez wrote:
> 
>> Hi, I've
seen what seems a bug in the communications layer. The first sign is an
"XDR decoding failed" error in the logs. This happens with Gluster
3.7.11 accessed through Ganesha and gfapi. The volume is a
distributed-disperse 4*(4+2). I'm able to reproduce the problem easily
doing the following test: iozone -t2 -s10g -r1024k -i0 -w
-F/iozone{1..2}.dat echo 3 >/proc/sys/vm/drop_caches iozone -t2 -s10g
-r1024k -i1 -w -F/iozone{1..2}.dat The error happens soon after starting
the read test.
> 
> Do you know if this only happens on disperse
volumes, or also with
> others?

I have only seen the problem with EC.
Not sure if it's because I do most of my tests with EC or because other
volumes do not manifest this problem. However I've only seen this when I
started testing Ganesha. I have never seen the problem with FUSE.

I
think the reason is that FUSE is slower than Ganesha (little more than
120 MB/s vs 400 MB/s) and the combination of events needed to cause this
problem would be much more unlikely to happen on FUSE. Since EC also
talks to many bricks simultaneously (6 in this case), maybe this makes
it more sensible to communications problems compared to a replicated
volume.

> If you have captured a network trace, could you provide it
to
> me? You can use 'editcap -s0 ...' to copy only the relevant
packets.
> But, I dont have an issue to download a few GB either if that
is easier
> for you.
> 
>> As can be seen in the data below,
client3_3_readv_cbk() is processing an iovec of 116 bytes, however it
should be of 154 bytes (the buffer in memory really seems to contain 154
bytes). The data on the network seems ok (at least I haven't been able
to identify any problem), so this must be a processing error on the
client side. The last field in cut buffer of the sequentialized data
corresponds to the length of the xdata field: 0x26. So at least 38 more
byte should be present. My guess is that some corner case is hit reading
fragmented network packets due to a high load. Debug information:
Breakpoint 1, client3_3_readv_cbk (req=0x7f540e64106c,
iov=0x7f540e6410ac, count=, myframe=0x7f54259a4d54) at
client-rpc-fops.c:3021 3021 gf_msg (this->name, GF_LOG_ERROR, EINVAL,
(gdb) print *iov $1 = {iov_base = 0x7f53e994e018, iov_len = 116} (gdb)
x/116xb 0x7f53e994e018 0x7f53e994e018: 0x00 0x00 0x80 0x00 0x00 0x00
0x00 0x00 0x7f53e994e020: 0xa8 0xbf 0xa3 0xe0 0x5f 0x48 0x4c 0x1e
> 
>
Hmm, I'm not sure how this is layed out in memory. 0x80 would be one
of
> first bytes in RPC payload, it signals 'last record' for the RPC
>
procedure, and we only send one record anyway. The four bytes combined
>
like (0x80 0x.. 0x.. 0x..) should be (0x80 | rpc-record-size). Reading
>
this in Wireshark from a .pcap.gz is much easier :)

The RPC header is
already decoded here. The 116 bytes are only the content of the readv
answer, as decoded by xdr_gfs3_read_rsp() function.

This means that the
first 4 bytes are the op_ret, next 4 are op_errno, followed by an
encoded iatt (with gfid and ino number as the first fields). Then a size
field and the length of a stream of bytes (corresponding to the encoded
xdata).

I've a network capture. I can upload it if you want, but I
think this is what you are trying to see:

(gdb) f 1
#1
0x7fa88cb5bab0 in rpc_clnt_handle_reply
(clnt=clnt@entry=0x7fa870413e30, pollin=pollin@entry=0x7fa86400efe0) at
rpc-clnt.c:764
764 req->cbkfn (req, req->rsp, req->rspcnt,
saved_frame->frame);
(gdb) print *pollin
$1 = {vector = {{iov_base =
0x7fa7ef58, iov_len = 140}, {iov_base = 0x7fa7ef46, iov_len =
32808}, {iov_base = 0x0, iov_len = 0} }, count = 2,

vectored = 1 '01', private = 0x7fa86c00ec60, iobref = 0x7fa86400e560,
hdr_iobuf = 0x7fa868022ee0, is_reply = 1 '01'}
(gdb) x/140xb
0x7fa7ef58
0x7fa7ef58: 0x00 0x00 0x89 0x68 0x00 0x00 0x00
0x01
0x7fa7ef580008: 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00
0x7fa7ef580010: 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00
0x7fa7ef580018: 0x00 0x00 0x80 0x00 0x00 0x00 0x00
0x00
0x7fa7ef580020: 0xc3 0x8e 0x35 0xf0 0x31 0xa1 0x45
0x01
0x7fa7ef580028: 0x8a 0x21 0x06 0x4b 0x08 0x4c 0x59
0xdf
0x7fa7ef580030: 0x8a 0x21 0x06 0x4b 0x08 0x4c 0x59
0xdf
0x7fa7ef580038: 0x00 0x00 0x00 0x00 0x00 0x00 0x08
0x00
0x7fa7ef580040: 0x00 0x00 0x81 0xa0 0x00 0x00 0x00
0x01
0x7fa7ef580048: 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00
0x7fa7ef580050: 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00
0x7fa7ef580058: 0x00 0x00 0x00 0x00 0xa0 0x00 0x00
0x00
0x7fa7ef580060: 0x00 0x00 0x10 0x00 0x00 0x00 0x00
0x00
0x7fa7ef580068: 0x00 0x50 0x00 0x00 0x57 0x22 0x33
0x1f
0x7fa7ef580070: 0x24 0x0c 0xc2 0x58 0x57 0x22 0x33
0xa1
0x7fa7ef580078: 0x34 0x11 0x3b 0x54 0x57 0x22 0x33
0xa1
0x7fa7ef580080: 0x34 0x20 0x7d 0x7a 0x00 0x00 0x80
0x00
0x7fa7ef580088: 0x00 0x00 0x00 0x26

(gdb) f 3
#3
0x000

Re: [Gluster-devel] Possible bug in the communications layer ?

2016-04-28 Thread Xavier Hernandez
 

Hi Jeff, 

On 28.04.2016 15:20, Jeff Darcy wrote: 

>> This happens
with Gluster 3.7.11 accessed through Ganesha and gfapi. The volume is a
distributed-disperse 4*(4+2). I'm able to reproduce the problem easily
doing the following test: iozone -t2 -s10g -r1024k -i0 -w
-F/iozone{1..2}.dat echo 3 >/proc/sys/vm/drop_caches iozone -t2 -s10g
-r1024k -i1 -w -F/iozone{1..2}.dat The error happens soon after starting
the read test. As can be seen in the data below, client3_3_readv_cbk()
is processing an iovec of 116 bytes, however it should be of 154 bytes
(the buffer in memory really seems to contain 154 bytes). The data on
the network seems ok (at least I haven't been able to identify any
problem), so this must be a processing error on the client side. The
last field in cut buffer of the sequentialized data corresponds to the
length of the xdata field: 0x26. So at least 38 more byte should be
present.
> 
> Nice detective work, Xavi. It would be *very* interesting
to see what
> the value of the "count" parameter is (it's unfortunately
optimized out).
> I'll bet it's two, and iov[1].iov_len is 38. I have a
weak memory of
> some problems with how this iov is put together, a
couple of years ago,
> and it looks like you might have tripped over one
more.

It seems you are right. The count is 2 and the first 38 bytes of
the second vector contains the remaining data of xdata field. The rest
of the data in the second vector seems the payload of the readv fop,
plus a 2 bytes padding:

(gdb) f 0
#0 client3_3_readv_cbk
(req=0x7fdc4051a31c, iov=0x7fdc4051a35c, count=,
myframe=0x7fdc520d505c) at client-rpc-fops.c:3021
3021 gf_msg
(this->name, GF_LOG_ERROR, EINVAL,
(gdb) print *iov
$2 = {iov_base =
0x7fdc14b0d018, iov_len = 116}
(gdb) f 1
#1 0x7fdc56dafab0 in
rpc_clnt_handle_reply (clnt=clnt@entry=0x7fdc3c1f4bb0,
pollin=pollin@entry=0x7fdc34010f20) at rpc-clnt.c:764
764 req->cbkfn
(req, req->rsp, req->rspcnt, saved_frame->frame);
(gdb) print *pollin
$3
= {vector = {{iov_base = 0x7fdc14b0d000, iov_len = 140}, {iov_base =
0x7fdc14a4d000, iov_len = 32808}, {iov_base = 0x0, iov_len = 0} }, count = 2,
 vectored = 1 '01', private = 0x7fdc340106c0,
iobref = 0x7fdc34006660, hdr_iobuf = 0x7fdc3c4c07c0, is_reply = 1
'01'}
(gdb) f 0
#0 client3_3_readv_cbk (req=0x7fdc4051a31c,
iov=0x7fdc4051a35c, count=, myframe=0x7fdc520d505c) at
client-rpc-fops.c:3021
3021 gf_msg (this->name, GF_LOG_ERROR,
EINVAL,
(gdb) print iov[1]
$4 = {iov_base = 0x7fdc14a4d000, iov_len =
32808}
(gdb) print iov[2]
$5 = {iov_base = 0x2, iov_len =
140583741974112}
(gdb) x/128xb 0x7fdc14a4d000
0x7fdc14a4d000: 0x00 0x00
0x00 0x01 0x00 0x00 0x00 0x17
0x7fdc14a4d008: 0x00 0x00 0x00 0x02 0x67
0x6c 0x75 0x73
0x7fdc14a4d010: 0x74 0x65 0x72 0x66 0x73 0x2e 0x69
0x6e
0x7fdc14a4d018: 0x6f 0x64 0x65 0x6c 0x6b 0x2d 0x63
0x6f
0x7fdc14a4d020: 0x75 0x6e 0x74 0x00 0x31 0x00 0x00
0x00
0x7fdc14a4d028: 0x5c 0x5c 0x5c 0x5c 0x5c 0x5c 0x5c
0x5c
0x7fdc14a4d030: 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00
0x7fdc14a4d038: 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00
0x7fdc14a4d040: 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00
0x7fdc14a4d048: 0x5c 0x00 0x00 0x00 0x00 0x00 0x00
0x00
0x7fdc14a4d050: 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00
0x7fdc14a4d058: 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00
0x7fdc14a4d060: 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00
0x7fdc14a4d068: 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00
0x7fdc14a4d070: 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00
0x7fdc14a4d078: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

> Maybe
it's related to all that epoll stuff.

I'm currently using 4 epoll
threads (this improves ec performance). I'll try to repeat the tests
with a single epoll thread, but I'm not sure if this will be enough to
get any conclusion if the problem doesn't manifest, since ec through
fuse with 4 epoll threads doesn't seem to trigger the problem.

Xavi
 ___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Possible bug in the communications layer ?

2016-04-28 Thread Xavier Hernandez

Hi,

I've seen what seems a bug in the communications layer. The first sign 
is an "XDR decoding failed" error in the logs.


This happens with Gluster 3.7.11 accessed through Ganesha and gfapi. The 
volume is a distributed-disperse 4*(4+2).


I'm able to reproduce the problem easily doing the following test:

iozone -t2 -s10g -r1024k -i0 -w -F /iozone{1..2}.dat
echo 3 >/proc/sys/vm/drop_caches
iozone -t2 -s10g -r1024k -i1 -w -F /iozone{1..2}.dat

The error happens soon after starting the read test.

As can be seen in the data below, client3_3_readv_cbk() is processing an 
iovec of 116 bytes, however it should be of 154 bytes (the buffer in 
memory really seems to contain 154 bytes). The data on the network seems 
ok (at least I haven't been able to identify any problem), so this must 
be a processing error on the client side.


The last field in cut buffer of the sequentialized data corresponds to 
the length of the xdata field: 0x26. So at least 38 more byte should be 
present.


My guess is that some corner case is hit reading fragmented network 
packets due to a high load.


Debug information:

Breakpoint 1, client3_3_readv_cbk (req=0x7f540e64106c, 
iov=0x7f540e6410ac, count=, myframe=0x7f54259a4d54) at 
client-rpc-fops.c:3021

3021gf_msg (this->name, GF_LOG_ERROR, EINVAL,
(gdb) print *iov
$1 = {iov_base = 0x7f53e994e018, iov_len = 116}
(gdb) x/116xb 0x7f53e994e018
0x7f53e994e018: 0x000x000x800x000x000x000x000x00
0x7f53e994e020: 0xa80xbf0xa30xe00x5f0x480x4c0x1e
0x7f53e994e028: 0x800xa30x8a0xd80x9d0xa10x1c0x75
0x7f53e994e030: 0x800xa30x8a0xd80x9d0xa10x1c0x75
0x7f53e994e038: 0x000x000x000x000x000x000x080x00
0x7f53e994e040: 0x000x000x810xa00x000x000x000x01
0x7f53e994e048: 0x000x000x000x000x000x000x000x00
0x7f53e994e050: 0x000x000x000x000x000x000x000x00
0x7f53e994e058: 0x000x000x000x000xa00x000x000x00
0x7f53e994e060: 0x000x000x100x000x000x000x000x00
0x7f53e994e068: 0x000x500x000x000x570x220x040x1f
0x7f53e994e070: 0x250x380x920x910x570x220x040xb3
0x7f53e994e078: 0x030x530x1b0x130x570x220x040xb3
0x7f53e994e080: 0x060xf50xe10x990x000x000x800x00
0x7f53e994e088: 0x000x000x000x26
(gdb) x/154xb 0x7f53e994e018
0x7f53e994e018: 0x000x000x800x000x000x000x000x00
0x7f53e994e020: 0xa80xbf0xa30xe00x5f0x480x4c0x1e
0x7f53e994e028: 0x800xa30x8a0xd80x9d0xa10x1c0x75
0x7f53e994e030: 0x800xa30x8a0xd80x9d0xa10x1c0x75
0x7f53e994e038: 0x000x000x000x000x000x000x080x00
0x7f53e994e040: 0x000x000x810xa00x000x000x000x01
0x7f53e994e048: 0x000x000x000x000x000x000x000x00
0x7f53e994e050: 0x000x000x000x000x000x000x000x00
0x7f53e994e058: 0x000x000x000x000xa00x000x000x00
0x7f53e994e060: 0x000x000x100x000x000x000x000x00
0x7f53e994e068: 0x000x500x000x000x570x220x040x1f
0x7f53e994e070: 0x250x380x920x910x570x220x040xb3
0x7f53e994e078: 0x030x530x1b0x130x570x220x040xb3
0x7f53e994e080: 0x060xf50xe10x990x000x000x800x00
0x7f53e994e088: 0x000x000x000x260x000x000x000x01
0x7f53e994e090: 0x000x000x000x170x000x000x000x02
0x7f53e994e098: 0x670x6c0x750x730x740x650x720x66
0x7f53e994e0a0: 0x730x2e0x690x6e0x6f0x640x650x6c
0x7f53e994e0a8: 0x6b0x2d0x630x6f0x750x6e0x740x00
0x7f53e994e0b0: 0x310x00
(gdb) bt
#0  client3_3_readv_cbk (req=0x7f540e64106c, iov=0x7f540e6410ac, 
count=, myframe=0x7f54259a4d54) at client-rpc-fops.c:3021
#1  0x7f542a677ab0 in rpc_clnt_handle_reply 
(clnt=clnt@entry=0x7f54101cdef0, pollin=pollin@entry=0x7f5491f0) at 
rpc-clnt.c:764
#2  0x7f542a677d6f in rpc_clnt_notify (trans=, 
mydata=0x7f54101cdf20, event=, data=0x7f5491f0) at 
rpc-clnt.c:925
#3  0x7f542a673853 in rpc_transport_notify 
(this=this@entry=0x7f54101ddb70, 
event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, 
data=data@entry=0x7f5491f0) at rpc-transport.c:546
#4  0x7f541d881666 in socket_event_poll_in 
(this=this@entry=0x7f54101ddb70) at socket.c:2237
#5  0x7f541d8842c4 in socket_event_handler (fd=fd@entry=30, 
idx=idx@entry=20, data=0x7f54101ddb70, poll_in=1, poll_out=0, 
poll_err=0) at socket.c:2350
#6  0x7f542a90aa4a in event_dispatch_epoll_handler 
(event=0x7f540effc540, event_pool=0xd3d9a0) at 

Re: [Gluster-devel] disperse volume file to subvolume mapping

2016-04-19 Thread Xavier Hernandez

Hi Serkan,

On 19/04/16 09:18, Serkan Çoban wrote:

Hi, I just reinstalled fresh 3.7.11 and I am seeing the same behavior.
50 clients copying part-0- named files using mapreduce to gluster
using one thread per server and they are using only 20 servers out of
60. On the other hand fio tests use all the servers. Anything I can do
to solve the issue?


Distribution of files to ec sets is done by dht. In theory if you create 
many files each ec set will receive the same amount of files. However 
when the number of files is small enough, statistics can fail.


Not sure what you are doing exactly, but a mapreduce procedure generally 
only creates a single output. In that case it makes sense that only one 
ec set is used. If you want to use all ec sets for a single file, you 
should enable sharding (I haven't tested that) or split the result in 
multiple files.


Xavi



Thanks,
Serkan


-- Forwarded message --
From: Serkan Çoban 
Date: Mon, Apr 18, 2016 at 2:39 PM
Subject: disperse volume file to subvolume mapping
To: Gluster Users 


Hi, I have a problem where clients are using only 1/3 of nodes in
disperse volume for writing.
I am testing from 50 clients using 1 to 10 threads with file names part-0-.
What I see is clients only use 20 nodes for writing. How is the file
name to sub volume hashing is done? Is this related to file names are
similar?

My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse volume
is 78 x (16+4). Only 26 out of 78 sub volumes used during writes..


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Fragment size in Systematic erasure code

2016-03-14 Thread Xavier Hernandez

Hi Ashish,

On 14/03/16 12:31, Ashish Pandey wrote:

Hi Xavi,

I think for Systematic erasure coded volume you are going to take fragment size 
of 512 Bytes.
Will there be any CLI option to configure this block size?
We were having a discussion and Manoj was suggesting to have this option which 
might improve performance for some workload.
For example- If we can configure it to 8K, all the read can be served only from 
one brick in case a file size is less than 8K.


I already considered to use a configurable fragment size, and I plan to 
have it. However the benefits of larger block sizes are not so clear. 
Having a fragment size of 8KB in a 4+2 configuration will use a stripe 
of 32KB. Any write smaller, or not aligned, or not multiple of this 
value will need a read-modify-write cycle, causing a performance 
degradation for some workloads. It's also slower to encode/decode a 
block of 32KB because it might not fully fit into processor caches, 
making the computation slower.


On the other side, a small read on multiple bricks should, in theory, be 
executed in parallel, not causing a noticeable performance drop.


Anyway many of these things depend on the workload, so having a 
configurable fragment size will give enough control to choose the best 
solution for each environment.


Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] GlusterFS FUSE client leaks summary — part I

2016-02-02 Thread Xavier Hernandez
e=706766688
num_allocs=2454051


And after drop_caches: https://gist.github.com/5eab63bc13f78787ed19


[mount/fuse.fuse - usage-type gf_common_mt_inode_ctx memusage]
size=550996416
num_allocs=1913182

There isn't much significant drop in inode contexts. One of the
reasons could be because of dentrys holding a refcount on the inodes
which shall result in inodes not getting purged even after
fuse_forget.


pool-name=fuse:dentry_t
hot-count=32761

if  '32761' is the current active dentry count, it still doesn't seem
to match up to inode count.

Thanks,
Soumya


And here is Valgrind output:
https://gist.github.com/2490aeac448320d98596

On субота, 30 січня 2016 р. 22:56:37 EET Xavier Hernandez wrote:

There's another inode leak caused by an incorrect counting of
lookups on directory reads.

Here's a patch that solves the problem for
3.7:

http://review.gluster.org/13324

Hopefully with this patch the
memory leaks should disapear.

Xavi

On 29.01.2016 19:09, Oleksandr

Natalenko wrote:

Here is intermediate summary of current memory


leaks in FUSE client


investigation.

I use GlusterFS v3.7.6


release with the following patches:

===



Kaleb S KEITHLEY (1):

fuse: use-after-free fix in fuse-bridge, revisited


Pranith Kumar K


(1):

mount/fuse: Fix use-after-free crash



Soumya Koduri (3):

gfapi: Fix inode nlookup counts


inode: Retire the inodes from the lru


list in inode_table_destroy


upcall: free the xdr* allocations
===


With those patches we got API leaks fixed (I hope, brief tests show


that) and


got rid of "kernel notifier loop terminated" message.


Nevertheless, FUSE


client still leaks.

I have several test


volumes with several million of small files (100K…2M in


average). I


do 2 types of FUSE client testing:

1) find /mnt/volume -type d
2)


rsync -av -H /mnt/source_volume/* /mnt/target_volume/


And most


up-to-date results are shown below:

=== find /mnt/volume -type d


===


Memory consumption: ~4G



Statedump:

https://gist.github.com/10cde83c63f1b4f1dd7a


Valgrind:

https://gist.github.com/097afb01ebb2c5e9e78d


I guess,


fuse-bridge/fuse-resolve. related.


=== rsync -av -H


/mnt/source_volume/* /mnt/target_volume/ ===


Memory consumption:

~3.3...4G


Statedump (target volume):

https://gist.github.com/31e43110eaa4da663435


Valgrind (target volume):

https://gist.github.com/f8e0151a6878cacc9b1a


I guess,


DHT-related.


Give me more patches to test :).


___


Gluster-devel mailing


list


Gluster-devel@gluster.org


http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client leaks summary — part I

2016-02-01 Thread Xavier Hernandez

Hi,

On 01/02/16 09:54, Soumya Koduri wrote:



On 02/01/2016 01:39 PM, Oleksandr Natalenko wrote:

Wait. It seems to be my bad.

Before unmounting I do drop_caches (2), and glusterfs process CPU usage
goes to 100% for a while. I haven't waited for it to drop to 0%, and
instead perform unmount. It seems glusterfs is purging inodes and that's
why it uses 100% of CPU. I've re-tested it, waiting for CPU usage to
become normal, and got no leaks.

Will verify this once again and report more.

BTW, if that works, how could I limit inode cache for FUSE client? I do
not want it to go beyond 1G, for example, even if I have 48G of RAM on
my server.


Its hard-coded for now. For fuse the lru limit (of the inodes which are
not active) is (32*1024).


This is not exact for current implementation. The inode memory pool is 
configured with 32*1024 entries, but the lru limit is set to infinite: 
currently inode_table_prune() takes lru_limit == 0 as infinite, and the 
inode table created by fuse is initialized with 0.


Anyway this should not be a big problem in normal conditions. After 
having fixed the incorrect nlookup count for "." and ".." directory 
entries, when the kernel detects memory pressure and sends inode 
forgets, the memory will be released.



One of the ways to address this (which we were discussing earlier) is to
have an option to configure inode cache limit.


I think this will need more thinking. I've made a fast test forcing 
lru_limit to a small value and weird errors have appeared (probably from 
inodes being expected to exist when kernel sends new requests). Anyway I 
haven't spent time on this. I haven't tested in on master either.


Xavi


If that sounds good, we
can then check on if it has to be global/volume-level, client/server/both.

Thanks,
Soumya



01.02.2016 09:54, Soumya Koduri написав:

On 01/31/2016 03:05 PM, Oleksandr Natalenko wrote:

Unfortunately, this patch doesn't help.

RAM usage on "find" finish is ~9G.

Here is statedump before drop_caches: https://gist.github.com/
fc1647de0982ab447e20


[mount/fuse.fuse - usage-type gf_common_mt_inode_ctx memusage]
size=706766688
num_allocs=2454051



And after drop_caches: https://gist.github.com/5eab63bc13f78787ed19


[mount/fuse.fuse - usage-type gf_common_mt_inode_ctx memusage]
size=550996416
num_allocs=1913182

There isn't much significant drop in inode contexts. One of the
reasons could be because of dentrys holding a refcount on the inodes
which shall result in inodes not getting purged even after
fuse_forget.


pool-name=fuse:dentry_t
hot-count=32761

if  '32761' is the current active dentry count, it still doesn't seem
to match up to inode count.

Thanks,
Soumya


And here is Valgrind output:
https://gist.github.com/2490aeac448320d98596

On субота, 30 січня 2016 р. 22:56:37 EET Xavier Hernandez wrote:

There's another inode leak caused by an incorrect counting of
lookups on directory reads.

Here's a patch that solves the problem for
3.7:

http://review.gluster.org/13324

Hopefully with this patch the
memory leaks should disapear.

Xavi

On 29.01.2016 19:09, Oleksandr

Natalenko wrote:

Here is intermediate summary of current memory


leaks in FUSE client


investigation.

I use GlusterFS v3.7.6


release with the following patches:

===



Kaleb S KEITHLEY (1):

fuse: use-after-free fix in fuse-bridge, revisited


Pranith Kumar K


(1):

mount/fuse: Fix use-after-free crash



Soumya Koduri (3):

gfapi: Fix inode nlookup counts


inode: Retire the inodes from the lru


list in inode_table_destroy


upcall: free the xdr* allocations
===


With those patches we got API leaks fixed (I hope, brief tests show


that) and


got rid of "kernel notifier loop terminated" message.


Nevertheless, FUSE


client still leaks.

I have several test


volumes with several million of small files (100K…2M in


average). I


do 2 types of FUSE client testing:

1) find /mnt/volume -type d
2)


rsync -av -H /mnt/source_volume/* /mnt/target_volume/


And most


up-to-date results are shown below:

=== find /mnt/volume -type d


===


Memory consumption: ~4G



Statedump:

https://gist.github.com/10cde83c63f1b4f1dd7a


Valgrind:

https://gist.github.com/097afb01ebb2c5e9e78d


I guess,


fuse-bridge/fuse-resolve. related.


=== rsync -av -H


/mnt/source_volume/* /mnt/target_volume/ ===


Memory consumption:

~3.3...4G


Statedump (target volume):

https://gist.github.com/31e43110eaa4da663435


Valgrind (target volume):

https://gist.github.com/f8e0151a6878cacc9b1a


I guess,


DHT-related.


Give me more patches to test :).


___


Gluster-devel mailing


list


Gluster-devel@gluster.org


http://www.gluster.org/mailman/listinfo/gluster-devel



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mai

Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client leaks summary — part I

2016-02-01 Thread Xavier Hernandez

Hi Oleksandr,

On 01/02/16 09:09, Oleksandr Natalenko wrote:

Wait. It seems to be my bad.

Before unmounting I do drop_caches (2), and glusterfs process CPU usage
goes to 100% for a while.


That's the expected behavior after applying the nlookup count patch. As 
it's configured now, gluster won't release memory until the kernel 
requests it. Forcing a drop caches causes this to be made massively, 
consuming a lot of CPU. On normal circumstances, when memory is low, 
kernel will start releasing cached entries. This will include requests 
to gluster to release memory associated with those inodes in an 
incremental way as it's needed.



I haven't waited for it to drop to 0%, and
instead perform unmount. It seems glusterfs is purging inodes and that's
why it uses 100% of CPU. I've re-tested it, waiting for CPU usage to
become normal, and got no leaks.


I've made the same experiment and I ended with only 4 inodes still in 
use (probably the root directory and some other special entries) after 
having had several tens of thousands.


Xavi



Will verify this once again and report more.

BTW, if that works, how could I limit inode cache for FUSE client? I do
not want it to go beyond 1G, for example, even if I have 48G of RAM on
my server.

01.02.2016 09:54, Soumya Koduri написав:

On 01/31/2016 03:05 PM, Oleksandr Natalenko wrote:

Unfortunately, this patch doesn't help.

RAM usage on "find" finish is ~9G.

Here is statedump before drop_caches: https://gist.github.com/
fc1647de0982ab447e20


[mount/fuse.fuse - usage-type gf_common_mt_inode_ctx memusage]
size=706766688
num_allocs=2454051



And after drop_caches: https://gist.github.com/5eab63bc13f78787ed19


[mount/fuse.fuse - usage-type gf_common_mt_inode_ctx memusage]
size=550996416
num_allocs=1913182

There isn't much significant drop in inode contexts. One of the
reasons could be because of dentrys holding a refcount on the inodes
which shall result in inodes not getting purged even after
fuse_forget.


pool-name=fuse:dentry_t
hot-count=32761

if  '32761' is the current active dentry count, it still doesn't seem
to match up to inode count.

Thanks,
Soumya


And here is Valgrind output:
https://gist.github.com/2490aeac448320d98596

On субота, 30 січня 2016 р. 22:56:37 EET Xavier Hernandez wrote:

There's another inode leak caused by an incorrect counting of
lookups on directory reads.

Here's a patch that solves the problem for
3.7:

http://review.gluster.org/13324

Hopefully with this patch the
memory leaks should disapear.

Xavi

On 29.01.2016 19:09, Oleksandr

Natalenko wrote:

Here is intermediate summary of current memory


leaks in FUSE client


investigation.

I use GlusterFS v3.7.6


release with the following patches:

===



Kaleb S KEITHLEY (1):

fuse: use-after-free fix in fuse-bridge, revisited


Pranith Kumar K


(1):

mount/fuse: Fix use-after-free crash



Soumya Koduri (3):

gfapi: Fix inode nlookup counts


inode: Retire the inodes from the lru


list in inode_table_destroy


upcall: free the xdr* allocations
===


With those patches we got API leaks fixed (I hope, brief tests show


that) and


got rid of "kernel notifier loop terminated" message.


Nevertheless, FUSE


client still leaks.

I have several test


volumes with several million of small files (100K…2M in


average). I


do 2 types of FUSE client testing:

1) find /mnt/volume -type d
2)


rsync -av -H /mnt/source_volume/* /mnt/target_volume/


And most


up-to-date results are shown below:

=== find /mnt/volume -type d


===


Memory consumption: ~4G



Statedump:

https://gist.github.com/10cde83c63f1b4f1dd7a


Valgrind:

https://gist.github.com/097afb01ebb2c5e9e78d


I guess,


fuse-bridge/fuse-resolve. related.


=== rsync -av -H


/mnt/source_volume/* /mnt/target_volume/ ===


Memory consumption:

~3.3...4G


Statedump (target volume):

https://gist.github.com/31e43110eaa4da663435


Valgrind (target volume):

https://gist.github.com/f8e0151a6878cacc9b1a


I guess,


DHT-related.


Give me more patches to test :).


___


Gluster-devel mailing


list


Gluster-devel@gluster.org


http://www.gluster.org/mailman/listinfo/gluster-devel



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-users mailing list
gluster-us...@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] GlusterFS FUSE client leaks summary — part I

2016-01-30 Thread Xavier Hernandez
 

There's another inode leak caused by an incorrect counting of
lookups on directory reads.

Here's a patch that solves the problem for
3.7:

http://review.gluster.org/13324

Hopefully with this patch the
memory leaks should disapear.

Xavi

On 29.01.2016 19:09, Oleksandr
Natalenko wrote: 

> Here is intermediate summary of current memory
leaks in FUSE client 
> investigation.
> 
> I use GlusterFS v3.7.6
release with the following patches:
> 
> ===
> Kaleb S KEITHLEY (1):
>
fuse: use-after-free fix in fuse-bridge, revisited
> 
> Pranith Kumar K
(1):
> mount/fuse: Fix use-after-free crash
> 
> Soumya Koduri (3):
>
gfapi: Fix inode nlookup counts
> inode: Retire the inodes from the lru
list in inode_table_destroy
> upcall: free the xdr* allocations
> ===
>

> With those patches we got API leaks fixed (I hope, brief tests show
that) and 
> got rid of "kernel notifier loop terminated" message.
Nevertheless, FUSE 
> client still leaks.
> 
> I have several test
volumes with several million of small files (100K…2M in 
> average). I
do 2 types of FUSE client testing:
> 
> 1) find /mnt/volume -type d
> 2)
rsync -av -H /mnt/source_volume/* /mnt/target_volume/
> 
> And most
up-to-date results are shown below:
> 
> === find /mnt/volume -type d
===
> 
> Memory consumption: ~4G
> Statedump:
https://gist.github.com/10cde83c63f1b4f1dd7a
> Valgrind:
https://gist.github.com/097afb01ebb2c5e9e78d
> 
> I guess,
fuse-bridge/fuse-resolve. related.
> 
> === rsync -av -H
/mnt/source_volume/* /mnt/target_volume/ ===
> 
> Memory consumption:
~3.3...4G
> Statedump (target volume):
https://gist.github.com/31e43110eaa4da663435
> Valgrind (target volume):
https://gist.github.com/f8e0151a6878cacc9b1a
> 
> I guess,
DHT-related.
> 
> Give me more patches to test :).
>
___
> Gluster-devel mailing
list
> Gluster-devel@gluster.org
>
http://www.gluster.org/mailman/listinfo/gluster-devel
 ___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] distributed files/directories and [cm]time updates

2016-01-26 Thread Xavier Hernandez

Hi Joseph,

On 26/01/16 10:42, Joseph Fernandes wrote:

Hi Xavi,

Answer inline:

- Original Message -
From: "Xavier Hernandez" <xhernan...@datalab.es>
To: "Joseph Fernandes" <josfe...@redhat.com>
Cc: "Pranith Kumar Karampuri" <pkara...@redhat.com>, "Gluster Devel" 
<gluster-devel@gluster.org>
Sent: Tuesday, January 26, 2016 2:09:43 PM
Subject: Re: [Gluster-devel] distributed files/directories and [cm]time updates

Hi Joseph,

On 26/01/16 09:07, Joseph Fernandes wrote:

Answer inline:


- Original Message -
From: "Xavier Hernandez" <xhernan...@datalab.es>
To: "Pranith Kumar Karampuri" <pkara...@redhat.com>, "Gluster Devel" 
<gluster-devel@gluster.org>
Sent: Tuesday, January 26, 2016 1:21:37 PM
Subject: Re: [Gluster-devel] distributed files/directories and [cm]time updates

Hi Pranith,

On 26/01/16 03:47, Pranith Kumar Karampuri wrote:

hi,
 Traditionally gluster has been using ctime/mtime of the
files/dirs on the bricks as stat output. Problem we are seeing with this
approach is that, software which depends on it gets confused when there
are differences in these times. Tar especially gives "file changed as we
read it" whenever it detects ctime differences when stat is served from
different bricks. The way we have been trying to solve it is to serve
the stat structures from same brick in afr, max-time in dht. But it
doesn't avoid the problem completely. Because there is no way to change
ctime at the moment(lutimes() only allows mtime, atime), there is little
we can do to make sure ctimes match after self-heals/xattr
updates/rebalance. I am wondering if anyone of you solved these problems
before, if yes how did you go about doing it? It seems like applications
which depend on this for backups get confused the same way. The only way
out I see it is to bring ctime to an xattr, but that will need more iops
and gluster has to keep updating it on quite a few fops.


I did think about this when I was writing ec at the beginning. The idea
was that the point in time at which each fop is executed were controlled
by the client by adding an special xattr to each regular fop. Of course
this would require support inside the storage/posix xlator. At that
time, adding the needed support to other xlators seemed too complex for
me, so I decided to do something similar to afr.

Anyway, the idea was like this: for example, when a write fop needs to
be sent, dht/afr/ec sets the current time in a special xattr, for
example 'glusterfs.time'. It can be done in a way that if the time is
already set by a higher xlator, it's not modified. This way DHT could
set the time in fops involving multiple afr subvolumes. For other fops,
would be afr who sets the time. It could also be set directly by the top
most xlator (fuse), but that time could be incorrect because lower
xlators could delay the fop execution and reorder it. This would need
more thinking.

That xattr will be received by storage/posix. This xlator will determine
what times need to be modified and will change them. In the case of a
write, it can decide to modify mtime and, maybe, atime. For a mkdir or
create, it will set the times of the new file/directory and also the
mtime of the parent directory. It depends on the specific fop being
processed.

mtime, atime and ctime (or even others) could be saved in a special
posix xattr instead of relying on the file system attributes that cannot
be modified (at least for ctime).

This solution doesn't require extra fops, So it seems quite clean to me.
The additional I/O needed in posix could be minimized by implementing a
metadata cache in storage/posix that would read all metadata on lookup
and update it on disk only at regular intervals and/or on invalidation.
All fops would read/write into the cache. This would even reduce the
number of I/O we are currently doing for each fop.


JOE: the idea of metadata cache is cool for read work loads, but for writes we

would end up doing double writes to the disk. i.e 1 for the actual write or 1 
to update the setxattr.
IMHO we cannot have it in a write back cache (periodic flush to disk) and 
ctime/mtime/atime data loss
or inconsistency will be a problem. Your thoughts?


If we want to have all in physical storage at all times, gluster will be
slow. We only need to be posix compliant, and posix allows some degree
of "inconsistency" here. i.e. we are not forced to write to physical
storage until the user application sends a flush or similar request.
Note that there are xlators that currently take advantage of this: for
example write-behind and md-cache.

Almost all file systems (if not all) rely on this to improve
performance, otherwise they would be really slow.

JOE : Agree


Of course this could cause a temporal inconsistency between bricks, but
since all cluster xlators (dht, afr and ec) use special xattrs to track
consistency, 

Re: [Gluster-devel] distributed files/directories and [cm]time updates

2016-01-26 Thread Xavier Hernandez

Hi Joseph,

On 26/01/16 09:07, Joseph Fernandes wrote:

Answer inline:


- Original Message -
From: "Xavier Hernandez" <xhernan...@datalab.es>
To: "Pranith Kumar Karampuri" <pkara...@redhat.com>, "Gluster Devel" 
<gluster-devel@gluster.org>
Sent: Tuesday, January 26, 2016 1:21:37 PM
Subject: Re: [Gluster-devel] distributed files/directories and [cm]time updates

Hi Pranith,

On 26/01/16 03:47, Pranith Kumar Karampuri wrote:

hi,
Traditionally gluster has been using ctime/mtime of the
files/dirs on the bricks as stat output. Problem we are seeing with this
approach is that, software which depends on it gets confused when there
are differences in these times. Tar especially gives "file changed as we
read it" whenever it detects ctime differences when stat is served from
different bricks. The way we have been trying to solve it is to serve
the stat structures from same brick in afr, max-time in dht. But it
doesn't avoid the problem completely. Because there is no way to change
ctime at the moment(lutimes() only allows mtime, atime), there is little
we can do to make sure ctimes match after self-heals/xattr
updates/rebalance. I am wondering if anyone of you solved these problems
before, if yes how did you go about doing it? It seems like applications
which depend on this for backups get confused the same way. The only way
out I see it is to bring ctime to an xattr, but that will need more iops
and gluster has to keep updating it on quite a few fops.


I did think about this when I was writing ec at the beginning. The idea
was that the point in time at which each fop is executed were controlled
by the client by adding an special xattr to each regular fop. Of course
this would require support inside the storage/posix xlator. At that
time, adding the needed support to other xlators seemed too complex for
me, so I decided to do something similar to afr.

Anyway, the idea was like this: for example, when a write fop needs to
be sent, dht/afr/ec sets the current time in a special xattr, for
example 'glusterfs.time'. It can be done in a way that if the time is
already set by a higher xlator, it's not modified. This way DHT could
set the time in fops involving multiple afr subvolumes. For other fops,
would be afr who sets the time. It could also be set directly by the top
most xlator (fuse), but that time could be incorrect because lower
xlators could delay the fop execution and reorder it. This would need
more thinking.

That xattr will be received by storage/posix. This xlator will determine
what times need to be modified and will change them. In the case of a
write, it can decide to modify mtime and, maybe, atime. For a mkdir or
create, it will set the times of the new file/directory and also the
mtime of the parent directory. It depends on the specific fop being
processed.

mtime, atime and ctime (or even others) could be saved in a special
posix xattr instead of relying on the file system attributes that cannot
be modified (at least for ctime).

This solution doesn't require extra fops, So it seems quite clean to me.
The additional I/O needed in posix could be minimized by implementing a
metadata cache in storage/posix that would read all metadata on lookup
and update it on disk only at regular intervals and/or on invalidation.
All fops would read/write into the cache. This would even reduce the
number of I/O we are currently doing for each fop.


JOE: the idea of metadata cache is cool for read work loads, but for writes we

would end up doing double writes to the disk. i.e 1 for the actual write or 1 
to update the setxattr.
IMHO we cannot have it in a write back cache (periodic flush to disk) and 
ctime/mtime/atime data loss
or inconsistency will be a problem. Your thoughts?


If we want to have all in physical storage at all times, gluster will be 
slow. We only need to be posix compliant, and posix allows some degree 
of "inconsistency" here. i.e. we are not forced to write to physical 
storage until the user application sends a flush or similar request. 
Note that there are xlators that currently take advantage of this: for 
example write-behind and md-cache.


Almost all file systems (if not all) rely on this to improve 
performance, otherwise they would be really slow.


Of course this could cause a temporal inconsistency between bricks, but 
since all cluster xlators (dht, afr and ec) use special xattrs to track 
consistency, a crash before flushing the metadata could be detected and 
repaired (with additional care even a crash while flushing metadata 
could be detected).


Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] distributed files/directories and [cm]time updates

2016-01-25 Thread Xavier Hernandez

Hi Pranith,

On 26/01/16 03:47, Pranith Kumar Karampuri wrote:

hi,
   Traditionally gluster has been using ctime/mtime of the
files/dirs on the bricks as stat output. Problem we are seeing with this
approach is that, software which depends on it gets confused when there
are differences in these times. Tar especially gives "file changed as we
read it" whenever it detects ctime differences when stat is served from
different bricks. The way we have been trying to solve it is to serve
the stat structures from same brick in afr, max-time in dht. But it
doesn't avoid the problem completely. Because there is no way to change
ctime at the moment(lutimes() only allows mtime, atime), there is little
we can do to make sure ctimes match after self-heals/xattr
updates/rebalance. I am wondering if anyone of you solved these problems
before, if yes how did you go about doing it? It seems like applications
which depend on this for backups get confused the same way. The only way
out I see it is to bring ctime to an xattr, but that will need more iops
and gluster has to keep updating it on quite a few fops.


I did think about this when I was writing ec at the beginning. The idea 
was that the point in time at which each fop is executed were controlled 
by the client by adding an special xattr to each regular fop. Of course 
this would require support inside the storage/posix xlator. At that 
time, adding the needed support to other xlators seemed too complex for 
me, so I decided to do something similar to afr.


Anyway, the idea was like this: for example, when a write fop needs to 
be sent, dht/afr/ec sets the current time in a special xattr, for 
example 'glusterfs.time'. It can be done in a way that if the time is 
already set by a higher xlator, it's not modified. This way DHT could 
set the time in fops involving multiple afr subvolumes. For other fops, 
would be afr who sets the time. It could also be set directly by the top 
most xlator (fuse), but that time could be incorrect because lower 
xlators could delay the fop execution and reorder it. This would need 
more thinking.


That xattr will be received by storage/posix. This xlator will determine 
what times need to be modified and will change them. In the case of a 
write, it can decide to modify mtime and, maybe, atime. For a mkdir or 
create, it will set the times of the new file/directory and also the 
mtime of the parent directory. It depends on the specific fop being 
processed.


mtime, atime and ctime (or even others) could be saved in a special 
posix xattr instead of relying on the file system attributes that cannot 
be modified (at least for ctime).


This solution doesn't require extra fops, So it seems quite clean to me. 
The additional I/O needed in posix could be minimized by implementing a 
metadata cache in storage/posix that would read all metadata on lookup 
and update it on disk only at regular intervals and/or on invalidation. 
All fops would read/write into the cache. This would even reduce the 
number of I/O we are currently doing for each fop.


Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client

2016-01-21 Thread Xavier Hernandez
If this message appears way before the volume is unmounted, can you try 
to start the volume manually using this command and repeat the tests ?


glusterfs --fopen-keep-cache=off --volfile-server= 
--volfile-id=/ 


This will prevent invalidation requests to be sent to the kernel, so 
there shouldn't be any memory leak even if the worker thread exits 
prematurely.


If that solves the problem, we could try to determine the cause of the 
premature exit and solve it.


Xavi


On 20/01/16 10:08, Oleksandr Natalenko wrote:

Yes, there are couple of messages like this in my logs too (I guess one
message per each remount):

===
[2016-01-18 23:42:08.742447] I [fuse-bridge.c:3875:notify_kernel_loop] 0-
glusterfs-fuse: kernel notifier loop terminated
===

On середа, 20 січня 2016 р. 09:51:23 EET Xavier Hernandez wrote:

I'm seeing a similar problem with 3.7.6.

This latest statedump contains a lot of gf_fuse_mt_invalidate_node_t
objects in fuse. Looking at the code I see they are used to send
invalidations to kernel fuse, however this is done in a separate thread
that writes a log message when it exits. On the system I'm seeing the
memory leak, I can see that message in the log files:

[2016-01-18 23:04:55.384873] I [fuse-bridge.c:3875:notify_kernel_loop]
0-glusterfs-fuse: kernel notifier loop terminated

But the volume is still working at this moment, so any future inode
invalidations will leak memory because it was this thread that should
release it.

Can you check if you also see this message in the mount log ?

It seems that this thread terminates if write returns any error
different than ENOENT. I'm not sure if there could be any other error
that can cause this.

Xavi

On 20/01/16 00:13, Oleksandr Natalenko wrote:

Here is another RAM usage stats and statedump of GlusterFS mount
approaching to just another OOM:

===
root 32495  1.4 88.3 4943868 1697316 ? Ssl  Jan13 129:18
/usr/sbin/
glusterfs --volfile-server=server.example.com --volfile-id=volume
/mnt/volume ===

https://gist.github.com/86198201c79e927b46bd

1.6G of RAM just for almost idle mount (we occasionally store Asterisk
recordings there). Triple OOM for 69 days of uptime.

Any thoughts?

On середа, 13 січня 2016 р. 16:26:59 EET Soumya Koduri wrote:

kill -USR1


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client

2016-01-20 Thread Xavier Hernandez

I'm seeing a similar problem with 3.7.6.

This latest statedump contains a lot of gf_fuse_mt_invalidate_node_t 
objects in fuse. Looking at the code I see they are used to send 
invalidations to kernel fuse, however this is done in a separate thread 
that writes a log message when it exits. On the system I'm seeing the 
memory leak, I can see that message in the log files:


[2016-01-18 23:04:55.384873] I [fuse-bridge.c:3875:notify_kernel_loop] 
0-glusterfs-fuse: kernel notifier loop terminated


But the volume is still working at this moment, so any future inode 
invalidations will leak memory because it was this thread that should 
release it.


Can you check if you also see this message in the mount log ?

It seems that this thread terminates if write returns any error 
different than ENOENT. I'm not sure if there could be any other error 
that can cause this.


Xavi

On 20/01/16 00:13, Oleksandr Natalenko wrote:

Here is another RAM usage stats and statedump of GlusterFS mount approaching
to just another OOM:

===
root 32495  1.4 88.3 4943868 1697316 ? Ssl  Jan13 129:18 /usr/sbin/
glusterfs --volfile-server=server.example.com --volfile-id=volume /mnt/volume
===

https://gist.github.com/86198201c79e927b46bd

1.6G of RAM just for almost idle mount (we occasionally store Asterisk
recordings there). Triple OOM for 69 days of uptime.

Any thoughts?

On середа, 13 січня 2016 р. 16:26:59 EET Soumya Koduri wrote:

kill -USR1



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] tests/basic/ec/ec-3-1.t generated core

2016-01-14 Thread Xavier Hernandez
Is it possible that volume mount returns before fuse_init() is executed 
? if that's true, then the core is generated because just after mounting 
the volume, statedumps are requested to determine when all ec childs are 
up. The code in fuse's dump assumes that fuse_init() has already been 
called when a statedump is generated.


#0  0x7f75dae83137 in fuse_itable_dump (this=0x2079be0) at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/xlators/mount/fuse/src/fuse-bridge.c:4988

4988inode_table_dump(priv->active_subvol->itable,
(gdb) print priv->init_recvd
$14 = 0 '\000'

Xavi

On 14/01/16 08:33, Xavier Hernandez wrote:

The failure happens when a statedump is generated. For some reason
priv->active_subvol is NULL, causing a segmentation fault:

(gdb) t 1
[Switching to thread 1 (LWP 4179)]
#0  0x7f75dae83137 in fuse_itable_dump (this=0x2079be0) at
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/xlators/mount/fuse/src/fuse-bridge.c:4988

4988inode_table_dump(priv->active_subvol->itable,
(gdb) bt
#0  0x7f75dae83137 in fuse_itable_dump (this=0x2079be0) at
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/xlators/mount/fuse/src/fuse-bridge.c:4988

#1  0x7f75e30f8a11 in gf_proc_dump_xlator_info (top=0x2079be0) at
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/libglusterfs/src/statedump.c:506

#2  0x7f75e30f96e9 in gf_proc_dump_info (signum=10, ctx=0x2055010)
at
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/libglusterfs/src/statedump.c:832

#3  0x00409894 in glusterfs_sigwaiter (arg=0x7ffceb7dba50) at
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/glusterfsd/src/glusterfsd.c:2015

#4  0x7f75e23a6a51 in start_thread () from ./lib64/libpthread.so.0
#5  0x7f75e1d1093d in clone () from ./lib64/libc.so.6
(gdb) list
4983 return -1;
4984
4985priv = this->private;
4986
4987gf_proc_dump_add_section("xlator.mount.fuse.itable");
4988inode_table_dump(priv->active_subvol->itable,
4989 "xlator.mount.fuse.itable");
4990
4991return 0;
4992}
(gdb) print priv->active_subvol
$5 = (xlator_t *) 0x0

Does this sound familiar to anyone ?

Xavi

On 14/01/16 08:08, Xavier Hernandez wrote:

I'm looking it.

On 14/01/16 08:03, Atin Mukherjee wrote:

[1] has caused a regression failure with a core from the mentioned test.
Mind having a look?

[1]
https://build.gluster.org/job/rackspace-regression-2GB-triggered/17579/consoleFull



Thanks,
Atin


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] tests/basic/ec/ec-3-1.t generated core

2016-01-13 Thread Xavier Hernandez

I'm looking it.

On 14/01/16 08:03, Atin Mukherjee wrote:

[1] has caused a regression failure with a core from the mentioned test.
Mind having a look?

[1]
https://build.gluster.org/job/rackspace-regression-2GB-triggered/17579/consoleFull

Thanks,
Atin


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] tests/basic/ec/ec-3-1.t generated core

2016-01-13 Thread Xavier Hernandez
The failure happens when a statedump is generated. For some reason 
priv->active_subvol is NULL, causing a segmentation fault:


(gdb) t 1
[Switching to thread 1 (LWP 4179)]
#0  0x7f75dae83137 in fuse_itable_dump (this=0x2079be0) at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/xlators/mount/fuse/src/fuse-bridge.c:4988

4988inode_table_dump(priv->active_subvol->itable,
(gdb) bt
#0  0x7f75dae83137 in fuse_itable_dump (this=0x2079be0) at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/xlators/mount/fuse/src/fuse-bridge.c:4988
#1  0x7f75e30f8a11 in gf_proc_dump_xlator_info (top=0x2079be0) at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/libglusterfs/src/statedump.c:506
#2  0x7f75e30f96e9 in gf_proc_dump_info (signum=10, ctx=0x2055010) 
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/libglusterfs/src/statedump.c:832
#3  0x00409894 in glusterfs_sigwaiter (arg=0x7ffceb7dba50) at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/glusterfsd/src/glusterfsd.c:2015

#4  0x7f75e23a6a51 in start_thread () from ./lib64/libpthread.so.0
#5  0x7f75e1d1093d in clone () from ./lib64/libc.so.6
(gdb) list
4983 return -1;
4984
4985priv = this->private;
4986
4987gf_proc_dump_add_section("xlator.mount.fuse.itable");
4988inode_table_dump(priv->active_subvol->itable,
4989 "xlator.mount.fuse.itable");
4990
4991return 0;
4992}
(gdb) print priv->active_subvol
$5 = (xlator_t *) 0x0

Does this sound familiar to anyone ?

Xavi

On 14/01/16 08:08, Xavier Hernandez wrote:

I'm looking it.

On 14/01/16 08:03, Atin Mukherjee wrote:

[1] has caused a regression failure with a core from the mentioned test.
Mind having a look?

[1]
https://build.gluster.org/job/rackspace-regression-2GB-triggered/17579/consoleFull


Thanks,
Atin


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


  1   2   >