[Gluster-devel] Suggest how to recognize the time when heal is triggered by using events

2017-07-04 Thread Taehwa Lee
Hello, Karampuri.


I've been developing products using glusterfs for 2 years almost with my 
co-workers.

I got a problem that the products cannot recognize the time when heal is 
triggered.

I think healing affect the performance of glusterfs volume definitely.

So, we should monitor whether healing is in progress or not.


To monitor it, event api is one of the best way, I guess.

So I have created the issue including patch for it on bugzilla; 
https://bugzilla.redhat.com/show_bug.cgi?id=1467543 




Can I get some feedbacks? 

Thanks in advance.


Best regards.


-
이 태 화
Taehwa Lee
Gluesys Co.,Ltd.
alghost@gmail.com
010-3420-6114, 070-8785-6591
-

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Review request for several Gluster/NFS changes

2017-07-04 Thread Niels de Vos
Hello,

I'd like to have some reviews for the following changes:

nfs: make nfs3_call_state_t refcounted
- https://review.gluster.org/17696

nfs/nlm: unref fds in nlm_client_free()
- https://review.gluster.org/17697

nfs/nlm: handle reconnect for non-NLM4_LOCK requests
- https://review.gluster.org/17698

nfs/nlm: use refcounting for nfs3_call_state_t
- https://review.gluster.org/17699

nfs/nlm: keep track of the call-state and frame for notifications
- https://review.gluster.org/17700


These prevent some unfortunate use-after-free in certain (un)lock
situations that cthon04 can expose.

Thanks!
Niels
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Fedora Package Request for gluster-block

2017-07-04 Thread Niels de Vos
Hi,

gluster-block is ready to be packaged into Fedora. For this, I have
posted a package review request:
  https://bugzilla.redhat.com/show_bug.cgi?id=1467677

In case there is a Gluster developer that would like to review it, not
is your chance!

Thanks,
Niels


signature.asc
Description: PGP signature
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Coverity covscan for 2017-07-04-a14475fa (master branch)

2017-07-04 Thread staticanalysis
GlusterFS Coverity covscan results are available from
http://download.gluster.org/pub/gluster/glusterfs/static-analysis/master/glusterfs-coverity/2017-07-04-a14475fa
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Compilation with gcc 7.x

2017-07-04 Thread Csaba Henk
Hi list,

I've compiled glusterfs with gcc 7.x (to be precise, with 7.1.1),
which is soon to get its prime time as the C compiler of
Fedora 26.

The Release Notes (https://gcc.gnu.org/gcc-7/changes.html)
give account of a broad list of new and improved warnings...
and that shows. While with gcc 6.x the only warning I had
is "lchmod is not implemented and will always fail", with
gcc 7.x I got 218 warnings alltogether. For reference, I
attach the excerpted warnings from the compilation output.

Are you aware of this? Is there any plan what to do about it?

Csaba


glusterfs-v3.12dev-187-g89faa4661-build-warnings.log.gz
Description: GNU Zip compressed data
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-07-04 Thread Xavier Hernandez

Hi Pranith,

On 03/07/17 08:33, Pranith Kumar Karampuri wrote:

Xavi,
  Now that the change has been reverted, we can resume this
discussion and decide on the exact format that considers, tier, dht,
afr, ec. People working geo-rep/dht/afr/ec had an internal discussion
and we all agreed that this proposal would be a good way forward. I
think once we agree on the format and decide on the initial
encoding/decoding functions of the xattr and this change is merged, we
can send patches on afr/ec/dht and geo-rep to take it to closure.

Could you propose the new format you have in mind that considers all of
the xlators?


My idea was to create a new xattr not bound to any particular function 
but which could give enough information to be used in many places.


Currently we have another attribute called glusterfs.pathinfo that 
returns hierarchical information about the location of a file. Maybe we 
can extend this to unify all these attributes into a single feature that 
could be used for multiple purposes.


Since we have time to discuss it, I would like to design it with more 
information than we already talked.


First of all, the amount of information that this attribute can contain 
is quite big if we expect to have volumes with thousands of bricks. Even 
in the most simple case of returning only an UUID, we can easily go 
beyond the limit of 64KB.


Consider also, for example, what shard should return when pathinfo is 
requested for a file. Probably it should return a list of shards, each 
one with all its associated pathinfo. We are talking about big amounts 
of data here.


I think this kind of information doesn't fit very well in an extended 
attribute. Another think to consider is that most probably the requester 
of the data only needs a fragment of it, so we are generating big 
amounts of data only to be parsed and reduced later, dismissing most of it.


What do you think about using a very special virtual file to manage all 
this information ? it could be easily read using normal read fops, so it 
could manage big amounts of data easily. Also, accessing only to some 
parts of the file we could go directly where we want, avoiding the read 
of all remaining data.


A very basic idea could be this:

Each xlator would have a reserved area of the file. We can reserve up to 
4GB per xlator (32 bits). The remaining 32 bits of the offset would 
indicate the xlator we want to access.


At offset 0 we have generic information about the volume. One of the the 
things that this information should include is a basic hierarchy of the 
whole volume and the offset for each xlator.


After reading this, the user will seek to the desired offset and read 
the information related to the xlator it is interested in.


All the information should be stored in a format easily extensible that 
will be kept compatible even if new information is added in the future 
(for example doing special mappings of the 32 bits offsets reserved for 
the xlator).


For example we can reserve the first megabyte of the xlator area to have 
a mapping of attributes with its respective offset.


I think that using a binary format would simplify all this a lot.

Do you think this is a way to explore or should I stop wasting time here ?

Xavi





On Wed, Jun 21, 2017 at 2:08 PM, Karthik Subrahmanya
> wrote:



On Wed, Jun 21, 2017 at 1:56 PM, Xavier Hernandez
> wrote:

That's ok. I'm currently unable to write a patch for this on ec.

Sunil is working on this patch.

~Karthik

If no one can do it, I can try to do it in 6 - 7 hours...

Xavi


On Wednesday, June 21, 2017 09:48 CEST, Pranith Kumar Karampuri
> wrote:




On Wed, Jun 21, 2017 at 1:00 PM, Xavier Hernandez
> wrote:

I'm ok with reverting node-uuid content to the previous
format and create a new xattr for the new format.
Currently, only rebalance will use it.

Only thing to consider is what can happen if we have a
half upgraded cluster where some clients have this change
and some not. Can rebalance work in this situation ? if
so, could there be any issue ?


I think there shouldn't be any problem, because this is
in-memory xattr so layers below afr/ec will only see node-uuid
xattr.
This also gives us a chance to do whatever we want to do in
future with this xattr without any problems about backward
compatibility.

You can check

https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507


for how karthik implemented this in AFR (this got 

Re: [Gluster-devel] Disperse volume : Sequential Writes

2017-07-04 Thread Xavier Hernandez

Hi Pranith,

On 03/07/17 05:35, Pranith Kumar Karampuri wrote:

Ashish, Xavi,
   I think it is better to implement this change as a separate
read-after-write caching xlator which we can load between EC and client
xlator. That way EC will not get a lot more functionality than necessary
and may be this xlator can be used somewhere else in the stack if possible.


while this seems a good way to separate functionalities, it has a big 
problem. If we add a caching xlator between ec and *all* of its 
subvolumes, it will only be able to cache encoded data. So, when ec 
needs the "cached" data, it will need to issue a request to each of its 
subvolumes and compute the decoded data before being able to use it, so 
we don't avoid the decoding overhead.


Also, if we want to make the xlator generic, it will probably cache a 
lot more data than ec really needs. Increasing memory footprint 
considerably for no real use.


Additionally, this new xlator will need to guarantee that the cached 
data is current, so it will need its own locking logic (that would be 
another copy of the existing logic in one of the current xlators) 
which is slow and difficult to maintain, or it will need to intercept 
and reuse locking calls from parent xlators, which can be quite complex 
since we have multiple xlator levels where locks can be taken, not only ec.


This is a relatively simple change to make inside ec, but a very complex 
change (IMO) if we want to do it as a stand-alone xlator and be generic 
enough to be reused and work safely in other places of the stack.


If we want to separate functionalities I think we should create a new 
concept of xlator which is transversal to the "traditional" xlator stack.


Current xlators are linear in the sense that each one operates only at 
one place (it can be moved by reconfiguration, but once instantiated, it 
always work at the same place) and passes data to the next one.


A transversal xlator (or maybe a service xlator would be better) would 
be one not bound to any place of the stack, but could be used by all 
other xlators to implement some service, like caching, multithreading, 
locking, ... these are features that many xlators need but cannot use 
easily (nor efficiently) if they are implicitly implemented in some 
specific place of the stack outside its control.


The transaction framework we already talked, could be though as one of 
these service xlators. Multithreading could also benefit of this 
approach because xlators would have more control about what things can 
be processed by a background thread and which ones not. Probably there 
are other features that could benefit from this approach.


In the case of brick multiplexing, if some xlators are removed from each 
stack and loaded as global services, most probably the memory footprint 
will be lower and the resource usage more optimized.


Just an idea...

Xavi



On Fri, Jun 16, 2017 at 4:19 PM, Ashish Pandey > wrote:


I think it should be done as we have agreement on basic design.


*From: *"Pranith Kumar Karampuri" >
*To: *"Xavier Hernandez" >
*Cc: *"Ashish Pandey" >, "Gluster Devel"
>
*Sent: *Friday, June 16, 2017 3:50:09 PM
*Subject: *Re: [Gluster-devel] Disperse volume : Sequential Writes




On Fri, Jun 16, 2017 at 3:12 PM, Xavier Hernandez
> wrote:

On 16/06/17 10:51, Pranith Kumar Karampuri wrote:



On Fri, Jun 16, 2017 at 12:02 PM, Xavier Hernandez

>> wrote:

On 15/06/17 11:50, Pranith Kumar Karampuri wrote:



On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey

>
 

Re: [Gluster-devel] Disperse volume : Sequential Writes

2017-07-04 Thread Ashish Pandey

I think it is a good Idea. 
May be we can add more enhancement in this xlator to improve things in future. 

- Original Message -

From: "Pranith Kumar Karampuri"  
To: "Ashish Pandey"  
Cc: "Xavier Hernandez" , "Gluster Devel" 
 
Sent: Monday, July 3, 2017 9:05:54 AM 
Subject: Re: [Gluster-devel] Disperse volume : Sequential Writes 

Ashish, Xavi, 
I think it is better to implement this change as a separate read-after-write 
caching xlator which we can load between EC and client xlator. That way EC will 
not get a lot more functionality than necessary and may be this xlator can be 
used somewhere else in the stack if possible. 

On Fri, Jun 16, 2017 at 4:19 PM, Ashish Pandey < aspan...@redhat.com > wrote: 




I think it should be done as we have agreement on basic design. 


From: "Pranith Kumar Karampuri" < pkara...@redhat.com > 
To: "Xavier Hernandez" < xhernan...@datalab.es > 
Cc: "Ashish Pandey" < aspan...@redhat.com >, "Gluster Devel" < 
gluster-devel@gluster.org > 
Sent: Friday, June 16, 2017 3:50:09 PM 
Subject: Re: [Gluster-devel] Disperse volume : Sequential Writes 




On Fri, Jun 16, 2017 at 3:12 PM, Xavier Hernandez < xhernan...@datalab.es > 
wrote: 


On 16/06/17 10:51, Pranith Kumar Karampuri wrote: 




On Fri, Jun 16, 2017 at 12:02 PM, Xavier Hernandez 
< xhernan...@datalab.es > wrote: 

On 15/06/17 11:50, Pranith Kumar Karampuri wrote: 



On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey 
< aspan...@redhat.com  
>> wrote: 

Hi All, 

We have been facing some issues in disperse (EC) volume. 
We know that currently EC is not good for random IO as it 
requires 
READ-MODIFY-WRITE fop 
cycle if an offset and offset+length falls in the middle of 
strip size. 

Unfortunately, it could also happen with sequential writes. 
Consider an EC volume with configuration 4+2. The stripe 
size for 
this would be 512 * 4 = 2048. That is, 2048 bytes of user data 
stored in one stripe. 
Let's say 2048 + 512 = 2560 bytes are already written on this 
volume. 512 Bytes would be in second stripe. 
Now, if there are sequential writes with offset 2560 and of 
size 1 
Byte, we have to read the whole stripe, encode it with 1 
Byte and 
then again have to write it back. 
Next, write with offset 2561 and size of 1 Byte will again 
READ-MODIFY-WRITE the whole stripe. This is causing bad 
performance. 

There are some tools and scenario's where such kind of load is 
coming and users are not aware of that. 
Example: fio and zip 

Solution: 
One possible solution to deal with this issue is to keep 
last stripe 
in memory. 
This way, we need not to read it again and we can save READ fop 
going over the network. 
Considering the above example, we have to keep last 2048 bytes 
(maximum) in memory per file. This should not be a big 
deal as we already keep some data like xattr's and size info in 
memory and based on that we take decisions. 

Please provide your thoughts on this and also if you have 
any other 
solution. 


Just adding more details. 
The stripe will be in memory only when lock on the inode is active. 


I think that's ok. 

One 
thing we are yet to decide on is: do we want to read the stripe 
everytime we get the lock or just after an extending write is 
performed. 
I am thinking keeping the stripe in memory just after an 
extending write 
is better as it doesn't involve extra network operation. 


I wouldn't read the last stripe unconditionally every time we lock 
the inode. There's no benefit at all on random writes (in fact it's 
worse) and a sequential write will issue the read anyway when 
needed. The only difference is a small delay for the first operation 
after a lock. 


Yes, perfect. 



What I would do is to keep the last stripe of every write (we can 
consider to do it per fd), even if it's not the last stripe of the 
file (to also optimize sequential rewrites). 


Ah! good point. But if we remember it per fd, one fd's cached data can 
be over-written by another fd on the disk so we need to also do cache 
invalidation. 



We only cache data if we have the inodelk, so all related fd's must be from the 
same client, and we'll control all its writes so cache invalidation in this 
case is pretty easy. 

There exists the possibility to have two fd's from the same client writing to 
the same region. To control this we would need some range checking in the 
writes, but all this is local, so it's easy to control it. 

Anyway, this is probably not a common case, so we could start by caching only 
the last stripe of the last write, ignoring the fd. 



May be implementation should consider this possibility. 
Yet to think about how to do this. But it is a good point. We should 
consider this. 


Maybe we could keep a list of cached stripes sorted by offset in the inode (if 
the maximum number of entries is small, we could keep the list not sorted). 
Each fd should store the offset of the last write. Cached