Re: [Gluster-devel] Wanted - 3.7.5 release manager

2015-09-02 Thread Pranith Kumar Karampuri



On 09/02/2015 07:33 PM, Vijay Bellur wrote:

On Wednesday 02 September 2015 06:38 PM, Atin Mukherjee wrote:

IIRC, Pranith already volunteered for it in one of the last community
meetings?



Thanks Atin. I do recollect it now.

Pranith - can you confirm being the release manager for 3.7.5?

Yes, I can do this.

Pranith


-Vijay


-Atin
Sent from one plus one

On Sep 2, 2015 6:00 PM, "Vijay Bellur" > wrote:

Hi All,

We have been rotating release managers for minor releases in the
3.7.x train. We just released 3.7.4 and are looking for volunteers
to be release managers for 3.7.5 (scheduled for 30th September). If
anybody is interested in volunteering, please drop a note here.

Thanks,
Vijay
___
Gluster-devel mailing list
Gluster-devel@gluster.org 
http://www.gluster.org/mailman/listinfo/gluster-devel





___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Roadmap for afr, ec

2015-09-17 Thread Pranith Kumar Karampuri



On 09/16/2015 03:42 PM, fanghuang.d...@yahoo.com wrote:

Hi Pranith,

For the EC encoding/decoding algorithm, could we design a plug-in mechanism to 
make users can choose their own
algorithm or can use the third side library just like Ceph? And I am also 
curious why originally the IDA algorithm
is chosen, instead of the common used Reed-Solomon algorithm?
Pluggability of algorithms is also in plan. I never really bothered to 
check which algorithm was used, and was under the impression that we are 
using reed-solomon nonsystematic erasure codes as told to me by Dan(CCed).


Pranith
  
Best Regards,

Fang Huang



On Monday, 14 September 2015, 16:30, Pranith Kumar Karampuri 
<pkara...@redhat.com> wrote:

hi,

Here is a list of common improvements for both ec and afr planned over
the next few months:

1) Granular entry self-heals.
   Both afr and ec at the moment do lot of readdirs and lookups to
figure out the differences between the directories to perform heals.
Kritika, Ravi, Anuradha and I are discussing about how to prevent this.
The base algo is to store only the names that need heal in
.glusterfs/indices/entry-changes// as links to base
file in .glusterfs/indices/entry-changes of the bricks. So only the
names that need to be healed will be going through name heals.
We want to complete this for 3.8 definitely.

2) Granular data self-heals.
   At the moment even if a single byte changes in the file afr, ec
read the entire file to fix the problems. We are thinking of preventing
this by remembering where the changes happened on the file in extended
attributes. There will be a new extended attribute on the file which
represents a bit map of the changes and each bit represents a range that
needs healing. This extended attribute will have a maximum size it can
represent, the extra chunks will be represented like shards in
.glusterfs/indices/data-changes/<gfid-> extended
attribute on
this block will store ranges that need heals.

For example: If we have extended attribute value maximum size as 4KB and
each bit represents 128KB (i.e. first bit represents changes done from
offset 0-128KB, 2nd bit 128KB+1-256KB etc.), In single extended
attribute we can store changes happening to file upto 4GB (We are
thinking of dynamically increasing the size represented by each bit from
say 4k to 128k, but this is still in design). For changes that are
happening from offset 4GB+1 - 8GB will be stored in extended attribute
of .glusterfs/indices/data-changes/. Changes happening
from offset 8GB+1 to 12GB will be stored in extended attribute of
.glusterfs/indices/data-changes/, (please note that
these files are empty, they will just contain extended attributes) etc.
We want to complete this for 3.8 (stretch goal)

3) Performance & throttling improvements for self-heal:
   We are also looking into the multi-threaded self-heal daemon patch
by Richard for inclusion in 3.8. We are waiting for the discussions by
Raghavendra G on QoS to be over before coming to any decisions on
throttling.

After we have compound fops:
Goal here is to come up with compound fops and prevent un-necessary
round trips:
4) Transaction latency improvements:
   On afr:
In the unoptimized version of transaction we have: 1) Lock, 2)
Pre-op 3) op 4) Post-op 5) unlock
We will
have: 1)
Lock, 2) Pre-op + op 3) post-op + unlock
 This reduces round trips from 5 to 3 in the un-optimized version
of afr-transaction.
   On EC:
In the unoptimized version (worst case of unaligned write) of
transaction we have: 1) Lock, 2) get version, size xattrs 3) reads of
pre, post unaligned chunks 4) op 5) update version, size 6) unlock
We will
have: 1)
Lock + get version, size xattrs + reads of pre, post unaligned chunks,
2) op  3) update version, size + unlock
 This reduces round trips from 6 to 3 in the un-optimized version
of ec-transaction.

5) Entry self-heal per name latency improvements:
  Before: 1) Lock, 2) lookup to determine if the file needs to be
deleted/created 3) create/delete 4) Unlock
  After: 1) Lock + lookup 2) delete/create + unlock

Roadmap that applies only for EC: for 3.8
- Use SSE2/AVX/NEON extensions when available to speed up Galois Field
calculations
- Use a systematic matrix to improve encoding performance (it will also
improve decoding performance when all bricks are healthy)
- Implement a new algorithm able to detect and repair chunks of data on
the fly.

Roadmap that applies only for AFR:
1) Once granular entry/data heals, throttling are in, we can look at
generalizing Richard's lazy replication patch to be used for Near
synchronous replication between data centers and possibly just the
bricks, haven't looked into the patch myself.

We will be sending out more mails as soon as design completes for each
of these items. We are eagerly 

[Gluster-devel] tracker bug for 3.7.5 is created

2015-09-09 Thread Pranith Kumar Karampuri

hi,
 Please use 
https://bugzilla.redhat.com/show_bug.cgi?id=glusterfs-3.7.5 for tracking 
bug fixes that need to get into 3.7.5 release.


Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Roadmap for afr, ec

2015-09-14 Thread Pranith Kumar Karampuri

hi,

Here is a list of common improvements for both ec and afr planned over 
the next few months:


1) Granular entry self-heals.
 Both afr and ec at the moment do lot of readdirs and lookups to 
figure out the differences between the directories to perform heals. 
Kritika, Ravi, Anuradha and I are discussing about how to prevent this. 
The base algo is to store only the names that need heal in 
.glusterfs/indices/entry-changes// as links to base 
file in .glusterfs/indices/entry-changes of the bricks. So only the 
names that need to be healed will be going through name heals.

We want to complete this for 3.8 definitely.

2) Granular data self-heals.
 At the moment even if a single byte changes in the file afr, ec 
read the entire file to fix the problems. We are thinking of preventing 
this by remembering where the changes happened on the file in extended 
attributes. There will be a new extended attribute on the file which 
represents a bit map of the changes and each bit represents a range that 
needs healing. This extended attribute will have a maximum size it can 
represent, the extra chunks will be represented like shards in 
.glusterfs/indices/data-changes/ extended attribute on 
this block will store ranges that need heals.


For example: If we have extended attribute value maximum size as 4KB and 
each bit represents 128KB (i.e. first bit represents changes done from 
offset 0-128KB, 2nd bit 128KB+1-256KB etc.), In single extended 
attribute we can store changes happening to file upto 4GB (We are 
thinking of dynamically increasing the size represented by each bit from 
say 4k to 128k, but this is still in design). For changes that are 
happening from offset 4GB+1 - 8GB will be stored in extended attribute 
of .glusterfs/indices/data-changes/. Changes happening 
from offset 8GB+1 to 12GB will be stored in extended attribute of 
.glusterfs/indices/data-changes/, (please note that 
these files are empty, they will just contain extended attributes) etc.

We want to complete this for 3.8 (stretch goal)

3) Performance & throttling improvements for self-heal:
 We are also looking into the multi-threaded self-heal daemon patch 
by Richard for inclusion in 3.8. We are waiting for the discussions by 
Raghavendra G on QoS to be over before coming to any decisions on 
throttling.


After we have compound fops:
Goal here is to come up with compound fops and prevent un-necessary 
round trips:

4) Transaction latency improvements:
 On afr:
  In the unoptimized version of transaction we have: 1) Lock, 2) 
Pre-op 3) op 4) Post-op 5) unlock
  We will 
have: 1) 
Lock, 2) Pre-op + op 3) post-op + unlock
   This reduces round trips from 5 to 3 in the un-optimized version 
of afr-transaction.

 On EC:
  In the unoptimized version (worst case of unaligned write) of 
transaction we have: 1) Lock, 2) get version, size xattrs 3) reads of 
pre, post unaligned chunks 4) op 5) update version, size 6) unlock
  We will 
have: 1) 
Lock + get version, size xattrs + reads of pre, post unaligned chunks, 
2) op  3) update version, size + unlock
   This reduces round trips from 6 to 3 in the un-optimized version 
of ec-transaction.


5) Entry self-heal per name latency improvements:
Before: 1) Lock, 2) lookup to determine if the file needs to be 
deleted/created 3) create/delete 4) Unlock

After: 1) Lock + lookup 2) delete/create + unlock

Roadmap that applies only for EC: for 3.8
- Use SSE2/AVX/NEON extensions when available to speed up Galois Field 
calculations
- Use a systematic matrix to improve encoding performance (it will also 
improve decoding performance when all bricks are healthy)
- Implement a new algorithm able to detect and repair chunks of data on 
the fly.


Roadmap that applies only for AFR:
1) Once granular entry/data heals, throttling are in, we can look at 
generalizing Richard's lazy replication patch to be used for Near 
synchronous replication between data centers and possibly just the 
bricks, haven't looked into the patch myself.


We will be sending out more mails as soon as design completes for each 
of these items. We are eagerly waiting for Xavi to come back to get his 
comments as well for how EC will be impacted by the common changes. 
Feedback on this plan is very welcome!


Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] 3.7.5 tagging deadline approaching

2015-09-28 Thread Pranith Kumar Karampuri

Hi all,

3.7.5 is scheduled to be tagged in three days. This cannot be extended as it 
will break release schedules for others.

Please ensure that any changes you want to get into 3.7.5 gets merged
by the deadline. Also make sure to add those bugs to the tracker bug
[1].

There are around ~10 recent (less than month) reviews open on release-3.7
[2]. Make sure you get your changes merged by the respective
maintainers. I will also merge changes if the component maintainers
have reviewed the change. If there are any changes among this list
that need to get merged, please inform me of them by replying to this
mail.

Pranith

[1] https://bugzilla.redhat.com/show_bug.cgi?id=glusterfs-3.7.5
[2] 
https://review.gluster.org/#/q/project:glusterfs+branch:release-3.7+status:open


PS: Thanks to Kaushal's earlier 3.7.4/3.7.3 mails, I just had to 
copy/paste the mail changing some info :-)
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] compound fop design first cut

2015-12-07 Thread Pranith Kumar Karampuri



On 12/08/2015 09:02 AM, Pranith Kumar Karampuri wrote:



On 12/08/2015 02:53 AM, Shyam wrote:

Hi,

Why not think along the lines of new FOPs like fop_compound(_cbk) 
where, the inargs to this FOP is a list of FOPs to execute (either in 
order or any order)?
That is the intent. The question is how do we specify the fops that we 
want to do and the arguments to the fop. In this approach, for example 
xl_fxattrop_writev() is a new FOP. List of fops that need to be done 
are fxattrop, writev in that order and the arguments are a union of 
the arguments needed to perform the fops fxattrop, writev. The reason 
why this fop is not implemented through out the graph is to not change 
most of the stack on the brick side in the first cut of the 
implementation. i.e. quota/barrier/geo-rep/io-threads 
priorities/bit-rot may have to implement these new compund fops. We 
still get the benefit of avoiding the network round trips.


With a scheme like the above we could,
 - compound any set of FOPs (of course, we need to take care here, 
but still the feasibility exists)
It still exists but the fop space will be blown for each of the 
combination.
 - Each xlator can inspect the compound relation and chose to 
uncompound them. So if an xlator cannot perform FOPA+B as a single 
compound FOP, it can choose to send FOPA and then FOPB and chain up 
the responses back to the compound request sent to it. Also, the 
intention here would be to leverage existing FOP code in any xlator, 
to appropriately modify the inargs
 - The RPC payload is constructed based on existing FOP RPC 
definitions, but compounded based on the compound FOP RPC definition
This will be done in phase-3 after learning a bit more about how best 
to implement it to prevent stuffing arguments in xdata in future as 
much as possible. After which we can choose to retire 
compound-fop-sender and receiver xlators.


Possibly on the brick graph as well, pass these down as compounded 
FOPs, till someone decides to break it open and do it in phases 
(ultimately POSIX xlator).
This will be done in phase-2. At the moment we are not giving any 
choice for the xlators on the brick side.


The intention would be to break a compound FOP in case an xlator in 
between cannot support it or, even expand a compound FOP request, say 
the fxattropAndWrite is an AFR compounding decision, but a compound 
request to AFR maybe WriteandClose, hence AFR needs to extend this 
compound request.
Yes. There was a discussion with krutika where if shard wants to do 
write then xattrop in a single fop, then we need dht to implement 
dht_writev_fxattrop which should look somewhat similar to 
dht_writev(), and afr will need to implement afr_writev_fxattrop() as 
full blown transaction where it needs to take data+metadata domain 
locks then do data+metadata pre-op then wind to 
compound_fop_sender_writev_fxattrop() and then data+metadata post-op 
then unlocks.


If we were to do writev, fxattrop separately, fops will be (In 
unoptimized case):

1) finodelk for write
2) fxattrop for preop of write.
3) write
4) fxattrop for post op of write
5) unlock for write
6) finodelk for fxattrop
7) fxattrop for preop of shard-fxattrop
8) shard-fxattrop
9) fxattrop for post op of shard fxattrop
10) unlock forfxattrop

If AFR chooses to implement writev_fxattrop: means data+metadata 
transaction.
1) finodelk in data, metadata domain simultaneously (just like we take 
multiple locks in rename)

2) preop for data, metadata parts as part of the compound fop
3) writev+fxattrop
4)postop for data, metadata parts as part of the compound fop
5) unlocks simultaneously.

So it is still 2x reduction of the number of network fops except for 
may be locking.


The above is just a off the cuff thought on the same.
We need to arrive at a consensus about how to specify the list of fops 
and their arguments. The reason why I went against list_of_fops is to 
make discovery of possibile optimizations we can do easier per 
compound fop (Inspired by ec's implementation of multiplications by 
all possible elements in the Galois field, where multiplication with 
different number has a different optimization). Could you elaborate 
more about the idea you have about list_of_fops and its arguments? May 
be we can come up with combinations of fops where we can employ this 
technique of just list_of_fops and wind. I think rest of the solutions 
you mentioned is where it will converge towards over time. Intention 
is to avoid network round trips without waiting for the whole stack to 
change as much as possible.
May be I am over thinking it. Not a lot of combinations could be 
transactions. In any case do let me know what you have in mind.




Pranith


The scheme below seems too specific to my eyes, and looks like we 
would be defining specific compound FOPs than the ability to have 
generic ones.


On 12/07/2015 04:08 AM, Pranith Kumar Karampuri wrote:

hi,

Draft of the design doc:

Main motivation for the design of this feature

[Gluster-devel] compound fop design first cut

2015-12-07 Thread Pranith Kumar Karampuri

hi,

Draft of the design doc:

Main motivation for the design of this feature is to reduce network 
round trips by sending more
than one fop in a network operation, preferably without introducing new 
rpcs.


There are new 2 new xlators compound-fop-sender, compound-fop-receiver.
compound-fop-sender is going to be loaded on top of each client-xlator 
on the

mount/client and compound-fop-receiver is going to be loaded below
server-xlator on the bricks. On the mount/client side from the caller 
xlator
till compund-fop-encoder xlator, the xlators can choose to implement 
this extra

compound fop handling. Once it reaches "compound-fop-sender" it will try to
choose a base fop on which it encodes the other fop in the base-fop's 
xdata,

and winds the base fop to client xlator(). client xlator sends the base fop
with encoded xdata to server xlator on the brick using rpc of the base fop.
Once server xlator does resolve_and_resume() it will wind the base fop to
compound-fop-receiver xlator. This fop will decode the extra fop from 
xdata of
the base-fop. Based on the order encoded in the xdata it executes 
separate fops

one after the other and stores the cbk response arguments of both the
operations. It again encodes the response of the extra fop on to the 
base fop's
response xdata and unwind the fop to server xlator. Sends the response 
using

base-rpc's response structure. Client xlator will unwind the base fop to
compound-fop-sender, which will decode the response to the compound fop's
response arguments of the compound fop and unwind to the parent xlators.

I will take an example of fxattrop+write operation that we want to 
implement in

afr as an example to explain how things may look.

compound_fop_sender_fxattrop_write(call_frame_t *frame, xlator_t *this, 
fd_t * fd,

gf_xattrop_flags_t flags,
dict_t * fxattrop_dict,
dict_t * fxattrop_xdata,
struct iovec * vector,
int32_t count,
off_t off,
uint32_t flags,
struct iobref * iobref,
dict_t * writev_xdata)
) {
0) Remember the compound-fop
take base-fop as write()
in wriev_xdata add the following key,value pairs
1) "xattrop-flags", flags
2) for-each-fxattrop_dict key -> "fxattrop-dict-", 
value
3) for-each-fxattrop_xdata key -> 
"fxattrop-xdata-", value

4) "order" -> "fxattrop, writev"
5) "compound-fops" -> "fxattrop"
6) Wind writev()
}

compound_fop_sender_fxattrop_write_cbk(...)
{
/*decode the response args and call parent_fxattrop_write_cbk*/
}

_fxattrop_write_cbk (call_frame_t *frame, 
void *cookie,
xlator_t *this, int32_t 
fxattrop_op_ret,

int32_t fxattrop_op_errno,
dict_t *fxattrop_dict,
dict_t *fxattrop_xdata,
int32_t writev_op_ret, int32_t 
writev_op_errno,

struct iatt *writev_prebuf,
struct iatt *writev_postbuf,
dict_t *writev_xdata)
{
/**/
}

compound_fop_receiver_writev(call_frame_t *frame, xlator_t *this, fd_t * 
fd,

struct iovec * vector,
int32_t count,
off_t off,
uint32_t flags,
struct iobref * iobref,
dict_t * writev_xdata)
{
0) Check if writev_xdata has "compound-fop" else default_writev()
2) decode writev_xdata from above encoding -> flags, 
fxattrop_dict, fxattrop-xdata

3) get "order"
4) Store all the above in 'local'
5) wind fxattrop() with 
compound_receiver_fxattrop_cbk_writev_wind() as cbk

}

compound_receiver_fxattrop_cbk_writev_wind (call_frame_t *frame, void 
*cookie,
xlator_t *this, int32_t 
op_ret,
int32_t op_errno, dict_t 
*dict,

dict_t *xdata)
{
0) store fxattrop cbk_args
1) Perform writev() with writev_params with 
compound_receiver_writev_cbk() as the 'cbk'

}

compound_writev_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
 int32_t op_ret, int32_t op_errno, struct iatt 
*prebuf,

 struct iatt *postbuf, dict_t *xdata)
{
0) store writev cbk_args
1) Encode fxattrop response to writev_xdata with similar 
encoding in the compound_fop_sender_fxattrop_write()

2) unwind writev()
}

This example is just to show how things may look, but the actual 
implementation
may just have all base-fops calling common function to perform the 
operations
in the order given in the receriver xl. Yet to think about that. It is 
probably better to Encode
fop-number from glusterfs_fop_t rather than the fop-string in the 
dictionary.


This is phase-1 of the change because we don't 

Re: [Gluster-devel] glusterfs as dovecot backend storage

2015-12-07 Thread Pranith Kumar Karampuri
Do you mind adding gluster-users to that thread? It would be nice to 
know what are the problems they ran into to fix them.


Pranith
On 12/07/2015 03:44 PM, Emmanuel Dreyfus wrote:

Hello

In case nobody noticed, there is ongoing discussion on the dovecot
mailing list about using glusterfs as mail storage.  Some poeple
ran into trouble and I think knowledgable hints would be of great
value there.

NB: I did not dare to attempt such a setup, regardless of how
appealing it is. I fear troubles too much. :-)



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] compound fop design first cut

2015-12-08 Thread Pranith Kumar Karampuri



On 12/09/2015 10:39 AM, Prashanth Pai wrote:
  

However, I’d be even more comfortable with an even simpler approach that
avoids the need to solve what the database folks (who have dealt with
complex transactions for years) would tell us is a really hard problem.
Instead of designing for every case we can imagine, let’s design for the
cases that we know would be useful for improving performance.  Open plus
read/write plus close is an obvious one.  Raghavendra mentions
create+inodelk as well.

 From object interface (Swift/S3) perspective, this is the fop order and flow 
for object operations:

GET: open(), fstat(), fgetxattr()s, read()s, close()
Krutika implemented fstat+fgetxattr(http://review.gluster.org/10180). In 
posix there is an implementation of GF_CONTENT_KEY which is used to read 
a file in lookup by quick-read. This needs to be exposed for fds as well 
I think. So you can do all this using fstat on anon-fd.

HEAD: stat(), getxattr()s
Krutika already implemented this for sharding 
http://review.gluster.org/10158. You can do this using stat fop.

PUT: creat(), write()s, setxattr(), fsync(), close(), rename()

This I think should be a new compound fop. Nothing similar exists.

DELETE: getxattr(), unlink()
This can also be clubbed in unlink already because xdata exists on the 
wire already.


Compounding some of these ops and exposing them as consumable libgfapi APIs 
like glfs_get() and glfs_put() similar to librados compound APIs[1] would 
greatly improve performance for object based access.

[1]: https://github.com/ceph/ceph/blob/master/src/include/rados/librados.h#L2219

Thanks.

- Prashanth Pai


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] compound fop design first cut

2015-12-08 Thread Pranith Kumar Karampuri



On 12/09/2015 06:37 AM, Vijay Bellur wrote:

On 12/08/2015 03:45 PM, Jeff Darcy wrote:




On December 8, 2015 at 12:53:04 PM, Ira Cooper (i...@redhat.com) wrote:

Raghavendra Gowdappa writes:
I propose that we define a "compound op" that contains ops.

Within each op, there are fields that can be "inherited" from the
previous op, via use of a sentinel value.

Sentinel is -1, for all of these examples.

So:

LOOKUP (1, "foo") (Sets the gfid value to be picked up by 
compounding, 1

is the root directory, as a gfid, by convention.)
OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound value.)
WRITE(-1, "foo", 3) (Uses the glfd compound value.)
CLOSE(-1) (Uses the glfd compound value)


So, basically, what the programming-language types would call futures
and promises.  It’s a good and well studied concept, which is necessary
to solve the second-order problem of how to specify an argument in
sub-operation N+1 that’s not known until sub-operation N completes.

To be honest, some of the highly general approaches suggested here scare
me too.  Wrapping up the arguments for one sub-operation in xdata for
another would get pretty hairy if we ever try to go beyond two
sub-operations and have to nest sub-operation #3’s args within
sub-operation #2’s xdata which is itself encoded within sub-operation
#1’s xdata.  There’s also not much clarity about how to handle errors in
that model.  Encoding N sub-operations’ arguments in a linear structure
as Shyam proposes seems a bit cleaner that way.  If I were to continue
down that route I’d suggest just having start_compound and end-compound
fops, plus an extra field (or by-convention xdata key) that either the
client-side or server-side translator could use to build whatever
structure it wants and schedule sub-operations however it wants.

However, I’d be even more comfortable with an even simpler approach that
avoids the need to solve what the database folks (who have dealt with
complex transactions for years) would tell us is a really hard problem.
Instead of designing for every case we can imagine, let’s design for the
cases that we know would be useful for improving performance. Open plus
read/write plus close is an obvious one.  Raghavendra mentions
create+inodelk as well.  For each of those, we can easily define a
structure that contains the necessary fields, we don’t need a
client-side translator, and the server-side translator can take care of
“forwarding” results from one sub-operation to the next.  We could even
use GF_FOP_IPC to prototype this.  If we later find that the number of
“one-off” compound requests is growing too large, then at least we’ll
have some experience to guide our design of a more general alternative.
Right now, I think we’re trying to look further ahead than we can see
clearly.
Yes Agree. This makes implementation on the client side simpler as well. 
So it is welcome.


Just updating the solution.
1) New RPCs are going to be implemented.
2) client stack will use these new fops.
3) On the server side we have server xlator implementing these new fops 
to decode the RPC request then resolve_resume and 
compound-op-receiver(Better name for this is welcome) which sends one op 
after other and send compound fop response.


List of compound fops identified so far:
Swift/S3:
PUT: creat(), write()s, setxattr(), fsync(), close(), rename()

Dht:
mkdir + inodelk

Afr:
xattrop+writev, xattrop+unlock to begin with.

Could everyone who needs compound fops add to this list?

I see that Niels is back on 14th. Does anyone else know the list of 
compound fops he has in mind?


Pranith.


Starting with a well defined set of operations for compounding has its 
advantages. It would be easier to understand and maintain correctness 
across the stack. Some of our translators perform transactions & 
create/update internal metadata for certain fops. It would be easier 
for such translators if the compound operations are well defined and 
does not entail deep introspection of a generic representation to 
ensure that the right behavior gets reflected at the end of a compound 
operation.


-Vijay





___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] compound fop design first cut

2015-12-09 Thread Pranith Kumar Karampuri



On 12/09/2015 08:11 PM, Shyam wrote:

On 12/09/2015 02:37 AM, Soumya Koduri wrote:



On 12/09/2015 11:44 AM, Pranith Kumar Karampuri wrote:



On 12/09/2015 06:37 AM, Vijay Bellur wrote:

On 12/08/2015 03:45 PM, Jeff Darcy wrote:




On December 8, 2015 at 12:53:04 PM, Ira Cooper (i...@redhat.com) 
wrote:

Raghavendra Gowdappa writes:
I propose that we define a "compound op" that contains ops.

Within each op, there are fields that can be "inherited" from the
previous op, via use of a sentinel value.

Sentinel is -1, for all of these examples.

So:

LOOKUP (1, "foo") (Sets the gfid value to be picked up by
compounding, 1
is the root directory, as a gfid, by convention.)
OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound 
value.)

WRITE(-1, "foo", 3) (Uses the glfd compound value.)
CLOSE(-1) (Uses the glfd compound value)


So, basically, what the programming-language types would call futures
and promises.  It’s a good and well studied concept, which is 
necessary

to solve the second-order problem of how to specify an argument in
sub-operation N+1 that’s not known until sub-operation N completes.

To be honest, some of the highly general approaches suggested here
scare
me too.  Wrapping up the arguments for one sub-operation in xdata for
another would get pretty hairy if we ever try to go beyond two
sub-operations and have to nest sub-operation #3’s args within
sub-operation #2’s xdata which is itself encoded within sub-operation
#1’s xdata.  There’s also not much clarity about how to handle
errors in
that model.  Encoding N sub-operations’ arguments in a linear 
structure
as Shyam proposes seems a bit cleaner that way.  If I were to 
continue
down that route I’d suggest just having start_compound and 
end-compound
fops, plus an extra field (or by-convention xdata key) that either 
the

client-side or server-side translator could use to build whatever
structure it wants and schedule sub-operations however it wants.

However, I’d be even more comfortable with an even simpler approach
that
avoids the need to solve what the database folks (who have dealt with
complex transactions for years) would tell us is a really hard 
problem.

Instead of designing for every case we can imagine, let’s design for
the
cases that we know would be useful for improving performance. Open 
plus

read/write plus close is an obvious one.  Raghavendra mentions
create+inodelk as well.  For each of those, we can easily define a
structure that contains the necessary fields, we don’t need a
client-side translator, and the server-side translator can take 
care of
“forwarding” results from one sub-operation to the next. We could 
even
use GF_FOP_IPC to prototype this.  If we later find that the 
number of

“one-off” compound requests is growing too large, then at least we’ll
have some experience to guide our design of a more general 
alternative.

Right now, I think we’re trying to look further ahead than we can see
clearly.
Yes Agree. This makes implementation on the client side simpler as 
well.

So it is welcome.

Just updating the solution.
1) New RPCs are going to be implemented.
2) client stack will use these new fops.
3) On the server side we have server xlator implementing these new fops
to decode the RPC request then resolve_resume and
compound-op-receiver(Better name for this is welcome) which sends 
one op

after other and send compound fop response.


@Pranith, I assume you would expand on this at a later date (something 
along the lines of what Soumya has done below, right?


I will talk to her tomorrow to know more about this. Not saying this is 
what I will be implementing (There doesn't seem to be any consensus 
yet). But I would love to know how it is implemented.


Pranith




List of compound fops identified so far:
Swift/S3:
PUT: creat(), write()s, setxattr(), fsync(), close(), rename()

Dht:
mkdir + inodelk

Afr:
xattrop+writev, xattrop+unlock to begin with.

Could everyone who needs compound fops add to this list?

I see that Niels is back on 14th. Does anyone else know the list of
compound fops he has in mind?


 From the discussions we had with Niels regarding the kerberos support
on GlusterFS, I think below are the set of compound fops which are
required.

set_uid +
set_gid +
set_lkowner (or kerberos principal name) +
actual_fop

Also gfapi does lookup (first time/to refresh inode) before performing
actual fops most of the times. It may really help if we can club such
fops -


@Soumya +5 (just a random number :) )

This came to my mind as well, and is a good candidate for compounding.



LOOKUP + FOP (OPEN etc)

Coming to the design proposed, I agree with Shyam, Ira and Jeff's
thoughts. Defining different compound fops for each specific set of
operations and wrapping up those arguments in xdata seem rather complex
and difficult to maintain going further. Having being worked with NFS,
may I suggest why not we follow (or in similar lines)  the approa

Re: [Gluster-devel] compound fop design first cut

2015-12-11 Thread Pranith Kumar Karampuri



On 12/09/2015 11:48 PM, Pranith Kumar Karampuri wrote:



On 12/09/2015 08:11 PM, Shyam wrote:

On 12/09/2015 02:37 AM, Soumya Koduri wrote:



On 12/09/2015 11:44 AM, Pranith Kumar Karampuri wrote:



On 12/09/2015 06:37 AM, Vijay Bellur wrote:

On 12/08/2015 03:45 PM, Jeff Darcy wrote:




On December 8, 2015 at 12:53:04 PM, Ira Cooper (i...@redhat.com) 
wrote:

Raghavendra Gowdappa writes:
I propose that we define a "compound op" that contains ops.

Within each op, there are fields that can be "inherited" from the
previous op, via use of a sentinel value.

Sentinel is -1, for all of these examples.

So:

LOOKUP (1, "foo") (Sets the gfid value to be picked up by
compounding, 1
is the root directory, as a gfid, by convention.)
OPEN(-1, O_RDWR) (Uses the gfid value, sets the glfd compound 
value.)

WRITE(-1, "foo", 3) (Uses the glfd compound value.)
CLOSE(-1) (Uses the glfd compound value)


So, basically, what the programming-language types would call 
futures
and promises.  It’s a good and well studied concept, which is 
necessary

to solve the second-order problem of how to specify an argument in
sub-operation N+1 that’s not known until sub-operation N completes.

To be honest, some of the highly general approaches suggested here
scare
me too.  Wrapping up the arguments for one sub-operation in xdata 
for

another would get pretty hairy if we ever try to go beyond two
sub-operations and have to nest sub-operation #3’s args within
sub-operation #2’s xdata which is itself encoded within 
sub-operation

#1’s xdata.  There’s also not much clarity about how to handle
errors in
that model.  Encoding N sub-operations’ arguments in a linear 
structure
as Shyam proposes seems a bit cleaner that way.  If I were to 
continue
down that route I’d suggest just having start_compound and 
end-compound
fops, plus an extra field (or by-convention xdata key) that 
either the

client-side or server-side translator could use to build whatever
structure it wants and schedule sub-operations however it wants.

However, I’d be even more comfortable with an even simpler approach
that
avoids the need to solve what the database folks (who have dealt 
with
complex transactions for years) would tell us is a really hard 
problem.

Instead of designing for every case we can imagine, let’s design for
the
cases that we know would be useful for improving performance. 
Open plus

read/write plus close is an obvious one.  Raghavendra mentions
create+inodelk as well.  For each of those, we can easily define a
structure that contains the necessary fields, we don’t need a
client-side translator, and the server-side translator can take 
care of
“forwarding” results from one sub-operation to the next. We could 
even
use GF_FOP_IPC to prototype this.  If we later find that the 
number of
“one-off” compound requests is growing too large, then at least 
we’ll
have some experience to guide our design of a more general 
alternative.
Right now, I think we’re trying to look further ahead than we can 
see

clearly.
Yes Agree. This makes implementation on the client side simpler as 
well.

So it is welcome.

Just updating the solution.
1) New RPCs are going to be implemented.
2) client stack will use these new fops.
3) On the server side we have server xlator implementing these new 
fops

to decode the RPC request then resolve_resume and
compound-op-receiver(Better name for this is welcome) which sends 
one op

after other and send compound fop response.


@Pranith, I assume you would expand on this at a later date 
(something along the lines of what Soumya has done below, right?


I will talk to her tomorrow to know more about this. Not saying this 
is what I will be implementing (There doesn't seem to be any consensus 
yet). But I would love to know how it is implemented.


Soumya and I had a discussion about this and it seems like the NFS way 
of stuffing the args seems to workout at a high level. Even the sentinel 
value based work may also be possible. What I will do now is to take a 
look at the structure deeply and work out how all the fops mentioned in 
this thread can be implemented. I will update you guys about my findings 
in a couple of days.


Pranith


Pranith




List of compound fops identified so far:
Swift/S3:
PUT: creat(), write()s, setxattr(), fsync(), close(), rename()

Dht:
mkdir + inodelk

Afr:
xattrop+writev, xattrop+unlock to begin with.

Could everyone who needs compound fops add to this list?

I see that Niels is back on 14th. Does anyone else know the list of
compound fops he has in mind?


 From the discussions we had with Niels regarding the kerberos support
on GlusterFS, I think below are the set of compound fops which are
required.

set_uid +
set_gid +
set_lkowner (or kerberos principal name) +
actual_fop

Also gfapi does lookup (first time/to refresh inode) before performing
actual fops most of the times. It may really help if we can club such
fops -


@Soumya +5 (just a 

Re: [Gluster-devel] compound fop design first cut

2015-12-09 Thread Pranith Kumar Karampuri



On 12/09/2015 08:08 PM, Shyam wrote:

On 12/09/2015 12:52 AM, Pranith Kumar Karampuri wrote:



On 12/09/2015 10:39 AM, Prashanth Pai wrote:
However, I’d be even more comfortable with an even simpler approach 
that

avoids the need to solve what the database folks (who have dealt with
complex transactions for years) would tell us is a really hard 
problem.
Instead of designing for every case we can imagine, let’s design 
for the
cases that we know would be useful for improving performance.  Open 
plus

read/write plus close is an obvious one.  Raghavendra mentions
create+inodelk as well.

 From object interface (Swift/S3) perspective, this is the fop order
and flow for object operations:

GET: open(), fstat(), fgetxattr()s, read()s, close()

Krutika implemented fstat+fgetxattr(http://review.gluster.org/10180). In
posix there is an implementation of GF_CONTENT_KEY which is used to read
a file in lookup by quick-read. This needs to be exposed for fds as well
I think. So you can do all this using fstat on anon-fd.

HEAD: stat(), getxattr()s

Krutika already implemented this for sharding
http://review.gluster.org/10158. You can do this using stat fop.


I believe we need to fork this part of the conversation, i.e the stat 
+ xattr information clubbing.


My view on a stat for gluster is, POSIX stat + gluster extended 
information being returned. I state this as, a file system when it 
stats its inode, should get all information regarding the inode, and 
not just the POSIX ones. In the case of other local FS, the inode 
structure has more fields than just what POSIX needs, so when the 
inode is *read* the FS can populate all its internal inode information 
and return to the application/syscall the relevant fields that it needs.


I believe gluster should do the same, so in the cases above, we should 
actually extend our stat information (not elaborating how) to include 
all information from the brick, i.e stat from POSIX and all the 
extended attrs for the inode (file or dir). This can then be consumed 
by any layer as needed.


Currently, each layer adds what it needs in addition to the stat 
information in the xdata, as an xattr request, this can continue or go 
away, if the relevant FOPs return the whole inode information upward.


This also has useful outcomes in readdirp calls, where we get the 
extended stat information for each entry.

You can use "list-xattr" in xdata request to get this.


With the patches referred to, and older patches, this seems to be the 
direction sought (around 2013), any reasons why this is not prevalent 
across the stack and made so? Or am I mistaken?
No reason. We can revive it. There didn't seem to be any interest. So I 
didn't follow up to get it in.


Pranith



PUT: creat(), write()s, setxattr(), fsync(), close(), rename()

This I think should be a new compound fop. Nothing similar exists.

DELETE: getxattr(), unlink()

This can also be clubbed in unlink already because xdata exists on the
wire already.


Compounding some of these ops and exposing them as consumable libgfapi
APIs like glfs_get() and glfs_put() similar to librados compound
APIs[1] would greatly improve performance for object based access.

[1]:
https://github.com/ceph/ceph/blob/master/src/include/rados/librados.h#L2219 




Thanks.

- Prashanth Pai


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Interesting profile data on tests

2016-01-03 Thread Pranith Kumar Karampuri



On 01/02/2016 10:11 PM, Raghavendra Talur wrote:



On Jan 2, 2016 8:18 PM, "Atin Mukherjee" > wrote:

>
> -Atin
> Sent from one plus one
> On Jan 2, 2016 4:41 PM, "Raghavendra Talur" > wrote:

> >
> >
> >
> > On Sat, Jan 2, 2016 at 12:03 PM, Atin Mukherjee 
> wrote:

> >>
> >> -Atin
> >> Sent from one plus one
> >>
> >>
> >> On Jan 2, 2016 11:52 AM, "Vijay Bellur" > wrote:

> >> >
> >> > On 12/30/2015 10:36 AM, Raghavendra Talur wrote:
> >> >>
> >> >> This is not comprehensive data but some interesting bits
> >> >>
> >> >> Average time taken for various commands in our .t files.
> >> >>
> >> >> * glusterd - 2 second
> >> >> * gluster vol start/stop - 3 second
> >> >> * gluster vol set/info(any basic gluster cli command) -1 second
> >> >> * gluster mount - 2 second
> >> >> * gluster add brick - 2 second
> >> >> * gluster remove brick - 5 second
> >> >> * gluster rebalance start 5 second
> >> >> * gluster tier attach/detach - 6 second
> >> >>
> >> >> The only other single command which takes 1+ second is sleep. 
Most of

> >> >> the other
> >> >> external commands we use in bash scripts are not that time taking.
> >> >>
> >> >>
> >> >> Hence,
> >> >>
> >> >> 1. Don't stop/delete a gluster volume in .t file unless it is 
part of

> >> >> your test. Let the cleanup function handle that.
> >> >> 2. Don't call gluster vol info at the start of the test if not 
required
> >> >> 3. Merge as many tests as possible to reduce glusterd 
starts/vol starts

> >> >> and mounts.
> >> >> 4. Use sleep only if it is absolutely required.
> >> >>
> >> >> You can use this bug 
https://bugzilla.redhat.com/show_bug.cgi?id=1294826

> >> >> to send patches to improve test times.
> >> >>
> >> >
> >> > Thank you! These are good set of steps that can help in 
reducing the overall time consumed for a regression test run. I also 
think the larger latencies observed in volume operations could be 
related to the the set of fsync()s involved in making configuration 
state durable in glusterd's store. It would be interesting to see if 
we can use a ramdisk for /var/lib/glusterd and check if the latencies 
would improve.

> >> That's a good suggestion. It'd definitely improve the latency.
> >
> >
> > Tried this.
> > I saw improvement of 1 second with commands which took over 4 
second. May be there is something else which is taking more time?
> Did you observe this for clustered tests? I think apart from fsync() 
and n/w latency the rest of the things should be pretty light wight 
and shouldn't consume much time.


This data is true for non clustered tests too. I am suspecting address 
resolution.
Rafi had suggested we use IP addresses instead of host names for HOST 
variables




I have seen latency because of address resolution as well.

Pranith


>
> >
> >>
> >> >
> >> > Regards,
> >> > Vijay
> >> >
> >> > ___
> >> > Gluster-devel mailing list
> >> > Gluster-devel@gluster.org 
> >> > http://www.gluster.org/mailman/listinfo/gluster-devel
> >
> >



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] strace like utility for gluster

2016-01-03 Thread Pranith Kumar Karampuri

hi,
 I have been thinking of coming up with a utility similar to strace 
for gluster. I am sending this mail to find out if anyone else is also 
thinking about it and how they are solving this problem to see if we can 
come up with a concrete solution that we can implement.


These are my thoughts about the solution so far:

1) Doing trace for a brick/mount to know just what that process is 
winding/unwinding (Or whatever else it wants to tell the trace process) 
seems easier. We can launch a trace process which will open listener 
unix socket to which the glusterfs process can send whatever it needs to 
(glfstrace -h  -p  -s 
). We will have to 
write gf_trace() infra, which will do the job of sending the information 
to this trace process only when there is a trace process trying to 
listen in on what is happening inside the process.


2) Doing end-to-end tracing:
Unix socket based listening is not going to work anymore (Unless we send 
the trace information in xdata or something, which doesn't seem nice to 
me). We can use network sockets for sending the information from the 
bricks to the trace process. So at the time of starting a trace on the 
mount process, mount process will need to send an rpc to the brick 
process giving the trace process hostname/port information for that 
client, and bricks can send the trace information to the trace process 
directly. So we can have multiple trace processes tracing different 
mounts and bricks will be able to send trace information to different 
trace processes.


We will need to make sure gf_trace() is not going to send the 
information to trace process in the io-path.


3) Doing glfstrace -p . I have an approach for fuse 
based mounts. I am hoping nfs/smb folks will respond if we can do 
something similar in those mounts.


glfstrace process now traces the client process with extra information 
i.e. frame->root->pid information in fuse-bridge which can be used to 
filter only these fops executed by the application. Rest is similar to 2)


4) Doing glfstrace -c 

glfstrace process forks, it knows the pid of child, child should wait to 
hear from parent to start 'exec-of-cmd'. glfstrace process sets things 
up similar to 3), once it sets up tracing, tells child to exec the cmd.


Comments are welcome :-). Happy new year by the way!!

Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] strace like utility for gluster

2016-01-03 Thread Pranith Kumar Karampuri



On 01/04/2016 11:04 AM, Prashanth Pai wrote:

Hi,

FYI, ltrace is very helpful for libgfapi clients.

[root@hummingbird gfapi]# ltrace -c -ff -p 28906

^C% time seconds  usecs/call calls  function
-- --- --- - 
  43.275.952481   59524   100 glfs_rename
  21.392.942866   29428   100 glfs_fsync
   7.351.011715   10117   100 glfs_fsetxattr
   7.260.9991259991   100 glfs_creat
   5.040.6928121154   600 free
   4.950.681092 851   800 __errno_location
   4.670.6417903208   200 glfs_stat
   3.960.545340 908   600 malloc
   1.260.1732931732   100 glfs_close
   0.840.1152841152   100 glfs_write
-- --- --- - 
100.00   13.755798  2800 total
Nice, I didn't know about this. What I am looking for with this tool is 
even more granularity. i.e. per xlator information. It shouldn't be so 
difficult to find information like time spent in each xlator. What fop 
from fuse lead to what other fops in each of the xlator etc.


Pranith


Regards,
  -Prashanth Pai

- Original Message -

From: "Pranith Kumar Karampuri" <pkara...@redhat.com>
To: "Gluster Devel" <gluster-devel@gluster.org>
Sent: Monday, January 4, 2016 10:49:50 AM
Subject: [Gluster-devel] strace like utility for gluster

hi,
   I have been thinking of coming up with a utility similar to strace
for gluster. I am sending this mail to find out if anyone else is also
thinking about it and how they are solving this problem to see if we can
come up with a concrete solution that we can implement.

These are my thoughts about the solution so far:

1) Doing trace for a brick/mount to know just what that process is
winding/unwinding (Or whatever else it wants to tell the trace process)
seems easier. We can launch a trace process which will open listener
unix socket to which the glusterfs process can send whatever it needs to
(glfstrace -h  -p  -s
). We will have to
write gf_trace() infra, which will do the job of sending the information
to this trace process only when there is a trace process trying to
listen in on what is happening inside the process.

2) Doing end-to-end tracing:
Unix socket based listening is not going to work anymore (Unless we send
the trace information in xdata or something, which doesn't seem nice to
me). We can use network sockets for sending the information from the
bricks to the trace process. So at the time of starting a trace on the
mount process, mount process will need to send an rpc to the brick
process giving the trace process hostname/port information for that
client, and bricks can send the trace information to the trace process
directly. So we can have multiple trace processes tracing different
mounts and bricks will be able to send trace information to different
trace processes.

We will need to make sure gf_trace() is not going to send the
information to trace process in the io-path.

3) Doing glfstrace -p . I have an approach for fuse
based mounts. I am hoping nfs/smb folks will respond if we can do
something similar in those mounts.

glfstrace process now traces the client process with extra information
i.e. frame->root->pid information in fuse-bridge which can be used to
filter only these fops executed by the application. Rest is similar to 2)

4) Doing glfstrace -c 

glfstrace process forks, it knows the pid of child, child should wait to
hear from parent to start 'exec-of-cmd'. glfstrace process sets things
up similar to 3), once it sets up tracing, tells child to exec the cmd.

Comments are welcome :-). Happy new year by the way!!

Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Wrong usage of dict functions

2016-01-06 Thread Pranith Kumar Karampuri

hi,
   It seems like two ways to create dictionary is causing problems. 
There are quite a few dict_new()/dict_destroy() or 
get_new_dict()/dict_unref() in the code base. So stopped exposing the 
functions without ref/unref i.e. get_new_dict()/dict_destroy() as part 
of http://review.gluster.org/13183


Files changed as part of the patch:
 api/src/glfs-mgmt.c|  2 +-
 api/src/glfs.c |  2 +-
 cli/src/cli-cmd-parser.c   | 42 
+-

 cli/src/cli-cmd-system.c   |  6 +++---
 cli/src/cli-cmd-volume.c   |  2 +-
 cli/src/cli-rpc-ops.c  |  4 ++--
 cli/src/cli.c  |  2 +-
 glusterfsd/src/glusterfsd.c|  2 +-
 libglusterfs/src/dict.h|  5 -
 libglusterfs/src/graph.c   |  2 +-
 libglusterfs/src/graph.y   |  2 +-
 xlators/cluster/afr/src/afr-self-heal-common.c |  6 +++---
 xlators/cluster/afr/src/afr-self-heal-name.c   |  2 +-
 xlators/cluster/dht/src/dht-selfheal.c | 15 +++
 xlators/cluster/dht/src/dht-shared.c   |  2 +-
 xlators/mgmt/glusterd/src/glusterd-geo-rep.c   |  4 ++--
 xlators/mgmt/glusterd/src/glusterd-op-sm.c |  4 ++--
 xlators/mgmt/glusterd/src/glusterd-volgen.c| 12 ++--
 xlators/mount/fuse/src/fuse-bridge.c   |  3 +--
 xlators/mount/fuse/src/fuse-bridge.h   |  2 --
 20 files changed, 56 insertions(+), 65 deletions(-)


Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] NetBSD tests not running to completion.

2016-01-07 Thread Pranith Kumar Karampuri



On 01/07/2016 02:39 PM, Emmanuel Dreyfus wrote:

On Wed, Jan 06, 2016 at 05:49:04PM +0530, Ravishankar N wrote:

I re triggered NetBSD regressions for http://review.gluster.org/#/c/13041/3
but they are being run in silent mode and are not completing. Can some one
from the infra-team take a look? The last 22 tests in
https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/ have
failed. Highly unlikely that something is wrong with all those patches.

I note your latest test compelted with an error in mount-nfs-auth.t:
https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/13260/consoleFull

Would you have the jenkins build that did not complete s that I can have a
look at it?

Generally speaking, I have to pôint that NetBSD regression does show light
on generic bugs, we had a recent exemple with quota-nfs.t. For now there
are not other well supported platforms, but if you want glusterfs to
be really portable, removing mandatory NetBSD regression is not a good idea:
portability bugs will crop.

Even a daily or weekly regression run seems a bad idea to me. If you do not
prevent integration of patches that break NetBSD regression, that will get
in, and tests will break one by one over time. I have a first hand
experience of this situation, when I was actually trying to catch on with
NetBSD regression. Many time I reached something reliable enough to become
mandatory, and got broken by a new patch before it became actualy mandatory.

IMO, relaxing NetBSD regression requirement means the project drops the goal
of being portable.


hi Emmanuel,
 This Sunday I have some time I can spend helping in making 
tests better for NetBSD. I have seen bugs that are caught only by NetBSD 
regression just recently, so I see value in making NetBSD more reliable. 
Please let me know what are the things we can work on. It would help if 
you give me something specific to glusterfs to make it more valuable in 
the short term. Over time I would like to learn enough to share the load 
with you however little it may be (Please bear with me, I some times go 
quiet). Here are the initial things I would like to know to begin with:


1) How to set up NetBSD VMs on my laptop which is of exact version as 
the ones that are run on build systems.
2) How to prevent NetBSD machines hang when things crash (At least I 
used to see that the machines hang when fuse crashes before, not sure if 
this is still the case)? (This failure needs manual intervention at the 
moment on NetBSD regressions, if we make it report failures and pick 
next job that would be the best way forward)
3) We should come up with a list of known problems and how to 
troubleshoot those problems, when things are not going smooth in NetBSD. 
Again, we really need to make things automatic, this should be last 
resort. Our top goal should be to make NetBSD machines report failures 
and go to execute next job.
4) How can we make debugging better in NetBSD? In the worst case we can 
make all tests execute in trace/debug mode on NetBSD.


I really want to appreciate the fine job you have done so far in making 
sure glusterfs is stable on NetBSD.


Infra team,
   I think we need to make some improvements to our infra. We need 
to get information about health of linux, NetBSD regression builds.
1) Something like, in the last 100 builds how many builds succeeded on 
Linux, how many succeeded on NetBSD.
2) What are the tests that failed in the last 100 builds and how many 
times on both Linux and NetBSD. (I actually wrote this part in some 
parts, but the whole command output has changed making my scripts stale)

Any other ideas you guys have?
3) Which components have highest number of spurious failures.
4) How many builds did not complete/manually aborted etc.

Once we start measuring these things, next steps are to setup a process 
in place to get the health of the project stable and keep it that way.


Please let me know if anyone wants to volunteer to make things better in 
this infra part. Most of the code will be in python.


Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] NetBSD tests not running to completion.

2016-01-08 Thread Pranith Kumar Karampuri



On 01/08/2016 03:25 PM, Emmanuel Dreyfus wrote:

On Fri, Jan 08, 2016 at 03:18:02PM +0530, Pranith Kumar Karampuri wrote:

Should the cleanup script needs to be manually executed on the NetBSD
machine?

You can run the script manually, but if the goal is to restore a
misbehaving machine, rebooting is probbaly the fastest way to sort
the issue.

While thinking about it, I suspect there may be some benefit
into rebooting the machine if the regression does not finish
within a sane amount of time.


Rebooting upon a single test leading to crash may not be a good idea. We 
need a reliable way of finding the need for finding that the mount hung 
because of crash and execute this cleanup script when that situation 
happens. So question is can we detect this state?





First step could be to parse jenkins logs and find which test fail or hang
most often in NetBSD regression

This work is under way. I will have to change some of the scripts I wrote to
get this information.

Great.


To avoid duplication of work, did you take any tests that you are
already investigating? If not that is the first thing I will try to find out.

No, I did not started investigating right now because I have no idea where
I should look at. Your input will be very valuable.

Since we don't have the script now, I did this manually:

Here are the results for the last 15-20 runs:

Test Number of times it happened
tests/basic/afr/arbiter-statfs.t: bad status 1 
---5

tests/basic/afr/self-heal.t ---1
tests/basic/afr/entry-self-heal.t ---1
tests/basic/quota-nfs.t ---2



The following happened: 4 times
One example: 
https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/13283/console
https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/13280/console 
- this seems different compared to the one above.


+ '/opt/qa/build.sh'
Build timed out (after 300 minutes). Marking the build as failed.
Build was aborted
Finished: FAILURE



The following happened: 4 times
One example: 
https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/13279/console


ERROR: Connection was broken: java.io.IOException: Unexpected EOF
at 
hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:99)
at 
hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
at 
hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
at 
hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)



I can take a look at why tests are failing (On sunday, not today :-) ). 
Could you look at why the timeouts/'Connection broken' stuff is happening?


Once we find out what happened. First goal is to detect and repair it 
automatically. If we can't, let us write up a wiki page or something to 
tell how to proceed when this happens.


Pranith




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] NetBSD tests not running to completion.

2016-01-08 Thread Pranith Kumar Karampuri



On 01/08/2016 02:08 PM, Emmanuel Dreyfus wrote:

On Fri, Jan 08, 2016 at 11:45:20AM +0530, Pranith Kumar Karampuri wrote:

1) How to set up NetBSD VMs on my laptop which is of exact version as the
ones that are run on build systems.

Well, the easier way is to pick the VM image we run at rackspace, which
relies on Xen. If you use an hardware virtualization system, we just need
to change the kernel and use NetBSD-7.0 GENERIC one. What hypervisor
do you use?

Alternatively it is easy to make a fresh NetBSD install. The only trap
is that glusterfs backing store filesystem must be formatted in FFFv1
format to get extended attrbiute support (this is obtained by  newfs -O1).


2) How to prevent NetBSD machines hang when things crash (At least I used to
see that the machines hang when fuse crashes before, not sure if this is
still the case)? (This failure needs manual intervention at the moment on
NetBSD regressions, if we make it report failures and pick next job that
would be the best way forward)

It depends what we are talking about. If this is a moint point that does
not want to unmount, killing the perfused daemon (which is the bridge
between FUSE and native PUFFS) will help. The cleanup script does it.
Do you have a hang example?


Should the cleanup script needs to be manually executed on the NetBSD 
machine?





3) We should come up with a list of known problems and how to troubleshoot
those problems, when things are not going smooth in NetBSD. Again, we really
need to make things automatic, this should be last resort. Our top goal
should be to make NetBSD machines report failures and go to execute next
job.

This is the frustrating point for me: we have complains that things go bad,
but we do not have data about chat tests caused troubles. Fixing the problem
underlying unbacked complains means we will have to gather data on our own.

First step could be to parse jenkins logs and find which test fail or hang
most often in NetBSD regression


This work is under way. I will have to change some of the scripts I 
wrote to get this information.





4) How can we make debugging better in NetBSD? In the worst case we can make
all tests execute in trace/debug mode on NetBSD.

I really want to appreciate the fine job you have done so far in making sure
glusterfs is stable on NetBSD.

Thanks! I must confess the idea of having the NetBSD port demoted is a bit
depressing given the amount of work I invested in it.
With your support I think we can make things better. To avoid 
duplication of work, did you take any tests that you are already 
investigating? If not that is the first thing I will try to find out.


Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] glusterfsd crash due to page allocation failure

2015-12-27 Thread Pranith Kumar Karampuri
After debugging with David, we found that the issue is already fixed for 
3.7.7 by the patch http://review.gluster.org/12312


Pranith

On 12/22/2015 10:45 PM, David Robinson wrote:

Niels,

> 1. how is infiniband involved/configured in this environment?

gfsib01bkp and gfs02bkp are connected via infiniband. We are using tcp 
transport as I never was able to get RDMA to work.


Volume Name: gfsbackup
Type: Distribute
Volume ID: e78d5123-d9bc-4d88-9c73-61d28abf0b41
Status: Started
Number of Bricks: 7
Transport-type: tcp
Bricks:
Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/gfsbackup
Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/gfsbackup
Brick3: gfsib02bkp.corvidtec.com:/data/brick01bkp/gfsbackup
Brick4: gfsib02bkp.corvidtec.com:/data/brick02bkp/gfsbackup
Brick5: gfsib02bkp.corvidtec.com:/data/brick03bkp/gfsbackup
Brick6: gfsib02bkp.corvidtec.com:/data/brick04bkp/gfsbackup
Brick7: gfsib02bkp.corvidtec.com:/data/brick05bkp/gfsbackup

> 2. was there a change/update of the driver (kernel update maybe?)
Before upgrading these servers from gluster 3.6.6 to 3.7.6, I did a 
'yum update' which did upgrade the kernel.

Current kernel is 2.6.32-573.12.1.el6.x86_64

> 3. do you get a coredump of the glusterfsd process when this happens?
There are a series of core files in / around the same time that this 
happens.

-rw---1 root root  168865792 Dec 22 10:45 core.3700
-rw---1 root root  168861696 Dec 22 10:45 core.3661
-rw---1 root root  168861696 Dec 22 10:45 core.3706
-rw---1 root root  168861696 Dec 22 10:45 core.3677
-rw---1 root root  168861696 Dec 22 10:45 core.3669
-rw---1 root root  168857600 Dec 22 10:45 core.3654
-rw---1 root root  254345216 Dec 22 10:45 core.3693
-rw---1 root root  254341120 Dec 22 10:45 core.3685

> 4. is this a fuse mount process, or a brick process? (check by PID?)
I have rebooted the machine as it was in a bad state and I could no 
longer write to the gluster volume.

When it does it again, I will check the PID.

This machine has both brick processses and fuse mounts.  The storage 
servers mount the volume through a fuse mount and then I use rsync to 
backup my primary storage system.


David




 Hello,

 We've recently upgraded from gluster 3.6.6 to 3.7.6 and have started
 encountering dmesg page allocation errors (stack trace is appended).

 It appears that glusterfsd now sometimes fills up the cache 
completely and
 crashes with a page allocation failure. I *believe* it mainly 
happens when
 copying lots of new data to the system, running a 'find', or 
similar. Hosts
 are all Scientific Linux 6.6 and these errors occur consistently on 
two

 separate gluster pools.

 Has anyone else seen this issue and are there any known fixes for 
it via

 sysctl kernel parameters or other means?

 Please let me know of any other diagnostic information that would 
help.


Could you explain a little more about this? The below is a message from
the kernel telling you that the mlx4_ib (Mellanox Infiniband?) driver is
requesting more continuous memory than is immediately available.

So, the questions I have regarding this:

1. how is infiniband involved/configured in this environment?
2. was there a change/update of the driver (kernel update maybe?)
3. do you get a coredump of the glusterfsd process when this happens?
4. is this a fuse mount process, or a brick process? (check by PID?)

Thanks,
Niels




 Thanks,
 Patrick


 [1458118.134697] glusterfsd: page allocation failure. order:5, 
mode:0x20

 > [1458118.134701] Pid: 6010, comm: glusterfsd Not tainted
 > 2.6.32-573.3.1.el6.x86_64 #1
 > [1458118.134702] Call Trace:
 > [1458118.134714]  [] ? 
__alloc_pages_nodemask+0x7dc/0x950
 > [1458118.134728]  [] ? 
mlx4_ib_post_send+0x680/0x1f90

 > [mlx4_ib]
 > [1458118.134733]  [] ? kmem_getpages+0x62/0x170
 > [1458118.134735]  [] ? fallback_alloc+0x1ba/0x270
 > [1458118.134736]  [] ? cache_grow+0x2cf/0x320
 > [1458118.134738]  [] ? 
cache_alloc_node+0x99/0x160

 > [1458118.134743]  [] ? pskb_expand_head+0x62/0x280
 > [1458118.134744]  [] ? __kmalloc+0x199/0x230
 > [1458118.134746]  [] ? pskb_expand_head+0x62/0x280
 > [1458118.134748]  [] ? 
__pskb_pull_tail+0x2aa/0x360
 > [1458118.134751]  [] ? 
harmonize_features+0x29/0x70
 > [1458118.134753]  [] ? 
dev_hard_start_xmit+0x1c4/0x490

 > [1458118.134758]  [] ? sch_direct_xmit+0x15a/0x1c0
 > [1458118.134759]  [] ? dev_queue_xmit+0x228/0x320
 > [1458118.134762]  [] ? 
neigh_connected_output+0xbd/0x100
 > [1458118.134766]  [] ? 
ip_finish_output+0x287/0x360

 > [1458118.134767]  [] ? ip_output+0xb8/0xc0
 > [1458118.134769]  [] ? __ip_local_out+0x9f/0xb0
 > [1458118.134770]  [] ? ip_local_out+0x25/0x30
 > [1458118.134772]  [] ? ip_queue_xmit+0x190/0x420
 > [1458118.134773]  [] ? 
__alloc_pages_nodemask+0x129/0x950
 > [1458118.134776]  [] ? 
tcp_transmit_skb+0x4b4/0x8b0

 > [1458118.134778]  [] ? tcp_write_xmit+0x1da/0xa90
 > [1458118.134779]  [] ? __kmalloc_node+0x4d/0x60
 > 

Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client

2015-12-27 Thread Pranith Kumar Karampuri



On 12/26/2015 04:45 AM, Oleksandr Natalenko wrote:

Also, here is valgrind output with our custom tool, that does GlusterFS volume
traversing (with simple stats) just like find tool. In this case NFS-Ganesha
is not used.

https://gist.github.com/e4602a50d3c98f7a2766

hi Oleksandr,
  I went through the code. Both NFS Ganesha and the custom tool use 
gfapi and the leak is stemming from that. I am not very familiar with 
this part of code but there seems to be one inode_unref() that is 
missing in failure path of resolution. Not sure if that is corresponding 
to the leaks.


Soumya,
   Could this be the issue? review.gluster.org seems to be down. So 
couldn't send the patch. Please ping me on IRC.

diff --git a/api/src/glfs-resolve.c b/api/src/glfs-resolve.c
index b5efcba..52b538b 100644
--- a/api/src/glfs-resolve.c
+++ b/api/src/glfs-resolve.c
@@ -467,9 +467,11 @@ priv_glfs_resolve_at (struct glfs *fs, xlator_t 
*subvol, inode_t *at,

}
}

-   if (parent && next_component)
+   if (parent && next_component) {
+   inode_unref (parent);
+   parent = NULL;
/* resolution failed mid-way */
goto out;
+}

/* At this point, all components up to the last parent directory
   have been resolved successfully (@parent). Resolution of 
basename


Pranith


One may see GlusterFS-related leaks here as well.

On пʼятниця, 25 грудня 2015 р. 20:28:13 EET Soumya Koduri wrote:

On 12/24/2015 09:17 PM, Oleksandr Natalenko wrote:

Another addition: it seems to be GlusterFS API library memory leak
because NFS-Ganesha also consumes huge amount of memory while doing
ordinary "find . -type f" via NFSv4.2 on remote client. Here is memory
usage:

===
root  5416 34.2 78.5 2047176 1480552 ? Ssl  12:02 117:54
/usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f
/etc/ganesha/ganesha.conf -N NIV_EVENT
===

1.4G is too much for simple stat() :(.

Ideas?

nfs-ganesha also has cache layer which can scale to millions of entries
depending on the number of files/directories being looked upon. However
there are parameters to tune it. So either try stat with few entries or
add below block in nfs-ganesha.conf file, set low limits and check the
difference. That may help us narrow down how much memory actually
consumed by core nfs-ganesha and gfAPI.

CACHEINODE {
Cache_Size(uint32, range 1 to UINT32_MAX, default 32633); # cache size
Entries_HWMark(uint32, range 1 to UINT32_MAX, default 10); #Max no.
of entries in the cache.
}

Thanks,
Soumya


24.12.2015 16:32, Oleksandr Natalenko написав:

Still actual issue for 3.7.6. Any suggestions?

24.09.2015 10:14, Oleksandr Natalenko написав:

In our GlusterFS deployment we've encountered something like memory
leak in GlusterFS FUSE client.

We use replicated (×2) GlusterFS volume to store mail (exim+dovecot,
maildir format). Here is inode stats for both bricks and mountpoint:

===
Brick 1 (Server 1):

Filesystem InodesIUsed

  IFree IUse% Mounted on

/dev/mapper/vg_vd1_misc-lv08_mail   578768144 10954918

  5678132262% /bricks/r6sdLV08_vd1_mail

Brick 2 (Server 2):

Filesystem InodesIUsed

  IFree IUse% Mounted on

/dev/mapper/vg_vd0_misc-lv07_mail   578767984 10954913

  5678130712% /bricks/r6sdLV07_vd0_mail

Mountpoint (Server 3):

Filesystem  InodesIUsed  IFree
IUse% Mounted on
glusterfs.xxx:mail   578767760 10954915  567812845
2% /var/spool/mail/virtual
===

glusterfs.xxx domain has two A records for both Server 1 and Server 2.

Here is volume info:

===
Volume Name: mail
Type: Replicate
Volume ID: f564e85c-7aa6-4170-9417-1f501aa98cd2
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: server1.xxx:/bricks/r6sdLV08_vd1_mail/mail
Brick2: server2.xxx:/bricks/r6sdLV07_vd0_mail/mail
Options Reconfigured:
nfs.rpc-auth-allow: 1.2.4.0/24,4.5.6.0/24
features.cache-invalidation-timeout: 10
performance.stat-prefetch: off
performance.quick-read: on
performance.read-ahead: off
performance.flush-behind: on
performance.write-behind: on
performance.io-thread-count: 4
performance.cache-max-file-size: 1048576
performance.cache-size: 67108864
performance.readdir-ahead: off
===

Soon enough after mounting and exim/dovecot start, glusterfs client
process begins to consume huge amount of RAM:

===
user@server3 ~$ ps aux | grep glusterfs | grep mail
root 28895 14.4 15.0 15510324 14908868 ?   Ssl  Sep03 4310:05
/usr/sbin/glusterfs --fopen-keep-cache --direct-io-mode=disable
--volfile-server=glusterfs.xxx --volfile-id=mail
/var/spool/mail/virtual
===

That is, ~15 GiB of RAM.

Also we've tried to use mountpoint withing separate KVM VM with 2 or 3
GiB of RAM, and soon after starting mail daemons got OOM killer for
glusterfs client process.

Mounting same share via 

Re: [Gluster-devel] glusterfsd crash due to page allocation failure

2015-12-21 Thread Pranith Kumar Karampuri

hi Glomski,
This is the second time I am hearing about memory allocation 
problems in 3.7.6 but this time on brick side. Are you able to recreate 
this issue? Will it be possible to get statedumps of the bricks 
processes just before they crash?


Pranith

On 12/22/2015 02:25 AM, Glomski, Patrick wrote:

Hello,

We've recently upgraded from gluster 3.6.6 to 3.7.6 and have started 
encountering dmesg page allocation errors (stack trace is appended).


It appears that glusterfsd now sometimes fills up the cache completely 
and crashes with a page allocation failure. I *believe* it mainly 
happens when copying lots of new data to the system, running a 'find', 
or similar. Hosts are all Scientific Linux 6.6 and these errors occur 
consistently on two separate gluster pools.


Has anyone else seen this issue and are there any known fixes for it 
via sysctl kernel parameters or other means?


Please let me know of any other diagnostic information that would help.

Thanks,
Patrick


[1458118.134697] glusterfsd: page allocation failure. order:5,
mode:0x20
[1458118.134701] Pid: 6010, comm: glusterfsd Not tainted
2.6.32-573.3.1.el6.x86_64 #1
[1458118.134702] Call Trace:
[1458118.134714]  [] ?
__alloc_pages_nodemask+0x7dc/0x950
[1458118.134728]  [] ?
mlx4_ib_post_send+0x680/0x1f90 [mlx4_ib]
[1458118.134733]  [] ? kmem_getpages+0x62/0x170
[1458118.134735]  [] ? fallback_alloc+0x1ba/0x270
[1458118.134736]  [] ? cache_grow+0x2cf/0x320
[1458118.134738]  [] ?
cache_alloc_node+0x99/0x160
[1458118.134743]  [] ? pskb_expand_head+0x62/0x280
[1458118.134744]  [] ? __kmalloc+0x199/0x230
[1458118.134746]  [] ? pskb_expand_head+0x62/0x280
[1458118.134748]  [] ? __pskb_pull_tail+0x2aa/0x360
[1458118.134751]  [] ? harmonize_features+0x29/0x70
[1458118.134753]  [] ?
dev_hard_start_xmit+0x1c4/0x490
[1458118.134758]  [] ? sch_direct_xmit+0x15a/0x1c0
[1458118.134759]  [] ? dev_queue_xmit+0x228/0x320
[1458118.134762]  [] ?
neigh_connected_output+0xbd/0x100
[1458118.134766]  [] ? ip_finish_output+0x287/0x360
[1458118.134767]  [] ? ip_output+0xb8/0xc0
[1458118.134769]  [] ? __ip_local_out+0x9f/0xb0
[1458118.134770]  [] ? ip_local_out+0x25/0x30
[1458118.134772]  [] ? ip_queue_xmit+0x190/0x420
[1458118.134773]  [] ?
__alloc_pages_nodemask+0x129/0x950
[1458118.134776]  [] ? tcp_transmit_skb+0x4b4/0x8b0
[1458118.134778]  [] ? tcp_write_xmit+0x1da/0xa90
[1458118.134779]  [] ? __kmalloc_node+0x4d/0x60
[1458118.134780]  [] ? tcp_push_one+0x30/0x40
[1458118.134782]  [] ? tcp_sendmsg+0x9cc/0xa20
[1458118.134786]  [] ? sock_aio_write+0x19b/0x1c0
[1458118.134788]  [] ? sock_aio_write+0x0/0x1c0
[1458118.134791]  [] ?
do_sync_readv_writev+0xfb/0x140
[1458118.134797]  [] ?
autoremove_wake_function+0x0/0x40
[1458118.134801]  [] ?
selinux_file_permission+0xbf/0x150
[1458118.134804]  [] ?
security_file_permission+0x16/0x20
[1458118.134806]  [] ? do_readv_writev+0xd6/0x1f0
[1458118.134807]  [] ? vfs_writev+0x46/0x60
[1458118.134809]  [] ? sys_writev+0x51/0xd0
[1458118.134812]  [] ?
__audit_syscall_exit+0x25e/0x290
[1458118.134816]  [] ?
system_call_fastpath+0x16/0x1b




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] glusterfsd crash due to page allocation failure

2015-12-22 Thread Pranith Kumar Karampuri



On 12/22/2015 09:10 PM, David Robinson wrote:

Pranith,
This issue continues to happen.  If you could provide instructions for 
getting you the statedump, I would be happy to send that information.
I am not sure how to get a statedump just before the crash as the 
crash is intermittent.

Command: gluster volume statedump 

This generates statedump files in /var/run/gluster/ directory. Do you 
think you can execute this command once every 'X' time until the crash 
is hit? Post these files and hopefully that should be good enough to fix 
the problem.


Pranith

David
-- Original Message --
From: "Pranith Kumar Karampuri" <pkara...@redhat.com 
<mailto:pkara...@redhat.com>>
To: "Glomski, Patrick" <patrick.glom...@corvidtec.com 
<mailto:patrick.glom...@corvidtec.com>>; gluster-devel@gluster.org 
<mailto:gluster-devel@gluster.org>; gluster-us...@gluster.org 
<mailto:gluster-us...@gluster.org>
Cc: "David Robinson" <david.robin...@corvidtec.com 
<mailto:david.robin...@corvidtec.com>>

Sent: 12/21/2015 11:59:33 PM
Subject: Re: [Gluster-devel] glusterfsd crash due to page allocation 
failure

hi Glomski,
This is the second time I am hearing about memory allocation 
problems in 3.7.6 but this time on brick side. Are you able to 
recreate this issue? Will it be possible to get statedumps of the 
bricks processes just before they crash?


Pranith

On 12/22/2015 02:25 AM, Glomski, Patrick wrote:

Hello,

We've recently upgraded from gluster 3.6.6 to 3.7.6 and have started 
encountering dmesg page allocation errors (stack trace is appended).


It appears that glusterfsd now sometimes fills up the cache 
completely and crashes with a page allocation failure. I *believe* 
it mainly happens when copying lots of new data to the system, 
running a 'find', or similar. Hosts are all Scientific Linux 6.6 and 
these errors occur consistently on two separate gluster pools.


Has anyone else seen this issue and are there any known fixes for it 
via sysctl kernel parameters or other means?


Please let me know of any other diagnostic information that would help.

Thanks,
Patrick


[1458118.134697] glusterfsd: page allocation failure. order:5,
mode:0x20
[1458118.134701] Pid: 6010, comm: glusterfsd Not tainted
2.6.32-573.3.1.el6.x86_64 #1
[1458118.134702] Call Trace:
[1458118.134714]  [] ?
__alloc_pages_nodemask+0x7dc/0x950
[1458118.134728]  [] ?
mlx4_ib_post_send+0x680/0x1f90 [mlx4_ib]
[1458118.134733]  [] ? kmem_getpages+0x62/0x170
[1458118.134735]  [] ? fallback_alloc+0x1ba/0x270
[1458118.134736]  [] ? cache_grow+0x2cf/0x320
[1458118.134738]  [] ?
cache_alloc_node+0x99/0x160
[1458118.134743]  [] ? pskb_expand_head+0x62/0x280
[1458118.134744]  [] ? __kmalloc+0x199/0x230
[1458118.134746]  [] ? pskb_expand_head+0x62/0x280
[1458118.134748]  [] ?
__pskb_pull_tail+0x2aa/0x360
[1458118.134751]  [] ?
harmonize_features+0x29/0x70
[1458118.134753]  [] ?
dev_hard_start_xmit+0x1c4/0x490
[1458118.134758]  [] ? sch_direct_xmit+0x15a/0x1c0
[1458118.134759]  [] ? dev_queue_xmit+0x228/0x320
[1458118.134762]  [] ?
neigh_connected_output+0xbd/0x100
[1458118.134766]  [] ?
ip_finish_output+0x287/0x360
[1458118.134767]  [] ? ip_output+0xb8/0xc0
[1458118.134769]  [] ? __ip_local_out+0x9f/0xb0
[1458118.134770]  [] ? ip_local_out+0x25/0x30
[1458118.134772]  [] ? ip_queue_xmit+0x190/0x420
[1458118.134773]  [] ?
__alloc_pages_nodemask+0x129/0x950
[1458118.134776]  [] ?
tcp_transmit_skb+0x4b4/0x8b0
[1458118.134778]  [] ? tcp_write_xmit+0x1da/0xa90
[1458118.134779]  [] ? __kmalloc_node+0x4d/0x60
[1458118.134780]  [] ? tcp_push_one+0x30/0x40
[1458118.134782]  [] ? tcp_sendmsg+0x9cc/0xa20
[1458118.134786]  [] ? sock_aio_write+0x19b/0x1c0
[1458118.134788]  [] ? sock_aio_write+0x0/0x1c0
[1458118.134791]  [] ?
do_sync_readv_writev+0xfb/0x140
[1458118.134797]  [] ?
autoremove_wake_function+0x0/0x40
[1458118.134801]  [] ?
selinux_file_permission+0xbf/0x150
[1458118.134804]  [] ?
security_file_permission+0x16/0x20
[1458118.134806]  [] ? do_readv_writev+0xd6/0x1f0
[1458118.134807]  [] ? vfs_writev+0x46/0x60
[1458118.134809]  [] ? sys_writev+0x51/0xd0
[1458118.134812]  [] ?
__audit_syscall_exit+0x25e/0x290
[1458118.134816]  [] ?
system_call_fastpath+0x16/0x1b




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] glusterfsd crash due to page allocation failure

2015-12-22 Thread Pranith Kumar Karampuri



On 12/22/2015 10:45 PM, David Robinson wrote:

Niels,

> 1. how is infiniband involved/configured in this environment?

gfsib01bkp and gfs02bkp are connected via infiniband. We are using tcp 
transport as I never was able to get RDMA to work.


Volume Name: gfsbackup
Type: Distribute
Volume ID: e78d5123-d9bc-4d88-9c73-61d28abf0b41
Status: Started
Number of Bricks: 7
Transport-type: tcp
Bricks:
Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/gfsbackup
Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/gfsbackup
Brick3: gfsib02bkp.corvidtec.com:/data/brick01bkp/gfsbackup
Brick4: gfsib02bkp.corvidtec.com:/data/brick02bkp/gfsbackup
Brick5: gfsib02bkp.corvidtec.com:/data/brick03bkp/gfsbackup
Brick6: gfsib02bkp.corvidtec.com:/data/brick04bkp/gfsbackup
Brick7: gfsib02bkp.corvidtec.com:/data/brick05bkp/gfsbackup

> 2. was there a change/update of the driver (kernel update maybe?)
Before upgrading these servers from gluster 3.6.6 to 3.7.6, I did a 
'yum update' which did upgrade the kernel.

Current kernel is 2.6.32-573.12.1.el6.x86_64

> 3. do you get a coredump of the glusterfsd process when this happens?
There are a series of core files in / around the same time that this 
happens.

-rw---1 root root  168865792 Dec 22 10:45 core.3700
-rw---1 root root  168861696 Dec 22 10:45 core.3661
-rw---1 root root  168861696 Dec 22 10:45 core.3706
-rw---1 root root  168861696 Dec 22 10:45 core.3677
-rw---1 root root  168861696 Dec 22 10:45 core.3669
-rw---1 root root  168857600 Dec 22 10:45 core.3654
-rw---1 root root  254345216 Dec 22 10:45 core.3693
-rw---1 root root  254341120 Dec 22 10:45 core.3685

> 4. is this a fuse mount process, or a brick process? (check by PID?)
I have rebooted the machine as it was in a bad state and I could no 
longer write to the gluster volume.

When it does it again, I will check the PID.

Oh you are observing cores? It is highly unlikely because of mem-leaks 
then :-/. I think we need to proceed based on what Niels suggested. Let 
us see what you find out.


Pranith
This machine has both brick processses and fuse mounts.  The storage 
servers mount the volume through a fuse mount and then I use rsync to 
backup my primary storage system.


David




 Hello,

 We've recently upgraded from gluster 3.6.6 to 3.7.6 and have started
 encountering dmesg page allocation errors (stack trace is appended).

 It appears that glusterfsd now sometimes fills up the cache 
completely and
 crashes with a page allocation failure. I *believe* it mainly 
happens when
 copying lots of new data to the system, running a 'find', or 
similar. Hosts
 are all Scientific Linux 6.6 and these errors occur consistently on 
two

 separate gluster pools.

 Has anyone else seen this issue and are there any known fixes for 
it via

 sysctl kernel parameters or other means?

 Please let me know of any other diagnostic information that would 
help.


Could you explain a little more about this? The below is a message from
the kernel telling you that the mlx4_ib (Mellanox Infiniband?) driver is
requesting more continuous memory than is immediately available.

So, the questions I have regarding this:

1. how is infiniband involved/configured in this environment?
2. was there a change/update of the driver (kernel update maybe?)
3. do you get a coredump of the glusterfsd process when this happens?
4. is this a fuse mount process, or a brick process? (check by PID?)

Thanks,
Niels




 Thanks,
 Patrick


 [1458118.134697] glusterfsd: page allocation failure. order:5, 
mode:0x20

 > [1458118.134701] Pid: 6010, comm: glusterfsd Not tainted
 > 2.6.32-573.3.1.el6.x86_64 #1
 > [1458118.134702] Call Trace:
 > [1458118.134714]  [] ? 
__alloc_pages_nodemask+0x7dc/0x950
 > [1458118.134728]  [] ? 
mlx4_ib_post_send+0x680/0x1f90

 > [mlx4_ib]
 > [1458118.134733]  [] ? kmem_getpages+0x62/0x170
 > [1458118.134735]  [] ? fallback_alloc+0x1ba/0x270
 > [1458118.134736]  [] ? cache_grow+0x2cf/0x320
 > [1458118.134738]  [] ? 
cache_alloc_node+0x99/0x160

 > [1458118.134743]  [] ? pskb_expand_head+0x62/0x280
 > [1458118.134744]  [] ? __kmalloc+0x199/0x230
 > [1458118.134746]  [] ? pskb_expand_head+0x62/0x280
 > [1458118.134748]  [] ? 
__pskb_pull_tail+0x2aa/0x360
 > [1458118.134751]  [] ? 
harmonize_features+0x29/0x70
 > [1458118.134753]  [] ? 
dev_hard_start_xmit+0x1c4/0x490

 > [1458118.134758]  [] ? sch_direct_xmit+0x15a/0x1c0
 > [1458118.134759]  [] ? dev_queue_xmit+0x228/0x320
 > [1458118.134762]  [] ? 
neigh_connected_output+0xbd/0x100
 > [1458118.134766]  [] ? 
ip_finish_output+0x287/0x360

 > [1458118.134767]  [] ? ip_output+0xb8/0xc0
 > [1458118.134769]  [] ? __ip_local_out+0x9f/0xb0
 > [1458118.134770]  [] ? ip_local_out+0x25/0x30
 > [1458118.134772]  [] ? ip_queue_xmit+0x190/0x420
 > [1458118.134773]  [] ? 
__alloc_pages_nodemask+0x129/0x950
 > [1458118.134776]  [] ? 
tcp_transmit_skb+0x4b4/0x8b0

 > [1458118.134778]  [] ? tcp_write_xmit+0x1da/0xa90
 > [1458118.134779] 

[Gluster-devel] 3.7.7 release

2015-12-24 Thread Pranith Kumar Karampuri

hi,
 I am going to make 3.7.7 release early next week. Please make sure 
your patches are merged. If you have any patches that must go to 3.7.7. 
let me know. I will wait for them to be merged.


Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Help needed in understanding GlusterFS logs and debugging elasticsearch failures

2015-12-17 Thread Pranith Kumar Karampuri



On 12/17/2015 04:03 PM, Vijay Bellur wrote:

On 12/17/2015 05:09 AM, Sachidananda URS wrote:

Hi,

I tried the same use case with pure DHT (1 & 2 nodes). I don't see any
problems.
However, if I try the same tests with distributed replicate, the indices
go into red.

If any additional details are needed than the logs attached in the
earlier mails please let me know.



Can you please try with option "cluster.consistent-metadata" enabled 
on a distributed replicated volume?
I am talking to Sac and will look at the machines with him. Will post 
with updates.


Pranith


Thanks,
Vijay

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] NetBSD tests not running to completion.

2016-01-08 Thread Pranith Kumar Karampuri



On 01/08/2016 08:14 PM, Emmanuel Dreyfus wrote:

On Fri, Jan 08, 2016 at 10:56:22AM +, Emmanuel Dreyfus wrote:

On Fri, Jan 08, 2016 at 03:18:02PM +0530, Pranith Kumar Karampuri wrote:

With your support I think we can make things better. To avoid duplication of
work, did you take any tests that you are already investigating? If not that
is the first thing I will try to find out.

I will look at the ./tests/basic/afr/arbiter-statfs.t problem with
loopback device.

800 rusn so far without a hitch. I suspect the problem is caused by the leftover
of another test.



I see the following lines in 'cleanup' function:
NetBSD)
vnd=`vnconfig -l | \
 awk '!/not in use/{printf("%s%s:%d ", $1, $2, $5);}'`

Can there be Loopback devices that are in use when this piece of the 
code is executed, which can lead to the problems we ran into? I may be 
completely wrong. It is a wild guess about something I don't completely 
understand.


Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] NetBSD tests not running to completion.

2016-01-08 Thread Pranith Kumar Karampuri



Does it seems reasonable? That way nothing can hang more than 2 hours.

That addresses the technical issue of hanging tests.  It doesn't address
the process issue of the entire project and development team being held
hostage to one feature.


Guys,
I think we just need to come up with rules for considering a 
platform to have voting ability before merging the patch. Which is not 
too hard to come up with if we all put our minds to it and come up with 
something that is agreeable for everyone. Just like glusterfs goes 
through ups and downs in stability in development, the platform may also 
go thorough the same, I do agree that the platform stability shouldn't 
hinder patch acceptance be it Linux/NetBSD (I hope FreeBSD can also 
become a voting member) so the platform may come and go in the list of 
platforms that can vote based on the platform stability. We need both 
entry and exit criteria for the platform to be considered to have a vote.


Of course if we agree to the things above, when some .t fails when the 
platform is considered stable, we need to fix it. But if it is something 
that happens only on the platform (May be the loopback failure happening 
on NetBSD which emmanuel is looking at now falls into that category) 
then until these kinds of issues are fixed we shouldn't let the project 
be slowed down. Setting clear expectations as to why some platform can 
vote for patch merging will go a long way in preventing these kind of 
discussions further, may be it can even be automated and the port 
maintainer will be notified when the platform health degrades to a point 
where it has to exit the list of platforms which have vote. Even when 
the platform doesn't vote, it still runs the tests. Once the port 
maintainers solve the problems and the health for that port is better, 
it can be automatically added to the list of platforms which can vote.


For all this to happen we need data in place. Jeff's patch to find 
failures is a great addition in gathering such data. We need more such 
data points. All of us need to agree on the data points.


These are my thoughts on the matter. Comments welcome!

Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] tests/bugs/tier/bug-1286974.t failed and dropped a core

2016-01-12 Thread Pranith Kumar Karampuri



On 01/13/2016 09:14 AM, Dan Lambright wrote:


- Original Message -

From: "Niels de Vos" 
To: "Dan Lambright" , "Joseph Fernandes" 

Cc: gluster-devel@gluster.org
Sent: Tuesday, January 12, 2016 3:52:51 PM
Subject: tests/bugs/tier/bug-1286974.t failed and dropped a core

Hi guys,

could you please have a look at this regression test failure?

Pranith, could you or someone with EC expertise help us diagnose this problem?

The test script does:

TEST touch /mnt/glusterfs/0/file{1..100};

I see some number of errors such as:

[2016-01-12 20:14:26.888412] E [MSGID: 122063] 
[ec-common.c:943:ec_prepare_update_cbk] 0-patchy-disperse-0: Unable to get size 
xattr [No such file or directory]
[2016-01-12 20:14:26.888493] E [MSGID: 109031] 
[dht-linkfile.c:301:dht_linkfile_setattr_cbk] 0-patchy-tier-dht: Failed to set 
attr uid/gid on /file28 :  [No such file or directory]

.. right before the crash. The backtrace is in mnt-glusterfs-0.log, it failed 
in ec function ec_manager_setattr().

It appears to be an assert, if I found the code right.

 GF_ASSERT(ec_get_inode_size(fop,
 fop->locks[0].lock->loc.inode,
 >iatt[0].ia_size));

/lib64/libc.so.6(+0x2b74e)[0x7f84c62a974e]
/lib64/libc.so.6(__assert_perror_fail+0x0)[0x7f84c62a9810]
/build/install/lib/glusterfs/3.8dev/xlator/cluster/disperse.so(+0x312f5)[0x7f84ba5ce2f5]
/build/install/lib/glusterfs/3.8dev/xlator/cluster/disperse.so(+0x14918)[0x7f84ba5b1918]
/build/install/lib/glusterfs/3.8dev/xlator/cluster/disperse.so(+0x10756)[0x7f84ba5ad756]
/build/install/lib/glusterfs/3.8dev/xlator/cluster/disperse.so(+0x1093c)[0x7f84ba5ad93c]
/build/install/lib/glusterfs/3.8dev/xlator/cluster/disperse.so(+0x2fbe0)[0x7f84ba5ccbe0]
/build/install/lib/glusterfs/3.8dev/xlator/cluster/disperse.so(+0x30ea9)[0x7f84ba5cdea9]
/build/install/lib/glusterfs/3.8dev/xlator/protocol/client.so(+0x1f706)[0x7f84ba854706]
/build/install/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0x1b2)[0x7f84c74e542a]


This is because dht sents setattr on an inode that is not linked. For 
now we are addressing it in EC with http://review.gluster.com/13039. 
Regressions need to pass.


Pranith




 
https://build.gluster.org/job/rackspace-regression-2GB-triggered/17475/consoleFull

 [20:14:46] ./tests/bugs/tier/bug-1286974.t ..
 not ok 16
 Failed 1/24 subtests
 [20:14:46]
 
 Test Summary Report

 ---
 ./tests/bugs/tier/bug-1286974.t (Wstat: 0 Tests: 24 Failed: 1)
   Failed test:  16
 Files=1, Tests=24, 37 wallclock secs ( 0.03 usr  0.01 sys +  2.38 cusr
 0.77 csys =  3.19 CPU)
 Result: FAIL
 ./tests/bugs/tier/bug-1286974.t: bad status 1
 ./tests/bugs/tier/bug-1286974.t: 1 new core files
 Ignoring failure from known-bad test ./tests/bugs/tier/bug-1286974.t

Failures are ignored as mentioned in the last line, but cores are not
allowed. Please prevent this from happening :)

Thanks,
Niels



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-infra] NetBSD tests not running to completion.

2016-01-12 Thread Pranith Kumar Karampuri



On 01/11/2016 01:00 PM, Pranith Kumar Karampuri wrote:



On 01/09/2016 12:34 AM, Vijay Bellur wrote:

On 01/08/2016 08:18 AM, Jeff Darcy wrote:

  I think we just need to come up with rules for considering a
platform to have voting ability before merging the patch.


I totally agree, except for the "just" part.  ;)  IMO a platform is 
much

like a feature in terms of requiring commitment/accountability,
community agreement on cost/benefit, and so on.  You can see a lot of
that in the feature-page template.

https://github.com/gluster/glusterfs-specs/blob/master/in_progress/template.md 



That might provide a good starting point, even though some items won't
apply to a platform and others are surely missing.  It's new territory,
after all.  Also, I believe the bar for platforms should be higher than
for features, because a new platform multiplies our test load (and
associated burdens) instead of merely adding to it.  Also, new features
rarely impact all developers the way that new platforms do.

Nobody should be making assumptions or unilateral decisions about
something as important as when it is or is not OK to block all merges
throughout the project.  That needs to be the subject of an explicit 
and

carefully considered community decision.  That, in turn, requires some
clearly defined cost/benefit analysis and resource commitment. If we
don't get the process right this time, we'll end up having this same
conversation yet again, and I'm sure nobody wants that.


Agree here.

Pranith - can you please help come up with a governance process for 
platforms in consultation with Jeff and Emmanuel? Once it is ready we 
can propose that in the broader community and formalize it.


Sent the following rfc patch which will be updated based on the 
discussions.

http://review.gluster.org/13211

I left the decisions we need to come up with as TBD in the sections. 
Please feel free to suggest what you would like to see there as 
comments on the patch. If we need more sections that we need to 
consider, let us add them as comments too.
I will periodically refresh the patch based on the decisions that are 
agreed upon.
+ All the people who responded on the thread. Please update the patch 
with your suggestions.


Pranith


I hope the discussion will come to natural conclusions based on the 
discussions there.


Thanks
Pranith


Thanks,
Vijay



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-infra] NetBSD tests not running to completion.

2016-01-10 Thread Pranith Kumar Karampuri



On 01/09/2016 12:34 AM, Vijay Bellur wrote:

On 01/08/2016 08:18 AM, Jeff Darcy wrote:

  I think we just need to come up with rules for considering a
platform to have voting ability before merging the patch.


I totally agree, except for the "just" part.  ;)  IMO a platform is much
like a feature in terms of requiring commitment/accountability,
community agreement on cost/benefit, and so on.  You can see a lot of
that in the feature-page template.

https://github.com/gluster/glusterfs-specs/blob/master/in_progress/template.md 



That might provide a good starting point, even though some items won't
apply to a platform and others are surely missing.  It's new territory,
after all.  Also, I believe the bar for platforms should be higher than
for features, because a new platform multiplies our test load (and
associated burdens) instead of merely adding to it.  Also, new features
rarely impact all developers the way that new platforms do.

Nobody should be making assumptions or unilateral decisions about
something as important as when it is or is not OK to block all merges
throughout the project.  That needs to be the subject of an explicit and
carefully considered community decision.  That, in turn, requires some
clearly defined cost/benefit analysis and resource commitment. If we
don't get the process right this time, we'll end up having this same
conversation yet again, and I'm sure nobody wants that.


Agree here.

Pranith - can you please help come up with a governance process for 
platforms in consultation with Jeff and Emmanuel? Once it is ready we 
can propose that in the broader community and formalize it.


Sent the following rfc patch which will be updated based on the discussions.
http://review.gluster.org/13211

I left the decisions we need to come up with as TBD in the sections. 
Please feel free to suggest what you would like to see there as comments 
on the patch. If we need more sections that we need to consider, let us 
add them as comments too.
I will periodically refresh the patch based on the decisions that are 
agreed upon.


I hope the discussion will come to natural conclusions based on the 
discussions there.


Thanks
Pranith


Thanks,
Vijay



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] scripts to get incoming bugs on components, number of reviews from gerrit

2016-06-07 Thread Pranith Kumar Karampuri
hi,
Does anyone know/have any scripts to get this information from
bugzilla/gerrit?

-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Huge VSZ (VIRT) usage by glustershd on dummy node

2016-06-08 Thread Pranith Kumar Karampuri
Oleksandr,
Could you take statedump of the shd process once in 5-10 minutes and send
may be 5 samples of them when it starts to increase? This will help us find
what datatypes are being allocated a lot and can lead to coming up with
possible theories for the increase.

On Wed, Jun 8, 2016 at 12:03 PM, Oleksandr Natalenko <
oleksa...@natalenko.name> wrote:

> Also, I've checked shd log files, and found out that for some reason shd
> constantly reconnects to bricks: [1]
>
> Please note that suggested fix [2] by Pranith does not help, VIRT value
> still grows:
>
> ===
> root  1010  0.0  9.6 7415248 374688 ?  Ssl  чер07   0:14
> /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
> /var/lib/glusterd/glustershd/run/glustershd.pid -l
> /var/log/glusterfs/glustershd.log -S
> /var/run/gluster/7848e17764dd4ba80f4623aecb91b07a.socket --xlator-option
> *replicate*.node-uuid=80bc95e1-2027-4a96-bb66-d9c8ade624d7
> ===
>
> I do not know the reason why it is reconnecting, but I suspect leak to
> happen on that reconnect.
>
> CCing Pranith.
>
> [1] http://termbin.com/brob
> [2] http://review.gluster.org/#/c/14053/
>
> 06.06.2016 12:21, Kaushal M написав:
>
>> Has multi-threaded SHD been merged into 3.7.* by any chance? If not,
>>
>> what I'm saying below doesn't apply.
>>
>> We saw problems when encrypted transports were used, because the RPC
>> layer was not reaping threads (doing pthread_join) when a connection
>> ended. This lead to similar observations of huge VIRT and relatively
>> small RSS.
>>
>> I'm not sure how multi-threaded shd works, but it could be leaking
>> threads in a similar way.
>>
>> On Mon, Jun 6, 2016 at 1:54 PM, Oleksandr Natalenko
>>  wrote:
>>
>>> Hello.
>>>
>>> We use v3.7.11, replica 2 setup between 2 nodes + 1 dummy node for
>>> keeping
>>> volumes metadata.
>>>
>>> Now we observe huge VSZ (VIRT) usage by glustershd on dummy node:
>>>
>>> ===
>>> root 15109  0.0 13.7 76552820 535272 ? Ssl  тра26   2:11
>>> /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
>>> /var/lib/glusterd/glustershd/run/glustershd.pid -l
>>> /var/log/glusterfs/glustershd.log -S
>>> /var/run/gluster/7848e17764dd4ba80f4623aecb91b07a.socket --xlator-option
>>> *replicate*.node-uuid=80bc95e1-2027-4a96-bb66-d9c8ade624d7
>>> ===
>>>
>>> that is ~73G. RSS seems to be OK (~522M). Here is the statedump of
>>> glustershd process: [1]
>>>
>>> Also, here is sum of sizes, presented in statedump:
>>>
>>> ===
>>> # cat /var/run/gluster/glusterdump.15109.dump.1465200139 | awk -F '='
>>> 'BEGIN
>>> {sum=0} /^size=/ {sum+=$2} END {print sum}'
>>> 353276406
>>> ===
>>>
>>> That is ~337 MiB.
>>>
>>> Also, here are VIRT values from 2 replica nodes:
>>>
>>> ===
>>> root 24659  0.0  0.3 5645836 451796 ?  Ssl  тра24   3:28
>>> /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
>>> /var/lib/glusterd/glustershd/run/glustershd.pid -l
>>> /var/log/glusterfs/glustershd.log -S
>>> /var/run/gluster/44ec3f29003eccedf894865107d5db90.socket --xlator-option
>>> *replicate*.node-uuid=a19afcc2-e26c-43ce-bca6-d27dc1713e87
>>> root 18312  0.0  0.3 6137500 477472 ?  Ssl  тра19   6:37
>>> /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
>>> /var/lib/glusterd/glustershd/run/glustershd.pid -l
>>> /var/log/glusterfs/glustershd.log -S
>>> /var/run/gluster/1670a3abbd1eea968126eb6f5be20322.socket --xlator-option
>>> *replicate*.node-uuid=52dca21b-c81c-48b5-9de2-1ed37987fbc2
>>> ===
>>>
>>> Those are 5 to 6G, which is much less than dummy node has, but still look
>>> too big for us.
>>>
>>> Should we care about huge VIRT value on dummy node? Also, how one would
>>> debug that?
>>>
>>> Regards,
>>>   Oleksandr.
>>>
>>> [1] https://gist.github.com/d2cfa25251136512580220fcdb8a6ce6
>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>
>>


-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Huge VSZ (VIRT) usage by glustershd on dummy node

2016-06-08 Thread Pranith Kumar Karampuri
On Wed, Jun 8, 2016 at 12:33 PM, Oleksandr Natalenko <
oleksa...@natalenko.name> wrote:

> Yup, I can do that, but please note that RSS does not change. Will
> statedump show VIRT values?
>
> Also, I'm looking at the numbers now, and see that on each reconnect VIRT
> grows by ~24M (once per ~10–15 mins). Probably, that could give you some
> idea what is going wrong.
>

That's interesting. Never saw something like this happen. I would still
like to see if there are any clues in statedump when all this happens. May
be what you said will be confirmed that nothing new is allocated but I
would just like to confirm.


> 08.06.2016 09:50, Pranith Kumar Karampuri написав:
>
> Oleksandr,
>> Could you take statedump of the shd process once in 5-10 minutes and
>> send may be 5 samples of them when it starts to increase? This will
>> help us find what datatypes are being allocated a lot and can lead to
>> coming up with possible theories for the increase.
>>
>> On Wed, Jun 8, 2016 at 12:03 PM, Oleksandr Natalenko
>> <oleksa...@natalenko.name> wrote:
>>
>> Also, I've checked shd log files, and found out that for some reason
>>> shd constantly reconnects to bricks: [1]
>>>
>>> Please note that suggested fix [2] by Pranith does not help, VIRT
>>> value still grows:
>>>
>>> ===
>>> root  1010  0.0  9.6 7415248 374688 ?  Ssl  чер07   0:14
>>> /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
>>> /var/lib/glusterd/glustershd/run/glustershd.pid -l
>>> /var/log/glusterfs/glustershd.log -S
>>> /var/run/gluster/7848e17764dd4ba80f4623aecb91b07a.socket
>>> --xlator-option
>>> *replicate*.node-uuid=80bc95e1-2027-4a96-bb66-d9c8ade624d7
>>> ===
>>>
>>> I do not know the reason why it is reconnecting, but I suspect leak
>>> to happen on that reconnect.
>>>
>>> CCing Pranith.
>>>
>>> [1] http://termbin.com/brob
>>> [2] http://review.gluster.org/#/c/14053/
>>>
>>> 06.06.2016 12:21, Kaushal M написав:
>>> Has multi-threaded SHD been merged into 3.7.* by any chance? If
>>> not,
>>>
>>> what I'm saying below doesn't apply.
>>>
>>> We saw problems when encrypted transports were used, because the RPC
>>> layer was not reaping threads (doing pthread_join) when a connection
>>> ended. This lead to similar observations of huge VIRT and relatively
>>> small RSS.
>>>
>>> I'm not sure how multi-threaded shd works, but it could be leaking
>>> threads in a similar way.
>>>
>>> On Mon, Jun 6, 2016 at 1:54 PM, Oleksandr Natalenko
>>> <oleksa...@natalenko.name> wrote:
>>> Hello.
>>>
>>> We use v3.7.11, replica 2 setup between 2 nodes + 1 dummy node for
>>> keeping
>>> volumes metadata.
>>>
>>> Now we observe huge VSZ (VIRT) usage by glustershd on dummy node:
>>>
>>> ===
>>> root 15109  0.0 13.7 76552820 535272 ? Ssl  тра26   2:11
>>> /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
>>> /var/lib/glusterd/glustershd/run/glustershd.pid -l
>>> /var/log/glusterfs/glustershd.log -S
>>> /var/run/gluster/7848e17764dd4ba80f4623aecb91b07a.socket
>>> --xlator-option
>>> *replicate*.node-uuid=80bc95e1-2027-4a96-bb66-d9c8ade624d7
>>> ===
>>>
>>> that is ~73G. RSS seems to be OK (~522M). Here is the statedump of
>>> glustershd process: [1]
>>>
>>> Also, here is sum of sizes, presented in statedump:
>>>
>>> ===
>>> # cat /var/run/gluster/glusterdump.15109.dump.1465200139 | awk -F
>>> '=' 'BEGIN
>>> {sum=0} /^size=/ {sum+=$2} END {print sum}'
>>> 353276406
>>> ===
>>>
>>> That is ~337 MiB.
>>>
>>> Also, here are VIRT values from 2 replica nodes:
>>>
>>> ===
>>> root 24659  0.0  0.3 5645836 451796 ?  Ssl  тра24   3:28
>>> /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
>>> /var/lib/glusterd/glustershd/run/glustershd.pid -l
>>> /var/log/glusterfs/glustershd.log -S
>>> /var/run/gluster/44ec3f29003eccedf894865107d5db90.socket
>>> --xlator-option
>>> *replicate*.node-uuid=a19afcc2-e26c-43ce-bca6-d27dc1713e87
>>> root 18312  0.0  0.3 6137500 477472 ?  Ssl  тра19   6:37
>>> /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
>>> /var/lib/glusterd/glustershd/run/glustershd.pid -l
>>> /var/log/glusterfs/glustershd.log -S
>>> /var/run/gluster/1670a3abbd1eea968126eb6f5be20322.socket
>>> --xlator-option
>>> *replicate*.node-uuid=52dca21b-c81c-48b5-9de2-1ed37987fbc2
>>> ===
>>>
>>> Those are 5 to 6G, which is much less than dummy node has, but still
>>> look
>>> too big for us.
>>>
>>> Should we care about huge VIRT value on dummy node? Also, how one
>>> would
>>> debug that?
>>>
>>> Regards,
>>> Oleksandr.
>>>
>>> [1] https://gist.github.com/d2cfa25251136512580220fcdb8a6ce6
>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>
>>
>> --
>>
>> Pranith
>>
>


-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Gluster Community Newsletter, May 2016

2016-05-27 Thread Pranith Kumar Karampuri
Hey Amye,
The form doesn't seem to allow editing "Your Role within Gluster" and
"Why should you attend?" Could you let us know how to fill these fields?

Pranith

On Sat, May 28, 2016 at 12:45 AM, Amye Scavarda  wrote:

> Important happenings for Gluster this month:
> We're closing in on a 3.8 release, with release candidate 2 released on
> May 24th. (
> http://www.gluster.org/pipermail/gluster-devel/2016-May/049642.html)
> Our 3.8 roadmap of features is available at:
> https://www.gluster.org/community/roadmap/3.8/
> Our current timeline is to have a release in June, so update your release
> notes!
>
> Gluster Developers Summit:
> October 6, 7 directly following LinuxCon Berlin
> https://www.gluster.org/events/summit2016/
>
> This is an invite-only event, but you can apply for an invitation.
> Deadline for application is July 31, 2016.
> Apply for an invitation:
> http://goo.gl/forms/JOEzoimW9qVV4jdz1
>
>
> Gluster.org events page:
> Instead of having just a publicpad, we now have an area on the website
> where oyu can add upcoming talks, meetups and other events.
> https://www.gluster.org/events/
> -- To contribute an event, submit a pull request to the glusterweb github
> account.
> See something that should be moved over from the publicpad? Feel free to
> contribute!
>
> Change in presentation tracking:
> We have a new slideshare account to share Gluster-related presentations
> with.
> Find everything that was previously in the documentation site over at the
> new slideshare account.
> http://www.slideshare.net/GlusterCommunity/
>
> Top 5 contributors:
> Kaleb S. Keithley, Prasanna Kumar Kalever, Pranith Kumar K, Niels de Vos,
> Krutika Dhananjay
>
> Upcoming CFPs:
> LinuxCon Europe: (
> http://events.linuxfoundation.org/events/linuxcon-europe/program/cfp)
>  June 17
>
> --
> Amye Scavarda | a...@redhat.com | Gluster Community Lead
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] dht mkdir preop check, afr and (non-)readable afr subvols

2016-05-31 Thread Pranith Kumar Karampuri
Just checked ec code. Looks okay. All entry fops are also updating metadata
and data part of the xattr.

On Tue, May 31, 2016 at 12:37 PM, Xavier Hernandez <xhernan...@datalab.es>
wrote:

> Hi,
>
> On 31/05/16 07:05, Raghavendra Gowdappa wrote:
>
>> +gluster-devel, +Xavi
>>
>> Hi all,
>>
>> The context is [1], where bricks do pre-operation checks before doing a
>> fop and proceed with fop only if pre-op check is successful.
>>
>> @Xavi,
>>
>> We need your inputs on behavior of EC subvolumes as well.
>>
>
> If I understand correctly, EC shouldn't have any problems here.
>
> EC sends the mkdir request to all subvolumes that are currently considered
> "good" and tries to combine the answers. Answers that match in return code,
> errno (if necessary) and xdata contents (except for some special xattrs
> that are ignored for combination purposes), are grouped.
>
> Then it takes the group with more members/answers. If that group has a
> minimum size of #bricks - redundancy, it is considered the good answer.
> Otherwise EIO is returned because bricks are in an inconsistent state.
>
> If there's any answer in another group, it's considered bad and gets
> marked so that self-heal will repair it using the good information from the
> majority of bricks.
>
> xdata is combined and returned even if return code is -1.
>
> Is that enough to cover the needed behavior ?
>
> Xavi
>
>
>
>> [1] http://review.gluster.org/13885
>>
>> regards,
>> Raghavendra
>>
>> - Original Message -
>>
>>> From: "Pranith Kumar Karampuri" <pkara...@redhat.com>
>>> To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
>>> Cc: "team-quine-afr" <team-quine-...@redhat.com>, "rhs-zteam" <
>>> rhs-zt...@redhat.com>
>>> Sent: Tuesday, May 31, 2016 10:22:49 AM
>>> Subject: Re: dht mkdir preop check, afr and (non-)readable afr subvols
>>>
>>> I think you should start a discussion on gluster-devel so that Xavi gets
>>> a
>>> chance to respond on the mails as well.
>>>
>>> On Tue, May 31, 2016 at 10:21 AM, Raghavendra Gowdappa <
>>> rgowd...@redhat.com>
>>> wrote:
>>>
>>> Also note that we've plans to extend this pre-op check to all dentry
>>>> operations which also depend parent layout. So, the discussion need to
>>>> cover all dentry operations like:
>>>>
>>>> 1. create
>>>> 2. mkdir
>>>> 3. rmdir
>>>> 4. mknod
>>>> 5. symlink
>>>> 6. unlink
>>>> 7. rename
>>>>
>>>> We also plan to have similar checks in lock codepath for directories too
>>>> (planning to use hashed-subvolume as lock-subvolume for directories).
>>>> So,
>>>> more fops :)
>>>> 8. lk (posix locks)
>>>> 9. inodelk
>>>> 10. entrylk
>>>>
>>>> regards,
>>>> Raghavendra
>>>>
>>>> - Original Message -
>>>>
>>>>> From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
>>>>> To: "team-quine-afr" <team-quine-...@redhat.com>
>>>>> Cc: "rhs-zteam" <rhs-zt...@redhat.com>
>>>>> Sent: Tuesday, May 31, 2016 10:15:04 AM
>>>>> Subject: dht mkdir preop check, afr and (non-)readable afr subvols
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I have some queries related to the behavior of afr_mkdir with respect
>>>>> to
>>>>> readable subvols.
>>>>>
>>>>> 1. While winding mkdir to subvols does afr check whether the subvolume
>>>>> is
>>>>> good/readable? Or does it wind to all subvols irrespective of whether a
>>>>> subvol is good/bad? In the latter case, what if
>>>>>a. mkdir succeeds on non-readable subvolume
>>>>>b. fails on readable subvolume
>>>>>
>>>>>   What is the result reported to higher layers in the above scenario?
>>>>> If
>>>>>   mkdir is failed, is it cleaned up on non-readable subvolume where it
>>>>>   failed?
>>>>>
>>>>> I am interested in this case as dht-preop check relies on layout xattrs
>>>>>
>>>> and I
>>>>
>>>>> assume layout xattrs in particular (and all xattrs in general) are
>>>>> guaranteed to be correct only on a readable subvolume of afr. So, in
>>>>>
>>>> essence
>>>>
>>>>> we shouldn't be winding down mkdir on non-readable subvols as whatever
>>>>>
>>>> the
>>>>
>>>>> decision brick makes as part of pre-op check is inherently flawed.
>>>>>
>>>>> regards,
>>>>> Raghavendra
>>>>>
>>>> --
>>> Pranith
>>>
>>>


-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] How to enable direct io??

2016-06-21 Thread Pranith Kumar Karampuri
There are two things you need to change for o-direct to be handled properly
in gluster stack:

1) gluster volume set  performance.strict-o-direct on
on nfs this option is gluster volume set 
performance.nfs.strict-o-direct on

2) gluster volume set  network.remote-dio off

Please note that we found a bug in o-direct reads which happen sometime,
which is fixed by http://review.gluster.org/14639

Without this patch you may get EINVAL for reads sometimes.

Pranith

On Fri, Jun 17, 2016 at 7:04 PM, Keiviw  wrote:

> By "mount -t glusterfs :/testvol -o direct-io-mode=true
> mountpoint",the GlusterFS client will enable the direct io, and the file
> will not cached in the GlusterFS client,but it won't work in the GlusterFS
> server. By defalut,the GlusterFS will ignore the direct io flag. How to
> make the server work in direct-io-mode??
>
>
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Question on merging zfs snapshot support into the mainline glusterfs

2016-06-21 Thread Pranith Kumar Karampuri
hi,
  Is there a plan to come up with an interface for snapshot
functionality? For example, in handling different types of sockets in
gluster all we need to do is to specify which interface we want to use and
ib,network-socket,unix-domain sockets all implement the interface. The code
doesn't have to assume anything about underlying socket type. Do you guys
think it is a worthwhile effort to separate out the logic of interface and
the code which uses snapshots? I see quite a few of if (strcmp ("zfs",
fstype)) code which can all be removed if we do this. Giving btrfs
snapshots in future will be a breeze as well, this way? All we need to do
is implementing snapshot interface using btrfs snapshot commands. I am not
talking about this patch per se. Just wanted to seek your inputs about
future plans for ease of maintaining the feature.

On Tue, Jun 21, 2016 at 11:46 AM, Atin Mukherjee 
wrote:

>
>
> On 06/21/2016 11:41 AM, Rajesh Joseph wrote:
> > What kind of locking issues you see? If you can provide some more
> > information I can be able to help you.
>
> That's related to stale lock issues on GlusterD which are there in 3.6.1
> since the fixes landed in the branch post 3.6.1. I have already provided
> the workaround/way to fix them [1]
>
> [1]
> http://www.gluster.org/pipermail/gluster-users/2016-June/thread.html#26995
>
> ~Atin
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Question on merging zfs snapshot support into the mainline glusterfs

2016-06-22 Thread Pranith Kumar Karampuri
Cool. Nice to know it is on the cards.

On Wed, Jun 22, 2016 at 11:45 AM, Rajesh Joseph <rjos...@redhat.com> wrote:

>
>
> On Tue, Jun 21, 2016 at 4:24 PM, Pranith Kumar Karampuri <
> pkara...@redhat.com> wrote:
>
>> hi,
>>   Is there a plan to come up with an interface for snapshot
>> functionality? For example, in handling different types of sockets in
>> gluster all we need to do is to specify which interface we want to use and
>> ib,network-socket,unix-domain sockets all implement the interface. The code
>> doesn't have to assume anything about underlying socket type. Do you guys
>> think it is a worthwhile effort to separate out the logic of interface and
>> the code which uses snapshots? I see quite a few of if (strcmp ("zfs",
>> fstype)) code which can all be removed if we do this. Giving btrfs
>> snapshots in future will be a breeze as well, this way? All we need to do
>> is implementing snapshot interface using btrfs snapshot commands. I am not
>> talking about this patch per se. Just wanted to seek your inputs about
>> future plans for ease of maintaining the feature.
>>
>
> As I said in my previous mail this is in plan and we will be doing it. But
> due to other priorities this was not taken in yet.
>
>
>>
>> On Tue, Jun 21, 2016 at 11:46 AM, Atin Mukherjee <amukh...@redhat.com>
>> wrote:
>>
>>>
>>>
>>> On 06/21/2016 11:41 AM, Rajesh Joseph wrote:
>>> > What kind of locking issues you see? If you can provide some more
>>> > information I can be able to help you.
>>>
>>> That's related to stale lock issues on GlusterD which are there in 3.6.1
>>> since the fixes landed in the branch post 3.6.1. I have already provided
>>> the workaround/way to fix them [1]
>>>
>>> [1]
>>>
>>> http://www.gluster.org/pipermail/gluster-users/2016-June/thread.html#26995
>>>
>>> ~Atin
>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>
>>
>>
>>
>> --
>> Pranith
>>
>
>


-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Reduce memcpy in glfs read and write

2016-06-21 Thread Pranith Kumar Karampuri
On Wed, Jun 22, 2016 at 5:50 AM, Sachin Pandit <span...@commvault.com>
wrote:

> Hey Pranith, I am good, I hope you are doing good too.
>
> Please find the comments inline.
>
>
>
> *From:* Pranith Kumar Karampuri [mailto:pkara...@redhat.com]
> *Sent:* Tuesday, June 21, 2016 5:58 AM
> *To:* Sachin Pandit <span...@commvault.com>
> *Cc:* gluster-devel@gluster.org
> *Subject:* Re: [Gluster-devel] Reduce memcpy in glfs read and write
>
>
>
> Hey!!
>
>Hope you are doing good. I took a look at the bt. So when flush
> comes write-behind has to flush all the writes down. I see the following
> frame hung in iob_unref:
> Thread 7 (Thread 0x7fa601a30700 (LWP 16218)):
> #0  0x7fa60cc55225 in pthread_spin_lock () from
> /lib64/libpthread.so.0  << Does it always hang there?
>
> -
>
> >>It does always hang here.
>
> -
> #1  0x7fa60e1f373e in iobref_unref (iobref=0x19dc7e0) at iobuf.c:907
> #2  0x7fa60e246fb2 in args_wipe (args=0x19e70ec) at
> default-args.c:1593
> #3  0x7fa60e1ea534 in call_stub_wipe_args (stub=0x19e709c) at
> call-stub.c:2466
> #4  0x7fa60e1ea5de in call_stub_destroy (stub=0x19e709c) at
> call-stub.c:2482
>
> Is this on top of master branch? It seems like we missed an unlock of the
> spin-lock or the iobref has junk value which gives the feeling that it is
> in locked state (May be double free?). Do you have any extra patches you
> have in your repo which make changes in iobuf?
>
> --
>
> >>I have implemented a method to reduce memcpy in libgfapi (My patch is on
> top of master branch), by making use of buffer from iobuf pool and passing
> the buffer to application. However, I have not made any changes in iobuf
> core feature. I don’t  think double free is happening anywhere in the code
> (I did check this using logs)
>
>
>
>  Method that I have implemented:
>
> 1)  Application asks for a buffer of specific size, and the buffer is
> allocated from the iobuf pool.
>
> 2)  Buffer is passed on to application, and the application writes
> the data into that buffer.
>
> 3)  Buffer with data in it is passed from application to libgfapi and
> the underlying translators (no memcpy in glfs_write)
>
>
>
> I have couple of questions, and observations:
>
>
>
> Observations:
>
> --
>
> 1)  For every write if I get a fresh buffer then I don’t see any
> problem. All the writes are going through.
>
> 2)  If I try to make use of buffer for consecutive writes, then I am
> seeing the hang in flush.
>
>
>
> Question1: Is it fine if I reuse the buffer for consecutive writes??
>
> Question2: Is it always ensured that the data is written to the file when
> I get a response from syncop_writev.
>

Will it be possible to share the patch on master and a test program which
can recreate this issue?


>
>
> Thanks,
>
> Sachin Pandit.
>
>
>
> --
>
>
>
> On Tue, Jun 21, 2016 at 4:07 AM, Sachin Pandit <span...@commvault.com>
> wrote:
>
> Hi all,
>
>
>
> I bid adieu to you all with the hope of crossing path again, and the time
> has come rather quickly. It feels great to work on GlusterFS again.
>
>
>
> Currently we are trying to write data backed up by Commvault Simpana to
> glusterfs volume (Disperse volume). To improve the performance, I have
> implemented the proposal put forward my Rafi  K C [1]. I have some
> questions regarding libgfapi and iobuf pool.
>
>
>
> To reduce an extra level of copy in glfs read and write, I have
> implemented few APIs to request a buffer (similar to the one represented in
>  [1]) from iobuf pool which can be used by the application to write data
> to. With this implementation, when I try to reuse the buffer for
> consecutive writes, I could see a hang in syncop_flush of glfs_close (BT of
> the hang can be found in [2]). I wanted to know if reusing the buffer is
> recommended. If not, do we need to request buffer for each writes?
>
>
>
> Setup : Distributed-Disperse ( 4 * (2+1)). Bricks scattered over 3 nodes.
>
>
>
> [1]
> http://www.gluster.org/pipermail/gluster-devel/2015-February/043966.html
>
> [2] Attached file -  bt.txt
>
>
>
> Thanks & Regards,
>
> Sachin Pandit.
>
>
>
> ***Legal Disclaimer***
>
> "This communication may contain confidential and privileged material for the
>
> sole use of the intended recipient. Any unauthorized review, use or 
>

Re: [Gluster-devel] Reduce memcpy in glfs read and write

2016-06-21 Thread Pranith Kumar Karampuri
Hey!!
   Hope you are doing good. I took a look at the bt. So when flush
comes write-behind has to flush all the writes down. I see the following
frame hung in iob_unref:
Thread 7 (Thread 0x7fa601a30700 (LWP 16218)):
#0  0x7fa60cc55225 in pthread_spin_lock () from /lib64/libpthread.so.0
<< Does it always hang there?
#1  0x7fa60e1f373e in iobref_unref (iobref=0x19dc7e0) at iobuf.c:907
#2  0x7fa60e246fb2 in args_wipe (args=0x19e70ec) at default-args.c:1593
#3  0x7fa60e1ea534 in call_stub_wipe_args (stub=0x19e709c) at
call-stub.c:2466
#4  0x7fa60e1ea5de in call_stub_destroy (stub=0x19e709c) at
call-stub.c:2482

Is this on top of master branch? It seems like we missed an unlock of the
spin-lock or the iobref has junk value which gives the feeling that it is
in locked state (May be double free?). Do you have any extra patches you
have in your repo which make changes in iobuf?

On Tue, Jun 21, 2016 at 4:07 AM, Sachin Pandit 
wrote:

> Hi all,
>
>
>
> I bid adieu to you all with the hope of crossing path again, and the time
> has come rather quickly. It feels great to work on GlusterFS again.
>
>
>
> Currently we are trying to write data backed up by Commvault Simpana to
> glusterfs volume (Disperse volume). To improve the performance, I have
> implemented the proposal put forward my Rafi  K C [1]. I have some
> questions regarding libgfapi and iobuf pool.
>
>
>
> To reduce an extra level of copy in glfs read and write, I have
> implemented few APIs to request a buffer (similar to the one represented in
>  [1]) from iobuf pool which can be used by the application to write data
> to. With this implementation, when I try to reuse the buffer for
> consecutive writes, I could see a hang in syncop_flush of glfs_close (BT of
> the hang can be found in [2]). I wanted to know if reusing the buffer is
> recommended. If not, do we need to request buffer for each writes?
>
>
>
> Setup : Distributed-Disperse ( 4 * (2+1)). Bricks scattered over 3 nodes.
>
>
>
> [1]
> http://www.gluster.org/pipermail/gluster-devel/2015-February/043966.html
>
> [2] Attached file -  bt.txt
>
>
>
> Thanks & Regards,
>
> Sachin Pandit.
>
>
> ***Legal Disclaimer***
> "This communication may contain confidential and privileged material for the
> sole use of the intended recipient. Any unauthorized review, use or 
> distribution
> by others is strictly prohibited. If you have received the message by mistake,
> please advise the sender by reply email and delete the message. Thank you."
> **
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] performance issues Manoj found in EC testing

2016-06-23 Thread Pranith Kumar Karampuri
hi Xavi,
  Meet Manoj from performance team Redhat. He has been testing EC
performance in his stretch clusters. He found some interesting things we
would like to share with you.

1) When we perform multiple streams of big file writes(12 parallel dds I
think) he found one thread to be always hot (99%CPU always). He was asking
me if fuse_reader thread does any extra processing in EC compared to
replicate. Initially I thought it would just lock and epoll threads will
perform the encoding but later realized that once we have the lock and
version details, next writes on the file would be encoded in the same
thread that comes to EC. write-behind could play a role and make the writes
come to EC in an epoll thread but we saw consistently there was just one
thread that is hot. Not multiple threads. We will be able to confirm this
in tomorrow's testing.

2) This is one more thing Raghavendra G found, that our current
implementation of epoll doesn't let other epoll threads pick messages from
a socket while one thread is processing one message from that socket. In
EC's case that can be encoding of the write/decoding read. This will not
let replies of operations on different files to be processed in parallel.
He thinks this can be fixed for 3.9.

Manoj will be raising a bug to gather all his findings. I just wanted to
introduce him and let you know the interesting things he is finding before
you see the bug :-).
-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Gluster Solution for Non Shared Persistent Storage in Docker Container

2016-06-23 Thread Pranith Kumar Karampuri
In case you missed the post on Gluster twitter/facebook,

https://pkalever.wordpress.com/2016/06/23/gluster-solution-for-non-shared-persistent-storage-in-docker-container/

We would love to hear your feedback on this.

-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Need help from FreeBSD developers

2016-06-24 Thread Pranith Kumar Karampuri
hi,
Based on the debugging done by Niels on the bug
https://bugzilla.redhat.com/show_bug.cgi?id=1181500#c5, we need a
confirmation about what listxattr returns on FreeBSD. Could someone please
help?

-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] NetBSD tests not running to completion.

2016-01-08 Thread Pranith Kumar Karampuri



On 01/08/2016 08:50 PM, Emmanuel Dreyfus wrote:

On Fri, Jan 08, 2016 at 08:37:16PM +0530, Pranith Kumar Karampuri wrote:

 NetBSD)
 vnd=`vnconfig -l | \
  awk '!/not in use/{printf("%s%s:%d ", $1, $2, $5);}'`

Can there be Loopback devices that are in use when this piece of the code is
executed, which can lead to the problems we ran into? I may be completely
wrong. It is a wild guess about something I don't completely understand.

This lists loopback devices in use. For instance:
vnd0:/d:180225 vnd1:/d:180226 vnd2:/d:180227

Next step is to look for loopback devices which backing store are in $B0
and unconfigure them.
Oops, wrong code reading. Is it possible to have loopback devices not in 
use, that we miss out on destroying? Could be a stupid question but 
still asking.


Pranith





___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] NetBSD tests not running to completion.

2016-01-10 Thread Pranith Kumar Karampuri



On 01/10/2016 02:04 PM, Pranith Kumar Karampuri wrote:



On 01/10/2016 11:08 AM, Emmanuel Dreyfus wrote:

Pranith Kumar Karampuri <pkara...@redhat.com> wrote:


tests/basic/afr/arbiter-statfs.t

I posted patches to fix this one (but it seems Jenkins is down? No
regression is running)


tests/basic/afr/self-heal.t
It seems like in this run, self-heal.t and quota.t are running at the 
same time. Not sure why that can happen. So for now not going to 
investigate more.
[2016-01-08 07:58:55.6N]:++ 
G_LOG:./tests/basic/afr/self-heal.t: TEST: 88 88 test -d 
/d/backends/brick0/file ++
[2016-01-08 07:58:55.6N]:++ 
G_LOG:./tests/basic/afr/self-heal.t: TEST: 89 89 diff /dev/fd/63 
/dev/fd/62 ++
[2016-01-08 07:58:55.6N]:++ G_LOG:./tests/basic/quota.t: TEST: 
124 124 gluster --mode=script --wignore volume quota patchy 
limit-usage /addbricktest/dir8 100MB ++
[2016-01-08 07:58:55.6N]:++ 
G_LOG:./tests/basic/afr/self-heal.t: TEST: 92 92 rm -rf 
/mnt/glusterfs/0/addbricktest ++
[2016-01-08 07:58:55.6N]:++ G_LOG:./tests/basic/quota.t: TEST: 
124 124 gluster --mode=script --wignore volume quota patchy 
limit-usage /addbricktest/dir9 100MB ++



tests/basic/afr/entry-self-heal.t
This seem to have a bit of history. We have more data points that 
keeps failing once in a while considering that michael posted a patch: 
http://review.gluster.org/12938


I tried to look into 3 instances of this failure:
1) 
https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/12574/consoleFull


same issue as above, two tests are running in parallel.
[2015-12-10 07:03:52.6N]:++ 
G_LOG:./tests/basic/afr/arbiter-statfs.t: TEST: 27 27 gluster 
--mode=script --wignore volume start patchy ++
[2015-12-10 07:03:06.6N]:++ 
G_LOG:./tests/basic/glusterd/heald.t: TEST: 58 58 [0-9][0-9]* 
get_shd_process_pid ++

[2015-12-10 07:03:58.047476]  : volume start patchy : SUCCESS
[2015-12-10 07:03:58.6N]:++ 
G_LOG:./tests/basic/afr/arbiter-statfs.t: TEST: 28 28 glusterfs 
--volfile-server=nbslave74.cloud.gluster.org --volfile-id=patchy 
/mnt/glusterfs/0 ++


2) 
https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/12569/consoleFull


same issue, self-heald.t and entry-self-heal.t are executing in parallel:
[2015-12-10 05:00:05.6N]:++ 
G_LOG:./tests/basic/afr/entry-self-heal.t: TEST: 167 167 1 
afr_child_up_status patchy 0 ++
[2015-12-10 05:00:07.6N]:++ 
G_LOG:./tests/basic/afr/self-heald.t: TEST: 30 30 1 
afr_child_up_status_in_shd patchy 4 ++
[2015-12-10 05:00:08.401698] I [rpc-clnt.c:1834:rpc_clnt_reconfig] 
0-patchy-client-0: changing port to 49152 (from 0)
[2015-12-10 05:00:08.403526] I [MSGID: 114057] 
[client-handshake.c:1421:select_server_supported_programs] 
0-patchy-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)


3) 
https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/13285/consoleFull


Looks same again: quota.t and entry-self-heal.t are executing at the 
same time.


[2016-01-08 07:58:07.6N]:++ G_LOG:./tests/basic/quota.t: TEST: 
75 75 8.0MB quotausage /test_dir ++
[2016-01-08 07:58:08.294126] I [MSGID: 108006] 
[afr-common.c:4136:afr_local_init] 0-patchy-replicate-0: no subvolumes up
[2016-01-08 07:58:08.6N]:++ 
G_LOG:./tests/basic/afr/entry-self-heal.t: TEST: 280 280 rm -rf 
/d/backends/patchy0/.glusterfs/indices/xattrop/29bc252c-3f32-4e3e-b3a9-31478c04bb7f 
/d/backends/patchy0/.glusterfs/indices/xattrop/50adf186-8323-4f01-98fb-5621b8d9edee 
/d/backends/patchy0/.glusterfs/indices/xattrop/690c83b4-3e17-4558-a025-d08775742814 
/d/backends/patchy0/.glusterfs/indices/xattrop/952e518c-aaa3-4697-a2a7-a25c906635bc 
/d/backends/patchy0/.glusterfs/indices/xattrop/be2d1bee-a81c-4c63-8fcc-f06f0fc40e9b 
/d/backends/patchy0/.glusterfs/indices/xattrop/dfa00115-a11a-4b6e-93cd-b03e02ac8727 
/d/backends/patchy0/.glusterfs/indices/xattrop/e25b0f17-aac0-4f0f-b2d5-23f3a6493c0d 
/d/backends/patchy0/.glusterfs/indices/xattrop/fb2b4f42-fe9f-48dc-a8d3-1c4419166bf0 
/d/backends/patchy0/.glusterfs/indices/xattrop/fc89b498-fb47-4218-8304-693bbdc6bfc6 
/d/backends/patchy0/.glusterfs/indices/xattrop/xattrop-10fe0390-68cf-42f6-9838-ca243fe26635 
/d/backends/patchy0/.glusterfs/indices/xattrop/xattrop-f4d7f633-fec7-4cbc-829b-5e54c66f60b1 
/d/backends/patchy1/.glusterfs/indices/xattrop/1ed0b466-4f82-4e89-8aa0-d33f3cbec8bf 
/d/backends/patchy1/.glusterfs/indices/xattrop/29bc252c-3f32-4e3e-b3a9-31478c04bb7f 
/d/backends/patchy1/.glusterfs/indices/xattrop/338a302d-8e5a-4276-966d-3479aa3051ed 
/d/backends/patchy1/.glusterfs/indices/xattrop/4291d3cb-7c96-41d9-8cb7-25360398590b 
/d/backends/patchy1/.glusterfs/indices/xattrop/48f788c0-48b1-4072-97aa-e136c97c1d88 
/d/backends/patchy1/.glusterfs/indices/xattrop/50adf186-8323-4f01-98fb-5621b8d9edee 
/d/backends/patchy1/.glusterfs/indices/xattrop/592847dd-2592-4fab-bc6a-25a771b89e

Re: [Gluster-devel] NetBSD tests not running to completion.

2016-01-10 Thread Pranith Kumar Karampuri



On 01/10/2016 11:08 AM, Emmanuel Dreyfus wrote:

Pranith Kumar Karampuri <pkara...@redhat.com> wrote:


tests/basic/afr/arbiter-statfs.t

I posted patches to fix this one (but it seems Jenkins is down? No
regression is running)


tests/basic/afr/self-heal.t
It seems like in this run, self-heal.t and quota.t are running at the 
same time. Not sure why that can happen. So for now not going to 
investigate more.
[2016-01-08 07:58:55.6N]:++ G_LOG:./tests/basic/afr/self-heal.t: 
TEST: 88 88 test -d /d/backends/brick0/file ++
[2016-01-08 07:58:55.6N]:++ G_LOG:./tests/basic/afr/self-heal.t: 
TEST: 89 89 diff /dev/fd/63 /dev/fd/62 ++
[2016-01-08 07:58:55.6N]:++ G_LOG:./tests/basic/quota.t: TEST: 
124 124 gluster --mode=script --wignore volume quota patchy limit-usage 
/addbricktest/dir8 100MB ++
[2016-01-08 07:58:55.6N]:++ G_LOG:./tests/basic/afr/self-heal.t: 
TEST: 92 92 rm -rf /mnt/glusterfs/0/addbricktest ++
[2016-01-08 07:58:55.6N]:++ G_LOG:./tests/basic/quota.t: TEST: 
124 124 gluster --mode=script --wignore volume quota patchy limit-usage 
/addbricktest/dir9 100MB ++



tests/basic/afr/entry-self-heal.t
This seem to have a bit of history. We have more data points that keeps 
failing once in a while considering that michael posted a patch: 
http://review.gluster.org/12938


Will be looking into this more now.

That two ones are still to be investigated, and it seems
tests/basic/afr/split-brain-resolution.t is now reliabily broken as
well.

Will take a look at this today after entry-self-heal.t

Pranith



tests/basic/quota-nfs.t

That one is marked as bad test and should not cause harm on spurious
failure as its result is ignored.

I am trying to reproduce a spurious VM reboot during tests by looping on
the whole test suite on nbslave70, with reboot on panic disabled (it
will drop into kernel debugger instead). No result so far.



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Linux regression tests are hanging too

2016-01-19 Thread Pranith Kumar Karampuri

Result: PASS
Build timed out (after 300 minutes). Marking the build as failed.
Build was aborted
Finished: FAILURE

https://build.gluster.org/job/rackspace-regression-2GB-triggered/17664/console

Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] performance issues Manoj found in EC testing

2016-06-28 Thread Pranith Kumar Karampuri
On Tue, Jun 28, 2016 at 10:21 AM, Poornima Gurusiddaiah <pguru...@redhat.com
> wrote:

> Regards,
> Poornima
>
> --
>
> *From: *"Pranith Kumar Karampuri" <pkara...@redhat.com>
> *To: *"Xavier Hernandez" <xhernan...@datalab.es>
> *Cc: *"Gluster Devel" <gluster-devel@gluster.org>
> *Sent: *Monday, June 27, 2016 5:48:24 PM
> *Subject: *Re: [Gluster-devel] performance issues Manoj found in EC
> testing
>
>
>
> On Mon, Jun 27, 2016 at 12:42 PM, Pranith Kumar Karampuri <
> pkara...@redhat.com> wrote:
>
>>
>>
>> On Mon, Jun 27, 2016 at 11:52 AM, Xavier Hernandez <xhernan...@datalab.es
>> > wrote:
>>
>>> Hi Manoj,
>>>
>>> I always enable client-io-threads option for disperse volumes. It
>>> improves performance sensibly, most probably because of the problem you
>>> have detected.
>>>
>>> I don't see any other way to solve that problem.
>>>
>>
>> I agree. Updated the bug with same info.
>>
>>>
>>> I think it would be a lot better to have a true thread pool (and maybe
>>> an I/O thread pool shared by fuse, client and server xlators) in
>>> libglusterfs instead of the io-threads xlator. This would allow each xlator
>>> to decide when and what should be parallelized in a more intelligent way,
>>> since basing the decision solely on the fop type seems too simplistic to me.
>>>
>>> In the specific case of EC, there are a lot of operations to perform for
>>> a single high level fop, and not all of them require the same priority.
>>> Also some of them could be executed in parallel instead of sequentially.
>>>
>>
>> I think it is high time we actually schedule(for which release) to get
>> this in gluster. May be you should send out a doc where we can work out
>> details? I will be happy to explore options to integrate io-threads,
>> syncop/barrier with this infra based on the design may be.
>>
>
> I was just thinking why we can't reuse synctask framework. It already
> scales up/down based on the tasks. At max it uses 16 threads. Whatever we
> want to be executed in parallel we can create a synctask around it and run
> it. Would that be good enough?
>
> Yes, synctask framework can be preferred over io-threads, else it would
> mean 16 synctask threads + 16(?) io-threads for one instance of mount, this
> will blow out the gfapi clients if they have many mounts from the same
> process. Also using synctask would mean code changes in EC?
>

Yes it will need some changes but I don't think they are big changes. I
think the functions to decode/encode already exist. We just to need to move
encoding/decoding as tasks and run as synctasks.

Xavi,
  Long time back we chatted a bit about synctask code and you wanted
the scheduling to happen by kernel or something. Apart from that do you see
any other issues? At least if the tasks are synchronous i.e. nothing goes
out the wire, task scheduling = thread scheduling by kernel and it works
exactly like thread-pool you were referring to. It does multi-tasking only
if the tasks are asynchronous in nature.


>
>
>>> Xavi
>>>
>>>
>>> On 25/06/16 19:42, Manoj Pillai wrote:
>>>
>>>>
>>>> - Original Message -
>>>>
>>>>> From: "Pranith Kumar Karampuri" <pkara...@redhat.com>
>>>>> To: "Xavier Hernandez" <xhernan...@datalab.es>
>>>>> Cc: "Manoj Pillai" <mpil...@redhat.com>, "Gluster Devel" <
>>>>> gluster-devel@gluster.org>
>>>>> Sent: Thursday, June 23, 2016 8:50:44 PM
>>>>> Subject: performance issues Manoj found in EC testing
>>>>>
>>>>> hi Xavi,
>>>>>   Meet Manoj from performance team Redhat. He has been testing
>>>>> EC
>>>>> performance in his stretch clusters. He found some interesting things
>>>>> we
>>>>> would like to share with you.
>>>>>
>>>>> 1) When we perform multiple streams of big file writes(12 parallel dds
>>>>> I
>>>>> think) he found one thread to be always hot (99%CPU always). He was
>>>>> asking
>>>>> me if fuse_reader thread does any extra processing in EC compared to
>>>>> replicate. Initially I thought it would just lock and epoll threads
>>>>> will
>>>>> perform the encoding but later realized that once we have the lock and
>>>>> 

Re: [Gluster-devel] performance issues Manoj found in EC testing

2016-06-28 Thread Pranith Kumar Karampuri
>
>> Yes it will need some changes but I don't think they are big changes. I
>> think the functions to decode/encode already exist. We just to need to
>> move encoding/decoding as tasks and run as synctasks.
>>
>
> I was also thinking in sleeping fops. Currently when they are resumed,
> they are processed in the same thread that was processing another fop. This
> could add latencies to fops or unnecessary delays in lock management. If
> they can be scheduled to be executed by another thread, these delays are
> drastically reduced.
>
> On the other hand, splitting the computation of EC encoding into multiple
> threads is bad because current implementation takes advantage of internal
> CPU memory cache, which is really fast. We should compute all fragments of
> a single request in the same thread. Multiple independent computations
> could be executed by different threads.
>
>
>> Xavi,
>>   Long time back we chatted a bit about synctask code and you wanted
>> the scheduling to happen by kernel or something. Apart from that do you
>> see any other issues? At least if the tasks are synchronous i.e. nothing
>> goes out the wire, task scheduling = thread scheduling by kernel and it
>> works exactly like thread-pool you were referring to. It does
>> multi-tasking only if the tasks are asynchronous in nature.
>>
>
> How would this work ? should we have to create a new synctask for each
> background function we want to execute ? I think this has an important
> overhead, since each synctask requires its own stack, creates a frame that
> we don't really need in most cases, and it causes context switches.
>

Yes we will have to create a synctask. Yes it does have overhead of own
stack because it assumes the task will pause at some point. I think when
synctask framework was written the smallest thing that will be executed is
a fop over network. It was mainly written to do replace-brick using 'pump'
xlator which is now deprecated. If we know upfront that the task will never
pause there is absolutely no need to create a new stack. In which case it
just executes the function and moves on to the next task.


>
> We could have hundreds or thousands of requests per second. they could
> even require more than one background task for each request in some cases.
> I'm not sure if synctasks are the right choice in this case.
>

For each request we need to create a new synctask. It will be placed in the
tasks that are ready to execute. there will be 16 threads(in the stressful
scenario) waiting for new tasks, one of them will pick it up and execute.



>
> I think that a thread pool is more lightweight.
>

I think a small write-up of your thoughts on how it should be would be a
good start for us.

In my head a thread-pool is a set of threads waiting for incoming tasks.
Each thread picks up a new task and executes the task, upon completion it
will move on to the next task that needs to be executed.

Synctask framework is also a thread-pool waiting for incoming tasks. Each
thread picks up a task in readyq and executes the task. If the task has to
pause in the middle it will have to put it in wait-q and move on to the
next one. If the task never pauses, then it will complete the task
execution and moves on to the next task.

So synctask is more complex than thread-pool because it assumes the tasks
will pause. I am wondering if we can 1) break the complexity into
thread-pool which is more light-weight and add synctask framework on top of
it. or alternatively 2) Optimize synctask framework to perform synchronous
tasks without any stack creation and execute it in the thread stack itself.


>
> Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [FAILED] NetBSD-regression for ./tests/basic/afr/self-heald.t

2016-02-08 Thread Pranith Kumar Karampuri



On 02/08/2016 04:16 PM, Ravishankar N wrote:

[Removing Milind, adding Pranith]

On 02/08/2016 04:09 PM, Emmanuel Dreyfus wrote:

On Mon, Feb 08, 2016 at 04:05:44PM +0530, Ravishankar N wrote:
The patch to add it to bad tests has already been merged, so I guess 
this

.t's failure won't pop up again.

IMo that was a bit too quick.
I guess Pranith merged it because of last week's complaint for the 
same .t and not wanting to block other patches from being merged.


Yes, two people came to my desk and said their patches are blocked 
because of this. So had to merge until we figure out the problem.


Pranith

  What is the procedure to get out of the
list?

Usually, you just fix the problem with the testcase and send a patch 
with the fix and removing it from bad_tests. (For example 
http://review.gluster.org/13233)




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [FAILED] NetBSD-regression for ./tests/basic/afr/self-heald.t

2016-02-08 Thread Pranith Kumar Karampuri



On 02/08/2016 04:22 PM, Pranith Kumar Karampuri wrote:



On 02/08/2016 04:16 PM, Ravishankar N wrote:

[Removing Milind, adding Pranith]

On 02/08/2016 04:09 PM, Emmanuel Dreyfus wrote:

On Mon, Feb 08, 2016 at 04:05:44PM +0530, Ravishankar N wrote:
The patch to add it to bad tests has already been merged, so I 
guess this

.t's failure won't pop up again.

IMo that was a bit too quick.
I guess Pranith merged it because of last week's complaint for the 
same .t and not wanting to block other patches from being merged.


Yes, two people came to my desk and said their patches are blocked 
because of this. So had to merge until we figure out the problem.


Patch is from last week though.

Pranith


Pranith

  What is the procedure to get out of the
list?

Usually, you just fix the problem with the testcase and send a patch 
with the fix and removing it from bad_tests. (For example 
http://review.gluster.org/13233)




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-infra] Code-Review+2 and Verified+1 cause multiple retriggers on Jenkins

2016-02-04 Thread Pranith Kumar Karampuri



On 02/04/2016 03:39 PM, Kaushal M wrote:

I'm okay with this.

+1



On Thu, Feb 4, 2016 at 3:34 PM, Raghavendra Talur  wrote:

Hi,

We recently changed the jenkins builds to be triggered on the following
triggers.

1. Verified+1
2. Code-review+2
3. recheck (netbsd|centos|smoke)

There is a bug in 1 and 2.

Multiple triggers of 1 or 2 would result in re-runs even when not intended.

I would like to replace 1 and 2 with a comment "run-all-regression" or
something like that.
Thoughts?


Thanks
Raghavendra Talur


___
Gluster-infra mailing list
gluster-in...@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-infra

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Cores on NetBSD of brick https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/14100/consoleFull

2016-02-08 Thread Pranith Kumar Karampuri



On 02/08/2016 08:20 PM, Emmanuel Dreyfus wrote:

On Mon, Feb 08, 2016 at 07:27:46PM +0530, Pranith Kumar Karampuri wrote:

   I don't see any logs in the archive. Did we change something?

I think thay are in a different tarball, in /archives/logs
I think the regression run is not giving that link anymore when the 
crash happens? Could you please add that also as a link in regression run?


Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Cores on NetBSD of brick https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/14100/consoleFull

2016-02-08 Thread Pranith Kumar Karampuri

Emmanuel,
  I don't see any logs in the archive. Did we change something?

Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [FAILED] NetBSD-regression for ./tests/basic/afr/self-heald.t

2016-02-08 Thread Pranith Kumar Karampuri



On 02/08/2016 05:04 PM, Michael Scherer wrote:

Le lundi 08 février 2016 à 16:22 +0530, Pranith Kumar Karampuri a
écrit :

On 02/08/2016 04:16 PM, Ravishankar N wrote:

[Removing Milind, adding Pranith]

On 02/08/2016 04:09 PM, Emmanuel Dreyfus wrote:

On Mon, Feb 08, 2016 at 04:05:44PM +0530, Ravishankar N wrote:

The patch to add it to bad tests has already been merged, so I guess
this
.t's failure won't pop up again.

IMo that was a bit too quick.

I guess Pranith merged it because of last week's complaint for the
same .t and not wanting to block other patches from being merged.

Yes, two people came to my desk and said their patches are blocked
because of this. So had to merge until we figure out the problem.

I suspect it would be better if people did use the list rather than
going to the desk, as it would help others who are either absent, in
another office or even not working in the same company be aware of the
issue.

next time this happen, can you direct people to gluster-devel ?

Will do :-).

Pranith




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Cores on NetBSD of brick https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/14100/consoleFull

2016-02-09 Thread Pranith Kumar Karampuri



On 02/09/2016 04:13 PM, Emmanuel Dreyfus wrote:

On Tue, Feb 09, 2016 at 11:56:37AM +0530, Pranith Kumar Karampuri wrote:

I think the regression run is not giving that link anymore when the crash
happens? Could you please add that also as a link in regression run?

Ther was the path of the archive, I changed it for a http:// link

Oops, sorry, it is there.

Pranith




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Google Summer of Code Application opens Feb 8th - Call For Mentors

2016-02-04 Thread Pranith Kumar Karampuri



On 02/04/2016 08:48 PM, Kaushal M wrote:

I'm still up to mentor the sub-directory mount support idea.

this one? http://review.gluster.org/10186

Pranith


On Thu, Feb 4, 2016 at 2:38 PM, Amye Scavarda  wrote:

Hi all,
Google Summer of Code opens up their organization application on Feb 8th,
and I wanted to reach out and check to see if there is interest within the
Gluster community to participate this year.

We'd need to have an application in place by Feb 19th, with a list of
possible ideas, as well as dedicated mentors.
https://gluster.readthedocs.org/en/latest/Developer-guide/Projects/ is a
current list of ideas, but not all of these have a mentor/owner.

If you're interested, please reach out to me by Feb 10th before our next
Gluster Community Meeting and we can discuss next steps at that meeting.


Thanks!
-- amye

--
Amye Scavarda | a...@redhat.com | Gluster Community Lead

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] crash in dht in https://build.gluster.org/job/rackspace-regression-2GB-triggered/18134/consoleFull

2016-02-09 Thread Pranith Kumar Karampuri

hi,
   I see the following crash. Is this a known issue?
(gdb) bt
#0  0x7f3f8c339fb4 in dht_selfheal_dir_setattr 
(frame=0x7f3f6c002a0c, loc=0x7f3f6c000944, stbuf=0x7f3f6c0009d4, 
valid=16777215,
layout=0x7f3f6c004140) at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/xlators/cluster/dht/src/dht-selfheal.c:1087
#1  0x7f3f8c33a4f8 in dht_selfheal_dir_mkdir_cbk 
(frame=0x7f3f6c002a0c, cookie=0x7f3f9004201c, this=0x7f3f8803bad0, 
op_ret=-1, op_errno=5,

inode=0x0, stbuf=0x0, preparent=0x0, postparent=0x0, xdata=0x0)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/xlators/cluster/dht/src/dht-selfheal.c:1147
#2  0x7f3f8fb81f31 in dht_mkdir (frame=0x7f3f9004201c, 
this=0x7f3f8803aa80, loc=0x7f3f6c000944, mode=16877, umask=0, 
params=0x7f3f9003594c)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/xlators/cluster/dht/src/dht-common.c:6710
#3  0x7f3f8c33ad1f in dht_selfheal_dir_mkdir (frame=0x7f3f6c002a0c, 
loc=0x7f3f6c000944, layout=0x7f3f6c004140, force=0)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/xlators/cluster/dht/src/dht-selfheal.c:1256
#4  0x7f3f8c33c4c8 in dht_selfheal_directory (frame=0x7f3f6c002a0c, 
dir_cbk=0x7f3f8c349588 , loc=0x7f3f6c000944,
layout=0x7f3f6c004140) at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/xlators/cluster/dht/src/dht-selfheal.c:1837
#5  0x7f3f8c34c0ac in dht_lookup_dir_cbk (frame=0x7f3f6c002a0c, 
cookie=0x7f3f6c0064bc, this=0x7f3f8803bad0, op_ret=-1, op_errno=2, 
inode=0x0,

stbuf=0x7f3f6c0069d4, xattr=0x0, postparent=0x7f3f6c006c04)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/xlators/cluster/dht/src/dht-common.c:724
#6  0x7f3f8fb5ea98 in dht_lookup_dir_cbk (frame=0x7f3f6c0064bc, 
cookie=0x7f3f6c0084dc, this=0x7f3f8803aa80, op_ret=-1, op_errno=2,
inode=0x7f3f6c0013ec, stbuf=0x7f3f94bc2860, xattr=0x0, 
postparent=0x7f3f94bc27f0)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/xlators/cluster/dht/src/dht-common.c:714
#7  0x7f3f8fddb23a in client3_3_lookup_cbk (req=0x7f3f6c0090ac, 
iov=0x7f3f6c0090ec, count=1, myframe=0x7f3f6c0084dc)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/xlators/protocol/client/src/client-rpc-fops.c:3028
#8  0x7f3fa0c2d42a in rpc_clnt_handle_reply (clnt=0x7f3f8806cdb0, 
pollin=0x7f3f90041c90)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/rpc/rpc-lib/src/rpc-clnt.c:759
#9  0x7f3fa0c2d8c8 in rpc_clnt_notify (trans=0x7f3f8806d240, 
mydata=0x7f3f8806cde0, event=RPC_TRANSPORT_MSG_RECEIVED, 
data=0x7f3f90041c90)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/rpc/rpc-lib/src/rpc-clnt.c:900
#10 0x7f3fa0c29b5a in rpc_transport_notify (this=0x7f3f8806d240, 
event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f3f90041c90)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/rpc/rpc-lib/src/rpc-transport.c:541

#11 0x7f3f961eadcb in socket_event_poll_in (this=0x7f3f8806d240)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/rpc/rpc-transport/socket/src/socket.c:2231
#12 0x7f3f961eb321 in socket_event_handler (fd=18, idx=12, 
data=0x7f3f8806d240, poll_in=1, poll_out=0, poll_err=0)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/rpc/rpc-transport/socket/src/socket.c:2344
#13 0x7f3fa0ec61a8 in event_dispatch_epoll_handler 
(event_pool=0x121fce0, event=0x7f3f94bc2e70)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/libglusterfs/src/event-epoll.c:571

#14 0x7f3fa0ec6596 in event_dispatch_epoll_worker (data=0x125e1a0)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/libglusterfs/src/event-epoll.c:674

#15 0x7f3fa0144a51 in start_thread () from ./lib64/libpthread.so.0
#16 0x7f3f9faae93d in clone () from ./lib64/libc.so.6


Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] regarding glusterd crash on NetBSD https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/14100/consoleFull

2016-02-10 Thread Pranith Kumar Karampuri

hi Atin, Kaushal,
  Is this a known issue?

(gdb) #1  0xbb789fb7 in __synclock_unlock (lock=0xbb1d4ac0)
(gdb) at 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/libglusterfs/src/syncop.c:1056

#2  0xbb789ffd in synclock_unlock (lock=0xbb1d4ac0)
at 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/libglusterfs/src/syncop.c:1071
#3  0xb9b803ff in glusterd_big_locked_notify (rpc=0xb8bc2070, 
mydata=0xb950efa0, event=RPC_CLNT_DISCONNECT, data=0x0,

notify_fn=0xb9b8ec28 <__glusterd_brick_rpc_notify>)
at 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/mgmt/glusterd/src/glusterd-handler.c:72
#4  0xb9b8f0d7 in glusterd_brick_rpc_notify (rpc=0xb8bc2070, 
mydata=0xb950efa0, event=RPC_CLNT_DISCONNECT, data=0x0)
at 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/mgmt/glusterd/src/glusterd-handler.c:4993
#5  0xbb708c22 in rpc_clnt_notify (trans=0xb8bc8030, mydata=0xb8bc2090, 
event=RPC_TRANSPORT_DISCONNECT, data=0xb8bc8030)
at 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-lib/src/rpc-clnt 
at 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-lib/src/rpc-clnIllegal 
process-id: 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/libglusterfs/src/syncop.c:1056.

(gdb) #2  0xbb789ffd in synclock_unlock (lock=0xbb1d4ac0)
(gdb) at 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/libglusterfs/src/syncop.c:1071
Illegal process-id: 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/libglusterfs/src/syncop.c:1071.
(gdb) #3  0xb9b803ff in glusterd_big_locked_notify (rpc=0xb8bc2070, 
mydata=0xb950efa0, event=RPC_CLNT_DISCONNECT, data=0x0,

(gdb) notify_fn=0xb9b8ec28 <__glusterd_brick_rpc_notify>)
Undefined command: "notify_fn".  Try "help".
(gdb) at 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/mgmt/glusterd/src/glusterd-handler.c:72
Illegal process-id: 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/mgmt/glusterd/src/glusterd-handler.c:72.
(gdb) #4  0xb9b8f0d7 in glusterd_brick_rpc_notify (rpc=0xb8bc2070, 
mydata=0xb950efa0, event=RPC_CLNT_DISCONNECT, data=0x0)
(gdb) at 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/mgmt/glusterd/src/glusterd-handler.c:4993
Illegal process-id: 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/mgmt/glusterd/src/glusterd-handler.c:4993.
#5  0xbb708c22 in rpc_clnt_notify (trans=0xb8bc8030, mydata=0xb8bc2090, 
event=RPC_TRANSPORT_DISCONNECT, data=0xb8bc8030)
/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-lib/src/rpc-clnt 
at 8030)
at 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-lib/src/rpc-transport.c:546

#7  0xbb231847 in socket_event_poll_err (this=0xb8bc8030)
at 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-transport/socket/src/socket.c:1151
#8  0xbb2359a2 in socket_event_handler (fd=15, idx=9, data=0xb8bc8030, 
poll_in=1, poll_out=4, poll_err=0)
at /home/jenkins/root/workspace/rackspace-n(gdb) #5  0xbb708c22 in 
rpc_clnt_notify (trans=0xb8bc8030, mydata=0xb8bc2090, 
event=RPC_TRANSPORT_DISCONNECT, data=0xb8bc8030)
(gdb) 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-lib/src/rpc-clnt 
at 8030)

Undefined command: "".  Try "help".
(gdb) at 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-lib/src/rpc-transport.c:546
Illegal process-id: 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-lib/src/rpc-transport.c:546.

(gdb) #7  0xbb231847 in socket_event_poll_err (this=0xb8bc8030)
(gdb) at 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-transport/socket/src/socket.c:1151
Illegal process-id: 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-transport/socket/src/socket.c:1151.
(gdb) #8  0xbb2359a2 in socket_event_handler (fd=15, idx=9, 
data=0xb8bc8030, poll_in=1, poll_out=4, poll_err=0)
(gdb) at 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-transport/socket/src/socket.c:2356
Illegal process-id: 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-transport/socket/src/socket.c:2356.
(gdb) #9  0xbb7a572e in event_dispatch_poll_handler 
(event_pool=0xbb143030, ufds=0xbb18f370, i=9)
(gdb) at 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/libglusterfs/src/event-poll.c:393
Illegal process-id: 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/libglusterfs/src/event-poll.c:393.

(gdb) #10 0xbb7a5a75 in event_dispatch_poll (event_pool=0xbb143030)
(gdb) at 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/libglusterfs/src/event-poll.c:489
Illegal process-id: 

Re: [Gluster-devel] regarding glusterd crash on NetBSD https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/14100/consoleFull

2016-02-10 Thread Pranith Kumar Karampuri



On 02/10/2016 06:01 PM, Atin Mukherjee wrote:

Not that I am aware of. Do you have backtrace of all the threads?


it doesn't seem to give proper output for 'thread apply all bt':
(gdb) thread apply all bt

Thread 6 (process 2):
#0  0xbb354977 in ?? ()
#1  0xbb682b67 in ?? ()
#2  0xba4fff98 in ?? ()
Cannot access memory at address 0xba4fffd4

I tried info threads and went to each thread, all of them except the 
thread that crashed has '??' in all frames. You can check the core on 
the netbsd machine mentioned in the console output in subject.


Pranith


~Atin

On 02/10/2016 05:58 PM, Pranith Kumar Karampuri wrote:

hi Atin, Kaushal,
   Is this a known issue?

(gdb) #1  0xbb789fb7 in __synclock_unlock (lock=0xbb1d4ac0)
(gdb) at
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/libglusterfs/src/syncop.c:1056

#2  0xbb789ffd in synclock_unlock (lock=0xbb1d4ac0)
 at
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/libglusterfs/src/syncop.c:1071

#3  0xb9b803ff in glusterd_big_locked_notify (rpc=0xb8bc2070,
mydata=0xb950efa0, event=RPC_CLNT_DISCONNECT, data=0x0,
 notify_fn=0xb9b8ec28 <__glusterd_brick_rpc_notify>)
 at
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/mgmt/glusterd/src/glusterd-handler.c:72

#4  0xb9b8f0d7 in glusterd_brick_rpc_notify (rpc=0xb8bc2070,
mydata=0xb950efa0, event=RPC_CLNT_DISCONNECT, data=0x0)
 at
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/mgmt/glusterd/src/glusterd-handler.c:4993

#5  0xbb708c22 in rpc_clnt_notify (trans=0xb8bc8030, mydata=0xb8bc2090,
event=RPC_TRANSPORT_DISCONNECT, data=0xb8bc8030)
 at
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-lib/src/rpc-clnt
at
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-lib/src/rpc-clnIllegal
process-id:
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/libglusterfs/src/syncop.c:1056.

(gdb) #2  0xbb789ffd in synclock_unlock (lock=0xbb1d4ac0)
(gdb) at
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/libglusterfs/src/syncop.c:1071

Illegal process-id:
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/libglusterfs/src/syncop.c:1071.

(gdb) #3  0xb9b803ff in glusterd_big_locked_notify (rpc=0xb8bc2070,
mydata=0xb950efa0, event=RPC_CLNT_DISCONNECT, data=0x0,
(gdb) notify_fn=0xb9b8ec28 <__glusterd_brick_rpc_notify>)
Undefined command: "notify_fn".  Try "help".
(gdb) at
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/mgmt/glusterd/src/glusterd-handler.c:72

Illegal process-id:
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/mgmt/glusterd/src/glusterd-handler.c:72.

(gdb) #4  0xb9b8f0d7 in glusterd_brick_rpc_notify (rpc=0xb8bc2070,
mydata=0xb950efa0, event=RPC_CLNT_DISCONNECT, data=0x0)
(gdb) at
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/mgmt/glusterd/src/glusterd-handler.c:4993

Illegal process-id:
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/mgmt/glusterd/src/glusterd-handler.c:4993.

#5  0xbb708c22 in rpc_clnt_notify (trans=0xb8bc8030, mydata=0xb8bc2090,
event=RPC_TRANSPORT_DISCONNECT, data=0xb8bc8030)
/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-lib/src/rpc-clnt
at 8030)
 at
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-lib/src/rpc-transport.c:546

#7  0xbb231847 in socket_event_poll_err (this=0xb8bc8030)
 at
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-transport/socket/src/socket.c:1151

#8  0xbb2359a2 in socket_event_handler (fd=15, idx=9, data=0xb8bc8030,
poll_in=1, poll_out=4, poll_err=0)
 at /home/jenkins/root/workspace/rackspace-n(gdb) #5  0xbb708c22 in
rpc_clnt_notify (trans=0xb8bc8030, mydata=0xb8bc2090,
event=RPC_TRANSPORT_DISCONNECT, data=0xb8bc8030)
(gdb)
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-lib/src/rpc-clnt
at 8030)
Undefined command: "".  Try "help".
(gdb) at
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-lib/src/rpc-transport.c:546

Illegal process-id:
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-lib/src/rpc-transport.c:546.

(gdb) #7  0xbb231847 in socket_event_poll_err (this=0xb8bc8030)
(gdb) at
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-transport/socket/src/socket.c:1151

Illegal process-id:
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-transport/socket/src/socket.c:1151.

(gdb) #8  0xbb2359a2 in socket_event_handler (fd=15, idx=9,
data=0xb8bc8030, poll_in=1, poll_out=4, poll_err=0)
(gdb) at
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/rpc/rpc-transport/socket/src/socket.c:2356

Illegal process-id:
/home

Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file

2016-01-29 Thread Pranith Kumar Karampuri



On 01/28/2016 05:05 PM, Pranith Kumar Karampuri wrote:
With baul jianguo's help I am able to see that FLUSH fops are hanging 
for some reason.


pk1@localhost - ~/Downloads
17:02:13 :) ⚡ grep "unique=" client-dump1.txt
unique=3160758373
unique=2073075682
unique=1455047665
unique=0

pk1@localhost - ~/Downloads
17:02:21 :) ⚡ grep "unique=" client-dump-0.txt
unique=3160758373
unique=2073075682
unique=1455047665
unique=0

I will be debugging a bit more and post my findings.

+Raghavendra G

All the stubs are hung in write-behind. I checked that the statedumps 
doesn't have any writes in progress. May be because of some race, flush 
fop is not resumed after write calls are complete? It seems this issue 
happens only when io-threads is enabled on the client.


Pranith


Pranith
On 01/28/2016 03:18 PM, baul jianguo wrote:

the client glusterfs gdb info, main thread id is 70800。
  In the top output,70800 thread time 1263:30,70810 thread time
1321:10,other thread time too small。
(gdb) thread apply all bt



Thread 9 (Thread 0x7fc21acaf700 (LWP 70801)):

#0  0x7fc21cc0c535 in sigwait () from /lib64/libpthread.so.0

#1  0x0040539b in glusterfs_sigwaiter (arg=) at glusterfsd.c:1653

#2  0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0

#3  0x7fc21c56e93d in clone () from /lib64/libc.so.6



Thread 8 (Thread 0x7fc21a2ae700 (LWP 70802)):

#0  0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0

#1  0x7fc21ded02bf in syncenv_task (proc=0x121ee60) at syncop.c:493

#2  0x7fc21ded6300 in syncenv_processor (thdata=0x121ee60) at 
syncop.c:571


#3  0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0

#4  0x7fc21c56e93d in clone () from /lib64/libc.so.6



Thread 7 (Thread 0x7fc2198ad700 (LWP 70803)):

#0  0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0

#1  0x7fc21ded02bf in syncenv_task (proc=0x121f220) at syncop.c:493

#2  0x7fc21ded6300 in syncenv_processor (thdata=0x121f220) at 
syncop.c:571


#3  0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0

#4  0x7fc21c56e93d in clone () from /lib64/libc.so.6



Thread 6 (Thread 0x7fc21767d700 (LWP 70805)):

#0  0x7fc21cc0bfbd in nanosleep () from /lib64/libpthread.so.0

#1  0x7fc21deb16bc in gf_timer_proc (ctx=0x11f2010) at timer.c:170

#2  0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0

#3  0x7fc21c56e93d in clone () from /lib64/libc.so.6



Thread 5 (Thread 0x7fc20fb1e700 (LWP 70810)):

#0  0x7fc21c566987 in readv () from /lib64/libc.so.6

#1  0x7fc21accbc55 in fuse_thread_proc (data=0x120f450) at
fuse-bridge.c:4752

#2  0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0

#3  0x7fc21c56e93d in clone () from /lib64/libc.so.6 时间最多



Thread 4 (Thread 0x7fc20f11d700 (LWP 70811)): 少点

#0  0x7fc21cc0b7dd in read () from /lib64/libpthread.so.0

#1  0x7fc21acc0e73 in read (data=) at
/usr/include/bits/unistd.h:45

#2  notify_kernel_loop (data=) at 
fuse-bridge.c:3786


#3  0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0

#4  0x7fc21c56e93d in clone () from /lib64/libc.so.6



Thread 3 (Thread 0x7fc1b16fe700 (LWP 206224)):

---Type  to continue, or q  to quit---

#0  0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0

#1  0x7fc20e515e60 in iot_worker (data=0x19eeda0) at 
io-threads.c:157


#2  0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0

#3  0x7fc21c56e93d in clone () from /lib64/libc.so.6



Thread 2 (Thread 0x7fc1b0bfb700 (LWP 214361)):

#0  0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0

#1  0x7fc20e515e60 in iot_worker (data=0x19eeda0) at 
io-threads.c:157


#2  0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0

#3  0x7fc21c56e93d in clone () from /lib64/libc.so.6



Thread 1 (Thread 0x7fc21e31e700 (LWP 70800)):

#0  0x7fc21c56ef33 in epoll_wait () from /lib64/libc.so.6

#1  0x7fc21deea3e7 in event_dispatch_epoll (event_pool=0x120dec0)
at event-epoll.c:428

#2  0x004075e4 in main (argc=4, argv=0x7fff3dc93698) at
glusterfsd.c:1983

On Thu, Jan 28, 2016 at 5:29 PM, baul jianguo <roidi...@gmail.com> 
wrote:

http://pastebin.centos.org/38941/
client statedump,only the pid 27419,168030,208655 hang,you can search
this pid in the statedump file。

On Wed, Jan 27, 2016 at 4:35 PM, Pranith Kumar Karampuri
<pkara...@redhat.com> wrote:

Hi,
   If the hang appears on enabling client side io-threads then 
it could

be because of some race that is seen when io-threads is enabled on the
client side. 2 things will help us debug this issue:
1) thread apply all bt inside gdb (with debuginfo rpms/debs 
installed )
2) Complete statedump of the mount at two intervals preferably 10 
seconds
apart. It becomes difficult to find out which ones are stuck vs the 
ones
that ar

Re: [Gluster-devel] [Gluster-users] Determining Connected Client Version

2016-01-26 Thread Pranith Kumar Karampuri



On 01/27/2016 09:21 AM, Atin Mukherjee wrote:


On 01/27/2016 07:21 AM, Vijay Bellur wrote:

On 01/26/2016 01:19 PM, Marc Eisenbarth wrote:

I'm trying to set a parameter on a volume, but unable to due to the
following message. I have a large number of connected clients and it's
likely that some clients have updated packages but haven't remounted the
volume. Is there an easier way to find the offending client?


You could grep for "accepted client from" in /var/log/glusterfs/bricks
to get an idea of the versions of connected clients.

I had sent a patch [1] to improve the error message to indicate which
client is the culprit here. This is not the first time I've heard an
user complaining about it, so will try to get it in the release stream.

Anyone up for review?

[1] http://review.gluster.org/#/c/11831/


This is definitely better than the state we have now. I looked at the 
patch. Looks fine. If you could resend the patch with BUG-id lets take 
it in.


What would be even better is if we can enhance "gluster volume status 
clients" which prints the clients connected at the moment to also print 
the min,max op-versions so that the users can get the versions of all 
clients in one command.


Pranith

HTH,
Vijay

___
Gluster-users mailing list
gluster-us...@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


___
Gluster-users mailing list
gluster-us...@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] Determining Connected Client Version

2016-01-26 Thread Pranith Kumar Karampuri



On 01/27/2016 12:49 PM, Atin Mukherjee wrote:


On 01/27/2016 12:40 PM, Pranith Kumar Karampuri wrote:


On 01/27/2016 09:21 AM, Atin Mukherjee wrote:

On 01/27/2016 07:21 AM, Vijay Bellur wrote:

On 01/26/2016 01:19 PM, Marc Eisenbarth wrote:

I'm trying to set a parameter on a volume, but unable to due to the
following message. I have a large number of connected clients and it's
likely that some clients have updated packages but haven't remounted
the
volume. Is there an easier way to find the offending client?


You could grep for "accepted client from" in /var/log/glusterfs/bricks
to get an idea of the versions of connected clients.

I had sent a patch [1] to improve the error message to indicate which
client is the culprit here. This is not the first time I've heard an
user complaining about it, so will try to get it in the release stream.

Anyone up for review?

[1] http://review.gluster.org/#/c/11831/

This is definitely better than the state we have now. I looked at the
patch. Looks fine. If you could resend the patch with BUG-id lets take
it in.

What would be even better is if we can enhance "gluster volume status
clients" which prints the clients connected at the moment to also print
the min,max op-versions so that the users can get the versions of all
clients in one command.

Yes doable. Actually I tried that earlier for the exact op-version but
it seems like glusterd doesn't come to know about the exact op-version
what client is running on. But having max and min op-version can
certainly be done. Do you want to file an enhancement for this?


Marc,
  Do you want to do it? I can take it up if you don't have time. I 
will wait till tomorrow.


Pranith



~Atin

Pranith

HTH,
Vijay

___
Gluster-users mailing list
gluster-us...@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


___
Gluster-users mailing list
gluster-us...@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file

2016-01-28 Thread Pranith Kumar Karampuri



On 01/28/2016 02:59 PM, baul jianguo wrote:

http://pastebin.centos.org/38941/
client statedump,only the pid 27419,168030,208655 hang,you can search
this pid in the statedump file。

Could you take one more statedump please?

Pranith


On Wed, Jan 27, 2016 at 4:35 PM, Pranith Kumar Karampuri
<pkara...@redhat.com> wrote:

Hi,
   If the hang appears on enabling client side io-threads then it could
be because of some race that is seen when io-threads is enabled on the
client side. 2 things will help us debug this issue:
1) thread apply all bt inside gdb (with debuginfo rpms/debs installed )
2) Complete statedump of the mount at two intervals preferably 10 seconds
apart. It becomes difficult to find out which ones are stuck vs the ones
that are on-going when we have just one statedump. If we have two, we can
find which frames are common in both of the statedumps and then take a
closer look there.

Feel free to ping me on #gluster-dev, nick: pranithk if you have the process
hung in that state and you guys don't mind me do a live debugging with you
guys. This option is the best of the lot!

Thanks a lot baul, Oleksandr for the debugging so far!

Pranith


On 01/25/2016 01:03 PM, baul jianguo wrote:

3.5.7 also hangs.only the flush op hung. Yes,off the
performance.client-io-threads ,no hang.

The hang does not relate the client kernel version.

One client statdump about flush op,any abnormal?

[global.callpool.stack.12]

uid=0

gid=0

pid=14432

unique=16336007098

lk-owner=77cb199aa36f3641

op=FLUSH

type=1

cnt=6



[global.callpool.stack.12.frame.1]

ref_count=1

translator=fuse

complete=0



[global.callpool.stack.12.frame.2]

ref_count=0

translator=datavolume-write-behind

complete=0

parent=datavolume-read-ahead

wind_from=ra_flush

wind_to=FIRST_CHILD (this)->fops->flush

unwind_to=ra_flush_cbk



[global.callpool.stack.12.frame.3]

ref_count=1

translator=datavolume-read-ahead

complete=0

parent=datavolume-open-behind

wind_from=default_flush_resume

wind_to=FIRST_CHILD(this)->fops->flush

unwind_to=default_flush_cbk



[global.callpool.stack.12.frame.4]

ref_count=1

translator=datavolume-open-behind

complete=0

parent=datavolume-io-threads

wind_from=iot_flush_wrapper

wind_to=FIRST_CHILD(this)->fops->flush

unwind_to=iot_flush_cbk



[global.callpool.stack.12.frame.5]

ref_count=1

translator=datavolume-io-threads

complete=0

parent=datavolume

wind_from=io_stats_flush

wind_to=FIRST_CHILD(this)->fops->flush

unwind_to=io_stats_flush_cbk



[global.callpool.stack.12.frame.6]

ref_count=1

translator=datavolume

complete=0

parent=fuse

wind_from=fuse_flush_resume

wind_to=xl->fops->flush

unwind_to=fuse_err_cbk



On Sun, Jan 24, 2016 at 5:35 AM, Oleksandr Natalenko
<oleksa...@natalenko.name> wrote:

With "performance.client-io-threads" set to "off" no hangs occurred in 3
rsync/rm rounds. Could that be some fuse-bridge lock race? Will bring
that
option to "on" back again and try to get full statedump.

On четвер, 21 січня 2016 р. 14:54:47 EET Raghavendra G wrote:

On Thu, Jan 21, 2016 at 10:49 AM, Pranith Kumar Karampuri <

pkara...@redhat.com> wrote:

On 01/18/2016 02:28 PM, Oleksandr Natalenko wrote:

XFS. Server side works OK, I'm able to mount volume again. Brick is
30%
full.

Oleksandr,

Will it be possible to get the statedump of the client, bricks

output next time it happens?


https://github.com/gluster/glusterfs/blob/master/doc/debugging/statedump.m
d#how-to-generate-statedump

We also need to dump inode information. To do that you've to add
"all=yes"
to /var/run/gluster/glusterdump.options before you issue commands to get
statedump.


Pranith


On понеділок, 18 січня 2016 р. 15:07:18 EET baul jianguo wrote:

What is your brick file system? and the glusterfsd process and all
thread status?
I met same issue when client app such as rsync stay in D status,and
the brick process and relate thread also be in the D status.
And the brick dev disk util is 100% .

On Sun, Jan 17, 2016 at 6:13 AM, Oleksandr Natalenko

<oleksa...@natalenko.name> wrote:

Wrong assumption, rsync hung again.

On субота, 16 січня 2016 р. 22:53:04 EET Oleksandr Natalenko wrote:

One possible reason:

cluster.lookup-optimize: on
cluster.readdir-optimize: on

I've disabled both optimizations, and at least as of now rsync
still
does
its job with no issues. I would like to find out what option causes
such
a
behavior and why. Will test more.

On пʼятниця, 15 січня 2016 р. 16:09:51 EET Oleksandr Natalenko
wrote:

Another observation: if rsyncing is resumed after hang, rsync
itself
hangs a lot faster because it does stat of already copied files.
So,
the
reason may be not writing itself, but massive stat on GlusterFS
volume
as well.

15.01.2016 09:40, Oleksandr Natalenko написав:

While doing rsync over millions of files from ordinary partition
to
GlusterFS volume, just after approx. first 2 million rsync hang
happens, 

Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file

2016-01-28 Thread Pranith Kumar Karampuri
With baul jianguo's help I am able to see that FLUSH fops are hanging 
for some reason.


pk1@localhost - ~/Downloads
17:02:13 :) ⚡ grep "unique=" client-dump1.txt
unique=3160758373
unique=2073075682
unique=1455047665
unique=0

pk1@localhost - ~/Downloads
17:02:21 :) ⚡ grep "unique=" client-dump-0.txt
unique=3160758373
unique=2073075682
unique=1455047665
unique=0

I will be debugging a bit more and post my findings.

Pranith
On 01/28/2016 03:18 PM, baul jianguo wrote:

the client glusterfs gdb info, main thread id is 70800。
  In the top output,70800 thread time 1263:30,70810 thread time
1321:10,other thread time too small。
(gdb) thread apply all bt



Thread 9 (Thread 0x7fc21acaf700 (LWP 70801)):

#0  0x7fc21cc0c535 in sigwait () from /lib64/libpthread.so.0

#1  0x0040539b in glusterfs_sigwaiter (arg=) at glusterfsd.c:1653

#2  0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0

#3  0x7fc21c56e93d in clone () from /lib64/libc.so.6



Thread 8 (Thread 0x7fc21a2ae700 (LWP 70802)):

#0  0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0

#1  0x7fc21ded02bf in syncenv_task (proc=0x121ee60) at syncop.c:493

#2  0x7fc21ded6300 in syncenv_processor (thdata=0x121ee60) at syncop.c:571

#3  0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0

#4  0x7fc21c56e93d in clone () from /lib64/libc.so.6



Thread 7 (Thread 0x7fc2198ad700 (LWP 70803)):

#0  0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0

#1  0x7fc21ded02bf in syncenv_task (proc=0x121f220) at syncop.c:493

#2  0x7fc21ded6300 in syncenv_processor (thdata=0x121f220) at syncop.c:571

#3  0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0

#4  0x7fc21c56e93d in clone () from /lib64/libc.so.6



Thread 6 (Thread 0x7fc21767d700 (LWP 70805)):

#0  0x7fc21cc0bfbd in nanosleep () from /lib64/libpthread.so.0

#1  0x7fc21deb16bc in gf_timer_proc (ctx=0x11f2010) at timer.c:170

#2  0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0

#3  0x7fc21c56e93d in clone () from /lib64/libc.so.6



Thread 5 (Thread 0x7fc20fb1e700 (LWP 70810)):

#0  0x7fc21c566987 in readv () from /lib64/libc.so.6

#1  0x7fc21accbc55 in fuse_thread_proc (data=0x120f450) at
fuse-bridge.c:4752

#2  0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0

#3  0x7fc21c56e93d in clone () from /lib64/libc.so.6 时间最多



Thread 4 (Thread 0x7fc20f11d700 (LWP 70811)): 少点

#0  0x7fc21cc0b7dd in read () from /lib64/libpthread.so.0

#1  0x7fc21acc0e73 in read (data=) at
/usr/include/bits/unistd.h:45

#2  notify_kernel_loop (data=) at fuse-bridge.c:3786

#3  0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0

#4  0x7fc21c56e93d in clone () from /lib64/libc.so.6



Thread 3 (Thread 0x7fc1b16fe700 (LWP 206224)):

---Type  to continue, or q  to quit---

#0  0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0

#1  0x7fc20e515e60 in iot_worker (data=0x19eeda0) at io-threads.c:157

#2  0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0

#3  0x7fc21c56e93d in clone () from /lib64/libc.so.6



Thread 2 (Thread 0x7fc1b0bfb700 (LWP 214361)):

#0  0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0

#1  0x7fc20e515e60 in iot_worker (data=0x19eeda0) at io-threads.c:157

#2  0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0

#3  0x7fc21c56e93d in clone () from /lib64/libc.so.6



Thread 1 (Thread 0x7fc21e31e700 (LWP 70800)):

#0  0x7fc21c56ef33 in epoll_wait () from /lib64/libc.so.6

#1  0x7fc21deea3e7 in event_dispatch_epoll (event_pool=0x120dec0)
at event-epoll.c:428

#2  0x004075e4 in main (argc=4, argv=0x7fff3dc93698) at
glusterfsd.c:1983

On Thu, Jan 28, 2016 at 5:29 PM, baul jianguo <roidi...@gmail.com> wrote:

http://pastebin.centos.org/38941/
client statedump,only the pid 27419,168030,208655 hang,you can search
this pid in the statedump file。

On Wed, Jan 27, 2016 at 4:35 PM, Pranith Kumar Karampuri
<pkara...@redhat.com> wrote:

Hi,
   If the hang appears on enabling client side io-threads then it could
be because of some race that is seen when io-threads is enabled on the
client side. 2 things will help us debug this issue:
1) thread apply all bt inside gdb (with debuginfo rpms/debs installed )
2) Complete statedump of the mount at two intervals preferably 10 seconds
apart. It becomes difficult to find out which ones are stuck vs the ones
that are on-going when we have just one statedump. If we have two, we can
find which frames are common in both of the statedumps and then take a
closer look there.

Feel free to ping me on #gluster-dev, nick: pranithk if you have the process
hung in that state and you guys don't mind me do a live debugging with you
guys. This option is the best of the lot!

Thanks a lot baul, Ol

Re: [Gluster-devel] Throttling xlator on the bricks

2016-01-25 Thread Pranith Kumar Karampuri



On 01/26/2016 08:14 AM, Vijay Bellur wrote:

On 01/25/2016 12:36 AM, Ravishankar N wrote:

Hi,

We are planning to introduce a throttling xlator on the server (brick)
process to regulate FOPS. The main motivation is to solve complaints 
about

AFR selfheal taking too much of CPU resources. (due to too many fops for
entry
self-heal, rchecksums for data self-heal etc.)



I am wondering if we can re-use the same xlator for throttling 
bandwidth, iops etc. in addition to fops. Based on admin configured 
policies we could provide different upper thresholds to different 
clients/tenants and this could prove to be an useful feature in 
multitenant deployments to avoid starvation/noisy neighbor class of 
problems. Has any thought gone in this direction?


Nope. It was mainly about internal processes at the moment.





The throttling is achieved using the Token Bucket Filter algorithm
(TBF). TBF
is already used by bitrot's bitd signer (which is a client process) in
gluster to regulate the CPU intensive check-sum calculation. By 
putting the
logic on the brick side, multiple clients- selfheal, bitrot, 
rebalance or

even the mounts themselves can avail the benefits of throttling.

The TBF algorithm in a nutshell is as follows: There is a bucket which
is filled
at a steady (configurable) rate with tokens. Each FOP will need a fixed
amount
of tokens to be processed. If the bucket has that many tokens, the 
FOP is
allowed and that many tokens are removed from the bucket. If not, the 
FOP is

queued until the bucket is filled.

The xlator will need to reside above io-threads and can have different
buckets,
one per client. There has to be a communication mechanism between the
client and
the brick (IPC?) to tell what FOPS need to be regulated from it, and the
no. of
tokens needed etc. These need to be re configurable via appropriate
mechanisms.
Each bucket will have a token filler thread which will fill the tokens
in it.


If there is one bucket per client and one thread per bucket, it would 
be difficult to scale as the number of clients increase. How can we do 
this better?


It is same thread for all the buckets. Because the number of internal 
clients at the moment is in single digits. The problem statement we have 
right now doesn't consider what you are looking for.




The main thread will enqueue heals in a list in the bucket if there 
aren't

enough tokens. Once the token filler detects some FOPS can be serviced,
it will
send a cond-broadcast to a dequeue thread which will process (stack
wind) all
the FOPS that have the required no. of tokens from all buckets.

This is just a high level abstraction: requesting feedback on any 
aspect of
this feature. what kind of mechanism is best between the 
client/bricks for

tuning various parameters? What other requirements do you foresee?



I am in favor of having administrator defined policies or templates 
(collection of policies) being used to provide the tuning parameter 
per client or a set of clients. We could even have a default template 
per use case etc. Is there a specific need to have this negotiation 
between clients and servers?


Thanks,
Vijay

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-02 Thread Pranith Kumar Karampuri



On 02/03/2016 09:20 AM, Shyam wrote:

On 02/02/2016 06:22 PM, Jeff Darcy wrote:
   Background: Quick-read + open-behind xlators are developed to 
help

in small file workload reads like apache webserver, tar etc to get the
data of the file in lookup FOP itself. What happens is, when a lookup
FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and
posix xlator reads the file and fills the data in xdata response if 
this
key is present as long as the file-size is less than max-length 
given in

the xdata. So when we do a tar of something like a kernel tree with
small files, if we look at profile of the bricks all we see are 
lookups.

OPEN + READ fops will not be sent at all over the network.

   With dht2 because data is present on a different cluster. We 
can't
get the data in lookup. Shyam was telling me that opens are also 
sent to

metadata cluster. That will make perf in this usecase back to where it
was before introducing these two features i.e. 1/3 of current perf
(Lookup vs lookup+open+read)


This is interesting thanks for the heads up.



Is "1/3 of current perf" based on actual measurements?  My understanding
was that the translators in question exist to send requests *in 
parallel*
with the original lookup stream.  That means it might be 3x the 
messages,

but it will only be 1/3 the performance if the network is saturated.
Also, the lookup is not guaranteed to be only one message.  It might be
as many as N (the number of bricks), so by the reasoning above the
performance would only drop to N/N+2.  I think the real situation is a
bit more complicated - and less dire - than you suggest.


I suggest that we send some fop at the
time of open to data cluster and change quick-read to cache this 
data on

open (if not already) then we can reduce the perf hit to 1/2 of current
perf, i.e. lookup+open.


At first glance, it seems pretty simple to do something like this, and
pretty obvious that we should.  The tricky question is: where should we
send that other op, before lookup has told us where the partition
containing that file is?  If there's some reasonable guess we can make,
the sending an open+read in parallel with the lookup will be helpful.
If not, then it will probably be a waste of time and network resources.
Shyam, is enough of this information being cached *on the clients* to
make this effective?



The file data would be located based on its GFID, so before the 
*first* lookup/stat for a file, there is no way to know it's GFID.
NOTE: Instead of a name hash the GFID hash is used, to get immunity 
against renames and the like, as a name hash could change the location 
information for the file (among other reasons).


The open+read can be done as a single FOP,
  - open for a read only case can do access checking on the client to 
allow the FOP to proceed to the DS without hitting the MDS for an open 
token


The client side cache is important from this and other such 
perspectives. It should also leverage upcall infra to keep the cache 
loosely coherent.


One thing to note here would be, for the client to do a lookup (where 
the file name should be known before hand), either a readdir/(p) has 
to have happened, or the client knows the name already (say 
application generated names). For the former (readdir case), there is 
enough information on the client to not need a lookup, but rather just 
do the open+read on the DS. For the latter the first lookup cannot be 
avoided, degrading this to a lookup+(open+read).


Some further tricks can be done to do readdir prefetching on such 
workloads, as the MDS runs on a DB (eventually), piggybacking more 
entries than requested on a lookup. I would possibly leave that for 
later, based on performance numbers in the small file area.


I strongly suggest that we don't postpone this to later as I think this 
is a solved problem. http://www.ietf.org/rfc/rfc4122.txt section 4.3 may 
be of help here. i.e. create UUID based on string, namespace. So we can 
use pgfid as namespace and filename as string. I understand that we will 
get into 2 hops if the file is renamed, but it is the best we can do 
right now. We can take help from crypto team in Redhat to make sure we 
do the right thing. If we get this implementation in dht2 after the code 
is released all the files created with old gfid-generation will work 
with half the possible perf.


Pranith


Shyam


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-03 Thread Pranith Kumar Karampuri



The file data would be located based on its GFID, so before the *first*
lookup/stat for a file, there is no way to know it's GFID.
NOTE: Instead of a name hash the GFID hash is used, to get immunity
against renames and the like, as a name hash could change the location
information for the file (among other reasons).


Another manner of achieving the same when the GFID of the file is 
known (from a readdir) is to wind the lookup and read of size to the 
respective MDS and DS, where the lookup would be responded to once the 
MDS responds, and the DS response is cached for the subsequent 
open+read case. So on the wire we would have a fan out of 2 FOPs, but 
still satisfy the quick read requirements.


Tar kind of workload doesn't have a problem because we know the gfid 
after readdirp.




I would assume the above resolves the problem posted, are there cases 
where we do not know the GFID of the file? i.e no readdir performed 
and client knows the file name that it wants to operate on? Do we have 
traces of the webserver workload to see if it generates names on the 
fly or does a readdir prior to that?


Problem is with workloads which know the files that need to be read 
without readdir, like hyperlinks (webserver), swift objects etc. These 
are two I know of which will have this problem, which can't be improved 
because we don't have metadata, data co-located. I have been trying to 
think of a solution for past few days. Nothing good is coming up :-/


Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-04 Thread Pranith Kumar Karampuri



On 02/03/2016 07:54 PM, Jeff Darcy wrote:

Problem is with workloads which know the files that need to be read
without readdir, like hyperlinks (webserver), swift objects etc. These
are two I know of which will have this problem, which can't be improved
because we don't have metadata, data co-located. I have been trying to
think of a solution for past few days. Nothing good is coming up :-/

In those cases, caching (at the MDS) would certainly help a lot.  Some
variation of the compounding infrastructure under development for Samba
etc. might also apply, since this really is a compound operation.
Even with compound fops It will still require two sequential network 
operations from dht2. One to MDC and one to DC So I don't think it helps.


Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-02 Thread Pranith Kumar Karampuri



On 02/02/2016 06:22 PM, Jeff Darcy wrote:

   Background: Quick-read + open-behind xlators are developed to help
in small file workload reads like apache webserver, tar etc to get the
data of the file in lookup FOP itself. What happens is, when a lookup
FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and
posix xlator reads the file and fills the data in xdata response if this
key is present as long as the file-size is less than max-length given in
the xdata. So when we do a tar of something like a kernel tree with
small files, if we look at profile of the bricks all we see are lookups.
OPEN + READ fops will not be sent at all over the network.

   With dht2 because data is present on a different cluster. We can't
get the data in lookup. Shyam was telling me that opens are also sent to
metadata cluster. That will make perf in this usecase back to where it
was before introducing these two features i.e. 1/3 of current perf
(Lookup vs lookup+open+read)

Is "1/3 of current perf" based on actual measurements?  My understanding
was that the translators in question exist to send requests *in parallel*
with the original lookup stream.  That means it might be 3x the messages,
but it will only be 1/3 the performance if the network is saturated.
Also, the lookup is not guaranteed to be only one message.  It might be
as many as N (the number of bricks), so by the reasoning above the
performance would only drop to N/N+2.  I think the real situation is a
bit more complicated - and less dire - than you suggest.


As per what I heard, when quick read (Now divided as open-behind and 
quick-read) was introduced webserver use case users reported 300% to 
400% perf improvement.
We should definitely test it once we have enough code to do so. I am 
just giving a heads up.


Having said that, for 'tar' I think we can most probably do a better job 
in dht2 because even after readdirp a nameless lookup comes. If it has 
GF_CONTENT_KEY we should send it to data cluster directly. For webserver 
usecase I don't have any ideas.


At least on my laptop this is what I saw, on a setup with different 
client, server machines, situation could be worse. This is distribute 
volume with one brick.


root@localhost - /mnt/d1
19:42:52 :) ⚡ time tar cf a.tgz a

real0m6.987s
user0m0.089s
sys0m0.481s

root@localhost - /mnt/d1
19:43:22 :) ⚡ cd

root@localhost - ~
19:43:25 :) ⚡ umount /mnt/d1

root@localhost - ~
19:43:27 :) ⚡ gluster volume set d1 open-behind off
volume set: success

root@localhost - ~
19:43:47 :) ⚡ gluster volume set d1 quick-read off
volume set: success

root@localhost - ~
19:44:03 :( ⚡ gluster volume stop d1
Stopping volume will make its data inaccessible. Do you want to 
continue? (y/n) y

volume stop: d1: success

root@localhost - ~
19:44:09 :) ⚡ gluster volume start d1
volume start: d1: success

root@localhost - ~
19:44:13 :) ⚡ mount -t glusterfs localhost.localdomain:/d1 /mnt/d1

root@localhost - ~
19:44:29 :) ⚡ cd /mnt/d1

root@localhost - /mnt/d1
19:44:30 :) ⚡ time tar cf b.tgz a

real0m12.176s
user0m0.098s
sys0m0.582s

Pranith



I suggest that we send some fop at the
time of open to data cluster and change quick-read to cache this data on
open (if not already) then we can reduce the perf hit to 1/2 of current
perf, i.e. lookup+open.

At first glance, it seems pretty simple to do something like this, and
pretty obvious that we should.  The tricky question is: where should we
send that other op, before lookup has told us where the partition
containing that file is?  If there's some reasonable guess we can make,
the sending an open+read in parallel with the lookup will be helpful.
If not, then it will probably be a waste of time and network resources.
Shyam, is enough of this information being cached *on the clients* to
make this effective?

Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-01 Thread Pranith Kumar Karampuri

hi,
 Background: Quick-read + open-behind xlators are developed to help 
in small file workload reads like apache webserver, tar etc to get the 
data of the file in lookup FOP itself. What happens is, when a lookup 
FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and 
posix xlator reads the file and fills the data in xdata response if this 
key is present as long as the file-size is less than max-length given in 
the xdata. So when we do a tar of something like a kernel tree with 
small files, if we look at profile of the bricks all we see are lookups. 
OPEN + READ fops will not be sent at all over the network.


 With dht2 because data is present on a different cluster. We can't 
get the data in lookup. Shyam was telling me that opens are also sent to 
metadata cluster. That will make perf in this usecase back to where it 
was before introducing these two features i.e. 1/3 of current perf 
(Lookup vs lookup+open+read). I suggest that we send some fop at the 
time of open to data cluster and change quick-read to cache this data on 
open (if not already) then we can reduce the perf hit to 1/2 of current 
perf, i.e. lookup+open.


 Sorry if this was already discussed and I didn't pay attention.

Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] glustershd fail to start after 3.7.7 upgrade

2016-02-01 Thread Pranith Kumar Karampuri



On 02/01/2016 10:16 PM, Joe Julian wrote:

WTF?

if (!xattrs_list) {
ret = -EINVAL;
gf_msg (this->name, GF_LOG_ERROR, -ret, 
AFR_MSG_NO_CHANGELOG,
"Unable to fetch afr pending changelogs. Is 
op-version"

" >= 30707?");
goto out;
}

This is not going to work. Look at the server upgrade process: Upgrade 
one side of the replica, wait for self-heals to finish, upgrade the 
other side. This breaks that process.


Sorry guys, this is my screw up, didn't think of this case in the 
design/review. I will fix this immediately. I hope the rpms are not yet 
available for 3.7.7?


Pranith


On 02/01/2016 07:08 AM, Emmanuel Dreyfus wrote:

Hi

After upgrading to 3.7.7, glustershd will not start anymore. Here are
the log. It seems that there a queued ops with odler version number
hat kill it.

[2016-02-01 15:03:04.832962] I [MSGID: 100030] 
[glusterfsd.c:2318:main] 0-/usr/pkg/sbin/glusterfs: Started running 
/usr/pkg/sbin/glusterfs version 3.7.7 (args: /usr/pkg/sbin/glusterfs 
-s localhost --volfile-id gluster/glustershd -p 
/var/lib/glusterd/glustershd/run/glustershd.pid -l 
/var/log/glusterfs/glustershd.log -S 
/var/run/gluster/c7e5574af4b5b4ffdcb61b1d5e63d8da.socket 
--xlator-option 
*replicate*.node-uuid=85eb78cd-8ffa-49ca-b3e7-d5030bc3124d)
[2016-02-01 15:03:05.065456] I [graph.c:269:gf_add_cmdline_options] 
0-gfs-replicate-3: adding option 'node-uuid' for volume 
'gfs-replicate-3' with value '85eb78cd-8ffa-49ca-b3e7-d5030bc3124d'
[2016-02-01 15:03:05.065510] I [graph.c:269:gf_add_cmdline_options] 
0-gfs-replicate-2: adding option 'node-uuid' for volume 
'gfs-replicate-2' with value '85eb78cd-8ffa-49ca-b3e7-d5030bc3124d'
[2016-02-01 15:03:05.065532] I [graph.c:269:gf_add_cmdline_options] 
0-gfs-replicate-1: adding option 'node-uuid' for volume 
'gfs-replicate-1' with value '85eb78cd-8ffa-49ca-b3e7-d5030bc3124d'
[2016-02-01 15:03:05.065552] I [graph.c:269:gf_add_cmdline_options] 
0-gfs-replicate-0: adding option 'node-uuid' for volume 
'gfs-replicate-0' with value '85eb78cd-8ffa-49ca-b3e7-d5030bc3124d'
[2016-02-01 15:03:05.065866] E [MSGID: 108040] [afr.c:418:init] 
0-gfs-replicate-3: Unable to fetch afr pending changelogs. Is 
op-version >= 30707? [Invalid argument]
[2016-02-01 15:03:05.066002] E [MSGID: 101019] 
[xlator.c:433:xlator_init] 0-gfs-replicate-3: Initialization of 
volume 'gfs-replicate-3' failed, review your volfile again
[2016-02-01 15:03:05.066024] E [graph.c:322:glusterfs_graph_init] 
0-gfs-replicate-3: initializing translator failed
[2016-02-01 15:03:05.066045] E [graph.c:661:glusterfs_graph_activate] 
0-graph: init failed
[2016-02-01 15:03:05.066348] W [glusterfsd.c:1236:cleanup_and_exit] 
(-->0xbbbd9d64  at 
/usr/pkg/lib/libgfrpc.so.0 -->0x8055f28  at 
/usr/pkg/sbin/glusterfs -->0x80518a4  at 
/usr/pkg/sbin/glusterfs ) 0-: received signum (0), shutting down




___
Gluster-users mailing list
gluster-us...@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-02 Thread Pranith Kumar Karampuri



On 02/03/2016 11:49 AM, Pranith Kumar Karampuri wrote:



On 02/03/2016 09:20 AM, Shyam wrote:

On 02/02/2016 06:22 PM, Jeff Darcy wrote:
   Background: Quick-read + open-behind xlators are developed 
to help

in small file workload reads like apache webserver, tar etc to get the
data of the file in lookup FOP itself. What happens is, when a lookup
FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and
posix xlator reads the file and fills the data in xdata response if 
this
key is present as long as the file-size is less than max-length 
given in

the xdata. So when we do a tar of something like a kernel tree with
small files, if we look at profile of the bricks all we see are 
lookups.

OPEN + READ fops will not be sent at all over the network.

   With dht2 because data is present on a different cluster. We 
can't
get the data in lookup. Shyam was telling me that opens are also 
sent to

metadata cluster. That will make perf in this usecase back to where it
was before introducing these two features i.e. 1/3 of current perf
(Lookup vs lookup+open+read)


This is interesting thanks for the heads up.



Is "1/3 of current perf" based on actual measurements?  My 
understanding
was that the translators in question exist to send requests *in 
parallel*
with the original lookup stream.  That means it might be 3x the 
messages,

but it will only be 1/3 the performance if the network is saturated.
Also, the lookup is not guaranteed to be only one message.  It might be
as many as N (the number of bricks), so by the reasoning above the
performance would only drop to N/N+2.  I think the real situation is a
bit more complicated - and less dire - than you suggest.


I suggest that we send some fop at the
time of open to data cluster and change quick-read to cache this 
data on
open (if not already) then we can reduce the perf hit to 1/2 of 
current

perf, i.e. lookup+open.


At first glance, it seems pretty simple to do something like this, and
pretty obvious that we should.  The tricky question is: where should we
send that other op, before lookup has told us where the partition
containing that file is?  If there's some reasonable guess we can make,
the sending an open+read in parallel with the lookup will be helpful.
If not, then it will probably be a waste of time and network resources.
Shyam, is enough of this information being cached *on the clients* to
make this effective?



The file data would be located based on its GFID, so before the 
*first* lookup/stat for a file, there is no way to know it's GFID.
NOTE: Instead of a name hash the GFID hash is used, to get immunity 
against renames and the like, as a name hash could change the 
location information for the file (among other reasons).


The open+read can be done as a single FOP,
  - open for a read only case can do access checking on the client to 
allow the FOP to proceed to the DS without hitting the MDS for an 
open token


The client side cache is important from this and other such 
perspectives. It should also leverage upcall infra to keep the cache 
loosely coherent.


One thing to note here would be, for the client to do a lookup (where 
the file name should be known before hand), either a readdir/(p) has 
to have happened, or the client knows the name already (say 
application generated names). For the former (readdir case), there is 
enough information on the client to not need a lookup, but rather 
just do the open+read on the DS. For the latter the first lookup 
cannot be avoided, degrading this to a lookup+(open+read).


Some further tricks can be done to do readdir prefetching on such 
workloads, as the MDS runs on a DB (eventually), piggybacking more 
entries than requested on a lookup. I would possibly leave that for 
later, based on performance numbers in the small file area.


I strongly suggest that we don't postpone this to later as I think 
this is a solved problem. http://www.ietf.org/rfc/rfc4122.txt section 
4.3 may be of help here. i.e. create UUID based on string, namespace. 
So we can use pgfid as namespace and filename as string. I understand 
that we will get into 2 hops if the file is renamed, but it is the 
best we can do right now. We can take help from crypto team in Redhat 
to make sure we do the right thing. If we get this implementation in 
dht2 after the code is released all the files created with old 
gfid-generation will work with half the possible perf.

Gah! ignore, it will lead to gfid collisions :-/

Pranith


Pranith


Shyam


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] 3.7 pending patches

2016-01-28 Thread Pranith Kumar Karampuri



On 01/28/2016 07:05 PM, Venky Shankar wrote:

Hey folks,

I just merged patch #13302 (and it's 3.7 equivalent) which fixes a scrubber 
crash.
This was causing other patches to fail regression.

Requesting a rebase of patches (especially 3.7 pending) that were blocked due to
this.

Thanks a lot for this venky, kotresh, Emmanuel. I re-triggered the builds.

I observed the following crash in one of the runs for 
https://build.gluster.org/job/rackspace-regression-2GB-triggered/17819/console 
(3.7):

(gdb) bt
#0  0x0040ecff in glusterfs_rebalance_event_notify_cbk (
req=0x7f0e58006dbc, iov=0x7f0e6cadb5d0, count=1, 
myframe=0x7f0e58003a7c)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/glusterfsd/src/glusterfsd-mgmt.c:1812

#1  0x7f0e79a1274b in saved_frames_unwind (saved_frames=0x19ffe70)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/rpc/rpc-lib/src/rpc-clnt.c:366

#2  0x7f0e79a127ea in saved_frames_destroy (frames=0x19ffe70)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/rpc/rpc-lib/src/rpc-clnt.c:383

#3  0x7f0e79a12c41 in rpc_clnt_connection_cleanup (conn=0x19fea20)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/rpc/rpc-lib/src/rpc-clnt.c:532
#4  0x7f0e79a136cb in rpc_clnt_notify (trans=0x19fee70, 
mydata=0x19fea20,

event=RPC_TRANSPORT_DISCONNECT, data=0x19fee70)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/rpc/rpc-lib/src/rpc-clnt.c:854

#5  0x7f0e79a0fb76 in rpc_transport_notify (this=0x19fee70,
event=RPC_TRANSPORT_DISCONNECT, data=0x19fee70)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/rpc/rpc-lib/src/rpc-transport.c:546

#6  0x7f0e6f1fd621 in socket_event_poll_err (this=0x19fee70)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/rpc/rpc-t---Type 
 to continue, or q  to quit---

ransport/socket/src/socket.c:1151
#7  0x7f0e6f20234c in socket_event_handler (fd=9, idx=1, 
data=0x19fee70,

poll_in=1, poll_out=0, poll_err=24)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/rpc/rpc-transport/socket/src/socket.c:2356
#8  0x7f0e79cc386c in event_dispatch_epoll_handler 
(event_pool=0x19c3c90,

event=0x7f0e6cadbe70)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/libglusterfs/src/event-epoll.c:575

#9  0x7f0e79cc3c5a in event_dispatch_epoll_worker (data=0x7f0e68014970)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/libglusterfs/src/event-epoll.c:678

#10 0x7f0e78f2aa51 in start_thread () from ./lib64/libpthread.so.0
#11 0x7f0e7889493d in clone () from ./lib64/libc.so.6
(gdb) fr 0
#0  0x0040ecff in glusterfs_rebalance_event_notify_cbk (
req=0x7f0e58006dbc, iov=0x7f0e6cadb5d0, count=1, 
myframe=0x7f0e58003a7c)
at 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/glusterfsd/src/glusterfsd-mgmt.c:1812
1812in 
/home/jenkins/root/workspace/rackspace-regression-2GB-triggered/glusterfsd/src/glusterfsd-mgmt.c

(gdb) info locals
rsp = {op_ret = 0, op_errno = 0, dict = {dict_len = 0, dict_val = 0x0}}
frame = 0x7f0e58003a7c
ctx = 0x0
ret = 0
__FUNCTION__ = "glusterfs_rebalance_event_notify_cbk"
(gdb) p frame->this
$1 = (xlator_t *) 0x3a6000
(gdb) p frame->this->name
Cannot access memory at address 0x3a6000

Pranith


Thanks,

 Venky


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] Trashcan issue with vim editor

2016-01-28 Thread Pranith Kumar Karampuri

+Anoop, Jiffin

On 01/27/2016 03:25 PM, PankaJ Singh wrote:


Hi,

We are using gluster 3.7.6 on ubuntu 14.04. We are facing an issue 
with trashcan feature.

Our scenario is as follow:

1. 2 node server (ubuntu 14.04 with glusterfs 3.7.6)
2. 1 client node (ubuntu 14.04)
3. I have created one volume vol1 with 2 bricks in replica and with 
transport = tcp mode.

4. I have enabled quota on vol1
5. Now I have enabled trashcan feature on vol1
6. Now I have mounted vol1 on client's home directory "mount -t 
glusterfs -o transport=tcp server-1:/vol1 /home/"
7. Now when I logged in via any existing non-root user and perform any 
editing via vim editor then I getting this error "*E200: *ReadPre 
autocommands made the file unreadable*" and my user's home 
directory*permission get changed to 000*. after sometime these 
permission gets revert back automatically.


(NOTE: user's home directories are copied in mounted directory 
glusterfs volume vol1)



Thanks & Regards
PankaJ Singh


___
Gluster-users mailing list
gluster-us...@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] 3.7.7. patch freeze

2016-01-28 Thread Pranith Kumar Karampuri
unless the patches are data-loss/crashes will not take any more other 
than the ones which help make regressions consistent:


Final set:
http://review.gluster.org/#/c/12768/
http://review.gluster.org/#/c/13305/<< user asked for this on 
gluster-users.

http://review.gluster.org/#/c/13119/
http://review.gluster.org/13292
http://review.gluster.org/13071
http://review.gluster.org/13312
http://review.gluster.org/#/c/13127/

Pranith

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] NSR: Suggestions for a new name

2016-01-20 Thread Pranith Kumar Karampuri



On 01/19/2016 08:00 PM, Avra Sengupta wrote:

Hi,

The leader election based replication has been called NSR or "New 
Style Replication" for a while now. We would like to have a new name 
for the same that's less generic. It can be something like "Leader 
Driven Replication" or something more specific that would make sense a 
few years down the line too.


We would love to hear more suggestions from the community. Thanks


If I had a chance to name AFR (Automatic File Replication) I would have 
named it Automatic Data replication. Feel free to use it if you like it.


Pranith


Regards,
Avra
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] heal hanging

2016-01-20 Thread Pranith Kumar Karampuri

hey,
   Which process is consuming so much cpu? I went through the logs 
you gave me. I see that the following files are in gfid mismatch state:


<066e4525-8f8b-43aa-b7a1-86bbcecc68b9/safebrowsing-backup>,
<1d48754b-b38c-403d-94e2-0f5c41d5f885/recovery.bak>,
,

Could you give me the output of "ls /indices/xattrop | wc 
-l" output on all the bricks which are acting this way? This will tell 
us the number of pending self-heals on the system.


Pranith

On 01/20/2016 09:26 PM, David Robinson wrote:

resending with parsed logs...
I am having issues with 3.6.6 where the load will spike up to 800% 
for one of the glusterfsd processes and the users can no longer 
access the system.  If I reboot the node, the heal will finish 
normally after a few minutes and the system will be responsive, 
but a few hours later the issue will start again.  It look like it 
is hanging in a heal and spinning up the load on one of the bricks.  
The heal gets stuck and says it is crawling and never returns.  
After a few minutes of the heal saying it is crawling, the load 
spikes up and the mounts become unresponsive.
Any suggestions on how to fix this?  It has us stopped cold as the 
user can no longer access the systems when the load spikes... Logs 
attached.

System setup info is:
[root@gfs01a ~]# gluster volume info homegfs

Volume Name: homegfs
Type: Distributed-Replicate
Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
Options Reconfigured:
performance.io-thread-count: 32
performance.cache-size: 128MB
performance.write-behind-window-size: 128MB
server.allow-insecure: on
network.ping-timeout: 42
storage.owner-gid: 100
geo-replication.indexing: off
geo-replication.ignore-pid-check: on
changelog.changelog: off
changelog.fsync-interval: 3
changelog.rollover-time: 15
server.manage-gids: on
diagnostics.client-log-level: WARNING
[root@gfs01a ~]# rpm -qa | grep gluster
gluster-nagios-common-0.1.1-0.el6.noarch
glusterfs-fuse-3.6.6-1.el6.x86_64
glusterfs-debuginfo-3.6.6-1.el6.x86_64
glusterfs-libs-3.6.6-1.el6.x86_64
glusterfs-geo-replication-3.6.6-1.el6.x86_64
glusterfs-api-3.6.6-1.el6.x86_64
glusterfs-devel-3.6.6-1.el6.x86_64
glusterfs-api-devel-3.6.6-1.el6.x86_64
glusterfs-3.6.6-1.el6.x86_64
glusterfs-cli-3.6.6-1.el6.x86_64
glusterfs-rdma-3.6.6-1.el6.x86_64
samba-vfs-glusterfs-4.1.11-2.el6.x86_64
glusterfs-server-3.6.6-1.el6.x86_64
glusterfs-extra-xlators-3.6.6-1.el6.x86_64



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file

2016-01-20 Thread Pranith Kumar Karampuri



On 01/18/2016 02:28 PM, Oleksandr Natalenko wrote:

XFS. Server side works OK, I'm able to mount volume again. Brick is 30% full.


Oleksandr,
  Will it be possible to get the statedump of the client, bricks 
output next time it happens?

https://github.com/gluster/glusterfs/blob/master/doc/debugging/statedump.md#how-to-generate-statedump

Pranith



On понеділок, 18 січня 2016 р. 15:07:18 EET baul jianguo wrote:

What is your brick file system? and the glusterfsd process and all
thread status?
I met same issue when client app such as rsync stay in D status,and
the brick process and relate thread also be in the D status.
And the brick dev disk util is 100% .

On Sun, Jan 17, 2016 at 6:13 AM, Oleksandr Natalenko

 wrote:

Wrong assumption, rsync hung again.

On субота, 16 січня 2016 р. 22:53:04 EET Oleksandr Natalenko wrote:

One possible reason:

cluster.lookup-optimize: on
cluster.readdir-optimize: on

I've disabled both optimizations, and at least as of now rsync still does
its job with no issues. I would like to find out what option causes such
a
behavior and why. Will test more.

On пʼятниця, 15 січня 2016 р. 16:09:51 EET Oleksandr Natalenko wrote:

Another observation: if rsyncing is resumed after hang, rsync itself
hangs a lot faster because it does stat of already copied files. So,
the
reason may be not writing itself, but massive stat on GlusterFS volume
as well.

15.01.2016 09:40, Oleksandr Natalenko написав:

While doing rsync over millions of files from ordinary partition to
GlusterFS volume, just after approx. first 2 million rsync hang
happens, and the following info appears in dmesg:

===
[17075038.924481] INFO: task rsync:10310 blocked for more than 120
seconds.
[17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[17075038.940748] rsync   D 88207fc13680 0 10310
10309 0x0080
[17075038.940752]  8809c578be18 0086 8809c578bfd8
00013680
[17075038.940756]  8809c578bfd8 00013680 880310cbe660
881159d16a30
[17075038.940759]  881e3aa25800 8809c578be48 881159d16b10
88087d553980
[17075038.940762] Call Trace:
[17075038.940770]  [] schedule+0x29/0x70
[17075038.940797]  []
__fuse_request_send+0x13d/0x2c0
[fuse]
[17075038.940801]  [] ?
fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse]
[17075038.940805]  [] ? wake_up_bit+0x30/0x30
[17075038.940809]  [] fuse_request_send+0x12/0x20
[fuse]
[17075038.940813]  [] fuse_flush+0xff/0x150 [fuse]
[17075038.940817]  [] filp_close+0x34/0x80
[17075038.940821]  [] __close_fd+0x78/0xa0
[17075038.940824]  [] SyS_close+0x23/0x50
[17075038.940828]  []
system_call_fastpath+0x16/0x1b
===

rsync blocks in D state, and to kill it, I have to do umount --lazy
on
GlusterFS mountpoint, and then kill corresponding client glusterfs
process. Then rsync exits.

Here is GlusterFS volume info:

===
Volume Name: asterisk_records
Type: Distributed-Replicate
Volume ID: dc1fe561-fa3a-4f2e-8330-ec7e52c75ba4
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1:
server1:/bricks/10_megaraid_0_3_9_x_0_4_3_hdd_r1_nolvm_hdd_storage_01
/as
te
risk/records Brick2:
server2:/bricks/10_megaraid_8_5_14_x_8_6_16_hdd_r1_nolvm_hdd_storage_
01/
as
terisk/records Brick3:
server1:/bricks/11_megaraid_0_5_4_x_0_6_5_hdd_r1_nolvm_hdd_storage_02
/as
te
risk/records Brick4:
server2:/bricks/11_megaraid_8_7_15_x_8_8_20_hdd_r1_nolvm_hdd_storage_
02/
as
terisk/records Brick5:
server1:/bricks/12_megaraid_0_7_6_x_0_13_14_hdd_r1_nolvm_hdd_storage_
03/
as
terisk/records Brick6:
server2:/bricks/12_megaraid_8_9_19_x_8_13_24_hdd_r1_nolvm_hdd_storage
_03
/a
sterisk/records Options Reconfigured:
cluster.lookup-optimize: on
cluster.readdir-optimize: on
client.event-threads: 2
network.inode-lru-limit: 4096
server.event-threads: 4
performance.client-io-threads: on
storage.linux-aio: on
performance.write-behind-window-size: 4194304
performance.stat-prefetch: on
performance.quick-read: on
performance.read-ahead: on
performance.flush-behind: on
performance.write-behind: on
performance.io-thread-count: 2
performance.cache-max-file-size: 1048576
performance.cache-size: 33554432
features.cache-invalidation: on
performance.readdir-ahead: on
===

The issue reproduces each time I rsync such an amount of files.

How could I debug this issue better?
___
Gluster-users mailing list
gluster-us...@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] [Gluster-users] heal hanging

2016-01-21 Thread Pranith Kumar Karampuri



On 01/21/2016 08:25 PM, Glomski, Patrick wrote:
Hello, Pranith. The typical behavior is that the %cpu on a glusterfsd 
process jumps to number of processor cores available (800% or 1200%, 
depending on the pair of nodes involved) and the load average on the 
machine goes very high (~20). The volume's heal statistics output 
shows that it is crawling one of the bricks and trying to heal, but 
this crawl hangs and never seems to finish.


The number of files in the xattrop directory varies over time, so I 
ran a wc -l as you requested periodically for some time and then 
started including a datestamped list of the files that were in the 
xattrops directory on each brick to see which were persistent. All 
bricks had files in the xattrop folder, so all results are attached.
Thanks this info is helpful. I don't see a lot of files. Could you give 
output of "gluster volume heal  info"? Is there any directory 
in there which is LARGE?


Pranith


Please let me know if there is anything else I can provide.

Patrick


On Thu, Jan 21, 2016 at 12:01 AM, Pranith Kumar Karampuri 
<pkara...@redhat.com <mailto:pkara...@redhat.com>> wrote:


hey,
   Which process is consuming so much cpu? I went through the
logs you gave me. I see that the following files are in gfid
mismatch state:

<066e4525-8f8b-43aa-b7a1-86bbcecc68b9/safebrowsing-backup>,
<1d48754b-b38c-403d-94e2-0f5c41d5f885/recovery.bak>,
,

Could you give me the output of "ls /indices/xattrop |
wc -l" output on all the bricks which are acting this way? This
will tell us the number of pending self-heals on the system.

Pranith


On 01/20/2016 09:26 PM, David Robinson wrote:

resending with parsed logs...

I am having issues with 3.6.6 where the load will spike up to
800% for one of the glusterfsd processes and the users can no
longer access the system.  If I reboot the node, the heal will
finish normally after a few minutes and the system will be
responsive, but a few hours later the issue will start again. 
It look like it is hanging in a heal and spinning up the load

on one of the bricks.  The heal gets stuck and says it is
crawling and never returns.  After a few minutes of the heal
saying it is crawling, the load spikes up and the mounts become
unresponsive.
Any suggestions on how to fix this?  It has us stopped cold as
the user can no longer access the systems when the load
spikes... Logs attached.
System setup info is:
[root@gfs01a ~]# gluster volume info homegfs

Volume Name: homegfs
Type: Distributed-Replicate
Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
Options Reconfigured:
performance.io-thread-count: 32
performance.cache-size: 128MB
performance.write-behind-window-size: 128MB
server.allow-insecure: on
network.ping-timeout: 42
storage.owner-gid: 100
geo-replication.indexing: off
geo-replication.ignore-pid-check: on
changelog.changelog: off
changelog.fsync-interval: 3
changelog.rollover-time: 15
server.manage-gids: on
diagnostics.client-log-level: WARNING
[root@gfs01a ~]# rpm -qa | grep gluster
gluster-nagios-common-0.1.1-0.el6.noarch
glusterfs-fuse-3.6.6-1.el6.x86_64
glusterfs-debuginfo-3.6.6-1.el6.x86_64
glusterfs-libs-3.6.6-1.el6.x86_64
glusterfs-geo-replication-3.6.6-1.el6.x86_64
glusterfs-api-3.6.6-1.el6.x86_64
glusterfs-devel-3.6.6-1.el6.x86_64
glusterfs-api-devel-3.6.6-1.el6.x86_64
glusterfs-3.6.6-1.el6.x86_64
glusterfs-cli-3.6.6-1.el6.x86_64
glusterfs-rdma-3.6.6-1.el6.x86_64
samba-vfs-glusterfs-4.1.11-2.el6.x86_64
glusterfs-server-3.6.6-1.el6.x86_64
glusterfs-extra-xlators-3.6.6-1.el6.x86_64



___
Gluster-devel mailing list
Gluster-devel@gluster.org  <mailto:Gluster-devel@gluster.org>
http://www.gluster.org/mailman/listinfo/gluster-devel



___
Gluster-users mailing list
gluster-us...@gluster.org <mailto:gluster-us...@gluster.org>
http://www.gluster.org/mailman/listinfo/gluster-users




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] heal hanging

2016-01-21 Thread Pranith Kumar Karampuri



On 01/21/2016 09:26 PM, Glomski, Patrick wrote:
I should mention that the problem is not currently occurring and there 
are no heals (output appended). By restarting the gluster services, we 
can stop the crawl, which lowers the load for a while. Subsequent 
crawls seem to finish properly. For what it's worth, files/folders 
that show up in the 'volume info' output during a hung crawl don't 
seem to be anything out of the ordinary.


Over the past four days, the typical time before the problem recurs 
after suppressing it in this manner is an hour. Last night when we 
reached out to you was the last time it happened and the load has been 
low since (a relief).  David believes that recursively listing the 
files (ls -alR or similar) from a client mount can force the issue to 
happen, but obviously I'd rather not unless we have some precise thing 
we're looking for. Let me know if you'd like me to attempt to drive 
the system unstable like that and what I should look for. As it's a 
production system, I'd rather not leave it in this state for long.


Will it be possible to send glustershd, mount logs of the past 4 days? I 
would like to see if this is because of directory self-heal going wild 
(Ravi is working on throttling feature for 3.8, which will allow to put 
breaks on self-heal traffic)


Pranith


[root@gfs01a xattrop]# gluster volume heal homegfs info
Brick gfs01a.corvidtec.com:/data/brick01a/homegfs/
Number of entries: 0

Brick gfs01b.corvidtec.com:/data/brick01b/homegfs/
Number of entries: 0

Brick gfs01a.corvidtec.com:/data/brick02a/homegfs/
Number of entries: 0

Brick gfs01b.corvidtec.com:/data/brick02b/homegfs/
Number of entries: 0

Brick gfs02a.corvidtec.com:/data/brick01a/homegfs/
Number of entries: 0

Brick gfs02b.corvidtec.com:/data/brick01b/homegfs/
Number of entries: 0

Brick gfs02a.corvidtec.com:/data/brick02a/homegfs/
Number of entries: 0

Brick gfs02b.corvidtec.com:/data/brick02b/homegfs/
Number of entries: 0




On Thu, Jan 21, 2016 at 10:40 AM, Pranith Kumar Karampuri 
<pkara...@redhat.com <mailto:pkara...@redhat.com>> wrote:




On 01/21/2016 08:25 PM, Glomski, Patrick wrote:

Hello, Pranith. The typical behavior is that the %cpu on a
glusterfsd process jumps to number of processor cores available
(800% or 1200%, depending on the pair of nodes involved) and the
load average on the machine goes very high (~20). The volume's
heal statistics output shows that it is crawling one of the
bricks and trying to heal, but this crawl hangs and never seems
to finish.

The number of files in the xattrop directory varies over time, so
I ran a wc -l as you requested periodically for some time and
then started including a datestamped list of the files that were
in the xattrops directory on each brick to see which were
persistent. All bricks had files in the xattrop folder, so all
results are attached.

Thanks this info is helpful. I don't see a lot of files. Could you
give output of "gluster volume heal  info"? Is there any
directory in there which is LARGE?

Pranith



Please let me know if there is anything else I can provide.

Patrick


On Thu, Jan 21, 2016 at 12:01 AM, Pranith Kumar Karampuri
<pkara...@redhat.com <mailto:pkara...@redhat.com>> wrote:

hey,
   Which process is consuming so much cpu? I went through
the logs you gave me. I see that the following files are in
gfid mismatch state:

<066e4525-8f8b-43aa-b7a1-86bbcecc68b9/safebrowsing-backup>,
<1d48754b-b38c-403d-94e2-0f5c41d5f885/recovery.bak>,
,

Could you give me the output of "ls
/indices/xattrop | wc -l" output on all the
bricks which are acting this way? This will tell us the
number of pending self-heals on the system.

Pranith


On 01/20/2016 09:26 PM, David Robinson wrote:

resending with parsed logs...

I am having issues with 3.6.6 where the load will spike up
to 800% for one of the glusterfsd processes and the users
can no longer access the system.  If I reboot the node,
the heal will finish normally after a few minutes and the
system will be responsive, but a few hours later the issue
will start again.  It look like it is hanging in a heal
and spinning up the load on one of the bricks.  The heal
gets stuck and says it is crawling and never returns.
After a few minutes of the heal saying it is crawling, the
load spikes up and the mounts become unresponsive.
Any suggestions on how to fix this?  It has us stopped
cold as the user can no longer access the systems when the
load spikes... Logs attached.
System setup info is:
[root@gfs01a ~]# gluster volume info homegfs

Volume Name: homegfs
Type: Distributed-Replicate
Volume

Re: [Gluster-devel] [Gluster-users] heal hanging

2016-01-21 Thread Pranith Kumar Karampuri
** Use 'gluster volume heal homegfs info' until bug is fixed ***

Brick gfs01a.corvidtec.com:/data/brick01a/homegfs/
Number of entries: 0

Brick gfs01b.corvidtec.com:/data/brick01b/homegfs/
Number of entries: 0

Brick gfs01a.corvidtec.com:/data/brick02a/homegfs/
Number of entries: 0

Brick gfs01b.corvidtec.com:/data/brick02b/homegfs/
Number of entries: 0

Brick gfs02a.corvidtec.com:/data/brick01a/homegfs/
/users/bangell/.gconfd - Is in split-brain

Number of entries: 1

Brick gfs02b.corvidtec.com:/data/brick01b/homegfs/
/users/bangell/.gconfd - Is in split-brain

/users/bangell/.gconfd/saved_state
Number of entries: 2

Brick gfs02a.corvidtec.com:/data/brick02a/homegfs/
Number of entries: 0

Brick gfs02b.corvidtec.com:/data/brick02b/homegfs/
Number of entries: 0




On Thu, Jan 21, 2016 at 11:10 AM, Pranith Kumar Karampuri
<pkara...@redhat.com <mailto:pkara...@redhat.com>> wrote:



On 01/21/2016 09:26 PM, Glomski, Patrick wrote:

I should mention that the problem is not currently occurring
and there are no heals (output appended). By restarting the
gluster services, we can stop the crawl, which lowers the
load for a while. Subsequent crawls seem to finish properly.
For what it's worth, files/folders that show up in the
'volume info' output during a hung crawl don't seem to be
anything out of the ordinary.

Over the past four days, the typical time before the problem
recurs after suppressing it in this manner is an hour. Last
night when we reached out to you was the last time it
happened and the load has been low since (a relief). David
believes that recursively listing the files (ls -alR or
similar) from a client mount can force the issue to happen,
but obviously I'd rather not unless we have some precise
thing we're looking for. Let me know if you'd like me to
attempt to drive the system unstable like that and what I
should look for. As it's a production system, I'd rather not
leave it in this state for long.


Will it be possible to send glustershd, mount logs of the past
4 days? I would like to see if this is because of directory
self-heal going wild (Ravi is working on throttling feature
for 3.8, which will allow to put breaks on self-heal traffic)

Pranith



[root@gfs01a xattrop]# gluster volume heal homegfs info
Brick gfs01a.corvidtec.com:/data/brick01a/homegfs/
Number of entries: 0

Brick gfs01b.corvidtec.com:/data/brick01b/homegfs/
Number of entries: 0

Brick gfs01a.corvidtec.com:/data/brick02a/homegfs/
Number of entries: 0

Brick gfs01b.corvidtec.com:/data/brick02b/homegfs/
Number of entries: 0

Brick gfs02a.corvidtec.com:/data/brick01a/homegfs/
Number of entries: 0

Brick gfs02b.corvidtec.com:/data/brick01b/homegfs/
Number of entries: 0

Brick gfs02a.corvidtec.com:/data/brick02a/homegfs/
Number of entries: 0

Brick gfs02b.corvidtec.com:/data/brick02b/homegfs/
Number of entries: 0




On Thu, Jan 21, 2016 at 10:40 AM, Pranith Kumar Karampuri
<pkara...@redhat.com <mailto:pkara...@redhat.com>> wrote:



On 01/21/2016 08:25 PM, Glomski, Patrick wrote:

Hello, Pranith. The typical behavior is that the %cpu on
a glusterfsd process jumps to number of processor cores
available (800% or 1200%, depending on the pair of nodes
involved) and the load average on the machine goes very
high (~20). The volume's heal statistics output shows
that it is crawling one of the bricks and trying to
heal, but this crawl hangs and never seems to finish.

The number of files in the xattrop directory varies over
time, so I ran a wc -l as you requested periodically for
some time and then started including a datestamped list
of the files that were in the xattrops directory on each
brick to see which were persistent. All bricks had files
in the xattrop folder, so all results are attached.

Thanks this info is helpful. I don't see a lot of files.
Could you give output of "gluster volume heal 
info"? Is there any directory in there which is LARGE?

Pranith



Please let me know if there is anything else I can provide.

Patrick


    On Thu, Jan 21, 2016 at 12:01 AM, Pranith Kumar
Karampuri <pkara...@redhat.com
<mailto:pkara...@redhat.com>> wrote:

hey,
   Which process is consuming so much cpu? I
went through the logs you gave me. I see

Re: [Gluster-devel] [Gluster-users] heal hanging

2016-01-21 Thread Pranith Kumar Karampuri



On 01/22/2016 07:19 AM, Pranith Kumar Karampuri wrote:



On 01/22/2016 07:13 AM, Glomski, Patrick wrote:
We use the samba glusterfs virtual filesystem (the current version 
provided on download.gluster.org <http://download.gluster.org>), but 
no windows clients connecting directly.


Hmm.. Is there a way to disable using this and check if the CPU% still 
increases? What getxattr of "glusterfs.get_real_filename " 
does is to scan the entire directory looking for strcasecmp(, 
). If anything matches then it will return the 
. But the problem is the scan is costly. So I wonder 
if this is the reason for the CPU spikes.

+Raghavendra Talur, +Poornima

Raghavendra, Poornima,
When are these getxattrs triggered? Did you guys see any 
brick CPU spikes before? I initially thought it could be because of big 
directory heals. But this is happening even when no self-heals are 
required. So I had to move away from that theory.


Pranith


Pranith


On Thu, Jan 21, 2016 at 8:37 PM, Pranith Kumar Karampuri 
<pkara...@redhat.com <mailto:pkara...@redhat.com>> wrote:


Do you have any windows clients? I see a lot of getxattr calls
for "glusterfs.get_real_filename" which lead to full readdirs of
the directories on the brick.

Pranith

On 01/22/2016 12:51 AM, Glomski, Patrick wrote:

Pranith, could this kind of behavior be self-inflicted by us
deleting files directly from the bricks? We have done that in
the past to clean up an issues where gluster wouldn't allow us
to delete from the mount.

If so, is it feasible to clean them up by running a search on
the .glusterfs directories directly and removing files with a
reference count of 1 that are non-zero size (or directly
checking the xattrs to be sure that it's not a DHT link).

find /data/brick01a/homegfs/.glusterfs -type f -not -empty
-links -2 -exec rm -f "{}" \;

Is there anything I'm inherently missing with that approach that
will further corrupt the system?


On Thu, Jan 21, 2016 at 1:02 PM, Glomski, Patrick
<patrick.glom...@corvidtec.com
<mailto:patrick.glom...@corvidtec.com>> wrote:

Load spiked again: ~1200%cpu on gfs02a for glusterfsd. Crawl
has been running on one of the bricks on gfs02b for 25 min
or so and users cannot access the volume.

I re-listed the xattrop directories as well as a 'top' entry
and heal statistics. Then I restarted the gluster services
on gfs02a.

=== top ===
PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+
COMMAND
 8969 root  20   0 2815m 204m 3588 S 1181.0  0.6
591:06.93 glusterfsd

=== xattrop ===
/data/brick01a/homegfs/.glusterfs/indices/xattrop:
xattrop-41f19453-91e4-437c-afa9-3b25614de210
xattrop-9b815879-2f4d-402b-867c-a6d65087788c

/data/brick02a/homegfs/.glusterfs/indices/xattrop:
xattrop-70131855-3cfb-49af-abce-9d23f57fb393
xattrop-dfb77848-a39d-4417-a725-9beca75d78c6

/data/brick01b/homegfs/.glusterfs/indices/xattrop:
e6e47ed9-309b-42a7-8c44-28c29b9a20f8
xattrop-5c797a64-bde7-4eac-b4fc-0befc632e125
xattrop-38ec65a1-00b5-4544-8a6c-bf0f531a1934
xattrop-ef0980ad-f074-4163-979f-16d5ef85b0a0

/data/brick02b/homegfs/.glusterfs/indices/xattrop:
xattrop-7402438d-0ee7-4fcf-b9bb-b561236f99bc
xattrop-8ffbf5f7-ace3-497d-944e-93ac85241413

/data/brick01a/homegfs/.glusterfs/indices/xattrop:
xattrop-0115acd0-caae-4dfd-b3b4-7cc42a0ff531

/data/brick02a/homegfs/.glusterfs/indices/xattrop:
xattrop-7e20fdb1-5224-4b9a-be06-568708526d70

/data/brick01b/homegfs/.glusterfs/indices/xattrop:
8034bc06-92cd-4fa5-8aaf-09039e79d2c8
c9ce22ed-6d8b-471b-a111-b39e57f0b512
94fa1d60-45ad-4341-b69c-315936b51e8d
xattrop-9c04623a-64ce-4f66-8b23-dbaba49119c7

/data/brick02b/homegfs/.glusterfs/indices/xattrop:
xattrop-b8c8f024-d038-49a2-9a53-c54ead09111d


=== heal stats ===

homegfs [b0-gfsib01a] : Starting time of crawl   : Thu
Jan 21 12:36:45 2016
homegfs [b0-gfsib01a] : Ending time of crawl : Thu
Jan 21 12:36:45 2016
homegfs [b0-gfsib01a] : Type of crawl: INDEX
homegfs [b0-gfsib01a] : No. of entries healed: 0
homegfs [b0-gfsib01a] : No. of entries in split-brain: 0
homegfs [b0-gfsib01a] : No. of heal failed entries   : 0

homegfs [b1-gfsib01b] : Starting time of crawl   : Thu
Jan 21 12:36:19 2016
homegfs [b1-gfsib01b] : Ending time of crawl : Thu
Jan 21 12:36:19 2016
homegfs [b1-gfsib01b] : Type of crawl: INDEX
homegfs [b1-gfsib01b] : No. of entries healed   

Re: [Gluster-devel] [Gluster-users] heal hanging

2016-01-21 Thread Pranith Kumar Karampuri



On 01/22/2016 07:13 AM, Glomski, Patrick wrote:
We use the samba glusterfs virtual filesystem (the current version 
provided on download.gluster.org <http://download.gluster.org>), but 
no windows clients connecting directly.


Hmm.. Is there a way to disable using this and check if the CPU% still 
increases? What getxattr of "glusterfs.get_real_filename " does 
is to scan the entire directory looking for strcasecmp(, 
). If anything matches then it will return the 
. But the problem is the scan is costly. So I wonder 
if this is the reason for the CPU spikes.


Pranith


On Thu, Jan 21, 2016 at 8:37 PM, Pranith Kumar Karampuri 
<pkara...@redhat.com <mailto:pkara...@redhat.com>> wrote:


Do you have any windows clients? I see a lot of getxattr calls for
"glusterfs.get_real_filename" which lead to full readdirs of the
directories on the brick.

Pranith

On 01/22/2016 12:51 AM, Glomski, Patrick wrote:

Pranith, could this kind of behavior be self-inflicted by us
deleting files directly from the bricks? We have done that in the
past to clean up an issues where gluster wouldn't allow us to
delete from the mount.

If so, is it feasible to clean them up by running a search on the
.glusterfs directories directly and removing files with a
reference count of 1 that are non-zero size (or directly checking
the xattrs to be sure that it's not a DHT link).

find /data/brick01a/homegfs/.glusterfs -type f -not -empty -links
-2 -exec rm -f "{}" \;

Is there anything I'm inherently missing with that approach that
will further corrupt the system?


On Thu, Jan 21, 2016 at 1:02 PM, Glomski, Patrick
<patrick.glom...@corvidtec.com
<mailto:patrick.glom...@corvidtec.com>> wrote:

Load spiked again: ~1200%cpu on gfs02a for glusterfsd. Crawl
has been running on one of the bricks on gfs02b for 25 min or
so and users cannot access the volume.

I re-listed the xattrop directories as well as a 'top' entry
and heal statistics. Then I restarted the gluster services on
gfs02a.

=== top ===
PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+
COMMAND
 8969 root  20   0 2815m 204m 3588 S 1181.0  0.6
591:06.93 glusterfsd

=== xattrop ===
/data/brick01a/homegfs/.glusterfs/indices/xattrop:
xattrop-41f19453-91e4-437c-afa9-3b25614de210
xattrop-9b815879-2f4d-402b-867c-a6d65087788c

/data/brick02a/homegfs/.glusterfs/indices/xattrop:
xattrop-70131855-3cfb-49af-abce-9d23f57fb393
xattrop-dfb77848-a39d-4417-a725-9beca75d78c6

/data/brick01b/homegfs/.glusterfs/indices/xattrop:
e6e47ed9-309b-42a7-8c44-28c29b9a20f8
xattrop-5c797a64-bde7-4eac-b4fc-0befc632e125
xattrop-38ec65a1-00b5-4544-8a6c-bf0f531a1934
xattrop-ef0980ad-f074-4163-979f-16d5ef85b0a0

/data/brick02b/homegfs/.glusterfs/indices/xattrop:
xattrop-7402438d-0ee7-4fcf-b9bb-b561236f99bc
xattrop-8ffbf5f7-ace3-497d-944e-93ac85241413

/data/brick01a/homegfs/.glusterfs/indices/xattrop:
xattrop-0115acd0-caae-4dfd-b3b4-7cc42a0ff531

/data/brick02a/homegfs/.glusterfs/indices/xattrop:
xattrop-7e20fdb1-5224-4b9a-be06-568708526d70

/data/brick01b/homegfs/.glusterfs/indices/xattrop:
8034bc06-92cd-4fa5-8aaf-09039e79d2c8
c9ce22ed-6d8b-471b-a111-b39e57f0b512
94fa1d60-45ad-4341-b69c-315936b51e8d
xattrop-9c04623a-64ce-4f66-8b23-dbaba49119c7

/data/brick02b/homegfs/.glusterfs/indices/xattrop:
xattrop-b8c8f024-d038-49a2-9a53-c54ead09111d


=== heal stats ===

homegfs [b0-gfsib01a] : Starting time of crawl   : Thu
Jan 21 12:36:45 2016
homegfs [b0-gfsib01a] : Ending time of crawl : Thu
Jan 21 12:36:45 2016
homegfs [b0-gfsib01a] : Type of crawl: INDEX
homegfs [b0-gfsib01a] : No. of entries healed: 0
homegfs [b0-gfsib01a] : No. of entries in split-brain: 0
homegfs [b0-gfsib01a] : No. of heal failed entries   : 0

homegfs [b1-gfsib01b] : Starting time of crawl   : Thu
Jan 21 12:36:19 2016
homegfs [b1-gfsib01b] : Ending time of crawl : Thu
Jan 21 12:36:19 2016
homegfs [b1-gfsib01b] : Type of crawl: INDEX
homegfs [b1-gfsib01b] : No. of entries healed: 0
homegfs [b1-gfsib01b] : No. of entries in split-brain: 0
homegfs [b1-gfsib01b] : No. of heal failed entries   : 1

homegfs [b2-gfsib01a] : Starting time of crawl   : Thu
Jan 21 12:36:48 2016
homegfs [b2-gfsib01a] : Ending time of crawl : Thu
Jan 21 12:36:48 2016
homegfs [b2-gfsib01a] : Type of crawl: INDEX

Re: [Gluster-devel] [Gluster-users] heal hanging

2016-01-21 Thread Pranith Kumar Karampuri
homegfs info' until bug is fixed ***

Brick gfs01a.corvidtec.com:/data/brick01a/homegfs/
Number of entries: 0

Brick gfs01b.corvidtec.com:/data/brick01b/homegfs/
Number of entries: 0

Brick gfs01a.corvidtec.com:/data/brick02a/homegfs/
Number of entries: 0

Brick gfs01b.corvidtec.com:/data/brick02b/homegfs/
Number of entries: 0

Brick gfs02a.corvidtec.com:/data/brick01a/homegfs/
/users/bangell/.gconfd - Is in split-brain

Number of entries: 1

Brick gfs02b.corvidtec.com:/data/brick01b/homegfs/
/users/bangell/.gconfd - Is in split-brain

/users/bangell/.gconfd/saved_state
Number of entries: 2

Brick gfs02a.corvidtec.com:/data/brick02a/homegfs/
Number of entries: 0

Brick gfs02b.corvidtec.com:/data/brick02b/homegfs/
Number of entries: 0




On Thu, Jan 21, 2016 at 11:10 AM, Pranith Kumar Karampuri
<pkara...@redhat.com <mailto:pkara...@redhat.com>> wrote:



On 01/21/2016 09:26 PM, Glomski, Patrick wrote:

I should mention that the problem is not currently occurring
and there are no heals (output appended). By restarting the
gluster services, we can stop the crawl, which lowers the
load for a while. Subsequent crawls seem to finish properly.
For what it's worth, files/folders that show up in the
'volume info' output during a hung crawl don't seem to be
anything out of the ordinary.

Over the past four days, the typical time before the problem
recurs after suppressing it in this manner is an hour. Last
night when we reached out to you was the last time it
happened and the load has been low since (a relief). David
believes that recursively listing the files (ls -alR or
similar) from a client mount can force the issue to happen,
but obviously I'd rather not unless we have some precise
thing we're looking for. Let me know if you'd like me to
attempt to drive the system unstable like that and what I
should look for. As it's a production system, I'd rather not
leave it in this state for long.


Will it be possible to send glustershd, mount logs of the past
4 days? I would like to see if this is because of directory
self-heal going wild (Ravi is working on throttling feature
for 3.8, which will allow to put breaks on self-heal traffic)

Pranith



[root@gfs01a xattrop]# gluster volume heal homegfs info
Brick gfs01a.corvidtec.com:/data/brick01a/homegfs/
Number of entries: 0

Brick gfs01b.corvidtec.com:/data/brick01b/homegfs/
Number of entries: 0

Brick gfs01a.corvidtec.com:/data/brick02a/homegfs/
Number of entries: 0

Brick gfs01b.corvidtec.com:/data/brick02b/homegfs/
Number of entries: 0

Brick gfs02a.corvidtec.com:/data/brick01a/homegfs/
Number of entries: 0

Brick gfs02b.corvidtec.com:/data/brick01b/homegfs/
Number of entries: 0

Brick gfs02a.corvidtec.com:/data/brick02a/homegfs/
Number of entries: 0

Brick gfs02b.corvidtec.com:/data/brick02b/homegfs/
Number of entries: 0




    On Thu, Jan 21, 2016 at 10:40 AM, Pranith Kumar Karampuri
<pkara...@redhat.com <mailto:pkara...@redhat.com>> wrote:



On 01/21/2016 08:25 PM, Glomski, Patrick wrote:

Hello, Pranith. The typical behavior is that the %cpu on
a glusterfsd process jumps to number of processor cores
available (800% or 1200%, depending on the pair of nodes
involved) and the load average on the machine goes very
high (~20). The volume's heal statistics output shows
that it is crawling one of the bricks and trying to
heal, but this crawl hangs and never seems to finish.

The number of files in the xattrop directory varies over
time, so I ran a wc -l as you requested periodically for
some time and then started including a datestamped list
of the files that were in the xattrops directory on each
brick to see which were persistent. All bricks had files
in the xattrop folder, so all results are attached.

Thanks this info is helpful. I don't see a lot of files.
Could you give output of "gluster volume heal 
info"? Is there any directory in there which is LARGE?

Pranith



Please let me know if there is anything else I can provide.

        Patrick


    On Thu, Jan 21, 2016 at 12:01 AM, Pranith Kumar
Karampuri <pkara...@redhat.com
<mailto:pkara...@redhat.com>> wrote:

hey,
   Which process is consuming so much cpu? I
went through the logs you gave me. I see that the
following files are 

Re: [Gluster-devel] [Gluster-users] heal hanging

2016-01-21 Thread Pranith Kumar Karampuri



On 01/22/2016 07:25 AM, Glomski, Patrick wrote:
Unfortunately, all samba mounts to the gluster volume through the 
gfapi vfs plugin have been disabled for the last 6 hours or so and 
frequency of %cpu spikes is increased. We had switched to sharing a 
fuse mount through samba, but I just disabled that as well. There are 
no samba shares of this volume now. The spikes now happen every thirty 
minutes or so. We've resorted to just rebooting the machine with high 
load for the present.


Next time this CPU spike happens, could you collect samples of gstack 
 every second for 10-20 seconds? That helps in finding the 
heavily hit function calls.


Something like "for i in {1..20}; do gstack  > 
sample-$i.txt; done"


Pranith


On Thu, Jan 21, 2016 at 8:49 PM, Pranith Kumar Karampuri 
<pkara...@redhat.com <mailto:pkara...@redhat.com>> wrote:




On 01/22/2016 07:13 AM, Glomski, Patrick wrote:

We use the samba glusterfs virtual filesystem (the current
version provided on download.gluster.org
<http://download.gluster.org>), but no windows clients connecting
directly.


Hmm.. Is there a way to disable using this and check if the CPU%
still increases? What getxattr of "glusterfs.get_real_filename
" does is to scan the entire directory looking for
strcasecmp(, ). If anything matches
then it will return the . But the problem is the
scan is costly. So I wonder if this is the reason for the CPU spikes.

Pranith



    On Thu, Jan 21, 2016 at 8:37 PM, Pranith Kumar Karampuri
<pkara...@redhat.com <mailto:pkara...@redhat.com>> wrote:

Do you have any windows clients? I see a lot of getxattr
calls for "glusterfs.get_real_filename" which lead to full
readdirs of the directories on the brick.

Pranith

On 01/22/2016 12:51 AM, Glomski, Patrick wrote:

Pranith, could this kind of behavior be self-inflicted by us
deleting files directly from the bricks? We have done that
in the past to clean up an issues where gluster wouldn't
allow us to delete from the mount.

If so, is it feasible to clean them up by running a search
on the .glusterfs directories directly and removing files
with a reference count of 1 that are non-zero size (or
directly checking the xattrs to be sure that it's not a DHT
link).

find /data/brick01a/homegfs/.glusterfs -type f -not -empty
-links -2 -exec rm -f "{}" \;

Is there anything I'm inherently missing with that approach
that will further corrupt the system?


On Thu, Jan 21, 2016 at 1:02 PM, Glomski, Patrick
<patrick.glom...@corvidtec.com
<mailto:patrick.glom...@corvidtec.com>> wrote:

Load spiked again: ~1200%cpu on gfs02a for glusterfsd.
Crawl has been running on one of the bricks on gfs02b
for 25 min or so and users cannot access the volume.

I re-listed the xattrop directories as well as a 'top'
entry and heal statistics. Then I restarted the gluster
services on gfs02a.

=== top ===
PID USER  PR  NI VIRT  RES  SHR S %CPU %MEMTIME+
COMMAND
 8969 root  20   0 2815m 204m 3588 S 1181.0 0.6
591:06.93 glusterfsd

=== xattrop ===
/data/brick01a/homegfs/.glusterfs/indices/xattrop:
xattrop-41f19453-91e4-437c-afa9-3b25614de210
xattrop-9b815879-2f4d-402b-867c-a6d65087788c

/data/brick02a/homegfs/.glusterfs/indices/xattrop:
xattrop-70131855-3cfb-49af-abce-9d23f57fb393
xattrop-dfb77848-a39d-4417-a725-9beca75d78c6

/data/brick01b/homegfs/.glusterfs/indices/xattrop:
e6e47ed9-309b-42a7-8c44-28c29b9a20f8
xattrop-5c797a64-bde7-4eac-b4fc-0befc632e125
xattrop-38ec65a1-00b5-4544-8a6c-bf0f531a1934
xattrop-ef0980ad-f074-4163-979f-16d5ef85b0a0

/data/brick02b/homegfs/.glusterfs/indices/xattrop:
xattrop-7402438d-0ee7-4fcf-b9bb-b561236f99bc
xattrop-8ffbf5f7-ace3-497d-944e-93ac85241413

/data/brick01a/homegfs/.glusterfs/indices/xattrop:
xattrop-0115acd0-caae-4dfd-b3b4-7cc42a0ff531

/data/brick02a/homegfs/.glusterfs/indices/xattrop:
xattrop-7e20fdb1-5224-4b9a-be06-568708526d70

/data/brick01b/homegfs/.glusterfs/indices/xattrop:
8034bc06-92cd-4fa5-8aaf-09039e79d2c8
c9ce22ed-6d8b-471b-a111-b39e57f0b512
94fa1d60-45ad-4341-b69c-315936b51e8d
xattrop-9c04623a-64ce-4f66-8b23-dbaba49119c7

/data/brick02b/homegfs/.glusterfs/indices/xattrop:
xattrop-b8c8f024-d038-49a2-9a53-c54ead09111d



Re: [Gluster-devel] [Gluster-users] heal hanging

2016-01-21 Thread Pranith Kumar Karampuri



On 01/22/2016 07:25 AM, Glomski, Patrick wrote:
Unfortunately, all samba mounts to the gluster volume through the 
gfapi vfs plugin have been disabled for the last 6 hours or so and 
frequency of %cpu spikes is increased. We had switched to sharing a 
fuse mount through samba, but I just disabled that as well. There are 
no samba shares of this volume now. The spikes now happen every thirty 
minutes or so. We've resorted to just rebooting the machine with high 
load for the present.


Could you see if the logs of following type are not at all coming?
[2016-01-21 15:13:00.005736] E 
[server-rpc-fops.c:768:server_getxattr_cbk] 0-homegfs-server: 110: 
GETXATTR /wks_backup (40e582d6-b0c7-4099-ba88-9168a3c

32ca6) (glusterfs.get_real_filename:desktop.ini) ==> (Permission denied)

These are operations that failed. Operations that succeed are the ones 
that will scan the directory. But I don't have a way to find them other 
than using tcpdumps.


At the moment I have 2 theories:
1) these get_real_filename calls
2) [2016-01-21 16:10:38.017828] E [server-helpers.c:46:gid_resolve] 
0-gid-cache: getpwuid_r(494) failed

"

Yessir they are.  Normally, sssd would look to the local cache file in 
/var/lib/sss/db/ first, to get any group or userid information, then go 
out to the domain controller.  I put the options that we are using on 
our GFS volumes below…  Thanks for your help.


We had been running sssd with sssd_nss and sssd_be sub-processes on 
these systems for a long time, under the GFS 3.5.2 code, and not run 
into the problem that David described with the high cpu usage on sssd_nss.


*"
*That was Tom Young's email 1.5 years back when we debugged it. But the 
process which was consuming lot of cpu is sssd_nss. So I am not sure if 
it is same issue. Let us debug to see '1)' doesn't happen. The gstack 
traces I asked for should also help.


Pranith


On Thu, Jan 21, 2016 at 8:49 PM, Pranith Kumar Karampuri 
<pkara...@redhat.com <mailto:pkara...@redhat.com>> wrote:




On 01/22/2016 07:13 AM, Glomski, Patrick wrote:

We use the samba glusterfs virtual filesystem (the current
version provided on download.gluster.org
<http://download.gluster.org>), but no windows clients connecting
directly.


Hmm.. Is there a way to disable using this and check if the CPU%
still increases? What getxattr of "glusterfs.get_real_filename
" does is to scan the entire directory looking for
strcasecmp(, ). If anything matches
then it will return the . But the problem is the
scan is costly. So I wonder if this is the reason for the CPU spikes.

Pranith



    On Thu, Jan 21, 2016 at 8:37 PM, Pranith Kumar Karampuri
<pkara...@redhat.com <mailto:pkara...@redhat.com>> wrote:

Do you have any windows clients? I see a lot of getxattr
calls for "glusterfs.get_real_filename" which lead to full
readdirs of the directories on the brick.

Pranith

On 01/22/2016 12:51 AM, Glomski, Patrick wrote:

Pranith, could this kind of behavior be self-inflicted by us
deleting files directly from the bricks? We have done that
in the past to clean up an issues where gluster wouldn't
allow us to delete from the mount.

If so, is it feasible to clean them up by running a search
on the .glusterfs directories directly and removing files
with a reference count of 1 that are non-zero size (or
directly checking the xattrs to be sure that it's not a DHT
link).

find /data/brick01a/homegfs/.glusterfs -type f -not -empty
-links -2 -exec rm -f "{}" \;

Is there anything I'm inherently missing with that approach
that will further corrupt the system?


On Thu, Jan 21, 2016 at 1:02 PM, Glomski, Patrick
<patrick.glom...@corvidtec.com
<mailto:patrick.glom...@corvidtec.com>> wrote:

Load spiked again: ~1200%cpu on gfs02a for glusterfsd.
Crawl has been running on one of the bricks on gfs02b
for 25 min or so and users cannot access the volume.

I re-listed the xattrop directories as well as a 'top'
entry and heal statistics. Then I restarted the gluster
services on gfs02a.

=== top ===
PID USER  PR  NI VIRT  RES  SHR S %CPU %MEMTIME+
COMMAND
 8969 root  20   0 2815m 204m 3588 S 1181.0 0.6
591:06.93 glusterfsd

=== xattrop ===
/data/brick01a/homegfs/.glusterfs/indices/xattrop:
xattrop-41f19453-91e4-437c-afa9-3b25614de210
xattrop-9b815879-2f4d-402b-867c-a6d65087788c

/data/brick02a/homegfs/.glusterfs/indices/xattrop:
xattrop-70131855-3cfb-49af-abce-9d23f57fb393
xattrop-dfb77848-a39d-

Re: [Gluster-devel] 答复: Re: Gluster AFR volume write performance has been seriously affected by GLUSTERFS_WRITE_IS_APPEND in afr_writev

2016-01-22 Thread Pranith Kumar Karampuri



On 01/22/2016 07:14 AM, li.ping...@zte.com.cn wrote:

Hi Pranith, it is appreciated for your reply.

Pranith Kumar Karampuri <pkara...@redhat.com> 写于 2016/01/20 18:51:19:

> 发件人:  Pranith Kumar Karampuri <pkara...@redhat.com>
> 收件人:  li.ping...@zte.com.cn, gluster-devel@gluster.org,
> 日期:  2016/01/20 18:51
> 主题: Re: [Gluster-devel] Gluster AFR volume write performance has
> been seriously affected by GLUSTERFS_WRITE_IS_APPEND in afr_writev
>
> Sorry for the delay in response.

> On 01/15/2016 02:34 PM, li.ping...@zte.com.cn wrote:
> GLUSTERFS_WRITE_IS_APPEND Setting in afr_writev function at
> glusterfs client end makes the posix_writev in the server end  deal
> IO write fops from parallel  to serial in consequence.
>
> i.e.  multiple io-worker threads carrying out IO write fops are
> blocked in posix_writev to execute final write fop pwrite/pwritev in
> __posix_writev function ONE AFTER ANOTHER.
>
> For example:
>
> thread1: iot_worker -> ...  -> posix_writev()   |
> thread2: iot_worker -> ...  -> posix_writev()   |
> thread3: iot_worker -> ...  -> posix_writev() -> __posix_writev()
> thread4: iot_worker -> ...  -> posix_writev()   |
>
> there are 4 iot_worker thread doing the 128KB IO write fops as
> above, but only one can execute __posix_writev function and the
> others have to wait.
>
> however, if the afr volume is configured on with storage.linux-aio
> which is off in default,  the iot_worker will use posix_aio_writev
> instead of posix_writev to write data.
> the posix_aio_writev function won't be affected by
> GLUSTERFS_WRITE_IS_APPEND, and the AFR volume write performance goes 
up.

> I think this is a bug :-(.

Yeah, I agree with you. I suppose the GLUSTERFS_WRITE_IS_APPEND is a 
misuse in afr_writev.
I checked the original intent of GLUSTERS_WRITE_IS_APPEND change at 
review website:

_http://review.gluster.org/#/c/5501/_

The initial purpose seems to avoid an unnecessary fsync()in
afr_changelog_post_op_safe function if the writing data position
was currently at the end of the file, detected by
(preop.ia_size == offset || (fd->flags & O_APPEND)) in posix_writev.

In comparison with the afr write performance loss, I think
it costs too much.

I suggest to make the GLUSTERS_WRITE_IS_APPEND setting configurable
just as ensure-durability in afr.


You are right, it doesn't make sense to put this option in dictionary if 
ensure-durability is off. http://review.gluster.org/13285 addresses 
this. Do you want to try this out?
Thanks for doing most of the work :-). Do let me know if you want to 
raise a bug for this. Or I can take that up if you don't have time.


Pranith


>
> So, my question is whether  AFR volume could work fine with
> storage.linux-aio configuration which bypass the
> GLUSTERFS_WRITE_IS_APPEND setting in afr_writev,
> and why glusterfs keeps posix_aio_writev different from posix_writev ?
>
> Any replies to clear my confusion would be grateful, and thanks in 
advance.

> What is the workload you have? multiple writers on same file workloads?

I test the afr gluster volume by fio like this:
fio --filename=/mnt/afr/20G.dat --direct=1 --rw=write --bs=128k 
--size=20G --numjobs=8
--runtime=60 --group_reporting --name=afr_test  --iodepth=1 
--ioengine=libaio


The Glusterfs BRICKS are two IBM X3550 M3.

The local disk direct write performance of 128KB IO req block size is 
about 18MB/s

in single thread and 80MB/s in 8 multi-threads.

If the GLUSTERS_WRITE_IS_APPEND is configed, the afr gluster volume 
write performance is 18MB/s
as the single thread, and if not, the performance is nearby 
75MB/s.(network bandwith is enough)


>
> Pranith
>
>
> 
> ZTE Information Security Notice: The information contained in this
> mail (and any attachment transmitted herewith) is privileged and
> confidential and is intended for the exclusive use of the addressee
> (s).  If you are not an intended recipient, any disclosure,
> reproduction, distribution or other dissemination or use of the
> information contained is strictly prohibited.  If you have received
> this mail in error, please delete it and notify us immediately.
>

>
>

> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel


ZTE Information Security Notice: The information contained in this mail (and 
any attachment transmitted herewith) is privileged and confidential and is 
intended for the exclusive use of the addressee(s).  If you are not an intended 
recipient, any disclosure, reproduction, distribution or other dissemination or 
use of the information contained is strictly prohibited.  If y

[Gluster-devel] Netbsd regressions are failing because of connection problems?

2016-01-20 Thread Pranith Kumar Karampuri

/origin/*
ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Command "git -c core.askpass=true fetch 
--tags --progress git://review.gluster.org/glusterfs.git 
+refs/heads/*:refs/remotes/origin/*" returned status code 128:

stdout:
stderr: fatal: unable to connect to review.gluster.org:
review.gluster.org[0: 184.107.76.10]: errno=Connection refused


at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1640)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1388)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$300(CliGitAPIImpl.java:62)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:313)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl$2.execute(CliGitAPIImpl.java:505)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:152)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:145)

at hudson.remoting.UserRequest.perform(UserRequest.java:120)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)

https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/13574/console

Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Feature: Automagic lock-revocation for features/locks xlator (v3.7.x)

2016-01-24 Thread Pranith Kumar Karampuri



On 01/25/2016 02:17 AM, Richard Wareing wrote:

Hello all,

Just gave a talk at SCaLE 14x today and I mentioned our new locks 
revocation feature which has had a significant impact on our GFS 
cluster reliability.  As such I wanted to share the patch with the 
community, so here's the bugzilla report:


https://bugzilla.redhat.com/show_bug.cgi?id=1301401

=
Summary:
Mis-behaving brick clients (gNFSd, FUSE, gfAPI) can cause cluster 
instability and eventual complete unavailability due to failures in 
releasing entry/inode locks in a timely manner.


Classic symptoms on this are increased brick (and/or gNFSd) memory 
usage due the high number of (lock request) frames piling up in the 
processes.  The failure-mode results in bricks eventually slowing down 
to a crawl due to swapping, or OOMing due to complete memory 
exhaustion; during this period the entire cluster can begin to fail. 
 End-users will experience this as hangs on the filesystem, first in a 
specific region of the file-system and ultimately the entire 
filesystem as the offending brick begins to turn into a zombie (i.e. 
not quite dead, but not quite alive either).


Currently, these situations must be handled by an administrator 
detecting & intervening via the "clear-locks" CLI command. 
 Unfortunately this doesn't scale for large numbers of clusters, and 
it depends on the correct (external) detection of the locks piling up 
(for which there is little signal other than state dumps).


This patch introduces two features to remedy this situation:

1. Monkey-unlocking - This is a feature targeted at developers (only!) 
to help track down crashes due to stale locks, and prove the utility 
of he lock revocation feature.  It does this by silently dropping 1% 
of unlock requests; simulating bugs or mis-behaving clients.


The feature is activated via:
features.locks-monkey-unlocking 

You'll see the message
"[] W [inodelk.c:653:pl_inode_setlk] 0-groot-locks: MONKEY 
LOCKING (forcing stuck lock)!" ... in the logs indicating a request 
has been dropped.


2. Lock revocation - Once enabled, this feature will revoke a 
*contended*lock (i.e. if nobody else asks for the lock, we will not 
revoke it)either by the amount of time the lock has been held, how 
many other lock requests are waiting on the lock to be freed, or some 
combination of both.  Clients which are losing their locks will be 
notified by receiving EAGAIN (send back to their callback function).


The feature is activated via these options:
features.locks-revocation-secs 
features.locks-revocation-clear-all [on/off]
features.locks-revocation-max-blocked 

Recommended settings are: 1800 seconds for a time based timeout (give 
clients the benefit of the doubt, or chose a max-blocked requires some 
experimentation depending on your workload, but generally values of 
hundreds to low thousands (it's normal for many ten's of locks to be 
taken out when files are being written @ high throughput).


I really like this feature. One question though, self-heal, rebalance 
domain locks are active until self-heal/rebalance is complete which can 
take more than 30 minutes if the files are in TBs. I will try to see 
what we can do to handle these without increasing the revocation-secs 
too much. May be we can come up with per domain revocation timeouts. 
Comments are welcome.


Pranith


=

The patch supplied will patch clean the the v3.7.6 release tag, and 
probably to any 3.7.x release & master (posix locks xlator is rarely 
touched).


Richard





___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] 3.7.7 update

2016-01-19 Thread Pranith Kumar Karampuri
https://public.pad.fsfe.org/p/glusterfs-3.7.7 is the final list of 
patches I am waiting for before making 3.7.7 release.


Please let me know if I need to wait for any other patches. It would be 
great if we make the tag tomorrow.


Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Few details needed about *any* recent or upcoming feature

2016-01-20 Thread Pranith Kumar Karampuri

http://www.gluster.org/pipermail/gluster-devel/2015-September/046773.html

Pranith

On 01/20/2016 04:11 PM, Niels de Vos wrote:

Hi all,

on Saturday the 30th of January I am scheduled to give a presentation
titled "Gluster roadmap, recent improvements and upcoming features":

   https://fosdem.org/2016/schedule/event/gluster_roadmap/

I would like to ask from all feature owners/developers to reply to this
email with a short description and a few keywords about their features.
My plan is to have at most one slide for each feature, so keep it short.

Thanks,
Niels


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Gluster AFR volume write performance has been seriously affected by GLUSTERFS_WRITE_IS_APPEND in afr_writev

2016-01-20 Thread Pranith Kumar Karampuri

Sorry for the delay in response.

On 01/15/2016 02:34 PM, li.ping...@zte.com.cn wrote:
GLUSTERFS_WRITE_IS_APPEND Setting in afr_writev function at glusterfs 
client end makes the posix_writev in the server end  deal IO write 
fops from parallel  to serial in consequence.


i.e.  multiple io-worker threads carrying out IO write fops are 
blocked in posix_writev to execute final write fop pwrite/pwritev in 
__posix_writev function ONE AFTER ANOTHER.


For example:

thread1: iot_worker -> ...  -> posix_writev()   |
thread2: iot_worker -> ...  -> posix_writev()   |
thread3: iot_worker -> ...  -> posix_writev()   -> __posix_writev()
thread4: iot_worker -> ...  -> posix_writev()   |

there are 4 iot_worker thread doing the 128KB IO write fops as above, 
but only one can execute __posix_writev function and the others have 
to wait.


however, if the afr volume is configured on with storage.linux-aio 
which is off in default,  the iot_worker will use posix_aio_writev 
instead of posix_writev to write data.
the posix_aio_writev function won't be affected by 
GLUSTERFS_WRITE_IS_APPEND, and the AFR volume write performance goes up.

I think this is a bug :-(.


So, my question is whether  AFR volume could work fine with 
storage.linux-aio configuration which bypass the 
GLUSTERFS_WRITE_IS_APPEND setting in afr_writev,

and why glusterfs keeps posix_aio_writev different from posix_writev ?

Any replies to clear my confusion would be grateful, and thanks in 
advance.

What is the workload you have? multiple writers on same file workloads?

Pranith




ZTE Information Security Notice: The information contained in this mail (and 
any attachment transmitted herewith) is privileged and confidential and is 
intended for the exclusive use of the addressee(s).  If you are not an intended 
recipient, any disclosure, reproduction, distribution or other dissemination or 
use of the information contained is strictly prohibited.  If you have received 
this mail in error, please delete it and notify us immediately.




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file

2016-01-27 Thread Pranith Kumar Karampuri

Hi,
  If the hang appears on enabling client side io-threads then it 
could be because of some race that is seen when io-threads is enabled on 
the client side. 2 things will help us debug this issue:

1) thread apply all bt inside gdb (with debuginfo rpms/debs installed )
2) Complete statedump of the mount at two intervals preferably 10 
seconds apart. It becomes difficult to find out which ones are stuck vs 
the ones that are on-going when we have just one statedump. If we have 
two, we can find which frames are common in both of the statedumps and 
then take a closer look there.


Feel free to ping me on #gluster-dev, nick: pranithk if you have the 
process hung in that state and you guys don't mind me do a live 
debugging with you guys. This option is the best of the lot!


Thanks a lot baul, Oleksandr for the debugging so far!

Pranith

On 01/25/2016 01:03 PM, baul jianguo wrote:

3.5.7 also hangs.only the flush op hung. Yes,off the
performance.client-io-threads ,no hang.

The hang does not relate the client kernel version.

One client statdump about flush op,any abnormal?

[global.callpool.stack.12]

uid=0

gid=0

pid=14432

unique=16336007098

lk-owner=77cb199aa36f3641

op=FLUSH

type=1

cnt=6



[global.callpool.stack.12.frame.1]

ref_count=1

translator=fuse

complete=0



[global.callpool.stack.12.frame.2]

ref_count=0

translator=datavolume-write-behind

complete=0

parent=datavolume-read-ahead

wind_from=ra_flush

wind_to=FIRST_CHILD (this)->fops->flush

unwind_to=ra_flush_cbk



[global.callpool.stack.12.frame.3]

ref_count=1

translator=datavolume-read-ahead

complete=0

parent=datavolume-open-behind

wind_from=default_flush_resume

wind_to=FIRST_CHILD(this)->fops->flush

unwind_to=default_flush_cbk



[global.callpool.stack.12.frame.4]

ref_count=1

translator=datavolume-open-behind

complete=0

parent=datavolume-io-threads

wind_from=iot_flush_wrapper

wind_to=FIRST_CHILD(this)->fops->flush

unwind_to=iot_flush_cbk



[global.callpool.stack.12.frame.5]

ref_count=1

translator=datavolume-io-threads

complete=0

parent=datavolume

wind_from=io_stats_flush

wind_to=FIRST_CHILD(this)->fops->flush

unwind_to=io_stats_flush_cbk



[global.callpool.stack.12.frame.6]

ref_count=1

translator=datavolume

complete=0

parent=fuse

wind_from=fuse_flush_resume

wind_to=xl->fops->flush

unwind_to=fuse_err_cbk



On Sun, Jan 24, 2016 at 5:35 AM, Oleksandr Natalenko
<oleksa...@natalenko.name> wrote:

With "performance.client-io-threads" set to "off" no hangs occurred in 3
rsync/rm rounds. Could that be some fuse-bridge lock race? Will bring that
option to "on" back again and try to get full statedump.

On четвер, 21 січня 2016 р. 14:54:47 EET Raghavendra G wrote:

On Thu, Jan 21, 2016 at 10:49 AM, Pranith Kumar Karampuri <

pkara...@redhat.com> wrote:

On 01/18/2016 02:28 PM, Oleksandr Natalenko wrote:

XFS. Server side works OK, I'm able to mount volume again. Brick is 30%
full.

Oleksandr,

   Will it be possible to get the statedump of the client, bricks

output next time it happens?

https://github.com/gluster/glusterfs/blob/master/doc/debugging/statedump.m
d#how-to-generate-statedump

We also need to dump inode information. To do that you've to add "all=yes"
to /var/run/gluster/glusterdump.options before you issue commands to get
statedump.


Pranith


On понеділок, 18 січня 2016 р. 15:07:18 EET baul jianguo wrote:

What is your brick file system? and the glusterfsd process and all
thread status?
I met same issue when client app such as rsync stay in D status,and
the brick process and relate thread also be in the D status.
And the brick dev disk util is 100% .

On Sun, Jan 17, 2016 at 6:13 AM, Oleksandr Natalenko

<oleksa...@natalenko.name> wrote:

Wrong assumption, rsync hung again.

On субота, 16 січня 2016 р. 22:53:04 EET Oleksandr Natalenko wrote:

One possible reason:

cluster.lookup-optimize: on
cluster.readdir-optimize: on

I've disabled both optimizations, and at least as of now rsync still
does
its job with no issues. I would like to find out what option causes
such
a
behavior and why. Will test more.

On пʼятниця, 15 січня 2016 р. 16:09:51 EET Oleksandr Natalenko wrote:

Another observation: if rsyncing is resumed after hang, rsync itself
hangs a lot faster because it does stat of already copied files. So,
the
reason may be not writing itself, but massive stat on GlusterFS
volume
as well.

15.01.2016 09:40, Oleksandr Natalenko написав:

While doing rsync over millions of files from ordinary partition to
GlusterFS volume, just after approx. first 2 million rsync hang
happens, and the following info appears in dmesg:

===
[17075038.924481] INFO: task rsync:10310 blocked for more than 120
seconds.
[17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[17075038.940748] rsync   D 88207fc13680 0 10310
10309 0x0080
[1

[Gluster-devel] distributed files/directories and [cm]time updates

2016-01-25 Thread Pranith Kumar Karampuri

hi,
  Traditionally gluster has been using ctime/mtime of the 
files/dirs on the bricks as stat output. Problem we are seeing with this 
approach is that, software which depends on it gets confused when there 
are differences in these times. Tar especially gives "file changed as we 
read it" whenever it detects ctime differences when stat is served from 
different bricks. The way we have been trying to solve it is to serve 
the stat structures from same brick in afr, max-time in dht. But it 
doesn't avoid the problem completely. Because there is no way to change 
ctime at the moment(lutimes() only allows mtime, atime), there is little 
we can do to make sure ctimes match after self-heals/xattr 
updates/rebalance. I am wondering if anyone of you solved these problems 
before, if yes how did you go about doing it? It seems like applications 
which depend on this for backups get confused the same way. The only way 
out I see it is to bring ctime to an xattr, but that will need more iops 
and gluster has to keep updating it on quite a few fops.


Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Reverse brick order in tier volume- Why?

2016-01-22 Thread Pranith Kumar Karampuri



On 01/22/2016 03:48 PM, Ravishankar N wrote:

On 01/19/2016 06:44 PM, Ravishankar N wrote:


1) Is there is a compelling reason as to why the bricks of hot-tier 
are in the reverse order ?
2) If there isn't one, should we spend time to fix it so that the 
bricks appear in the order in which they were given at the time of 
volume creaction/ attach-tier *OR*  just continue with the way things 
are currently because it is not that much of an issue?

Dan / Joseph - any pointers?

+Nitya, Rafi as well.

-Ravi


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Reverse brick order in tier volume- Why?

2016-01-22 Thread Pranith Kumar Karampuri



On 01/23/2016 10:02 AM, Dan Lambright wrote:


- Original Message -

From: "Pranith Kumar Karampuri" <pkara...@redhat.com>
To: "Ravishankar N" <ravishan...@redhat.com>, "Gluster Devel" 
<gluster-devel@gluster.org>, "Dan Lambright"
<dlamb...@redhat.com>, "Joseph Fernandes" <josfe...@redhat.com>, "Nithya 
Balachandran" <nbala...@redhat.com>,
"Mohammed Rafi K C" <rkavu...@redhat.com>
Sent: Friday, January 22, 2016 10:48:15 PM
Subject: Re: [Gluster-devel] Reverse brick order in tier volume- Why?



On 01/22/2016 03:48 PM, Ravishankar N wrote:

On 01/19/2016 06:44 PM, Ravishankar N wrote:

1) Is there is a compelling reason as to why the bricks of hot-tier
are in the reverse order ?
2) If there isn't one, should we spend time to fix it so that the
bricks appear in the order in which they were given at the time of
volume creaction/ attach-tier *OR*  just continue with the way things
are currently because it is not that much of an issue?

Dan / Joseph - any pointers?

This order was an artifact of how the volume is created using legacy code and 
data structures in glusterd-volgen.c. Two volume graphs are built (the hot and 
the cold). The two graphs are built and combined in a single list. As far as I 
know, nobody has run into trouble with this. Refactoring the code would be fine 
to ease maintainability.
Cool, the reason we ask is, that in arbiter volumes, 3rd brick is going 
to be the arbiter. If the bricks are in reverse order, it will lead to 
confusion. We will change it with our implementation of attach-tier for 
replica+arbiter bricks.


Pranith




+Nitya, Rafi as well.

-Ravi


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Throttling xlator on the bricks

2016-02-12 Thread Pranith Kumar Karampuri



On 02/13/2016 12:13 AM, Richard Wareing wrote:

Hey Ravi,

I'll ping Shreyas about this today.  There's also a patch we'll need for 
multi-threaded SHD to fix the least-pri queuing.  The PID of the process wasn't 
tagged correctly via the call frame in my original patch.  The patch below 
fixes this (for 3.6.3), I didn't see multi-threaded self heal on github/master 
yet so let me know what branch you need this patch on and I can come up with a 
clean patch.


Hi Richard,
 I reviewed the patch and found that the same needs to be done 
even for ec. So I am thinking if I can split it out as two different 
patches, 1 patch in syncop-utils which builds the functionality of 
parallelization. Another patch which uses this in afr, ec. Do you mind 
if I give it a go? I can complete it by end of Wednesday.


Pranith


Richard


=


diff --git a/xlators/cluster/afr/src/afr-self-heald.c 
b/xlators/cluster/afr/src/afr-self-heald.c
index 028010d..b0f6248 100644
--- a/xlators/cluster/afr/src/afr-self-heald.c
+++ b/xlators/cluster/afr/src/afr-self-heald.c
@@ -532,6 +532,9 @@ afr_mt_process_entries_done (int ret, call_frame_t 
*sync_frame,
  pthread_cond_signal (_data->task_done);
  }
  pthread_mutex_unlock (_data->lock);
+
+if (task_ctx->frame)
+AFR_STACK_DESTROY (task_ctx->frame);
  GF_FREE (task_ctx);
  return 0;
  }
@@ -787,6 +790,7 @@ _afr_mt_create_process_entries_task (xlator_t *this,
  int   ret = -1;
  afr_mt_process_entries_task_ctx_t *task_ctx;
  afr_mt_data_t *mt_data;
+call_frame_t  *frame = NULL;

  mt_data = >mt_data;

@@ -799,6 +803,8 @@ _afr_mt_create_process_entries_task (xlator_t *this,
  if (!task_ctx)
  goto err;

+task_ctx->frame = afr_frame_create (this);
+
  INIT_LIST_HEAD (_ctx->list);
  task_ctx->readdir_xl = this;
  task_ctx->healer = healer;
@@ -812,7 +818,7 @@ _afr_mt_create_process_entries_task (xlator_t *this,
  // This returns immediately, and afr_mt_process_entries_done will
  // be called when the task is completed e.g. our queue is empty
  ret = synctask_new (this->ctx->env, afr_mt_process_entries_task,
-afr_mt_process_entries_done, NULL,
+afr_mt_process_entries_done, task_ctx->frame,
  (void *)task_ctx);

  if (!ret) {
diff --git a/xlators/cluster/afr/src/afr-self-heald.h 
b/xlators/cluster/afr/src/afr-self-heald.h
index 817e712..1588fc8 100644
--- a/xlators/cluster/afr/src/afr-self-heald.h
+++ b/xlators/cluster/afr/src/afr-self-heald.h
@@ -74,6 +74,7 @@ typedef struct afr_mt_process_entries_task_ctx_ {
  subvol_healer_t *healer;
  xlator_t*readdir_xl;
  inode_t *idx_inode;  /* inode ref for xattrop dir */
+call_frame_t*frame;
  unsigned intentries_healed;
  unsigned intentries_processed;
  unsigned intalready_healed;


Richard

From: Ravishankar N [ravishan...@redhat.com]
Sent: Sunday, February 07, 2016 11:15 PM
To: Shreyas Siravara
Cc: Richard Wareing; Vijay Bellur; Gluster Devel
Subject: Re: [Gluster-devel] Throttling xlator on the bricks

Hello,

On 01/29/2016 06:51 AM, Shreyas Siravara wrote:

So the way our throttling works is (intentionally) very simplistic.

(1) When someone mounts an NFS share, we tag the frame with a 32 bit hash of 
the export name they were authorized to mount.
(2) io-stats keeps track of the "current rate" of fops we're seeing for that 
particular mount, using a sampling of fops and a moving average over a short period of 
time.
(3) Based on whether the share violated its allowed rate (which is defined in a config 
file), we tag the FOP as "least-pri". Of course this makes the assumption that 
all NFS endpoints are receiving roughly the same # of FOPs. The rate defined in the 
config file is a *per* NFS endpoint number. So if your cluster has 10 NFS endpoints, and 
you've pre-computed that it can do roughly 1000 FOPs per second, the rate in the config 
file would be 100.
(4) IO-Threads then shoves the FOP into the least-pri queue, rather than its 
default. The value is honored all the way down to the bricks.

The code is actually complete, and I'll put it up for review after we iron out 
a few minor issues.

Did you get a chance to send the patch? Just wanted to run some tests
and see if this is all we need at the moment to regulate shd traffic,
especially with Richard's multi-threaded heal patch
https://urldefense.proofpoint.com/v2/url?u=http-3A__review.gluster.org_-23_c_13329_=CwIC-g=5VD0RTtNlTh3ycd41b3MUw=qJ8Lp7ySfpQklq3QZr44Iw=B873EiTlTeUXIjEcoutZ6Py5KL0bwXIVroPbpwaKD8s=fo86UTOQWXf0nQZvvauqIIhlwoZHpRlQMNfQd7Ubu7g=
  being revived 

<    1   2   3   4   5   6   7   8   >