[Gluster-devel] GlusterFS-3.7.6-2 packages for Debian Wheezy now available

2016-02-03 Thread Kaleb Keithley

Hi,

If you're a Debian Wheezy user please give the new packages a try.

Thanks

--

Kaleb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] Trashcan issue with vim editor

2016-02-03 Thread Anoop C S
On Fri, 2016-01-29 at 18:59 +0530, PankaJ Singh wrote:
> Hi,
> 
> Thanks Anoop for the help, 
> Would you please tell me when can we expect this new release with
> this bug fix. 
> 

Please find the corresponding patch posted for mainline at [1]. I am
not sure whether we can back port the same and include it for 3.7.8. I
will update the thread asap.

[1] https://review.gluster.org/#/c/13346/

--Anoop C S.

> 
> Thanks & Reagrds
> 
> 
> 
> On Fri, Jan 29, 2016 at 12:42 PM, Anoop C S 
> wrote:
> > On Wed, 2016-01-27 at 15:25 +0530, PankaJ Singh wrote:
> > >
> > > Hi,
> > >
> > > We are using gluster 3.7.6 on ubuntu 14.04. We are facing an
> > issue
> > > with trashcan feature.
> > > Our scenario is as follow:
> > >
> > > 1. 2 node server (ubuntu 14.04 with glusterfs 3.7.6)
> > > 2. 1 client node (ubuntu 14.04)
> > > 3. I have created one volume vol1 with 2 bricks in replica and
> > with
> > > transport = tcp mode.
> > > 4. I have enabled quota on vol1
> > > 5. Now I have enabled trashcan feature on vol1 
> > > 6. Now I have mounted vol1 on client's home directory "mount -t
> > > glusterfs -o transport=tcp server-1:/vol1 /home/"
> > > 7. Now when I logged in via any existing non-root user and
> > perform
> > > any editing via vim editor then I getting this error "E200:
> > *ReadPre
> > > autocommands made the file unreadable" and my user's home
> > > directory permission get changed to 000. after sometime these
> > > permission gets revert back automatically.
> > >
> > > (NOTE: user's home directories are copied in mounted directory
> > > glusterfs volume vol1)
> > >
> > 
> > As discussed over irc, we will definitely look into this issue [1]
> > and
> > get back asap. On the other side, I have some solid reasons in
> > recommending not to use swap/backup files, created/used by Vim,
> > when
> > trash is enabled for a volume (assuming you have the basic vimrc
> > config
> > where swap/backup files are enabled by default):
> > 
> > 1. You will see lot of foo.swpx/foo.swp files (with time stamp
> > appended
> >    in their filenames) inside trashcan as Vim creates and removes
> > these
> >    swap files every now and then.
> > 
> > 2. Regarding backup files, you will notice a list of 4913 named
> > files
> >    inside .trashcan. These files are created and deleted by Vim to
> > make
> >    sure that it can create files in the current directory. And of
> >    course every time you save it with :w.
> > 
> > 3. Similar is the case with undo files like .foo.un~.
> > 
> > 4. Last but not the least, every time you do a :w, Vim performs a
> >    truncate operation which will cause the previous version of file
> > to
> >    be moved to .trashcan.
> > 
> > Having said that, you can insert the following lines to your vimrc
> > file
> > to prevent those unnecessary files, described through first 3
> > points,
> > to land inside .trashcan.
> > 
> > set noundofile
> > set noswapfile
> > set nobackup
> > set nowritebackup
> > 
> > As per the current implementation, we cannot prevent previous
> > versions
> > of file being created inside trash directory and I think these
> > files
> > will serve as backup files for future which is a good to have
> > feature.
> > 
> > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1302307
> > 
> > --Anoop C S
> > 
> > >
> > > Thanks & Regards
> > > PankaJ Singh
> > > ___
> > > Gluster-users mailing list
> > > gluster-us...@gluster.org
> > > http://www.gluster.org/mailman/listinfo/gluster-users
> > 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Non-blocking lock for renames

2016-02-03 Thread Raghavendra Gowdappa


- Original Message -
> From: "Raghavendra Gowdappa" 
> To: "Vijay Bellur" 
> Cc: "Gluster Devel" 
> Sent: Thursday, February 4, 2016 11:28:29 AM
> Subject: Re: [Gluster-devel] Non-blocking lock for renames
> 
> 
> 
> - Original Message -
> > From: "Vijay Bellur" 
> > To: "Shyamsundar Ranganathan" , "Raghavendra Gowdappa"
> > 
> > Cc: "Gluster Devel" 
> > Sent: Thursday, February 4, 2016 9:55:04 AM
> > Subject: Non-blocking lock for renames
> > 
> > DHT developers,
> > 
> > We introduced a non-blocking lock prior to a rename operation in dht and
> > fail the rename if the lock acquisition is not successful with 3.6. I
> > ran into an user in IRC yesterday who is affected by this behavior change:
> > 
> > "We're seeing a behavior in Gluster 3.7.x that we did not see in 3.4.x
> > and we're not sure how to fix it. When multiple processes are attempting
> > to rename a file to the same destination at once, we're now seeing
> > "Device or resource busy" and "Stale file handle" errors. Here's the
> > command to replicate it: cd /mnt/glustermount; while true; do
> > FILE=$RANDOM; touch $FILE; mv $FILE file-fv; done. The above command
> > would be ran on two or three servers within the same gluster cluster. In
> > the output, one would always be sucessfull in the rename, while the 2
> > other ones would fail with the above error."
> > 
> > The use case for concurrent renames was described as:
> > 
> > "we generate files and push them to the gluster cluster. Some are
> > generated multiple times and end up being pushed to the cluster at the
> > same time by different data generators; resulting in the 'rename
> > collision'. We use also the cluster.extra-hash-regex to make sure the
> > data is written in place. And this does the rename."
> > 
> > Is a non-blocking lock essential? Can we not use a blocking lock instead
> > of a non-blocking lock or fallback to a blocking lock if the original
> > non-blocking lock acquisition fails?
> 
> This lock synchronizes:
> 1. rename from application with file migration from rebalance process [1].
> 2. multiple renames from application on same file.
> 
> I think lock is still required for 1. However, since migration can
> potentially take large time, we chose a non-blocking lock to make sure
> application is not blocked for longer period.
> 
> The case 2 is what causing the issue mentioned in this thread. We did see
> some files being removed with parallel renames on the same file. But, by the
> time we had identified that its a bug in 'mv' (mv issues an unlink on src if
> src and dst happens to be hardlinks [2]. But test for hardlink check and
> unlink are not atomic. Dht breaks rename into a series of links and
> unlinks), we had introduced synchronizing b/w renames. So, we have two
> options:
> 
> 1. Use different domains for use cases 1 and 2 above. With different domains,
> use-case 2 above can be changed to use blocking locks. It might not be
> advisable to use blocking locks for use-case 1.
> 2. Since we identified the issue is with mv (I couldn't find another bug we
> filed on mv, but [2] is close to it), probably we don't need locking in 2 at
> all.
> 
> Suggestions?
> 
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=969298#c8
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=438076

Found the bug, we had filed on mv:
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1141368

> 
> regards,
> Raghavendra
> > 
> > Thanks,
> > Vijay
> > 
> > 
> > 
> > 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-03 Thread Shyam

On 02/04/2016 09:38 AM, Vijay Bellur wrote:

On 02/03/2016 11:34 AM, Venky Shankar wrote:

On Wed, Feb 03, 2016 at 09:24:06AM -0500, Jeff Darcy wrote:

Problem is with workloads which know the files that need to be read
without readdir, like hyperlinks (webserver), swift objects etc. These
are two I know of which will have this problem, which can't be improved
because we don't have metadata, data co-located. I have been trying to
think of a solution for past few days. Nothing good is coming up :-/


In those cases, caching (at the MDS) would certainly help a lot.  Some
variation of the compounding infrastructure under development for Samba
etc. might also apply, since this really is a compound operation.


Compounding in this case can help, but still without the cache, the read 
has to go to the DS, and on such a compounding, the MDS would reach out 
to the DS for the information than the client. Another possibility based 
on what we decide as the cache mechanism.




When a client is done modifying a file, MDS would refresh it's size,
mtime
attributes by fetching it from the DS. As part of this refresh, DS could
additionally send back the content if the file size falls in range, with
MDS persisting it, sending it back for subsequent lookup calls as it does
now. The content (on MDS) can be zapped once the file size crosses the
defined limit.


Venky, when you say persisting, I assume on disk, is that right?

If so, then the MDS storage size requirements would increase (based on 
amount of file data that need to be stored). As of now it is only 
inodes, and as we move to a db a record. In this case we may have 
*fatter* MDS partitions. Any comments/thoughts on that?


As with memory I would assume some form of eviction of data from MDS, to 
control the space utilization here as a possibility.






I like the idea. However the memory implications of maintaining content
in MDS is something to watch out for. quick-read is interested in files
of size 64k by default and with a reasonable number of files in that
range, we might end up consuming significant memory with this scheme.


Vijay, I think what Venky states is to stash the file on the local 
storage and not in memory. If it was in memory then brick process 
restarts would nuke the cache, and either we need mechanisms to 
rebuild/warm the cache or just start caching afresh.


If we were caching in memory, then yes the concern is valid, and one 
possibility is  some form of LRU for the same, to keep memory 
consumption in check.


Overall I would steer away from memory for this use case, and use the 
disk, as we do not know which files to cache (well in either case, but 
disk offers us more space to possibly punt on that issue). For files 
where the cache is missing and the file is small enough, either perform 
async read from the client (gaining some overlap time with the app) or 
just let it be, as we would get the open/read anyway, but would slow 
things down.




-Vijay
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Non-blocking lock for renames

2016-02-03 Thread Raghavendra Gowdappa


- Original Message -
> From: "Vijay Bellur" 
> To: "Shyamsundar Ranganathan" , "Raghavendra Gowdappa" 
> 
> Cc: "Gluster Devel" 
> Sent: Thursday, February 4, 2016 9:55:04 AM
> Subject: Non-blocking lock for renames
> 
> DHT developers,
> 
> We introduced a non-blocking lock prior to a rename operation in dht and
> fail the rename if the lock acquisition is not successful with 3.6. I
> ran into an user in IRC yesterday who is affected by this behavior change:
> 
> "We're seeing a behavior in Gluster 3.7.x that we did not see in 3.4.x
> and we're not sure how to fix it. When multiple processes are attempting
> to rename a file to the same destination at once, we're now seeing
> "Device or resource busy" and "Stale file handle" errors. Here's the
> command to replicate it: cd /mnt/glustermount; while true; do
> FILE=$RANDOM; touch $FILE; mv $FILE file-fv; done. The above command
> would be ran on two or three servers within the same gluster cluster. In
> the output, one would always be sucessfull in the rename, while the 2
> other ones would fail with the above error."
> 
> The use case for concurrent renames was described as:
> 
> "we generate files and push them to the gluster cluster. Some are
> generated multiple times and end up being pushed to the cluster at the
> same time by different data generators; resulting in the 'rename
> collision'. We use also the cluster.extra-hash-regex to make sure the
> data is written in place. And this does the rename."
> 
> Is a non-blocking lock essential? Can we not use a blocking lock instead
> of a non-blocking lock or fallback to a blocking lock if the original
> non-blocking lock acquisition fails?

This lock synchronizes:
1. rename from application with file migration from rebalance process [1].
2. multiple renames from application on same file.

I think lock is still required for 1. However, since migration can potentially 
take large time, we chose a non-blocking lock to make sure application is not 
blocked for longer period.

The case 2 is what causing the issue mentioned in this thread. We did see some 
files being removed with parallel renames on the same file. But, by the time we 
had identified that its a bug in 'mv' (mv issues an unlink on src if src and 
dst happens to be hardlinks [2]. But test for hardlink check and unlink are not 
atomic. Dht breaks rename into a series of links and unlinks), we had 
introduced synchronizing b/w renames. So, we have two options:

1. Use different domains for use cases 1 and 2 above. With different domains, 
use-case 2 above can be changed to use blocking locks. It might not be 
advisable to use blocking locks for use-case 1.
2. Since we identified the issue is with mv (I couldn't find another bug we 
filed on mv, but [2] is close to it), probably we don't need locking in 2 at 
all.

Suggestions?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=969298#c8
[2] https://bugzilla.redhat.com/show_bug.cgi?id=438076

regards,
Raghavendra
> 
> Thanks,
> Vijay
> 
> 
> 
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Contributing to Gluster

2016-02-03 Thread Sakshi Bansal
Hi Willy,

Its great to see your interest in GlusterFS. But due to some changes in the 
lookup related area the current lookup-optimize design does not hold good. Do 
you have any other areas in GlusterFS that you would like to work on. Please 
let us know so that we can help you further.

- Original Message -
From: "Kaushal M" 
To: "Willy Soesanto" 
Cc: gluster-devel@gluster.org, "Shyam" , 
saban...@redhat.com
Sent: Wednesday, February 3, 2016 8:15:32 PM
Subject: Re: [Gluster-devel] Contributing to Gluster

Maybe Shyam and Sakshi (in cc) can be helpful on this topic. They've
been involved in implementation of lookup-optimize.

~kaushal

On Tue, Feb 2, 2016 at 5:10 PM, Willy Soesanto  wrote:
> Hi Gluster-Devs,
>
> My name is Willy. I am a final year undergraduate student from Bandung
> Institute of Technology. My final year project is about Gluster. After I
> researched for a while about Gluster, I would like to take the task about
> working on lookup-self heal
> (https://public.pad.fsfe.org/p/dht_lookup_optimize). Are there any steps I
> should follow beforehand?
>
> Thanks,
>
> Willy
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Non-blocking lock for renames

2016-02-03 Thread Vijay Bellur

DHT developers,

We introduced a non-blocking lock prior to a rename operation in dht and 
fail the rename if the lock acquisition is not successful with 3.6. I 
ran into an user in IRC yesterday who is affected by this behavior change:


"We're seeing a behavior in Gluster 3.7.x that we did not see in 3.4.x 
and we're not sure how to fix it. When multiple processes are attempting 
to rename a file to the same destination at once, we're now seeing 
"Device or resource busy" and "Stale file handle" errors. Here's the 
command to replicate it: cd /mnt/glustermount; while true; do 
FILE=$RANDOM; touch $FILE; mv $FILE file-fv; done. The above command 
would be ran on two or three servers within the same gluster cluster. In 
the output, one would always be sucessfull in the rename, while the 2 
other ones would fail with the above error."


The use case for concurrent renames was described as:

"we generate files and push them to the gluster cluster. Some are 
generated multiple times and end up being pushed to the cluster at the 
same time by different data generators; resulting in the 'rename 
collision'. We use also the cluster.extra-hash-regex to make sure the 
data is written in place. And this does the rename."


Is a non-blocking lock essential? Can we not use a blocking lock instead 
of a non-blocking lock or fallback to a blocking lock if the original 
non-blocking lock acquisition fails?


Thanks,
Vijay



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-03 Thread Vijay Bellur

On 02/03/2016 11:34 AM, Venky Shankar wrote:

On Wed, Feb 03, 2016 at 09:24:06AM -0500, Jeff Darcy wrote:

Problem is with workloads which know the files that need to be read
without readdir, like hyperlinks (webserver), swift objects etc. These
are two I know of which will have this problem, which can't be improved
because we don't have metadata, data co-located. I have been trying to
think of a solution for past few days. Nothing good is coming up :-/


In those cases, caching (at the MDS) would certainly help a lot.  Some
variation of the compounding infrastructure under development for Samba
etc. might also apply, since this really is a compound operation.


When a client is done modifying a file, MDS would refresh it's size, mtime
attributes by fetching it from the DS. As part of this refresh, DS could
additionally send back the content if the file size falls in range, with
MDS persisting it, sending it back for subsequent lookup calls as it does
now. The content (on MDS) can be zapped once the file size crosses the
defined limit.



I like the idea. However the memory implications of maintaining content 
in MDS is something to watch out for. quick-read is interested in files 
of size 64k by default and with a reasonable number of files in that 
range, we might end up consuming significant memory with this scheme.


-Vijay
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-03 Thread Venky Shankar
On Wed, Feb 03, 2016 at 09:24:06AM -0500, Jeff Darcy wrote:
> > Problem is with workloads which know the files that need to be read
> > without readdir, like hyperlinks (webserver), swift objects etc. These
> > are two I know of which will have this problem, which can't be improved
> > because we don't have metadata, data co-located. I have been trying to
> > think of a solution for past few days. Nothing good is coming up :-/
> 
> In those cases, caching (at the MDS) would certainly help a lot.  Some
> variation of the compounding infrastructure under development for Samba
> etc. might also apply, since this really is a compound operation.

When a client is done modifying a file, MDS would refresh it's size, mtime
attributes by fetching it from the DS. As part of this refresh, DS could
additionally send back the content if the file size falls in range, with
MDS persisting it, sending it back for subsequent lookup calls as it does
now. The content (on MDS) can be zapped once the file size crosses the
defined limit.

But, when there are open file descriptors on an inode (O_RDWR || O_WRONLY
on a file), the size cannot be trusted (as MDS only knows about the updated
size after last close), which would be the degraded case.

Thanks,

Venky
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] REMINDER: Weekly Gluster Community meeting starts in 1 hour

2016-02-03 Thread Kaushal M
I missed doing a #startmeeting, so we don't have meeting minutes for
this weeks meeting. I've opened a request with fedora-infra to
manually import the log [1], and I'll update the list with the minutes
once complete.

~kaushal

[1] https://fedorahosted.org/fedora-infrastructure/ticket/5091

On Wed, Feb 3, 2016 at 4:34 PM, Niels de Vos  wrote:
>
> Hi all,
>
> The weekly Gluster community meeting is starting in 1 hour at 12:00 UTC.
> The current agenda for the meeting is below. Add any further topics to
> the agenda at https://public.pad.fsfe.org/p/gluster-community-meetings
>
> Meeting details:
> - location: #gluster-meeting on Freenode IRC
> - date: every Wednesday
> - time: 8:00 EDT, 12:00 UTC, 13:00 CET, 17:30 IST
> (in your terminal, run: date -d "12:00 UTC")
>
> Current Agenda:
>  * Roll Call
>  * AIs from last meeting
>  * GlusterFS 3.7
>  * GlusterFS 3.6
>  * GlusterFS 3.5
>  * GlusterFS 3.8
>  * GlusterFS 4.0
>  * Open Floor
>
> See you there,
> Niels
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-03 Thread Shyam

On 02/03/2016 07:54 PM, Jeff Darcy wrote:

Problem is with workloads which know the files that need to be read
without readdir, like hyperlinks (webserver), swift objects etc. These
are two I know of which will have this problem, which can't be improved
because we don't have metadata, data co-located. I have been trying to
think of a solution for past few days. Nothing good is coming up :-/


In those cases, caching (at the MDS) would certainly help a lot.  Some
variation of the compounding infrastructure under development for Samba
etc. might also apply, since this really is a compound operation.



The above is certainly an option, need to process it a bit more to 
respond sanely.


Another one is to generate the GFID for a file with parGFID+basename as 
input (which was something Pranith brought a few mails back in this 
chain). There was concern that we will have GFID clashes, but further 
reasoning suggests that it would not. An example follows,


Good cases:
- /D1/File is created, with top 2 bytes of the files GFID as the bucket 
(same as D1 bucket), and rest of GFID as some UUID generation of pGFID 
(gfid of D1) + base name
- When this file is looked up by name, its GFID can be generated at the 
client side as a hint, and the same fan out of lookup to MDS and read to 
DS can be initiated
* Validity of the READ data, is good only when the lookup agrees on the 
same GFID for the file


Bad cases:
- On a rename, the GFID of the file does not change, and so if /D1/File 
was renamed to /D2/File1, then a subsequent lookup could fail to 
prefetch the read, as the GFID hint generated is now based on GFID of D2 
and new name File1
- If post a rename /D1/File is again created, the GFID 
generated/requested by the client for this file would clash with the 
already generated GFID, hence the DHT server would decide to return a 
new GFID, that has no relation to the one generated by the hint. Again 
resulting in the nint failing


So with the above scheme, as long as files are not renamed the hint 
serves its purpose to prefetch even with just the name and parGFID.


One gotcha is that, I see a pattern with applications, that create a tmp 
file and then renames it to the real file name, sort of a swap file and 
then rename it to the real file as needed. For all such applications the 
hints above would fail.


I believe even Swift also uses a similar trick on the FS to rename an 
object, once it is considered fully written to. Another case would be 
compile workload. So overall the above as a scheme could work to 
alleviate the problem somewhat, but may cause harm in others (where the 
GFID hint is incorrect and so we end up sending a read without reason).


The above could easily be prototyped with DHT2 to see its benefits, so 
we will try that out at some point in the future.


Shyam
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Contributing to Gluster

2016-02-03 Thread Kaushal M
Maybe Shyam and Sakshi (in cc) can be helpful on this topic. They've
been involved in implementation of lookup-optimize.

~kaushal

On Tue, Feb 2, 2016 at 5:10 PM, Willy Soesanto  wrote:
> Hi Gluster-Devs,
>
> My name is Willy. I am a final year undergraduate student from Bandung
> Institute of Technology. My final year project is about Gluster. After I
> researched for a while about Gluster, I would like to take the task about
> working on lookup-self heal
> (https://public.pad.fsfe.org/p/dht_lookup_optimize). Are there any steps I
> should follow beforehand?
>
> Thanks,
>
> Willy
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-03 Thread Jeff Darcy
> Problem is with workloads which know the files that need to be read
> without readdir, like hyperlinks (webserver), swift objects etc. These
> are two I know of which will have this problem, which can't be improved
> because we don't have metadata, data co-located. I have been trying to
> think of a solution for past few days. Nothing good is coming up :-/

In those cases, caching (at the MDS) would certainly help a lot.  Some
variation of the compounding infrastructure under development for Samba
etc. might also apply, since this really is a compound operation.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] REMINDER: Weekly Gluster Community meeting starts in 1 hour

2016-02-03 Thread Niels de Vos

Hi all,

The weekly Gluster community meeting is starting in 1 hour at 12:00 UTC.
The current agenda for the meeting is below. Add any further topics to
the agenda at https://public.pad.fsfe.org/p/gluster-community-meetings

Meeting details:
- location: #gluster-meeting on Freenode IRC
- date: every Wednesday
- time: 8:00 EDT, 12:00 UTC, 13:00 CET, 17:30 IST
(in your terminal, run: date -d "12:00 UTC")

Current Agenda:
 * Roll Call
 * AIs from last meeting
 * GlusterFS 3.7
 * GlusterFS 3.6
 * GlusterFS 3.5
 * GlusterFS 3.8
 * GlusterFS 4.0
 * Open Floor

See you there,
Niels


signature.asc
Description: PGP signature
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-03 Thread Pranith Kumar Karampuri



The file data would be located based on its GFID, so before the *first*
lookup/stat for a file, there is no way to know it's GFID.
NOTE: Instead of a name hash the GFID hash is used, to get immunity
against renames and the like, as a name hash could change the location
information for the file (among other reasons).


Another manner of achieving the same when the GFID of the file is 
known (from a readdir) is to wind the lookup and read of size to the 
respective MDS and DS, where the lookup would be responded to once the 
MDS responds, and the DS response is cached for the subsequent 
open+read case. So on the wire we would have a fan out of 2 FOPs, but 
still satisfy the quick read requirements.


Tar kind of workload doesn't have a problem because we know the gfid 
after readdirp.




I would assume the above resolves the problem posted, are there cases 
where we do not know the GFID of the file? i.e no readdir performed 
and client knows the file name that it wants to operate on? Do we have 
traces of the webserver workload to see if it generates names on the 
fly or does a readdir prior to that?


Problem is with workloads which know the files that need to be read 
without readdir, like hyperlinks (webserver), swift objects etc. These 
are two I know of which will have this problem, which can't be improved 
because we don't have metadata, data co-located. I have been trying to 
think of a solution for past few days. Nothing good is coming up :-/


Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-03 Thread Shyam

On 02/03/2016 09:20 AM, Shyam wrote:

On 02/02/2016 06:22 PM, Jeff Darcy wrote:

   Background: Quick-read + open-behind xlators are developed to
help
in small file workload reads like apache webserver, tar etc to get the
data of the file in lookup FOP itself. What happens is, when a lookup
FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and
posix xlator reads the file and fills the data in xdata response if this
key is present as long as the file-size is less than max-length given in
the xdata. So when we do a tar of something like a kernel tree with
small files, if we look at profile of the bricks all we see are lookups.
OPEN + READ fops will not be sent at all over the network.

   With dht2 because data is present on a different cluster. We
can't
get the data in lookup. Shyam was telling me that opens are also sent to
metadata cluster. That will make perf in this usecase back to where it
was before introducing these two features i.e. 1/3 of current perf
(Lookup vs lookup+open+read)


This is interesting thanks for the heads up.



Is "1/3 of current perf" based on actual measurements?  My understanding
was that the translators in question exist to send requests *in parallel*
with the original lookup stream.  That means it might be 3x the messages,
but it will only be 1/3 the performance if the network is saturated.
Also, the lookup is not guaranteed to be only one message.  It might be
as many as N (the number of bricks), so by the reasoning above the
performance would only drop to N/N+2.  I think the real situation is a
bit more complicated - and less dire - than you suggest.


I suggest that we send some fop at the
time of open to data cluster and change quick-read to cache this data on
open (if not already) then we can reduce the perf hit to 1/2 of current
perf, i.e. lookup+open.


At first glance, it seems pretty simple to do something like this, and
pretty obvious that we should.  The tricky question is: where should we
send that other op, before lookup has told us where the partition
containing that file is?  If there's some reasonable guess we can make,
the sending an open+read in parallel with the lookup will be helpful.
If not, then it will probably be a waste of time and network resources.
Shyam, is enough of this information being cached *on the clients* to
make this effective?



The file data would be located based on its GFID, so before the *first*
lookup/stat for a file, there is no way to know it's GFID.
NOTE: Instead of a name hash the GFID hash is used, to get immunity
against renames and the like, as a name hash could change the location
information for the file (among other reasons).


Another manner of achieving the same when the GFID of the file is known 
(from a readdir) is to wind the lookup and read of size to the 
respective MDS and DS, where the lookup would be responded to once the 
MDS responds, and the DS response is cached for the subsequent open+read 
case. So on the wire we would have a fan out of 2 FOPs, but still 
satisfy the quick read requirements.


I would assume the above resolves the problem posted, are there cases 
where we do not know the GFID of the file? i.e no readdir performed and 
client knows the file name that it wants to operate on? Do we have 
traces of the webserver workload to see if it generates names on the fly 
or does a readdir prior to that?




The open+read can be done as a single FOP,
   - open for a read only case can do access checking on the client to
allow the FOP to proceed to the DS without hitting the MDS for an open
token

The client side cache is important from this and other such
perspectives. It should also leverage upcall infra to keep the cache
loosely coherent.

One thing to note here would be, for the client to do a lookup (where
the file name should be known before hand), either a readdir/(p) has to
have happened, or the client knows the name already (say application
generated names). For the former (readdir case), there is enough
information on the client to not need a lookup, but rather just do the
open+read on the DS. For the latter the first lookup cannot be avoided,
degrading this to a lookup+(open+read).

Some further tricks can be done to do readdir prefetching on such
workloads, as the MDS runs on a DB (eventually), piggybacking more
entries than requested on a lookup. I would possibly leave that for
later, based on performance numbers in the small file area.

Shyam

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel