Re: [Gluster-devel] Disabling read-ahead and io-cache for native fuse mounts

2019-02-12 Thread Manoj Pillai
On Wed, Feb 13, 2019 at 10:51 AM Raghavendra Gowdappa 
wrote:

>
>
> On Tue, Feb 12, 2019 at 5:38 PM Raghavendra Gowdappa 
> wrote:
>
>> All,
>>
>> We've found perf xlators io-cache and read-ahead not adding any
>> performance improvement. At best read-ahead is redundant due to kernel
>> read-ahead
>>
>
> One thing we are still figuring out is whether kernel read-ahead is
> tunable. From what we've explored, it _looks_ like (may not be entirely
> correct), ra is capped at 128KB. If that's the case, I am interested in few
> things:
> * Are there any realworld applications/usecases, which would benefit from
> larger read-ahead (Manoj says block devices can do ra of 4MB)?
>

kernel read-ahead is adaptive but influenced by the read-ahead setting on
the block device (/sys/block//queue/read_ahead_kb), which can be
tuned. For RHEL specifically, the default is 128KB (last I checked) but the
default RHEL tuned-profile, throughput-performance, bumps that up to 4MB.
It should be fairly easy to rig up a test  where 4MB read-ahead on the
block device gives better performance than 128KB read-ahead.

-- Manoj

* Is the limit on kernel ra tunable a hard one? IOW, what does it take to
> make it to do higher ra? If its difficult, can glusterfs read-ahead provide
> the expected performance improvement for these applications that would
> benefit from aggressive ra (as glusterfs can support larger ra sizes)?
>
> I am still inclined to prefer kernel ra as I think its more intelligent
> and can identify more sequential patterns than Glusterfs read-ahead [1][2].
> [1] https://www.kernel.org/doc/ols/2007/ols2007v2-pages-273-284.pdf
> [2] https://lwn.net/Articles/155510/
>
> and at worst io-cache is degrading the performance for workloads that
>> doesn't involve re-read. Given that VFS already have both these
>> functionalities, I am proposing to have these two translators turned off by
>> default for native fuse mounts.
>>
>> For non-native fuse mounts like gfapi (NFS-ganesha/samba) we can have
>> these xlators on by having custom profiles. Comments?
>>
>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1665029
>>
>> regards,
>> Raghavendra
>>
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] optimizing gluster fuse

2018-04-09 Thread Manoj Pillai
On Tue, Apr 10, 2018 at 10:02 AM, riya khanna 
wrote:

> On Mon, Apr 9, 2018 at 10:42 PM, Raghavendra Gowdappa  > wrote:
>
>> +Manoj.
>>
>> On Mon, Apr 9, 2018 at 10:18 PM, riya khanna 
>> wrote:
>>
>>> Hi All,
>>>
>>> I'm trying to use the new framework to speed up lookups/attr/xattr
>>> operations by split functionality between fast/slow execution paths. I'd
>>> highly appreciate if you could suggest experiments to evaluate the
>>> performance improvement.
>>>
>>
How about a software build workload, varying the number of source files?
Especially the case where nothing needs to be done because no files have
changed since last build -- this case should be all metadata operations.

-- Manoj


>> As you've pointed out already, this is a good place for read caches (both
>> data and metadata). While there is an overlap between things cached by
>> kernel and things cached by glusterfs, there are somethings which are
>> cached only by glusterfs but not by VFS/kernel. I think this is the area we
>> can explore to move these caches into kernel. Things I can think of:
>>
>>
> Even if things are cached by VFS (e.g., dir entries, attributes, etc.).
> The size of VFS dcache is limited and can affect performance when under
> pressure. Have you ever experienced such as case? Nevertheless, with the
> new framework can help create your own dir/attr cache managed by the
> user-space daemon - lets call it self-managed dcache.
>
>
>> * xattr caching - done by md-cache in glusterfs. I am not sure whether
>> VFS caches xattrs. If not, this can yield good returns for workloads
>> involving xattrs (like POSIX acls etc).
>>
>
> Thanks! Similar to attr, xattr caching should be doable as well. I can
> start by looking at the existing implementation in md-cache.
>
>
>> * GET kind of interface for small files - done by quick-read in
>> glusterfs. Note that we fetch the file in lookup. If we couple this with
>> pushing open-behind in kernel, we can prevent open/readv/flush/release to
>> glusterfs completely in suitable workloads (We had earlier found that this
>> boosts performance for webserver usecases). I think in lookup response, we
>> would've to populate page cache. Also lookup response signature doesn't
>> provide for holding this data. Not sure whether this can be done.
>>
>
> This one is tricky. There are some limitations imposed by the framework.
> Let me think about it.
>
>
>> * Dirent prefetching for directories - done by readdir-ahead.
>>
> The user space daemon in readdir() can populate the self-managed dcache.
> Future lookups can be served from this cache entirely within the kernel.
> What kind of workload can benefit from this?
>
>
>> * As you've already pointed out, we can improve on our invalidation
>> strategies.
>> * since page cache is already present in VFS, I don't think
>> read-ahead/io-cache might have any benefits.
>>
>
> The framework can also bypass fuse user space daemon during data I/O
> (e.g., read, write) if the file is locally stored by the lower file system.
> This design is called pass-though I/O and has been discussed numerous times
> on fuse-dlevel mailing list. Recent discussion: https://lwn.net/Ar
> ticles/674286/
> Does this apply to glusterfs as well, perhaps when a file is cached by the
> client locally?
>
>
>>> As I mentioned in my previous email, I'm caching replies from fuse
>>> daemon (hashed key/value blobs) in the kernel so that for the same key
>>> (e.g.,  in case of FUSE_LOOKUP), the reply (e.g.,
>>> fuse_entry_out) is served from the kernel itself and no call is delivered
>>> to user-space.
>>>
>>> While this may seem redundant due to entry_timeout/attr_timeout caching
>>> that already exists in FUSE, this design provides more control to the
>>> user-space daemon over when/what to invalidate. For instance, entry_timeout
>>> caching is only valid until a timeout or until the kernel removes dentry
>>> from its dcache.
>>>
>>> For invalidation, fuse_lowlevel_notify_inval_entry() can also remove
>>> entries from the hash table. Please refer to the figure attached in my last
>>> email.
>>>
>>> Thanks,
>>> Riya
>>>
>>> On Tue, Apr 3, 2018 at 1:45 PM, riya khanna 
>>> wrote:
>>>
 I'm attaching a figure that depicts the architecture of my optimized
 fuse framework. Kindly let me know if you have any questions.

 On Mon, Apr 2, 2018 at 10:57 AM, riya khanna 
 wrote:

> Thanks Amar! Please see my answers inline.
>
> On Mon, Apr 2, 2018 at 5:41 AM, Amar Tumballi 
> wrote:
>
>> Hi Riya,
>>
>> Thanks for writing to us. Some questions before we start on this.
>>
>> * Where can we see your work of modifying the fuse module to cache
>> the calls? Some reference would help us to provide more specific 
>> pointers.
>> (or ask better questions).
>>
>> I've created a fast 

Re: [Gluster-devel] 2 way with Arbiter degraded behavior

2018-02-21 Thread Manoj Pillai
On Wed, Feb 21, 2018 at 9:13 PM, Jeff Applewhite 
wrote:

> Hi All
>
> When you have a setup with 2 way replication + Arbiter backed by two
> large RAID 6 volumes what happens when there is a disk failure and
> rebuild in progress in one of those RAID sets from a client
> perspective?
>
> Does the FUSE client know how to prioritize the quicker disk (the RAID
> set that is not in rebuild)?  If not could it be made smart in this
> way? I ask because with large disks, rebuild priority could be set to
> very fast on the controller card if Gluster can auto detect or in some
> way work around the relatively slow performance from one of two
> backends. The likelihood of having rebuilds in progress on two
> different raid sets on two different servers is very low.
>
> Another way to state this is "is there some advantage we can gain from
> having double replication (RAID 6 + Gluster file replication)?"
>
> Thanks,
>
>
> Jeff Applewhite
>

I opened an issue sometime back  with this and some other scenarios in
mind:  https://github.com/gluster/glusterfs/issues/363.

-- Manoj

___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Performance experiments with io-stats translator

2017-06-09 Thread Manoj Pillai
So comparing the key latency, ∆ (2,3), in the two cases:

iodepth=1: 171 us
iodepth=8: 1453 us (in the ballpark of 171*8=1368). That's not good! (I
wonder if that relation roughly holds up for other values of iodepth).

This data doesn't conclusively establish that the problem is in gluster.
You'd see similar results if the network were saturated, like Vijay
suggested. But from what I remember of this test, the throughput here is
far too low for that to be the case.

-- Manoj


On Thu, Jun 8, 2017 at 6:37 PM, Krutika Dhananjay <kdhan...@redhat.com>
wrote:

> Indeed the latency on the client side dropped with iodepth=1. :)
> I ran the test twice and the results were consistent.
>
> Here are the exact numbers:
>
> *Translator Position*   *Avg Latency of READ fop as
> seen by this translator*
>
> 1. parent of client-io-threads437us
>
> ∆ (1,2) = 69us
>
> 2. parent of protocol/client-0368us
>
> ∆ (2,3) = 171us
>
> - end of client stack -
> - beginning of brick stack --
>
> 3. child of protocol/server   197us
>
> ∆ (3,4) = 4us
>
> 4. parent of io-threads193us
>
> ∆ (4,5) = 32us
>
> 5. child-of-io-threads  161us
>
> ∆ (5,6) = 11us
>
> 6. parent of storage/posix   150us
> ...
>  end of brick stack 
>
> Will continue reading code and get back when I find sth concrete.
>
> -Krutika
>
>
> On Thu, Jun 8, 2017 at 12:22 PM, Manoj Pillai <mpil...@redhat.com> wrote:
>
>> Thanks. So I was suggesting a repeat of the test but this time with
>> iodepth=1 in the fio job. If reducing the no. of concurrent requests
>>  reduces drastically the high latency you're seeing from the client-side,
>> that would strengthen the hypothesis than serialization/contention among
>> concurrent requests at the n/w layers is the root cause here.
>>
>> -- Manoj
>>
>>
>> On Thu, Jun 8, 2017 at 11:46 AM, Krutika Dhananjay <kdhan...@redhat.com>
>> wrote:
>>
>>> Hi,
>>>
>>> This is what my job file contains:
>>>
>>> [global]
>>> ioengine=libaio
>>> #unified_rw_reporting=1
>>> randrepeat=1
>>> norandommap=1
>>> group_reporting
>>> direct=1
>>> runtime=60
>>> thread
>>> size=16g
>>>
>>>
>>> [workload]
>>> bs=4k
>>> rw=randread
>>> iodepth=8
>>> numjobs=1
>>> file_service_type=random
>>> filename=/perf5/iotest/fio_5
>>> filename=/perf6/iotest/fio_6
>>> filename=/perf7/iotest/fio_7
>>> filename=/perf8/iotest/fio_8
>>>
>>> I have 3 vms reading from one mount, and each of these vms is running
>>> the above job in parallel.
>>>
>>> -Krutika
>>>
>>> On Tue, Jun 6, 2017 at 9:14 PM, Manoj Pillai <mpil...@redhat.com> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Jun 6, 2017 at 5:05 PM, Krutika Dhananjay <kdhan...@redhat.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> As part of identifying performance bottlenecks within gluster stack
>>>>> for VM image store use-case, I loaded io-stats at multiple points on the
>>>>> client and brick stack and ran randrd test using fio from within the 
>>>>> hosted
>>>>> vms in parallel.
>>>>>
>>>>> Before I get to the results, a little bit about the configuration ...
>>>>>
>>>>> 3 node cluster; 1x3 plain replicate volume with group virt settings,
>>>>> direct-io.
>>>>> 3 FUSE clients, one per node in the cluster (which implies reads are
>>>>> served from the replica that is local to the client).
>>>>>
>>>>> io-stats was loaded at the following places:
>>>>> On the client stack: Above client-io-threads and above
>>>>> protocol/client-0 (the first child of AFR).
>>>>> On the brick stack: Below protocol/server, above and below io-threads
>>>>> and just above storage/posix.
>>>>>
>>>>> Based on a 60-second run of randrd test and subsequent analysis of the
>>>>> stats dumped by the individual io-stats instances, the following is what I
>>>>> found:
>>>>>
>>>>> *​​Translator Position*   *Avg Latency of READ
>>>>>

Re: [Gluster-devel] Performance experiments with io-stats translator

2017-06-08 Thread Manoj Pillai
Thanks. So I was suggesting a repeat of the test but this time with
iodepth=1 in the fio job. If reducing the no. of concurrent requests
 reduces drastically the high latency you're seeing from the client-side,
that would strengthen the hypothesis than serialization/contention among
concurrent requests at the n/w layers is the root cause here.

-- Manoj

On Thu, Jun 8, 2017 at 11:46 AM, Krutika Dhananjay <kdhan...@redhat.com>
wrote:

> Hi,
>
> This is what my job file contains:
>
> [global]
> ioengine=libaio
> #unified_rw_reporting=1
> randrepeat=1
> norandommap=1
> group_reporting
> direct=1
> runtime=60
> thread
> size=16g
>
>
> [workload]
> bs=4k
> rw=randread
> iodepth=8
> numjobs=1
> file_service_type=random
> filename=/perf5/iotest/fio_5
> filename=/perf6/iotest/fio_6
> filename=/perf7/iotest/fio_7
> filename=/perf8/iotest/fio_8
>
> I have 3 vms reading from one mount, and each of these vms is running the
> above job in parallel.
>
> -Krutika
>
> On Tue, Jun 6, 2017 at 9:14 PM, Manoj Pillai <mpil...@redhat.com> wrote:
>
>>
>>
>> On Tue, Jun 6, 2017 at 5:05 PM, Krutika Dhananjay <kdhan...@redhat.com>
>> wrote:
>>
>>> Hi,
>>>
>>> As part of identifying performance bottlenecks within gluster stack for
>>> VM image store use-case, I loaded io-stats at multiple points on the client
>>> and brick stack and ran randrd test using fio from within the hosted vms in
>>> parallel.
>>>
>>> Before I get to the results, a little bit about the configuration ...
>>>
>>> 3 node cluster; 1x3 plain replicate volume with group virt settings,
>>> direct-io.
>>> 3 FUSE clients, one per node in the cluster (which implies reads are
>>> served from the replica that is local to the client).
>>>
>>> io-stats was loaded at the following places:
>>> On the client stack: Above client-io-threads and above protocol/client-0
>>> (the first child of AFR).
>>> On the brick stack: Below protocol/server, above and below io-threads
>>> and just above storage/posix.
>>>
>>> Based on a 60-second run of randrd test and subsequent analysis of the
>>> stats dumped by the individual io-stats instances, the following is what I
>>> found:
>>>
>>> *​​Translator Position*   *Avg Latency of READ fop
>>> as seen by this translator*
>>>
>>> 1. parent of client-io-threads1666us
>>>
>>> ∆ (1,2) = 50us
>>>
>>> 2. parent of protocol/client-01616us
>>>
>>> ∆ (2,3) = 1453us
>>>
>>> - end of client stack -
>>> - beginning of brick stack ---
>>>
>>> 3. child of protocol/server   163us
>>>
>>> ∆ (3,4) = 7us
>>>
>>> 4. parent of io-threads156us
>>>
>>> ∆ (4,5) = 20us
>>>
>>> 5. child-of-io-threads  136us
>>>
>>> ∆ (5,6) = 11us
>>>
>>> 6. parent of storage/posix   125us
>>> ...
>>>  end of brick stack 
>>>
>>> So it seems like the biggest bottleneck here is a combination of the
>>> network + epoll, rpc layer?
>>> I must admit I am no expert with networks, but I'm assuming if the
>>> client is reading from the local brick, then
>>> even latency contribution from the actual network won't be much, in
>>> which case bulk of the latency is coming from epoll, rpc layer, etc at both
>>> client and brick end? Please correct me if I'm wrong.
>>>
>>> I will, of course, do some more runs and confirm if the pattern is
>>> consistent.
>>>
>>> -Krutika
>>>
>>>
>> Really interesting numbers! How many concurrent requests are in flight in
>> this test? Could you post the fio job? I'm wondering if/how these latency
>> numbers change if you reduce the number of concurrent requests.
>>
>> -- Manoj
>>
>>
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Performance experiments with io-stats translator

2017-06-06 Thread Manoj Pillai
On Tue, Jun 6, 2017 at 5:05 PM, Krutika Dhananjay 
wrote:

> Hi,
>
> As part of identifying performance bottlenecks within gluster stack for VM
> image store use-case, I loaded io-stats at multiple points on the client
> and brick stack and ran randrd test using fio from within the hosted vms in
> parallel.
>
> Before I get to the results, a little bit about the configuration ...
>
> 3 node cluster; 1x3 plain replicate volume with group virt settings,
> direct-io.
> 3 FUSE clients, one per node in the cluster (which implies reads are
> served from the replica that is local to the client).
>
> io-stats was loaded at the following places:
> On the client stack: Above client-io-threads and above protocol/client-0
> (the first child of AFR).
> On the brick stack: Below protocol/server, above and below io-threads and
> just above storage/posix.
>
> Based on a 60-second run of randrd test and subsequent analysis of the
> stats dumped by the individual io-stats instances, the following is what I
> found:
>
> *​​Translator Position*   *Avg Latency of READ fop as
> seen by this translator*
>
> 1. parent of client-io-threads1666us
>
> ∆ (1,2) = 50us
>
> 2. parent of protocol/client-01616us
>
> ∆ (2,3) = 1453us
>
> - end of client stack -
> - beginning of brick stack ---
>
> 3. child of protocol/server   163us
>
> ∆ (3,4) = 7us
>
> 4. parent of io-threads156us
>
> ∆ (4,5) = 20us
>
> 5. child-of-io-threads  136us
>
> ∆ (5,6) = 11us
>
> 6. parent of storage/posix   125us
> ...
>  end of brick stack 
>
> So it seems like the biggest bottleneck here is a combination of the
> network + epoll, rpc layer?
> I must admit I am no expert with networks, but I'm assuming if the client
> is reading from the local brick, then
> even latency contribution from the actual network won't be much, in which
> case bulk of the latency is coming from epoll, rpc layer, etc at both
> client and brick end? Please correct me if I'm wrong.
>
> I will, of course, do some more runs and confirm if the pattern is
> consistent.
>
> -Krutika
>
>
Really interesting numbers! How many concurrent requests are in flight in
this test? Could you post the fio job? I'm wondering if/how these latency
numbers change if you reduce the number of concurrent requests.

-- Manoj
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] CFP for Gluster Developer Summit

2016-08-19 Thread Manoj Pillai

Here's a proposal ...

Title: State of Gluster Performance
Theme: Stability and Performance

I hope to achieve the following in this talk:

* present a brief overview of current performance for the broad
workload classes: large-file sequential and random workloads,
small-file and metadata-intensive workloads.

* highlight some use-cases where we are seeing really good
performance.

* highlight some of the areas of concerns, covering in some detail
the state of analysis and work in progress.

Regards,
Manoj

- Original Message -
> Hey All,
> 
> Gluster Developer Summit 2016 is fast approaching [1] on us. We are
> looking to have talks and discussions related to the following themes in
> the summit:
> 
> 1. Gluster.Next - focusing on features shaping the future of Gluster
> 
> 2. Experience - Description of real world experience and feedback from:
> a> Devops and Users deploying Gluster in production
> b> Developers integrating Gluster with other ecosystems
> 
> 3. Use cases  - focusing on key use cases that drive Gluster.today and
> Gluster.Next
> 
> 4. Stability & Performance - focusing on current improvements to reduce
> our technical debt backlog
> 
> 5. Process & infrastructure  - focusing on improving current workflow,
> infrastructure to make life easier for all of us!
> 
> If you have a talk/discussion proposal that can be part of these themes,
> please send out your proposal(s) by replying to this thread. Please
> clearly mention the theme for which your proposal is relevant when you
> do so. We will be ending the CFP by 12 midnight PDT on August 31st, 2016.
> 
> If you have other topics that do not fit in the themes listed, please
> feel free to propose and we might be able to accommodate some of them as
> lightening talks or something similar.
> 
> Please do reach out to me or Amye if you have any questions.
> 
> Thanks!
> Vijay
> 
> [1] https://www.gluster.org/events/summit2016/
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] performance issues Manoj found in EC testing

2016-06-27 Thread Manoj Pillai

- Original Message -
> From: "Sankarshan Mukhopadhyay" <sankarshan.mukhopadh...@gmail.com>
> To: "Gluster Devel" <gluster-devel@gluster.org>
> Sent: Monday, June 27, 2016 5:54:19 PM
> Subject: Re: [Gluster-devel] performance issues Manoj found in EC testing
> 
> On Mon, Jun 27, 2016 at 2:38 PM, Manoj Pillai <mpil...@redhat.com> wrote:
> > Thanks, folks! As a quick update, throughput on a single client test jumped
> > from ~180 MB/s to 700+MB/s after enabling client-io-threads. Throughput is
> > now more in line with what is expected for this workload based on
> > back-of-the-envelope calculations.
> 
> Is it possible to provide additional detail about this exercise in
> terms of setup; tests executed; data sets generated?

Yes, in the bz. The "before" number is from here:
https://bugzilla.redhat.com/show_bug.cgi?id=1349953#c1

-- Manoj


> 
> --
> sankarshan mukhopadhyay
> <https://about.me/sankarshan.mukhopadhyay>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] performance issues Manoj found in EC testing

2016-06-27 Thread Manoj Pillai


- Original Message -
> From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> To: "Pranith Kumar Karampuri" <pkara...@redhat.com>
> Cc: "Gluster Devel" <gluster-devel@gluster.org>
> Sent: Monday, June 27, 2016 12:48:49 PM
> Subject: Re: [Gluster-devel] performance issues Manoj found in EC testing
> 
> 
> 
> - Original Message -
> > From: "Pranith Kumar Karampuri" <pkara...@redhat.com>
> > To: "Xavier Hernandez" <xhernan...@datalab.es>
> > Cc: "Gluster Devel" <gluster-devel@gluster.org>
> > Sent: Monday, June 27, 2016 12:42:35 PM
> > Subject: Re: [Gluster-devel] performance issues Manoj found in EC testing
> > 
> > 
> > 
> > On Mon, Jun 27, 2016 at 11:52 AM, Xavier Hernandez < xhernan...@datalab.es
> > >
> > wrote:
> > 
> > 
> > Hi Manoj,
> > 
> > I always enable client-io-threads option for disperse volumes. It improves
> > performance sensibly, most probably because of the problem you have
> > detected.
> > 
> > I don't see any other way to solve that problem.
> > 
> > I agree. Updated the bug with same info.
> > 
> > 
> > 
> > I think it would be a lot better to have a true thread pool (and maybe an
> > I/O
> > thread pool shared by fuse, client and server xlators) in libglusterfs
> > instead of the io-threads xlator. This would allow each xlator to decide
> > when and what should be parallelized in a more intelligent way, since
> > basing
> > the decision solely on the fop type seems too simplistic to me.
> > 
> > In the specific case of EC, there are a lot of operations to perform for a
> > single high level fop, and not all of them require the same priority. Also
> > some of them could be executed in parallel instead of sequentially.
> > 
> > I think it is high time we actually schedule(for which release) to get this
> > in gluster. May be you should send out a doc where we can work out details?
> > I will be happy to explore options to integrate io-threads, syncop/barrier
> > with this infra based on the design may be.
> 
> +1. I can volunteer too.

Thanks, folks! As a quick update, throughput on a single client test jumped 
from ~180 MB/s to 700+MB/s after enabling client-io-threads. Throughput is 
now more in line with what is expected for this workload based on 
back-of-the-envelope calculations.

Are there any reservations about recommending client-io-threads=on as 
"default" tuning, until the enhancement discussed above becomes reality? 

-- Manoj

> 
> > 
> > 
> > 
> > Xavi
> > 
> > 
> > On 25/06/16 19:42, Manoj Pillai wrote:
> > 
> > 
> > 
> > - Original Message -
> > 
> > 
> > From: "Pranith Kumar Karampuri" < pkara...@redhat.com >
> > To: "Xavier Hernandez" < xhernan...@datalab.es >
> > Cc: "Manoj Pillai" < mpil...@redhat.com >, "Gluster Devel" <
> > gluster-devel@gluster.org >
> > Sent: Thursday, June 23, 2016 8:50:44 PM
> > Subject: performance issues Manoj found in EC testing
> > 
> > hi Xavi,
> > Meet Manoj from performance team Redhat. He has been testing EC
> > performance in his stretch clusters. He found some interesting things we
> > would like to share with you.
> > 
> > 1) When we perform multiple streams of big file writes(12 parallel dds I
> > think) he found one thread to be always hot (99%CPU always). He was asking
> > me if fuse_reader thread does any extra processing in EC compared to
> > replicate. Initially I thought it would just lock and epoll threads will
> > perform the encoding but later realized that once we have the lock and
> > version details, next writes on the file would be encoded in the same
> > thread that comes to EC. write-behind could play a role and make the writes
> > come to EC in an epoll thread but we saw consistently there was just one
> > thread that is hot. Not multiple threads. We will be able to confirm this
> > in tomorrow's testing.
> > 
> > 2) This is one more thing Raghavendra G found, that our current
> > implementation of epoll doesn't let other epoll threads pick messages from
> > a socket while one thread is processing one message from that socket. In
> > EC's case that can be encoding of the write/decoding read. This will not
> > let replies of operations on different files to be processed in parallel.
> > He thinks this can be fixed for 3.9.
> > 
> > Manoj will be raising a bug to gather a

Re: [Gluster-devel] performance issues Manoj found in EC testing

2016-06-25 Thread Manoj Pillai

- Original Message -
> From: "Pranith Kumar Karampuri" <pkara...@redhat.com>
> To: "Xavier Hernandez" <xhernan...@datalab.es>
> Cc: "Manoj Pillai" <mpil...@redhat.com>, "Gluster Devel" 
> <gluster-devel@gluster.org>
> Sent: Thursday, June 23, 2016 8:50:44 PM
> Subject: performance issues Manoj found in EC testing
> 
> hi Xavi,
>   Meet Manoj from performance team Redhat. He has been testing EC
> performance in his stretch clusters. He found some interesting things we
> would like to share with you.
> 
> 1) When we perform multiple streams of big file writes(12 parallel dds I
> think) he found one thread to be always hot (99%CPU always). He was asking
> me if fuse_reader thread does any extra processing in EC compared to
> replicate. Initially I thought it would just lock and epoll threads will
> perform the encoding but later realized that once we have the lock and
> version details, next writes on the file would be encoded in the same
> thread that comes to EC. write-behind could play a role and make the writes
> come to EC in an epoll thread but we saw consistently there was just one
> thread that is hot. Not multiple threads. We will be able to confirm this
> in tomorrow's testing.
> 
> 2) This is one more thing Raghavendra G found, that our current
> implementation of epoll doesn't let other epoll threads pick messages from
> a socket while one thread is processing one message from that socket. In
> EC's case that can be encoding of the write/decoding read. This will not
> let replies of operations on different files to be processed in parallel.
> He thinks this can be fixed for 3.9.
> 
> Manoj will be raising a bug to gather all his findings. I just wanted to
> introduce him and let you know the interesting things he is finding before
> you see the bug :-).
> --
> Pranith

Thanks, Pranith :).

Here's the bug: https://bugzilla.redhat.com/show_bug.cgi?id=1349953

Comparing EC and replica-2 runs, the hot thread is seen in both cases, so 
I have not opened this as an EC bug. But initial impression is that 
performance impact for EC is particularly bad (details in the bug).

-- Manoj
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Fragment size in Systematic erasure code

2016-03-14 Thread Manoj Pillai
Hi Xavi,

- Original Message -
> From: "Xavier Hernandez" <xhernan...@datalab.es>
> To: "Ashish Pandey" <aspan...@redhat.com>
> Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Manoj Pillai" 
> <mpil...@redhat.com>
> Sent: Monday, March 14, 2016 5:25:53 PM
> Subject: Re: Fragment size in Systematic erasure code
> 
> Hi Ashish,
> 
> On 14/03/16 12:31, Ashish Pandey wrote:
> > Hi Xavi,
> >
> > I think for Systematic erasure coded volume you are going to take fragment
> > size of 512 Bytes.
> > Will there be any CLI option to configure this block size?
> > We were having a discussion and Manoj was suggesting to have this option
> > which might improve performance for some workload.
> > For example- If we can configure it to 8K, all the read can be served only
> > from one brick in case a file size is less than 8K.
> 
> I already considered to use a configurable fragment size, and I plan to
> have it. 

Good to hear!

> However the benefits of larger block sizes are not so clear.
> Having a fragment size of 8KB in a 4+2 configuration will use a stripe
> of 32KB. Any write smaller, or not aligned, or not multiple of this
> value will need a read-modify-write cycle, causing a performance
> degradation for some workloads. It's also slower to encode/decode a
> block of 32KB because it might not fully fit into processor caches,
> making the computation slower.
> 
> On the other side, a small read on multiple bricks should, in theory, be
> executed in parallel, not causing a noticeable performance drop.

Yes, we're primarily looking to see if certain read-intensive workloads 
can benefit from a larger fragment size.

> 
> Anyway many of these things depend on the workload, so having a
> configurable fragment size will give enough control to choose the best
> solution for each environment.

Right. Once the option is available, we can do some testing with different 
workloads, varying the fragment size, and post our findings.

Thanks,
Manoj

> Xavi
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] NSR design document

2015-10-15 Thread Manoj Pillai


- Original Message -
> October 14 2015 3:11 PM, "Manoj Pillai" <mpil...@redhat.com> wrote:
> > E.g. 3x number of bricks could be a problem if workload has
> > operations that don't scale well with brick count.
> 
> Fortunately we have DHT2 to address that.
> 
> > Plus the brick
> > configuration guidelines would not exactly be elegant.
> 
> And we have Heketi to address that.
> 
> > FWIW, if I look at the performance and perf regressions tests
> > that are run at my place of work (as these tests stand today), I'd
> > expect AFR to significantly outperform this design on reads.
> 
> Reads tend to be absorbed by caches above us, *especially* in read-only
> workloads.  See Rosenblum and Ousterhout's 1992 log-structured file
> system paper, and about a bazillion others ever since.  

Yes, their point was that read absorption means the request 
stream at the secondary storage is dominated by writes, so you 
optimize for that. Plus, the non-overwrite mode of update has 
additional benefits, like easier implementation of snapshots 
or versioning, better recovery guarantees. And I think these 
additional benefits still hold true today, which is why there 
is continued interest in similar solutions. But a lot of data has 
flowed over the wires since 1992, and with the explosion in 
data sets sizes, read performance at the lower storage layers 
continues to be the determinant of overall performance for many use 
uses, (add stress) is what I think. Particularly among those shopping 
for a scale-out storage solution to fit their large data 
sets and modern workloads. Update-in-place file systems 
like XFS have endured quite well. 

> We need to be
> concerned at least as much about write performance, and NSR's write
> performance will *far* exceed AFR's because AFR uses neither networks
> nor disks efficiently.  It splits client bandwidth between N replicas,
> and it sprays writes all over the disk (data blocks plus inode plus
> index).  Most other storage systems designed in the last ten years can
> turn that into nice sequential journal writes, which can even be on a
> separate SSD or NVMe device (something AFR can't leverage at all).
> Before work on NSR ever started, I had already compared AFR to other
> file systems using these same methods and data flows (e.g. Ceph and
> MooseFS) many times.  Consistently, I'd see that the difference was
> quite a bit more than theoretical.  Despite all of the optimization work
> we've done on it, AFR's write behavior is still a huge millstone around
> our necks.
> 
> OK, let's bring some of these thoughts together.  If you've read
> Hennessy and Patterson, you've probably seen this formula before.
> 
> value (of an optimization) =
> benefit_when_applicable * probability -
> penalty_when_inapplicable * (1 - probability)
> 
> If NSR's write performance is significantly better than AFR's, and write
> performance is either dominant or at least highly relevant for most real
> workloads, what does that mean for performance overall?  As prototyping
> showed long ago, it means a significant improvement.  Is it *possible*
> to construct a read-dominant workload that shows something different?
> Of course it is.  It's even possible that write performance will degrade
> in certain (increasingly rare) physical configurations.  No design is
> best for every configuration and workload.  Some people tried to focus
> on the outliers when NSR was first proposed.  Our competitors will be
> glad to do the same, for the same reason - to keep their own pet designs
> from looking too bad.  The important question is whether performance
> improves for *most* real-world configurations and workloads.  NSR is
> quite deliberately somewhat write-optimized, because it's where we were
> the furthest behind and because it's the harder problem to solve.
> Optimizing for read-only workloads leaves users with any other kind of
> workload in a permanent hole.
> 
> Also, even for read-heavy workloads where we might see a deficit, we
> have not one but two workarounds.  One (brick splitting) we've just
> discussed, and it is quite deliberately being paired with other
> technologies in 4.0 to make it more effective.  The other (read from
> non-leaders) is also perfectly viable.  It's not the default because it
> reduces consistency to AFR levels, which I don't think serves our users
> very well.  However, if somebody's determined to make AFR comparisons,
> then it's only fair to compare at the same consistency level.  Giving
> users the ability to decide on such tradeoffs, instead of forcing one
> choice on everyone, has been part of NSR's design since day one.

And if there are improvements that can make the non-default option 
(read from non-leaders

Re: [Gluster-devel] NSR design document

2015-10-14 Thread Manoj Pillai


- Original Message -
> > "The reads will also be sent to, and processed by the current
> > leader."
> > 
> > So, at any given time, only one brick in the replica group is
> > handling read requests? For a read-only workload-phase,
> > all except one will be idle in any given term?
> 
> By default and in theory, yes.  The question is: does it matter in practice?
> If you only have one replica set, and if you haven't turned on the option
> to allow reads from non-leaders (which is not the default because it does
> imply a small reduction in consistency across failure scenarios), and if the
> client(s) bandwidth isn't already saturated, then yeah, it might be slower
> than AFR.  Then again, even that might be outweighed by gains in cache
> efficiency and avoidance of any need for locking.  In the absolute worst
> case, we can split bricks and create multiple replica sets across the same
> bricks, each with their own leader.  That would parallelize reads as much as
> AFR, while still gaining all of the other NSR advantages.
> 
> In other words, yes, in theory it could be a problem.  In practice?  No.

Or maybe: in theory, it shouldn't be a problem in practice :).

We _could_ split bricks to distribute the load more or less evenly. So 
what would naturally be a replica-3/JBOD configuration (i.e. each 
disk is a brick, multiple bricks per server), could be changed 
to carve out 3 bricks out of each disk to distribute load 
(otherwise 2/3 of the disks would be idle in said read-only 
workload phase, IIUC). Such carving could have its downsides though. 
E.g. 3x number of bricks could be a problem if workload has 
operations that don't scale well with brick count. Plus the brick 
configuration guidelines would not exactly be elegant.

FWIW, if I look at the performance and perf regressions tests 
that are run at my place of work (as these tests stand today), I'd 
expect AFR to significantly outperform this design on reads.

-- Manoj


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] NSR design document

2015-10-14 Thread Manoj Pillai

- Original Message -
> From: "Avra Sengupta" 
> To: "Gluster Devel" 
> Sent: Wednesday, October 14, 2015 2:10:33 PM
> Subject: [Gluster-devel] NSR design document
> 
> Hi,
> 
> Please find attached the NSR design document. It captures the
> architecture of NSR, and is in continuation from the discussions that
> happened during the GLuster Next community hangout. I would like to
> request you to kindly go through the document, and share any queries or
> concerns regarding the same.

>From "2. Introduction":
"The reads will also be sent to, and processed by the current
leader."

So, at any given time, only one brick in the replica group is 
handling read requests? For a read-only workload-phase, 
all except one will be idle in any given term?

-- Manoj


> Currently gluster.readdocs.org is down, and I will upload this document
> there once it is back online.
> 
> Regards,
> Avra
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel