On Tue, Jul 4, 2017 at 1:39 PM, Xavier Hernandez <xhernan...@datalab.es> wrote:
> Hi Pranith, > > On 03/07/17 05:35, Pranith Kumar Karampuri wrote: > >> Ashish, Xavi, >> I think it is better to implement this change as a separate >> read-after-write caching xlator which we can load between EC and client >> xlator. That way EC will not get a lot more functionality than necessary >> and may be this xlator can be used somewhere else in the stack if >> possible. >> > > while this seems a good way to separate functionalities, it has a big > problem. If we add a caching xlator between ec and *all* of its subvolumes, > it will only be able to cache encoded data. So, when ec needs the "cached" > data, it will need to issue a request to each of its subvolumes and compute > the decoded data before being able to use it, so we don't avoid the > decoding overhead. > > Also, if we want to make the xlator generic, it will probably cache a lot > more data than ec really needs. Increasing memory footprint considerably > for no real use. > > Additionally, this new xlator will need to guarantee that the cached data > is current, so it will need its own locking logic (that would be another > copy&paste of the existing logic in one of the current xlators) which is > slow and difficult to maintain, or it will need to intercept and reuse > locking calls from parent xlators, which can be quite complex since we have > multiple xlator levels where locks can be taken, not only ec. > > This is a relatively simple change to make inside ec, but a very complex > change (IMO) if we want to do it as a stand-alone xlator and be generic > enough to be reused and work safely in other places of the stack. > > If we want to separate functionalities I think we should create a new > concept of xlator which is transversal to the "traditional" xlator stack. > > Current xlators are linear in the sense that each one operates only at one > place (it can be moved by reconfiguration, but once instantiated, it always > work at the same place) and passes data to the next one. > > A transversal xlator (or maybe a service xlator would be better) would be > one not bound to any place of the stack, but could be used by all other > xlators to implement some service, like caching, multithreading, locking, > ... these are features that many xlators need but cannot use easily (nor > efficiently) if they are implicitly implemented in some specific place of > the stack outside its control. > > The transaction framework we already talked, could be though as one of > these service xlators. Multithreading could also benefit of this approach > because xlators would have more control about what things can be processed > by a background thread and which ones not. Probably there are other > features that could benefit from this approach. > > In the case of brick multiplexing, if some xlators are removed from each > stack and loaded as global services, most probably the memory footprint > will be lower and the resource usage more optimized. > I like the service xlator approach. But I don't think we have enough time to make it operational in the short term. Let us go with implementation of this feature in EC for now. I didn't realize the extra cost of decoding when I thought about the separation. So I guess we will stick to the old idea for now. > > Just an idea... > > Xavi > > >> On Fri, Jun 16, 2017 at 4:19 PM, Ashish Pandey <aspan...@redhat.com >> <mailto:aspan...@redhat.com>> wrote: >> >> >> I think it should be done as we have agreement on basic design. >> >> ------------------------------------------------------------ >> ------------ >> *From: *"Pranith Kumar Karampuri" <pkara...@redhat.com >> <mailto:pkara...@redhat.com>> >> *To: *"Xavier Hernandez" <xhernan...@datalab.es >> <mailto:xhernan...@datalab.es>> >> *Cc: *"Ashish Pandey" <aspan...@redhat.com >> <mailto:aspan...@redhat.com>>, "Gluster Devel" >> <gluster-devel@gluster.org <mailto:gluster-devel@gluster.org>> >> *Sent: *Friday, June 16, 2017 3:50:09 PM >> *Subject: *Re: [Gluster-devel] Disperse volume : Sequential Writes >> >> >> >> >> On Fri, Jun 16, 2017 at 3:12 PM, Xavier Hernandez >> <xhernan...@datalab.es <mailto:xhernan...@datalab.es>> wrote: >> >> On 16/06/17 10:51, Pranith Kumar Karampuri wrote: >> >> >> >> On Fri, Jun 16, 2017 at 12:02 PM, Xavier Hernandez >> <xhernan...@datalab.es <mailto:xhernan...@datalab.es> >> <mailto:xhernan...@datalab.es >> >> <mailto:xhernan...@datalab.es>>> wrote: >> >> On 15/06/17 11:50, Pranith Kumar Karampuri wrote: >> >> >> >> On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey >> <aspan...@redhat.com <mailto:aspan...@redhat.com> >> <mailto:aspan...@redhat.com <mailto:aspan...@redhat.com>> >> <mailto:aspan...@redhat.com >> <mailto:aspan...@redhat.com> <mailto:aspan...@redhat.com >> <mailto:aspan...@redhat.com>>>> wrote: >> >> Hi All, >> >> We have been facing some issues in disperse (EC) >> volume. >> We know that currently EC is not good for random >> IO as it >> requires >> READ-MODIFY-WRITE fop >> cycle if an offset and offset+length falls in >> the middle of >> strip size. >> >> Unfortunately, it could also happen with >> sequential writes. >> Consider an EC volume with configuration 4+2. >> The stripe >> size for >> this would be 512 * 4 = 2048. That is, 2048 >> bytes of user data >> stored in one stripe. >> Let's say 2048 + 512 = 2560 bytes are already >> written on this >> volume. 512 Bytes would be in second stripe. >> Now, if there are sequential writes with offset >> 2560 and of >> size 1 >> Byte, we have to read the whole stripe, encode >> it with 1 >> Byte and >> then again have to write it back. >> Next, write with offset 2561 and size of 1 Byte >> will again >> READ-MODIFY-WRITE the whole stripe. This is >> causing bad >> performance. >> >> There are some tools and scenario's where such >> kind of load is >> coming and users are not aware of that. >> Example: fio and zip >> >> Solution: >> One possible solution to deal with this issue is >> to keep >> last stripe >> in memory. >> This way, we need not to read it again and we >> can save READ fop >> going over the network. >> Considering the above example, we have to keep >> last 2048 bytes >> (maximum) in memory per file. This should not >> be a big >> deal as we already keep some data like xattr's >> and size info in >> memory and based on that we take decisions. >> >> Please provide your thoughts on this and also if >> you have >> any other >> solution. >> >> >> Just adding more details. >> The stripe will be in memory only when lock on the >> inode is active. >> >> >> I think that's ok. >> >> One >> thing we are yet to decide on is: do we want to read >> the stripe >> everytime we get the lock or just after an extending >> write is >> performed. >> I am thinking keeping the stripe in memory just after >> an >> extending write >> is better as it doesn't involve extra network >> operation. >> >> >> I wouldn't read the last stripe unconditionally every >> time we lock >> the inode. There's no benefit at all on random writes >> (in fact it's >> worse) and a sequential write will issue the read anyway >> when >> needed. The only difference is a small delay for the >> first operation >> after a lock. >> >> >> Yes, perfect. >> >> >> >> What I would do is to keep the last stripe of every >> write (we can >> consider to do it per fd), even if it's not the last >> stripe of the >> file (to also optimize sequential rewrites). >> >> >> Ah! good point. But if we remember it per fd, one fd's >> cached data can >> be over-written by another fd on the disk so we need to also >> do cache >> invalidation. >> >> >> We only cache data if we have the inodelk, so all related fd's >> must be from the same client, and we'll control all its writes >> so cache invalidation in this case is pretty easy. >> >> There exists the possibility to have two fd's from the same >> client writing to the same region. To control this we would need >> some range checking in the writes, but all this is local, so >> it's easy to control it. >> >> Anyway, this is probably not a common case, so we could start by >> caching only the last stripe of the last write, ignoring the fd. >> >> May be implementation should consider this possibility. >> Yet to think about how to do this. But it is a good point. >> We should >> consider this. >> >> >> Maybe we could keep a list of cached stripes sorted by offset in >> the inode (if the maximum number of entries is small, we could >> keep the list not sorted). Each fd should store the offset of >> the last write. Cached stripes should have a ref counter just to >> account for the case that two fd's point to the same offset. >> >> When a new write arrives, we check the offset stored in the fd >> and see if it corresponds to a sequential write. If so, we look >> at the inode list to find the cached stripe, otherwise we can >> release the cached stripe. >> >> We can limit the number of cached entries and release the least >> recently used when we reach some maximum. >> >> >> Yeah, this works :-). >> Ashish, >> Can all of this be implemented by 3.12? >> >> >> >> >> >> >> One thing I've observed is that a 'dd' with block size >> of 1MB gets >> split into multiple 128KB blocks that are sent in >> parallel and not >> necessarily processed in the sequential order. This >> means that big >> block sizes won't benefit much from this optimization >> since they >> will be seen as partially non-sequential writes. Anyway >> the change >> won't hurt. >> >> >> In this case as per the solution we won't cache anything >> right? Because >> we didn't request anything from the disk. We will only keep >> the data in >> cache if it is not aligned write which is at the current >> EOF. At least >> that is what I had in mind. >> >> >> Suppose we are writing multiple 1MB blocks at offset 1. If each >> write is split into 8 blocks of 128KB, all writes will be not >> aligned, and can be received in any order. Suppose that the >> first write happens to be at offset 128K + 1. We don't have >> anything cached, so we read the needed stripes and cache the >> last one. Now the next write is at offset 1. In this case we >> won't get any benefit from the previous write, since the stripe >> we need is not cached. However the write from the user point of >> view is sequential. >> >> It won't hurt but it won't take all benefits from the new >> caching mechanism. >> >> As a mitigating factor, we could consider to extend the previous >> solution I've explained to allow caching multiple stripes per >> fd. A small number like 8 would be enough. >> >> Xavi >> >> >> >> >> Xavi >> >> >> >> >> >> --- >> Ashish >> >> >> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel@gluster.org >> <mailto:Gluster-devel@gluster.org> >> <mailto:Gluster-devel@gluster.org >> <mailto:Gluster-devel@gluster.org>> >> <mailto:Gluster-devel@gluster.org >> <mailto:Gluster-devel@gluster.org> >> <mailto:Gluster-devel@gluster.org >> <mailto:Gluster-devel@gluster.org>>> >> >> http://lists.gluster.org/mailman/listinfo/gluster-devel >> <http://lists.gluster.org/mailman/listinfo/gluster-devel> >> >> <http://lists.gluster.org/mailman/listinfo/gluster-devel >> <http://lists.gluster.org/mailman/listinfo/gluster-devel>> >> >> <http://lists.gluster.org/mailman/listinfo/gluster-devel >> <http://lists.gluster.org/mailman/listinfo/gluster-devel> >> >> <http://lists.gluster.org/mailman/listinfo/gluster-devel >> <http://lists.gluster.org/mailman/listinfo/gluster-devel>>> >> >> >> >> >> -- >> Pranith >> >> >> >> >> >> -- >> Pranith >> >> >> >> >> >> -- >> Pranith >> >> >> >> >> -- >> Pranith >> > > -- Pranith
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-devel