On Fri, Jun 16, 2017 at 3:12 PM, Xavier Hernandez <xhernan...@datalab.es> wrote:
> On 16/06/17 10:51, Pranith Kumar Karampuri wrote: > >> >> >> On Fri, Jun 16, 2017 at 12:02 PM, Xavier Hernandez >> <xhernan...@datalab.es <mailto:xhernan...@datalab.es>> wrote: >> >> On 15/06/17 11:50, Pranith Kumar Karampuri wrote: >> >> >> >> On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey >> <aspan...@redhat.com <mailto:aspan...@redhat.com> >> <mailto:aspan...@redhat.com <mailto:aspan...@redhat.com>>> wrote: >> >> Hi All, >> >> We have been facing some issues in disperse (EC) volume. >> We know that currently EC is not good for random IO as it >> requires >> READ-MODIFY-WRITE fop >> cycle if an offset and offset+length falls in the middle of >> strip size. >> >> Unfortunately, it could also happen with sequential writes. >> Consider an EC volume with configuration 4+2. The stripe >> size for >> this would be 512 * 4 = 2048. That is, 2048 bytes of user data >> stored in one stripe. >> Let's say 2048 + 512 = 2560 bytes are already written on this >> volume. 512 Bytes would be in second stripe. >> Now, if there are sequential writes with offset 2560 and of >> size 1 >> Byte, we have to read the whole stripe, encode it with 1 >> Byte and >> then again have to write it back. >> Next, write with offset 2561 and size of 1 Byte will again >> READ-MODIFY-WRITE the whole stripe. This is causing bad >> performance. >> >> There are some tools and scenario's where such kind of load is >> coming and users are not aware of that. >> Example: fio and zip >> >> Solution: >> One possible solution to deal with this issue is to keep >> last stripe >> in memory. >> This way, we need not to read it again and we can save READ >> fop >> going over the network. >> Considering the above example, we have to keep last 2048 bytes >> (maximum) in memory per file. This should not be a big >> deal as we already keep some data like xattr's and size info >> in >> memory and based on that we take decisions. >> >> Please provide your thoughts on this and also if you have >> any other >> solution. >> >> >> Just adding more details. >> The stripe will be in memory only when lock on the inode is >> active. >> >> >> I think that's ok. >> >> One >> thing we are yet to decide on is: do we want to read the stripe >> everytime we get the lock or just after an extending write is >> performed. >> I am thinking keeping the stripe in memory just after an >> extending write >> is better as it doesn't involve extra network operation. >> >> >> I wouldn't read the last stripe unconditionally every time we lock >> the inode. There's no benefit at all on random writes (in fact it's >> worse) and a sequential write will issue the read anyway when >> needed. The only difference is a small delay for the first operation >> after a lock. >> >> >> Yes, perfect. >> >> >> >> What I would do is to keep the last stripe of every write (we can >> consider to do it per fd), even if it's not the last stripe of the >> file (to also optimize sequential rewrites). >> >> >> Ah! good point. But if we remember it per fd, one fd's cached data can >> be over-written by another fd on the disk so we need to also do cache >> invalidation. >> > > We only cache data if we have the inodelk, so all related fd's must be > from the same client, and we'll control all its writes so cache > invalidation in this case is pretty easy. > > There exists the possibility to have two fd's from the same client writing > to the same region. To control this we would need some range checking in > the writes, but all this is local, so it's easy to control it. > > Anyway, this is probably not a common case, so we could start by caching > only the last stripe of the last write, ignoring the fd. > > May be implementation should consider this possibility. >> Yet to think about how to do this. But it is a good point. We should >> consider this. >> > > Maybe we could keep a list of cached stripes sorted by offset in the inode > (if the maximum number of entries is small, we could keep the list not > sorted). Each fd should store the offset of the last write. Cached stripes > should have a ref counter just to account for the case that two fd's point > to the same offset. > > When a new write arrives, we check the offset stored in the fd and see if > it corresponds to a sequential write. If so, we look at the inode list to > find the cached stripe, otherwise we can release the cached stripe. > > We can limit the number of cached entries and release the least recently > used when we reach some maximum. > Yeah, this works :-). Ashish, Can all of this be implemented by 3.12? > > >> >> >> One thing I've observed is that a 'dd' with block size of 1MB gets >> split into multiple 128KB blocks that are sent in parallel and not >> necessarily processed in the sequential order. This means that big >> block sizes won't benefit much from this optimization since they >> will be seen as partially non-sequential writes. Anyway the change >> won't hurt. >> >> >> In this case as per the solution we won't cache anything right? Because >> we didn't request anything from the disk. We will only keep the data in >> cache if it is not aligned write which is at the current EOF. At least >> that is what I had in mind. >> > > Suppose we are writing multiple 1MB blocks at offset 1. If each write is > split into 8 blocks of 128KB, all writes will be not aligned, and can be > received in any order. Suppose that the first write happens to be at offset > 128K + 1. We don't have anything cached, so we read the needed stripes and > cache the last one. Now the next write is at offset 1. In this case we > won't get any benefit from the previous write, since the stripe we need is > not cached. However the write from the user point of view is sequential. > > It won't hurt but it won't take all benefits from the new caching > mechanism. > > As a mitigating factor, we could consider to extend the previous solution > I've explained to allow caching multiple stripes per fd. A small number > like 8 would be enough. > > Xavi > > >> >> >> Xavi >> >> >> >> >> >> --- >> Ashish >> >> >> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel@gluster.org <mailto:Gluster-devel@gluster.org> >> <mailto:Gluster-devel@gluster.org >> <mailto:Gluster-devel@gluster.org>> >> http://lists.gluster.org/mailman/listinfo/gluster-devel >> <http://lists.gluster.org/mailman/listinfo/gluster-devel> >> <http://lists.gluster.org/mailman/listinfo/gluster-devel >> <http://lists.gluster.org/mailman/listinfo/gluster-devel>> >> >> >> >> >> -- >> Pranith >> >> >> >> >> >> -- >> Pranith >> > > -- Pranith
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-devel