Ashish, Xavi, I think it is better to implement this change as a separate read-after-write caching xlator which we can load between EC and client xlator. That way EC will not get a lot more functionality than necessary and may be this xlator can be used somewhere else in the stack if possible.
On Fri, Jun 16, 2017 at 4:19 PM, Ashish Pandey <aspan...@redhat.com> wrote: > > I think it should be done as we have agreement on basic design. > > ------------------------------ > *From: *"Pranith Kumar Karampuri" <pkara...@redhat.com> > *To: *"Xavier Hernandez" <xhernan...@datalab.es> > *Cc: *"Ashish Pandey" <aspan...@redhat.com>, "Gluster Devel" < > gluster-devel@gluster.org> > *Sent: *Friday, June 16, 2017 3:50:09 PM > *Subject: *Re: [Gluster-devel] Disperse volume : Sequential Writes > > > > > On Fri, Jun 16, 2017 at 3:12 PM, Xavier Hernandez <xhernan...@datalab.es> > wrote: > >> On 16/06/17 10:51, Pranith Kumar Karampuri wrote: >> >>> >>> >>> On Fri, Jun 16, 2017 at 12:02 PM, Xavier Hernandez >>> <xhernan...@datalab.es <mailto:xhernan...@datalab.es>> wrote: >>> >>> On 15/06/17 11:50, Pranith Kumar Karampuri wrote: >>> >>> >>> >>> On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey >>> <aspan...@redhat.com <mailto:aspan...@redhat.com> >>> <mailto:aspan...@redhat.com <mailto:aspan...@redhat.com>>> >>> wrote: >>> >>> Hi All, >>> >>> We have been facing some issues in disperse (EC) volume. >>> We know that currently EC is not good for random IO as it >>> requires >>> READ-MODIFY-WRITE fop >>> cycle if an offset and offset+length falls in the middle of >>> strip size. >>> >>> Unfortunately, it could also happen with sequential writes. >>> Consider an EC volume with configuration 4+2. The stripe >>> size for >>> this would be 512 * 4 = 2048. That is, 2048 bytes of user >>> data >>> stored in one stripe. >>> Let's say 2048 + 512 = 2560 bytes are already written on this >>> volume. 512 Bytes would be in second stripe. >>> Now, if there are sequential writes with offset 2560 and of >>> size 1 >>> Byte, we have to read the whole stripe, encode it with 1 >>> Byte and >>> then again have to write it back. >>> Next, write with offset 2561 and size of 1 Byte will again >>> READ-MODIFY-WRITE the whole stripe. This is causing bad >>> performance. >>> >>> There are some tools and scenario's where such kind of load >>> is >>> coming and users are not aware of that. >>> Example: fio and zip >>> >>> Solution: >>> One possible solution to deal with this issue is to keep >>> last stripe >>> in memory. >>> This way, we need not to read it again and we can save READ >>> fop >>> going over the network. >>> Considering the above example, we have to keep last 2048 >>> bytes >>> (maximum) in memory per file. This should not be a big >>> deal as we already keep some data like xattr's and size info >>> in >>> memory and based on that we take decisions. >>> >>> Please provide your thoughts on this and also if you have >>> any other >>> solution. >>> >>> >>> Just adding more details. >>> The stripe will be in memory only when lock on the inode is >>> active. >>> >>> >>> I think that's ok. >>> >>> One >>> thing we are yet to decide on is: do we want to read the stripe >>> everytime we get the lock or just after an extending write is >>> performed. >>> I am thinking keeping the stripe in memory just after an >>> extending write >>> is better as it doesn't involve extra network operation. >>> >>> >>> I wouldn't read the last stripe unconditionally every time we lock >>> the inode. There's no benefit at all on random writes (in fact it's >>> worse) and a sequential write will issue the read anyway when >>> needed. The only difference is a small delay for the first operation >>> after a lock. >>> >>> >>> Yes, perfect. >>> >>> >>> >>> What I would do is to keep the last stripe of every write (we can >>> consider to do it per fd), even if it's not the last stripe of the >>> file (to also optimize sequential rewrites). >>> >>> >>> Ah! good point. But if we remember it per fd, one fd's cached data can >>> be over-written by another fd on the disk so we need to also do cache >>> invalidation. >>> >> >> We only cache data if we have the inodelk, so all related fd's must be >> from the same client, and we'll control all its writes so cache >> invalidation in this case is pretty easy. >> >> There exists the possibility to have two fd's from the same client >> writing to the same region. To control this we would need some range >> checking in the writes, but all this is local, so it's easy to control it. >> >> Anyway, this is probably not a common case, so we could start by caching >> only the last stripe of the last write, ignoring the fd. >> >> May be implementation should consider this possibility. >>> Yet to think about how to do this. But it is a good point. We should >>> consider this. >>> >> >> Maybe we could keep a list of cached stripes sorted by offset in the >> inode (if the maximum number of entries is small, we could keep the list >> not sorted). Each fd should store the offset of the last write. Cached >> stripes should have a ref counter just to account for the case that two >> fd's point to the same offset. >> >> When a new write arrives, we check the offset stored in the fd and see if >> it corresponds to a sequential write. If so, we look at the inode list to >> find the cached stripe, otherwise we can release the cached stripe. >> >> We can limit the number of cached entries and release the least recently >> used when we reach some maximum. >> > > Yeah, this works :-). > Ashish, > Can all of this be implemented by 3.12? > > >> >> >>> >>> >>> One thing I've observed is that a 'dd' with block size of 1MB gets >>> split into multiple 128KB blocks that are sent in parallel and not >>> necessarily processed in the sequential order. This means that big >>> block sizes won't benefit much from this optimization since they >>> will be seen as partially non-sequential writes. Anyway the change >>> won't hurt. >>> >>> >>> In this case as per the solution we won't cache anything right? Because >>> we didn't request anything from the disk. We will only keep the data in >>> cache if it is not aligned write which is at the current EOF. At least >>> that is what I had in mind. >>> >> >> Suppose we are writing multiple 1MB blocks at offset 1. If each write is >> split into 8 blocks of 128KB, all writes will be not aligned, and can be >> received in any order. Suppose that the first write happens to be at offset >> 128K + 1. We don't have anything cached, so we read the needed stripes and >> cache the last one. Now the next write is at offset 1. In this case we >> won't get any benefit from the previous write, since the stripe we need is >> not cached. However the write from the user point of view is sequential. >> >> It won't hurt but it won't take all benefits from the new caching >> mechanism. >> >> As a mitigating factor, we could consider to extend the previous solution >> I've explained to allow caching multiple stripes per fd. A small number >> like 8 would be enough. >> >> Xavi >> >> >>> >>> >>> Xavi >>> >>> >>> >>> >>> >>> --- >>> Ashish >>> >>> >>> >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel@gluster.org <mailto:Gluster-devel@gluster.org> >>> <mailto:Gluster-devel@gluster.org >>> <mailto:Gluster-devel@gluster.org>> >>> http://lists.gluster.org/mailman/listinfo/gluster-devel >>> <http://lists.gluster.org/mailman/listinfo/gluster-devel> >>> <http://lists.gluster.org/mailman/listinfo/gluster-devel >>> <http://lists.gluster.org/mailman/listinfo/gluster-devel>> >>> >>> >>> >>> >>> -- >>> Pranith >>> >>> >>> >>> >>> >>> -- >>> Pranith >>> >> >> > > > -- > Pranith > > -- Pranith
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-devel