Hi Rakesh, Thanks for sharing your thoughts and updates.
(a) In your last email, I am sure you meant => "... submitting read requests to fetch "any" (instead of all) the 'k' chunk (out of k+m-x surviving chunks) ? Do you have any optimization in place to decide which data-nodes will be part of those "k" ? (b) Are there any caching being done (as proposed for QFS in the previously attached "PPR" paper) ? (c) When you mentioned stripping is being done, I assume it is probably to reduce the chunk sizes and hence k*c ? Now, if my object sizes are large (e.g. super HD images) where I would have to get data from multiple stripes to rebuild the images before I can display to the client, do you think stripping would still help ? Is there a possibility that since I know that all the segments of the HD image would always be read together, by stripping and distributing it on different nodes, I am ignoring its special/temporal locality and further increase any associated delays ? Just wanted to know your thoughts. I am looking forward to the future performance improvements in HDFS. Regards, R. On Fri, Jul 22, 2016 at 8:52 AM, Rakesh Radhakrishnan <rake...@apache.org> wrote: > I'm adding one more point to the above. In my previous mail reply, I've > explained the striped block reconstruction task which will be triggered by > the Namenode on identifying a missing/bad block. Similarly, in case of hdfs > client read failure, currently hdfs client internally submitting read > requests to fetch all the 'k' chunks(belonging to the same stripe as the > failed chunk) from k data nodes and perform decoding to rebuild the lost > data chunk at the client side. > > Regards, > Rakesh > > On Fri, Jul 22, 2016 at 5:43 PM, Rakesh Radhakrishnan <rake...@apache.org> > wrote: > >> Hi Roy, >> >> Thanks for the interest in hdfs erasure coding feature and helping us in >> making the feature more attractive to the users by sharing performance >> improvement ideas. >> >> Presently, the reconstruction work has been implemented in a centralized >> manner in which the reconstruction task will be given to one data >> node(first in the pipeline). For example, we have (k, m) erasure code >> schema, assume one chunk (say c bytes) is lost because of a disk or server >> failure, k * c bytes of data need to be retrieved from k servers to recover >> the lost data. The reconstructing data node will fetch k chunks (belonging >> to the same stripe as the failed chunk) from k different servers and >> perform decoding to rebuild the lost data chunk. Yes, this k-factor >> increases the network traffic causes reconstruction to be very slow. IIUC, >> during the implementation time this point has come up but I think the >> priority has given for supporting the basic functionality first. I could >> see quite few jira tasks HDFS-7717, HDFS-7344 where it discussed about >> distributing the coding works to data nodes which includes - converting a >> file to a striped layout, reconstruction, error handling etc. But I feel, >> there is still room for discussing/implementing new approaches to get >> better performance results. >> >> In the shared doc, its mentioned that Partial-Parallel-Repair technique >> is successfully implemented on top of the Quantcast File System (QFS) [30], >> which supports RS-based erasure coded storage and got promising results. >> Its really an encouraging factor for us. I haven't gone through this doc >> deeply, it would be really great if you (or me or some other folks) could >> come up with the thoughts to discuss/implement similar mechanisms in HDFS >> as well. Mostly, will kick start the performance improvement activities >> after the much awaiting 3.0.0-alpha release:) >> >> >>>> Also, I would like to know what others have done to sustain good >> >>>> performance even under failures (other than keeping fail-over >> replicas). >> I'm not having much idea about this part, probably some other folks can >> pitch in and share thoughts. >> >> Regards, >> Rakesh >> >> On Fri, Jul 22, 2016 at 2:03 PM, Roy Leonard <roy.leonard...@gmail.com> >> wrote: >> >>> Greetings! >>> >>> We are evaluating erasure coding on HDFS to reduce storage cost. >>> However, the degraded read latency seems like a crucial bottleneck for >>> our >>> system. >>> After exploring some strategies for alleviating the pain of degraded read >>> latency, >>> I found a "tree-like recovery" technique might be useful, as described in >>> the following paper: >>> "Partial-parallel-repair (PPR): a distributed technique for repairing >>> erasure coded storage" (Eurosys-2016) >>> http://dl.acm.org/citation.cfm?id=2901328 >>> >>> My question is: >>> >>> Do you already have such tree-like recovery implemented in HDFS-EC if >>> not, >>> do you have any plans to add similar technique is near future ? >>> >>> Also, I would like to know what others have done to sustain good >>> performance even under failures (other than keeping fail-over replicas). >>> >>> Regards, >>> R. >>> >> >> >