Re: Improving recovery performance for degraded reads

Rakesh Radhakrishnan Wed, 27 Jul 2016 09:31:50 -0700

Hi Roy,

>>>> (a) In your last email, I am sure you meant => "... submitting read
requests to fetch "any" (instead of all) the 'k' chunk (out of k+m-x
surviving chunks)  ?
>>>> Do you have any optimization in place to decide which data-nodes will
be part of those "k" ?


Answer:-
I hope you know the write path, just adding few details here to support the
read explanation part. While writing to an EC file, dfs client writes data
stripe(e.g. 64KB cellsize) to multiple datanodes. For (k, m) schema, the
client writes data block to the first k datanodes and parity block to the
remaining m datanodes. Say, one stripe is (k * cellSize + m * cellSize)
data. While reading, client will fetch in the same order, read stripe by
stripe. The datanodes with data blocks are first fetched than the datanodes
with parity blocks because less EC block reconstruction work is needed.
Internally, dfs client reads the whole stripe one by one and contacts k
datanodes parallelly for each stripe. If there is any failures then will
contact parity datanodes and do reconstruction on the fly.
'DFSStripedInputStream' supported both positional read and read entire
buffer(e.g filesize buffer).

>>>> (b) Are there any caching being done (as proposed for QFS in the
previously attached "PPR" paper) ?
Answer:-
HDFS-9879, there is an open jira to discuss the caching of striped blocks
at the datanode. Perhaps, caching logic could be utilized similar to the
QFS and while reconstruction choose those datanodes that have already
cached the data in memory. This is an open improvement task as of now.

>>>> (c) When you mentioned stripping is being done, I assume it is
probably to reduce the chunk sizes and hence k*c ?
Answer:-
Yes, striping is done by dividing the block into several chunks, we call it
as cellSize (e.g. 64KB). (k * c + m * c) is one stripe. A block group
comprises of several stripes. I'd  suggest you to read the blog -
http://blog.cloudera.com/blog/2015/09/introduction-to-hdfs-erasure-coding-in-apache-hadoop/
to understand more about the stripe, cells and block group terminolgy etc
before reeading the below answer.
   blk_0      blk_1       blk_2
     |             |                |
     v            v               v
  +------+   +------+   +------+
  |cell_0|   |cell_1|   |cell_2|
  +------+   +------+   +------+

>>>> Now, if my object sizes are large (e.g. super HD images) where I would
have to get data from multiple stripes to rebuild the images before I can
display to the
>>>> client, do you think stripping would still help ?
>>>> Is there a possibility that since I know that all the segments of the
HD image would always be read together, by stripping and distributing it on
different nodes, I am ignoring
>>>> its special/temporal locality and further increase any associated
delays ?

Answer:-
Since for each stripe it contacts all the k datanodes, assume if there are
slow datanodes or some dead datanodes in each data block stripe then it
will affect the read performance. AFAIK, for a large file contiguous layout
is suitable, this will be supported in phase-2 and design discussions are
still going on, please see HDFS-8030 jira. On the otherside, in theory I
can say there is a benefit of striping layout, which enables the client to
work with multiple data nodes in parallel, greatly enhancing the aggregate
throughput(assuming that all datanodes are good servers). But this needs to
be tested in your cluster to understand the impact.


Thanks,
Rakesh
Intel

On Sun, Jul 24, 2016 at 12:00 PM, Roy Leonard <roy.leonard...@gmail.com>
wrote:

> Hi Rakesh,
>
> Thanks for sharing your thoughts and updates.
>
> (a) In your last email, I am sure you meant => "... submitting read
> requests to fetch "any" (instead of all) the 'k' chunk (out of k+m-x
> surviving chunks)  ?
> Do you have any optimization in place to decide which data-nodes will be
> part of those "k" ?
>
> (b) Are there any caching being done (as proposed for QFS in the previously
> attached "PPR" paper) ?
>
> (c) When you mentioned stripping is being done, I assume it is probably to
> reduce the chunk sizes and hence k*c ?
> Now, if my object sizes are large (e.g. super HD images) where I would have
> to get data from multiple stripes to rebuild the images before I can
> display to the client, do you think stripping would still help ?
> Is there a possibility that since I know that all the segments of the HD
> image would always be read together, by stripping and distributing it on
> different nodes, I am ignoring its special/temporal locality and further
> increase any associated delays ?
>
> Just wanted to know your thoughts.
> I am looking forward to the future performance improvements in HDFS.
>
> Regards,
> R.
>
> On Fri, Jul 22, 2016 at 8:52 AM, Rakesh Radhakrishnan <rake...@apache.org>
> wrote:
>
> > I'm adding one more point to the above. In my previous mail reply, I've
> > explained the striped block reconstruction task which will be triggered
> by
> > the Namenode on identifying a missing/bad block. Similarly, in case of
> hdfs
> > client read failure, currently hdfs client internally submitting read
> > requests to fetch all the 'k' chunks(belonging to the same stripe as the
> > failed chunk) from k data nodes and perform decoding to rebuild the lost
> > data chunk at the client side.
> >
> > Regards,
> > Rakesh
> >
> > On Fri, Jul 22, 2016 at 5:43 PM, Rakesh Radhakrishnan <
> rake...@apache.org>
> > wrote:
> >
> >> Hi Roy,
> >>
> >> Thanks for the interest in hdfs erasure coding feature and helping us in
> >> making the feature more attractive to the users by sharing performance
> >> improvement ideas.
> >>
> >> Presently, the reconstruction work has been implemented in a centralized
> >> manner in which the reconstruction task will be given to one data
> >> node(first in the pipeline). For example, we have (k, m) erasure code
> >> schema, assume one chunk (say c bytes) is lost because of a disk or
> server
> >> failure, k * c bytes of data need to be retrieved from k servers to
> recover
> >> the lost data. The reconstructing data node will fetch k chunks
> (belonging
> >> to the same stripe as the failed chunk) from k different servers and
> >> perform decoding to rebuild the lost data chunk. Yes, this k-factor
> >> increases the network traffic causes reconstruction to be very slow.
> IIUC,
> >> during the implementation time this point has come up but I think the
> >> priority has given for supporting the basic functionality first. I could
> >> see quite few jira tasks HDFS-7717, HDFS-7344 where it discussed about
> >> distributing the coding works to data nodes which includes - converting
> a
> >> file to a striped layout, reconstruction, error handling etc. But I
> feel,
> >> there is still room for discussing/implementing new approaches to get
> >> better performance results.
> >>
> >> In the shared doc, its mentioned that Partial-Parallel-Repair technique
> >> is successfully implemented on top of the Quantcast File System (QFS)
> [30],
> >> which supports RS-based erasure coded storage and got promising results.
> >> Its really an encouraging factor for us. I haven't gone through this doc
> >> deeply, it would be really great if you (or me or some other folks)
> could
> >> come up with the thoughts to discuss/implement similar mechanisms in
> HDFS
> >> as well. Mostly, will kick start the performance improvement activities
> >> after the much awaiting 3.0.0-alpha release:)
> >>
> >> >>>> Also, I would like to know what others have done to sustain good
> >> >>>> performance even under failures (other than keeping fail-over
> >> replicas).
> >> I'm not having much idea about this part, probably some other folks can
> >> pitch in and share thoughts.
> >>
> >> Regards,
> >> Rakesh
> >>
> >> On Fri, Jul 22, 2016 at 2:03 PM, Roy Leonard <roy.leonard...@gmail.com>
> >> wrote:
> >>
> >>> Greetings!
> >>>
> >>> We are evaluating erasure coding on HDFS to reduce storage cost.
> >>> However, the degraded read latency seems like a crucial bottleneck for
> >>> our
> >>> system.
> >>> After exploring some strategies for alleviating the pain of degraded
> read
> >>> latency,
> >>> I found a "tree-like recovery" technique might be useful, as described
> in
> >>> the following paper:
> >>> "Partial-parallel-repair (PPR): a distributed technique for repairing
> >>> erasure coded storage" (Eurosys-2016)
> >>> http://dl.acm.org/citation.cfm?id=2901328
> >>>
> >>> My question is:
> >>>
> >>> Do you already have such tree-like recovery implemented in HDFS-EC if
> >>> not,
> >>> do you have any plans to add similar technique is near future ?
> >>>
> >>> Also, I would like to know what others have done to sustain good
> >>> performance even under failures (other than keeping fail-over
> replicas).
> >>>
> >>> Regards,
> >>> R.
> >>>
> >>
> >>
> >
>

Re: Improving recovery performance for degraded reads

Reply via email to