[jira] [Commented] (HADOOP-11867) FS API: Add a high-performance vectored Read to FSDataInputStream API

Owen O'Malley (JIRA) Wed, 05 Dec 2018 14:58:07 -0800


    [ 
https://issues.apache.org/jira/browse/HADOOP-11867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710718#comment-16710718
 ]


Owen O'Malley commented on HADOOP-11867:
----------------------------------------

Currently, the implementation of the PositionedReadable.readFully(long, byte[], 
int, int) locks the stream so that you won't process multiple reads in parallel 
without a specific implementation that makes things better. For the REST-based 
file systems, absolutely the goal is to convert it into a single read with 
multiple ranges in the request.

I agree completely that implementing a prototype is a good first step, before 
locking down the exact semantics. My current thoughts:
 * You have no guarantees about the order the results are returned.
 * If the file system has mutable files, it is the application's responsibility 
to perform adequate locking prior to calling the read operations. (So yes, you 
get no guarantees about consistency of reads.) Since this case doesn't apply to 
the vast majority of users, I wouldn't want to complicate
 * Overlapping ranges are permitted.
 * It is up to the file system whether the reads lock the stream. The current 
implementation does it because it uses seek/read/seek to implement the 
positioned read. We should allow implementations to do it more directly. I 
don't think there should be any guarantees about ordering between async reads 
or sync read on the same stream.
 * Since the future contains the FileRange they passed in, they could pass in 
an extension that tracks the additional information that they need.

> FS API: Add a high-performance vectored Read to FSDataInputStream API
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-11867
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11867
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: hdfs-client
>    Affects Versions: 3.0.0
>            Reporter: Gopal V
>            Assignee: Owen O'Malley
>            Priority: Major
>              Labels: performance
>
> The most significant way to read from a filesystem in an efficient way is to 
> let the FileSystem implementation handle the seek behaviour underneath the 
> API to be the most efficient as possible.
> A better approach to the seek problem is to provide a sequence of read 
> locations as part of a single call, while letting the system schedule/plan 
> the reads ahead of time.
> This is exceedingly useful for seek-heavy readers on HDFS, since this allows 
> for potentially optimizing away the seek-gaps within the FSDataInputStream 
> implementation.
> For seek+read systems with even more latency than locally-attached disks, 
> something like a {{readFully(long[] offsets, ByteBuffer[] chunks)}} would 
> take of the seeks internally while reading chunk.remaining() bytes into each 
> chunk (which may be {{slice()}}ed off a bigger buffer).
> The base implementation can stub in this as a sequence of seeks + read() into 
> ByteBuffers, without forcing each FS implementation to override this in any 
> way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-11867) FS API: Add a high-performance vectored Read to FSDataInputStream API

Reply via email to