Re: [PR] HADOOP-19098 Vector IO: consistent specified rejection of overlapping ranges [hadoop]

via GitHub Fri, 22 Mar 2024 14:24:46 -0700


mukund-thakur commented on code in PR #6604:
URL: https://github.com/apache/hadoop/pull/6604#discussion_r1536087673



##########
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/VectoredReadUtils.java:
##########
@@ -133,26 +172,42 @@ private static void 
readNonByteBufferPositionedReadable(PositionedReadable strea
   /**
    * Read bytes from stream into a byte buffer using an
    * intermediate byte array.
-   * @param length number of bytes to read.
+   *   <pre>
+   *     (position, buffer, buffer-offset, length): Void
+   *     position:= the position within the file to read data.
+   *     buffer := a buffer to read fully `length` bytes into.
+   *     buffer-offset := the offset within the buffer to write data
+   *     length := the number of bytes to read.
+   *   </pre>
+   * The passed in function MUST block until the required length of
+   * data is read, or an exception is thrown.
+   * @param range range to read
    * @param buffer buffer to fill.
    * @param operation operation to use for reading data.
    * @throws IOException any IOE.
    */
-  public static void readInDirectBuffer(int length,
-                                        ByteBuffer buffer,
-                                        Function4RaisingIOE<Integer, byte[], 
Integer,
-                                                Integer, Void> operation) 
throws IOException {
+  public static void readInDirectBuffer(FileRange range,
+      ByteBuffer buffer,
+      Function4RaisingIOE<Long, byte[], Integer, Integer, Void> operation)
+      throws IOException {
+
+    LOG.debug("Reading {} into a direct buffer", range);
+    validateRangeRequest(range);

Review Comment:
   we would have already validated this. why again? 



##########
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java:
##########
@@ -885,19 +904,26 @@ public int maxReadSizeForVectorReads() {
    * @throws IOException IOE if any.
    */
   @Override
-  public void readVectored(List<? extends FileRange> ranges,
+  public synchronized void readVectored(List<? extends FileRange> ranges,

Review Comment:
   moving to synchronized? We are supporting multiple readVectored() in the 
same stream and it should be fine. 



##########
hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md:
##########
@@ -459,51 +459,119 @@ The position returned by `getPos()` after 
`readVectored()` is undefined.
 If a file is changed while the `readVectored()` operation is in progress, the 
output is
 undefined. Some ranges may have old data, some may have new, and some may have 
both.
 
-While a `readVectored()` operation is in progress, normal read api calls may 
block.
-
-Note: Don't use direct buffers for reading from ChecksumFileSystem as that may
-lead to memory fragmentation explained in HADOOP-18296.
+While a `readVectored()` operation is in progress, normal read API calls MAY 
block;
+the value of `getPos(`) is also undefined. Applications SHOULD NOT make such 
requests
+while waiting for the results of a vectored read.
 
+Note: Don't use direct buffers for reading from `ChecksumFileSystem` as that 
may
+lead to memory fragmentation explained in
+[HADOOP-18296](https://issues.apache.org/jira/browse/HADOOP-18296)
+_Memory fragmentation in ChecksumFileSystem Vectored IO implementation_
 
 #### Preconditions
 
-For each requested range:
+No empty lists.
+
+```python
+if ranges = null raise NullPointerException
+if ranges.len() = 0 raise IllegalArgumentException
+if allocate = null raise NullPointerException
+```
+
+For each requested range `range[i]` in the list of ranges `range[0..n]` sorted
+on `getOffset()` ascending such that
+
+for all `i where i > 0`:
 
-    range.getOffset >= 0 else raise IllegalArgumentException
-    range.getLength >= 0 else raise EOFException
+    range[i].getOffset() > range[i-1].getOffset()
+
+For all ranges `0..i` the preconditions are:
+
+```python
+ranges[i] != null else raise IllegalArgumentException
+ranges[i].getOffset() >= 0 else raise EOFException
+ranges[i].getLength() >= 0 else raise IllegalArgumentException
+if i > 0 and ranges[i].getOffset() < (ranges[i-1].getOffset() + 
ranges[i-1].getLength) :
+   raise IllegalArgumentException
+```
+If the length of the file is known during the validation phase:
+
+```python
+if range[i].getOffset + range[i].getLength >= data.length() raise EOFException
+```
 
 #### Postconditions
 
-For each requested range:
+For each requested range `range[i]` in the list of ranges `range[0..n]`
+
+```
+ranges[i]'.getData() = CompletableFuture<buffer: ByteBuffer>
+```
 
-    range.getData() returns CompletableFuture<ByteBuffer> which will have data
-    from range.getOffset to range.getLength.
+ and when `getData().get()` completes:
+```
+let buffer = `getData().get()
+let len = ranges[i].getLength()
+let data = new byte[len]
+(buffer.position() - buffer.limit) = len
+buffer.get(data, 0, len) = readFully(ranges[i].getOffset(), data, 0, len)
+```
 
-### `minSeekForVectorReads()`
+That is: the result of every ranged read is the result of the (possibly 
asynchronous)
+call to `PositionedReadable.readFully()` for the same offset and length
+
+#### `minSeekForVectorReads()`
 
 The smallest reasonable seek. Two ranges won't be merged together if the 
difference between
 end of first and start of next range is more than this value.
 
-### `maxReadSizeForVectorReads()`
+#### `maxReadSizeForVectorReads()`
 
 Maximum number of bytes which can be read in one go after merging the ranges.
-Two ranges won't be merged if the combined data to be read is more than this 
value.
+Two ranges won't be merged if the combined data to be read It's okay we have a 
look at what we do right now for readOkayis more than this value.

Review Comment:
   typo: some random lines I guess



##########
hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md:
##########
@@ -441,9 +441,9 @@ The semantics of this are exactly equivalent to
     readFully(position, buffer, 0, len(buffer))
 
 That is, the buffer is filled entirely with the contents of the input source
-from position `position`
+from position `position`.
 
-### `default void readVectored(List<? extends FileRange> ranges, 
IntFunction<ByteBuffer> allocate)`
+### `void readVectored(List<? extends FileRange> ranges, 
IntFunction<ByteBuffer> allocate)`

Review Comment:
   are we removing the default? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Re: [PR] HADOOP-19098 Vector IO: consistent specified rejection of overlapping ranges [hadoop]

Reply via email to