[jira] [Work logged] (HADOOP-15245) S3AInputStream.skip() to use lazy seek

ASF GitHub Bot (Jira) Tue, 15 Mar 2022 21:47:08 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-15245?focusedWorklogId=742024&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-742024
 ]


ASF GitHub Bot logged work on HADOOP-15245:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 16/Mar/22 04:46
            Start Date: 16/Mar/22 04:46
    Worklog Time Spent: 10m 
      Work Description: mehakmeet commented on a change in pull request #3927:
URL: https://github.com/apache/hadoop/pull/3927#discussion_r827622055



##########
File path: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
##########
@@ -781,6 +781,46 @@ public void readFully(long position, byte[] buffer, int 
offset, int length)
     }
   }
 
+  /**
+   * {@inheritDoc}
+   *
+   * This implements a more efficient method for skip. It calls lazy seek
+   * which will either make a new get request or do a default skip.
+   * If lazy seek fails, try doing a default skip.
+   *
+   * @param n Number of bytes to be skipped
+   * @return Number of bytes skipped
+   * @throws IOException on any problem
+   */
+  @Override
+  @Retries.OnceTranslated
+  public long skip(final long n) throws IOException {
+
+    if (n <= 0) {
+      return 0;
+    }
+
+    checkNotClosed();
+    streamStatistics.skipOperationStarted();
+
+    long targetPos = pos + n;

Review comment:
       Okay, Looking at the javadoc for "pos" it seemed like getPos() should be 
returning that, maybe we changed that recently. Lets make a test first to 
verify if the position is getting correctly updated before skipping, if it 
works then let's not change it and keep it as pos.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 742024)
    Time Spent: 3h 10m  (was: 3h)

> S3AInputStream.skip() to use lazy seek
> --------------------------------------
>
>                 Key: HADOOP-15245
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15245
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.1.0
>            Reporter: Steve Loughran
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> the default skip() does a read and discard of all bytes, no matter how far 
> ahead the skip is. This is very inefficient if the skip() is being done on 
> S3A random IO, though exactly what to do when in sequential mode.
> Proposed: 
> * add an optimized version of S3AInputStream.skip() which does a lazy seek, 
> which itself will decided when to skip() vs issue a new GET.
> * add some more instrumentation to measure how often this gets used



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Work logged] (HADOOP-15245) S3AInputStream.skip() to use lazy seek

Reply via email to