subject:"\[jira\] \[Work logged\] \(HADOOP\-18177\) document use and architecture design of prefetching s3a input stream"

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

2022-04-20 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18177?focusedWorklogId=759124&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759124
 ]

ASF GitHub Bot logged work on HADOOP-18177:
---

Author: ASF GitHub Bot
Created on: 20/Apr/22 11:10
Start Date: 20/Apr/22 11:10
Worklog Time Spent: 10m 
  Work Description: ahmarsuhail opened a new pull request, #4205:
URL: https://github.com/apache/hadoop/pull/4205

   ### Description of PR
   
   Documents usage and architecture of the prefetching input stream. 




Issue Time Tracking
---

Worklog Id: (was: 759124)
Remaining Estimate: 0h
Time Spent: 10m

> document use and architecture design of prefetching s3a input stream
> 
>
> Key: HADOOP-18177
> URL: https://issues.apache.org/jira/browse/HADOOP-18177
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: documentation, fs/s3
>Affects Versions: 3.4.0
>Reporter: Steve Loughran
>Assignee: Ahmar Suhail
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Document S3PrefetchingInputStream for users  (including any new failure modes 
> in troubleshooting) and the architecture for maintainers
> there's some markdown in 
> hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/README.md 
> already



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

2022-04-20 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18177?focusedWorklogId=759187&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759187
 ]

ASF GitHub Bot logged work on HADOOP-18177:
---

Author: ASF GitHub Bot
Created on: 20/Apr/22 12:45
Start Date: 20/Apr/22 12:45
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on PR #4205:
URL: https://github.com/apache/hadoop/pull/4205#issuecomment-1103889770

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 52s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  markdownlint  |   0m  0s |  |  markdownlint was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
    _ feature-HADOOP-18028-s3a-prefetch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  41m 30s |  |  
feature-HADOOP-18028-s3a-prefetch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 56s |  |  
feature-HADOOP-18028-s3a-prefetch passed  |
   | +1 :green_heart: |  shadedclient  |  66m 18s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 39s |  |  the patch passed  |
   | -1 :x: |  blanks  |   0m  0s | 
[/blanks-tabs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/1/artifact/out/blanks-tabs.txt)
 |  The patch 5 line(s) with tabs.  |
   | +1 :green_heart: |  mvnsite  |   0m 39s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  23m 31s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  asflicense  |   0m 44s | 
[/results-asflicense.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/1/artifact/out/results-asflicense.txt)
 |  The patch generated 1 ASF License warnings.  |
   |  |   |  93m 51s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4205 |
   | Optional Tests | dupname asflicense mvnsite codespell markdownlint |
   | uname | Linux 1c8ff3f57b1d 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 
17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | feature-HADOOP-18028-s3a-prefetch / 
b089cf68fb46f7fdbb5d1d88b1b7d87d3988ebfc |
   | Max. process+thread count | 601 (vs. ulimit of 5500) |
   | modules | C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/1/console |
   | versions | git=2.25.1 maven=3.6.3 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




Issue Time Tracking
---

Worklog Id: (was: 759187)
Time Spent: 20m  (was: 10m)

> document use and architecture design of prefetching s3a input stream
> 
>
> Key: HADOOP-18177
> URL: https://issues.apache.org/jira/browse/HADOOP-18177
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: documentation, fs/s3
>Affects Versions: 3.4.0
>Reporter: Steve Loughran
>Assignee: Ahmar Suhail
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Document S3PrefetchingInputStream for users  (including any new failure modes 
> in troubleshooting) and the architecture for maintainers
> there's some markdown in 
> hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/README.md 
> already



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

2022-04-20 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18177?focusedWorklogId=759259&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759259
 ]

ASF GitHub Bot logged work on HADOOP-18177:
---

Author: ASF GitHub Bot
Created on: 20/Apr/22 14:25
Start Date: 20/Apr/22 14:25
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on PR #4205:
URL: https://github.com/apache/hadoop/pull/4205#issuecomment-1103997065

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 49s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  markdownlint  |   0m  0s |  |  markdownlint was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
    _ feature-HADOOP-18028-s3a-prefetch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  42m  8s |  |  
feature-HADOOP-18028-s3a-prefetch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 55s |  |  
feature-HADOOP-18028-s3a-prefetch passed  |
   | +1 :green_heart: |  shadedclient  |  66m 19s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 38s |  |  the patch passed  |
   | -1 :x: |  blanks  |   0m  0s | 
[/blanks-tabs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/2/artifact/out/blanks-tabs.txt)
 |  The patch 1 line(s) with tabs.  |
   | +1 :green_heart: |  mvnsite  |   0m 40s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  23m 36s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  asflicense  |   0m 43s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   |  93m 33s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4205 |
   | Optional Tests | dupname asflicense mvnsite codespell markdownlint |
   | uname | Linux dfc2242ac9ef 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 
17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | feature-HADOOP-18028-s3a-prefetch / 
e5e9ea3075349df41fb52b5ae48b9f9cc3ed2d78 |
   | Max. process+thread count | 522 (vs. ulimit of 5500) |
   | modules | C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/2/console |
   | versions | git=2.25.1 maven=3.6.3 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




Issue Time Tracking
---

Worklog Id: (was: 759259)
Time Spent: 0.5h  (was: 20m)

> document use and architecture design of prefetching s3a input stream
> 
>
> Key: HADOOP-18177
> URL: https://issues.apache.org/jira/browse/HADOOP-18177
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: documentation, fs/s3
>Affects Versions: 3.4.0
>Reporter: Steve Loughran
>Assignee: Ahmar Suhail
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Document S3PrefetchingInputStream for users  (including any new failure modes 
> in troubleshooting) and the architecture for maintainers
> there's some markdown in 
> hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/README.md 
> already



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

2022-04-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18177?focusedWorklogId=759851&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759851
 ]

ASF GitHub Bot logged work on HADOOP-18177:
---

Author: ASF GitHub Bot
Created on: 21/Apr/22 09:04
Start Date: 21/Apr/22 09:04
Worklog Time Spent: 10m 
  Work Description: dannycjones commented on code in PR #4205:
URL: https://github.com/apache/hadoop/pull/4205#discussion_r854934280


##
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md:
##
@@ -0,0 +1,151 @@
+
+
+# S3A Prefetching
+
+
+This document explains the `S3PrefetchingInputStream` and the various 
components it uses.
+
+This input stream implements prefetching and caching to improve read 
performance of the input stream. A high level overview of this feature can also 
be found on 
[this](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0)
 blogpost.

Review Comment:
   Let's put the title in the link somehow for screenreaders.
   
   ```suggestion
   This input stream implements prefetching and caching to improve read 
performance of the input stream. A high level overview of this feature was 
published in [Pinterest Engineering's blog post titled "Improving efficiency 
and reducing runtime using S3 read 
optimization"](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0).
   ```



##
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md:
##
@@ -0,0 +1,151 @@
+
+
+# S3A Prefetching
+
+
+This document explains the `S3PrefetchingInputStream` and the various 
components it uses.
+
+This input stream implements prefetching and caching to improve read 
performance of the input stream. A high level overview of this feature can also 
be found on 
[this](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0)
 blogpost.
+
+With prefetching, we divide the file into blocks of a fixed size (default is 
8MB), associate buffers to these blocks, and then read data into these buffers 
asynchronously. We also potentially cache these blocks.
+
+### Basic Concepts
+
+* **File** : A binary blob of data stored on some storage device.
+* **Block :** A file is divided into a number of blocks. The default size of a 
block is 8MB, but can be configured. The size of the first n-1 blocks is same,  
and the size of the last block may be same or smaller.
+* **Block based reading** : The granularity of read is one block. That is, we 
read an entire block and return or none at all. Multiple blocks may be read in 
parallel.

Review Comment:
   colon is inside the bold here but not for others
   
   ```suggestion
   * **File**: A binary blob of data stored on some storage device.
   * **Block**: A file is divided into a number of blocks. The default size of 
a block is 8MB, but can be configured. The size of the first n-1 blocks is 
same,  and the size of the last block may be same or smaller.
   * **Block based reading**: The granularity of read is one block. That is, we 
read an entire block and return or none at all. Multiple blocks may be read in 
parallel.
   ```



##
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md:
##
@@ -0,0 +1,151 @@
+
+
+# S3A Prefetching
+
+
+This document explains the `S3PrefetchingInputStream` and the various 
components it uses.
+
+This input stream implements prefetching and caching to improve read 
performance of the input stream. A high level overview of this feature can also 
be found on 
[this](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0)
 blogpost.
+
+With prefetching, we divide the file into blocks of a fixed size (default is 
8MB), associate buffers to these blocks, and then read data into these buffers 
asynchronously. We also potentially cache these blocks.
+
+### Basic Concepts
+
+* **File** : A binary blob of data stored on some storage device.
+* **Block :** A file is divided into a number of blocks. The default size of a 
block is 8MB, but can be configured. The size of the first n-1 blocks is same,  
and the size of the last block may be same or smaller.
+* **Block based reading** : The granularity of read is one block. That is, we 
read an entire block and return or none at all. Multiple blocks may be read in 
parallel.
+
+### Configuring the stream
+
+|Property|Meaning|Default|
+|---   |---|---|
+|fs.s3a.prefetch.enabled|Enable the prefetch input stream|TRUE |
+|fs.s3a.prefetch.block.size|Size of a block|8MB|
+|fs.s3a.prefetch.block.count|Number of blocks to prefetch|8|
+
+### Key Components:
+
+`S3PrefetchingInputStream`

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

2022-04-22 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18177?focusedWorklogId=760945&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-760945
 ]

ASF GitHub Bot logged work on HADOOP-18177:
---

Author: ASF GitHub Bot
Created on: 22/Apr/22 16:58
Start Date: 22/Apr/22 16:58
Worklog Time Spent: 10m 
  Work Description: steveloughran commented on code in PR #4205:
URL: https://github.com/apache/hadoop/pull/4205#discussion_r856383998


##
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md:
##
@@ -0,0 +1,151 @@
+
+
+# S3A Prefetching
+
+
+This document explains the `S3PrefetchingInputStream` and the various 
components it uses.
+
+This input stream implements prefetching and caching to improve read 
performance of the input stream. A high level overview of this feature can also 
be found on 
[this](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0)
 blogpost.
+
+With prefetching, we divide the file into blocks of a fixed size (default is 
8MB), associate buffers to these blocks, and then read data into these buffers 
asynchronously. We also potentially cache these blocks.
+
+### Basic Concepts
+
+* **File** : A binary blob of data stored on some storage device.
+* **Block :** A file is divided into a number of blocks. The default size of a 
block is 8MB, but can be configured. The size of the first n-1 blocks is same,  
and the size of the last block may be same or smaller.
+* **Block based reading** : The granularity of read is one block. That is, we 
read an entire block and return or none at all. Multiple blocks may be read in 
parallel.
+
+### Configuring the stream
+
+|Property|Meaning|Default|
+|---   |---|---|
+|fs.s3a.prefetch.enabled|Enable the prefetch input stream|TRUE |

Review Comment:
   1. use backticks around the configuration names and values; all values must 
be valid if passed in as the config strings. they will be.
   2. use `true` for true, rather than the capitalised value



##
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md:
##
@@ -0,0 +1,151 @@
+
+
+# S3A Prefetching
+
+
+This document explains the `S3PrefetchingInputStream` and the various 
components it uses.
+
+This input stream implements prefetching and caching to improve read 
performance of the input stream. A high level overview of this feature can also 
be found on 
[this](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0)
 blogpost.
+
+With prefetching, we divide the file into blocks of a fixed size (default is 
8MB), associate buffers to these blocks, and then read data into these buffers 
asynchronously. We also potentially cache these blocks.
+

Review Comment:
   can you replace "we" in the docs to refer instead to the class/component 
which is doing the work



##
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md:
##
@@ -0,0 +1,151 @@
+
+
+# S3A Prefetching
+
+
+This document explains the `S3PrefetchingInputStream` and the various 
components it uses.
+
+This input stream implements prefetching and caching to improve read 
performance of the input stream. A high level overview of this feature can also 
be found on 
[this](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0)
 blogpost.
+
+With prefetching, we divide the file into blocks of a fixed size (default is 
8MB), associate buffers to these blocks, and then read data into these buffers 
asynchronously. We also potentially cache these blocks.
+
+### Basic Concepts
+
+* **File** : A binary blob of data stored on some storage device.
+* **Block :** A file is divided into a number of blocks. The default size of a 
block is 8MB, but can be configured. The size of the first n-1 blocks is same,  
and the size of the last block may be same or smaller.
+* **Block based reading** : The granularity of read is one block. That is, we 
read an entire block and return or none at all. Multiple blocks may be read in 
parallel.
+
+### Configuring the stream
+
+|Property|Meaning|Default|
+|---   |---|---|
+|fs.s3a.prefetch.enabled|Enable the prefetch input stream|TRUE |
+|fs.s3a.prefetch.block.size|Size of a block|8MB|
+|fs.s3a.prefetch.block.count|Number of blocks to prefetch|8|
+
+### Key Components:
+
+`S3PrefetchingInputStream` - When prefetching is enabled, S3AFileSystem will 
return an instance of this class as the input stream. Depending on the file 
size, it will either use the `S3InMemoryInputStream` or the 
`S3CachingInputStream` as the underlying input stream.
+
+`S3InMemoryInputStream` - Underlying input stream used when the file size < 
configured

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

2022-04-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18177?focusedWorklogId=761912&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-761912
 ]

ASF GitHub Bot logged work on HADOOP-18177:
---

Author: ASF GitHub Bot
Created on: 25/Apr/22 17:00
Start Date: 25/Apr/22 17:00
Worklog Time Spent: 10m 
  Work Description: ahmarsuhail commented on code in PR #4205:
URL: https://github.com/apache/hadoop/pull/4205#discussion_r857842762


##
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md:
##
@@ -0,0 +1,151 @@
+
+
+# S3A Prefetching
+
+
+This document explains the `S3PrefetchingInputStream` and the various 
components it uses.
+
+This input stream implements prefetching and caching to improve read 
performance of the input stream. A high level overview of this feature can also 
be found on 
[this](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0)
 blogpost.
+
+With prefetching, we divide the file into blocks of a fixed size (default is 
8MB), associate buffers to these blocks, and then read data into these buffers 
asynchronously. We also potentially cache these blocks.

Review Comment:
   good point, have removed 





Issue Time Tracking
---

Worklog Id: (was: 761912)
Time Spent: 1h 10m  (was: 1h)

> document use and architecture design of prefetching s3a input stream
> 
>
> Key: HADOOP-18177
> URL: https://issues.apache.org/jira/browse/HADOOP-18177
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: documentation, fs/s3
>Affects Versions: 3.4.0
>Reporter: Steve Loughran
>Assignee: Ahmar Suhail
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Document S3PrefetchingInputStream for users  (including any new failure modes 
> in troubleshooting) and the architecture for maintainers
> there's some markdown in 
> hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/README.md 
> already



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

2022-04-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18177?focusedWorklogId=761910&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-761910
 ]

ASF GitHub Bot logged work on HADOOP-18177:
---

Author: ASF GitHub Bot
Created on: 25/Apr/22 17:00
Start Date: 25/Apr/22 17:00
Worklog Time Spent: 10m 
  Work Description: ahmarsuhail commented on code in PR #4205:
URL: https://github.com/apache/hadoop/pull/4205#discussion_r857842464


##
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md:
##
@@ -0,0 +1,151 @@
+
+
+# S3A Prefetching
+
+
+This document explains the `S3PrefetchingInputStream` and the various 
components it uses.
+
+This input stream implements prefetching and caching to improve read 
performance of the input stream. A high level overview of this feature can also 
be found on 
[this](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0)
 blogpost.
+
+With prefetching, we divide the file into blocks of a fixed size (default is 
8MB), associate buffers to these blocks, and then read data into these buffers 
asynchronously. We also potentially cache these blocks.
+
+### Basic Concepts
+
+* **File** : A binary blob of data stored on some storage device.
+* **Block :** A file is divided into a number of blocks. The default size of a 
block is 8MB, but can be configured. The size of the first n-1 blocks is same,  
and the size of the last block may be same or smaller.
+* **Block based reading** : The granularity of read is one block. That is, we 
read an entire block and return or none at all. Multiple blocks may be read in 
parallel.
+
+### Configuring the stream
+
+|Property|Meaning|Default|
+|---   |---|---|
+|fs.s3a.prefetch.enabled|Enable the prefetch input stream|TRUE |
+|fs.s3a.prefetch.block.size|Size of a block|8MB|
+|fs.s3a.prefetch.block.count|Number of blocks to prefetch|8|
+
+### Key Components:
+
+`S3PrefetchingInputStream` - When prefetching is enabled, S3AFileSystem will 
return an instance of this class as the input stream. Depending on the file 
size, it will either use the `S3InMemoryInputStream` or the 
`S3CachingInputStream` as the underlying input stream.
+
+`S3InMemoryInputStream` - Underlying input stream used when the file size < 
configured block size. Will read the entire file into memory.
+
+`S3CachingInputStream` - Underlying input stream used when file size > 
configured block size. Uses asynchronous prefetching of blocks and caching to 
improve performance.
+
+`BlockData` - Holds information about the blocks in a file, such as:
+
+* Number of blocks in the file
+* Block size
+* State of each block (initially all blocks have state *NOT_READY*). Other 
states are: Queued, Ready, Cached.
+
+`BufferData` - Holds the buffer and additional information about it such as:
+
+* The block number this buffer is for
+* State of the buffer (Unknown, Blank, Prefetching, Caching, Ready, Done). 
Initial state of a buffer is blank.
+
+`CachingBlockManager` - Implements reading data into the buffer, prefetching 
and caching.
+
+`BufferPool` - Manages a fixed sized pool of buffers. It’s used by 
`CachingBlockManager` to acquire buffers.
+
+`S3File` - Implements operations to interact with S3 such as opening and 
closing the input stream to the S3 file.
+
+`S3Reader` - Implements reading from the stream opened by `S3File`. Reads from 
this input stream in blocks of 64KB.
+
+`FilePosition` - Provides functionality related to tracking the position in 
the file. Also gives access to the current buffer in use.
+
+`SingleFilePerBlockCache` - Responsible for caching blocks to the local file 
system. Each cache block is stored on the local disk as a separate file.
+
+### Operation
+
+### S3InMemoryInputStream
+
+If we have a file with size 5MB, and block size = 8MB. Since file size is less 
than the block size, the `S3InMemoryInputStream` will be used.
+
+If the caller makes the following read calls:
+
+
+```
+in.read(buffer, 0, 3MB);
+in.read(buffer, 0, 2MB);
+```
+
+When the first read is issued, there is no buffer in use yet. We get the data 
in this file by calling the `ensureCurrentBuffer()` method, which ensures that 
a buffer with data is available to be read from.
+
+The `ensureCurrentBuffer()` then:
+
+* Reads data into a buffer by calling `S3Reader.read(ByteBuffer buffer, long 
offset, int size)`
+*  `S3Reader` uses `S3File` to open an input stream to the S3 file by making a 
`getObject()` request with range as `(0, filesize)`.
+*  The S3Reader reads the entire file into the provided buffer, and once 
reading is complete closes the S3 stream and frees all underlying resources.
+* Now the entire file is in a buffer, set this data in `FilePosition` so it 
can be accessed by the input stream.
+
+The read operation now just

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

2022-04-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18177?focusedWorklogId=761957&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-761957
 ]

ASF GitHub Bot logged work on HADOOP-18177:
---

Author: ASF GitHub Bot
Created on: 25/Apr/22 18:23
Start Date: 25/Apr/22 18:23
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on PR #4205:
URL: https://github.com/apache/hadoop/pull/4205#issuecomment-1108897168

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  18m 10s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  markdownlint  |   0m  0s |  |  markdownlint was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
    _ feature-HADOOP-18028-s3a-prefetch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  41m 49s |  |  
feature-HADOOP-18028-s3a-prefetch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 55s |  |  
feature-HADOOP-18028-s3a-prefetch passed  |
   | +1 :green_heart: |  shadedclient  |  65m 51s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 39s |  |  the patch passed  |
   | -1 :x: |  blanks  |   0m  0s | 
[/blanks-tabs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/3/artifact/out/blanks-tabs.txt)
 |  The patch 1 line(s) with tabs.  |
   | +1 :green_heart: |  mvnsite  |   0m 39s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  23m  6s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  asflicense  |   0m 42s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 109m 59s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4205 |
   | Optional Tests | dupname asflicense mvnsite codespell markdownlint |
   | uname | Linux f50a219ad738 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 
17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | feature-HADOOP-18028-s3a-prefetch / 
24380d95f1da4aab12af6183f4302e513e87f6dc |
   | Max. process+thread count | 520 (vs. ulimit of 5500) |
   | modules | C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/3/console |
   | versions | git=2.25.1 maven=3.6.3 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




Issue Time Tracking
---

Worklog Id: (was: 761957)
Time Spent: 1h 20m  (was: 1h 10m)

> document use and architecture design of prefetching s3a input stream
> 
>
> Key: HADOOP-18177
> URL: https://issues.apache.org/jira/browse/HADOOP-18177
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: documentation, fs/s3
>Affects Versions: 3.4.0
>Reporter: Steve Loughran
>Assignee: Ahmar Suhail
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Document S3PrefetchingInputStream for users  (including any new failure modes 
> in troubleshooting) and the architecture for maintainers
> there's some markdown in 
> hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/README.md 
> already



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

2022-04-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18177?focusedWorklogId=761973&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-761973
 ]

ASF GitHub Bot logged work on HADOOP-18177:
---

Author: ASF GitHub Bot
Created on: 25/Apr/22 18:37
Start Date: 25/Apr/22 18:37
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on PR #4205:
URL: https://github.com/apache/hadoop/pull/4205#issuecomment-1108910254

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  18m  3s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  markdownlint  |   0m  0s |  |  markdownlint was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
    _ feature-HADOOP-18028-s3a-prefetch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  41m 46s |  |  
feature-HADOOP-18028-s3a-prefetch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 56s |  |  
feature-HADOOP-18028-s3a-prefetch passed  |
   | +1 :green_heart: |  shadedclient  |  66m  3s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 38s |  |  the patch passed  |
   | -1 :x: |  blanks  |   0m  0s | 
[/blanks-tabs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/4/artifact/out/blanks-tabs.txt)
 |  The patch 1 line(s) with tabs.  |
   | +1 :green_heart: |  mvnsite  |   0m 41s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  23m 22s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  asflicense  |   0m 42s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 110m 18s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4205 |
   | Optional Tests | dupname asflicense mvnsite codespell markdownlint |
   | uname | Linux ba91b572c6c8 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 
17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | feature-HADOOP-18028-s3a-prefetch / 
24380d95f1da4aab12af6183f4302e513e87f6dc |
   | Max. process+thread count | 521 (vs. ulimit of 5500) |
   | modules | C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/4/console |
   | versions | git=2.25.1 maven=3.6.3 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




Issue Time Tracking
---

Worklog Id: (was: 761973)
Time Spent: 1.5h  (was: 1h 20m)

> document use and architecture design of prefetching s3a input stream
> 
>
> Key: HADOOP-18177
> URL: https://issues.apache.org/jira/browse/HADOOP-18177
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: documentation, fs/s3
>Affects Versions: 3.4.0
>Reporter: Steve Loughran
>Assignee: Ahmar Suhail
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Document S3PrefetchingInputStream for users  (including any new failure modes 
> in troubleshooting) and the architecture for maintainers
> there's some markdown in 
> hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/README.md 
> already



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

2022-04-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18177?focusedWorklogId=762184&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-762184
 ]

ASF GitHub Bot logged work on HADOOP-18177:
---

Author: ASF GitHub Bot
Created on: 26/Apr/22 09:04
Start Date: 26/Apr/22 09:04
Worklog Time Spent: 10m 
  Work Description: dannycjones commented on code in PR #4205:
URL: https://github.com/apache/hadoop/pull/4205#discussion_r858455675


##
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md:
##
@@ -14,94 +14,114 @@
 
 # S3A Prefetching
 
-
 This document explains the `S3PrefetchingInputStream` and the various 
components it uses.
 
-This input stream implements prefetching and caching to improve read 
performance of the input stream. A high level overview of this feature can also 
be found on 
[this](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0)
 blogpost.
+This input stream implements prefetching and caching to improve read 
performance of the input
+stream. A high level overview of this feature was published in
+[Pinterest Engineering's blog post titled "Improving efficiency and reducing 
runtime using S3 read 
optimization"](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0)
+blogpost.
 
-With prefetching, we divide the file into blocks of a fixed size (default is 
8MB), associate buffers to these blocks, and then read data into these buffers 
asynchronously. We also potentially cache these blocks.
+With prefetching, the input stream divides the remote file into blocks of a 
fixed size, associates
+buffers to these blocks and then reads data into these buffers asynchronously. 
It also potentially
+caches these blocks.
 
 ### Basic Concepts
 
-* **File** : A binary blob of data stored on some storage device.
-* **Block :** A file is divided into a number of blocks. The default size of a 
block is 8MB, but can be configured. The size of the first n-1 blocks is same,  
and the size of the last block may be same or smaller.
-* **Block based reading** : The granularity of read is one block. That is, we 
read an entire block and return or none at all. Multiple blocks may be read in 
parallel.
+* **Remote File**: A binary blob of data stored on some storage device.
+* **Block**: A file is divided into a number of blocks. The size of the first 
n-1 blocks is same,
+  and the size of the last block may be same or smaller.
+* **Block based reading**: The granularity of read is one block. That is, 
either an entire block is
+  read and returned or none at all. Multiple blocks may be read in parallel.
 
 ### Configuring the stream
 
 |Property|Meaning|Default|
 |---   |---|---|
-|fs.s3a.prefetch.enabled|Enable the prefetch input stream|TRUE |
-|fs.s3a.prefetch.block.size|Size of a block|8MB|
-|fs.s3a.prefetch.block.count|Number of blocks to prefetch|8|
+|fs.s3a.prefetch.enabled|Enable the prefetch input stream|`true` |
+|fs.s3a.prefetch.block.size|Size of a block|`8M`|
+|fs.s3a.prefetch.block.count|Number of blocks to prefetch|`8`|

Review Comment:
   We want backticks around the configuration keys too.
   
   ```suggestion
   |`fs.s3a.prefetch.enabled`|Enable the prefetch input stream|`true` |
   |`fs.s3a.prefetch.block.size`|Size of a block|`8M`|
   |`fs.s3a.prefetch.block.count`|Number of blocks to prefetch|`8`|
   ```



##
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md:
##
@@ -14,94 +14,114 @@
 
 # S3A Prefetching
 
-
 This document explains the `S3PrefetchingInputStream` and the various 
components it uses.
 
-This input stream implements prefetching and caching to improve read 
performance of the input stream. A high level overview of this feature can also 
be found on 
[this](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0)
 blogpost.
+This input stream implements prefetching and caching to improve read 
performance of the input
+stream. A high level overview of this feature was published in
+[Pinterest Engineering's blog post titled "Improving efficiency and reducing 
runtime using S3 read 
optimization"](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0)
+blogpost.

Review Comment:
   drop `blogpost` as it is already mentioned earlier



##
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md:
##
@@ -14,94 +14,114 @@
 
 # S3A Prefetching
 
-
 This document explains the `S3PrefetchingInputStream` and the various 
components it uses.
 
-This input stream implements prefetching

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

2022-04-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18177?focusedWorklogId=762219&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-762219
 ]

ASF GitHub Bot logged work on HADOOP-18177:
---

Author: ASF GitHub Bot
Created on: 26/Apr/22 10:38
Start Date: 26/Apr/22 10:38
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on PR #4205:
URL: https://github.com/apache/hadoop/pull/4205#issuecomment-1109635269

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 52s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  markdownlint  |   0m  0s |  |  markdownlint was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
    _ feature-HADOOP-18028-s3a-prefetch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  41m 38s |  |  
feature-HADOOP-18028-s3a-prefetch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 56s |  |  
feature-HADOOP-18028-s3a-prefetch passed  |
   | +1 :green_heart: |  shadedclient  |  66m 33s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 41s |  |  the patch passed  |
   | -1 :x: |  blanks  |   0m  0s | 
[/blanks-tabs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/5/artifact/out/blanks-tabs.txt)
 |  The patch 1 line(s) with tabs.  |
   | +1 :green_heart: |  mvnsite  |   0m 45s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  24m 10s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  asflicense  |   0m 41s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   |  94m 31s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/5/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4205 |
   | Optional Tests | dupname asflicense mvnsite codespell markdownlint |
   | uname | Linux 9472cd5389b4 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 
17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | feature-HADOOP-18028-s3a-prefetch / 
ba1d26a0f77b6d3795cdeb923c1403dfa31e7ded |
   | Max. process+thread count | 518 (vs. ulimit of 5500) |
   | modules | C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/5/console |
   | versions | git=2.25.1 maven=3.6.3 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




Issue Time Tracking
---

Worklog Id: (was: 762219)
Time Spent: 1h 50m  (was: 1h 40m)

> document use and architecture design of prefetching s3a input stream
> 
>
> Key: HADOOP-18177
> URL: https://issues.apache.org/jira/browse/HADOOP-18177
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: documentation, fs/s3
>Affects Versions: 3.4.0
>Reporter: Steve Loughran
>Assignee: Ahmar Suhail
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Document S3PrefetchingInputStream for users  (including any new failure modes 
> in troubleshooting) and the architecture for maintainers
> there's some markdown in 
> hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/README.md 
> already



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

2022-04-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18177?focusedWorklogId=762237&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-762237
 ]

ASF GitHub Bot logged work on HADOOP-18177:
---

Author: ASF GitHub Bot
Created on: 26/Apr/22 11:18
Start Date: 26/Apr/22 11:18
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on PR #4205:
URL: https://github.com/apache/hadoop/pull/4205#issuecomment-1109670724

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 54s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  markdownlint  |   0m  1s |  |  markdownlint was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
    _ feature-HADOOP-18028-s3a-prefetch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  42m 23s |  |  
feature-HADOOP-18028-s3a-prefetch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 53s |  |  
feature-HADOOP-18028-s3a-prefetch passed  |
   | +1 :green_heart: |  shadedclient  |  66m 45s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 39s |  |  the patch passed  |
   | -1 :x: |  blanks  |   0m  0s | 
[/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/6/artifact/out/blanks-eol.txt)
 |  The patch has 30 line(s) that end in blanks. Use git apply --whitespace=fix 
<>. Refer https://git-scm.com/docs/git-apply  |
   | -1 :x: |  blanks  |   0m  0s | 
[/blanks-tabs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/6/artifact/out/blanks-tabs.txt)
 |  The patch 1 line(s) with tabs.  |
   | +1 :green_heart: |  mvnsite  |   0m 40s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  23m 36s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  asflicense  |   0m 43s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   |  94m 11s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/6/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4205 |
   | Optional Tests | dupname asflicense mvnsite codespell markdownlint |
   | uname | Linux afff6318d18d 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 
17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | feature-HADOOP-18028-s3a-prefetch / 
125152ef64d24083abb7308978302285c1720fb6 |
   | Max. process+thread count | 589 (vs. ulimit of 5500) |
   | modules | C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/6/console |
   | versions | git=2.25.1 maven=3.6.3 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




Issue Time Tracking
---

Worklog Id: (was: 762237)
Time Spent: 2h  (was: 1h 50m)

> document use and architecture design of prefetching s3a input stream
> 
>
> Key: HADOOP-18177
> URL: https://issues.apache.org/jira/browse/HADOOP-18177
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: documentation, fs/s3
>Affects Versions: 3.4.0
>Reporter: Steve Loughran
>Assignee: Ahmar Suhail
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Document S3PrefetchingInputStream for users  (including any new failure modes 
> in troubleshooting) and the architecture for maintainers
> there's some markdown in 
> hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/README.md 
> already



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

2022-04-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18177?focusedWorklogId=762239&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-762239
 ]

ASF GitHub Bot logged work on HADOOP-18177:
---

Author: ASF GitHub Bot
Created on: 26/Apr/22 11:28
Start Date: 26/Apr/22 11:28
Worklog Time Spent: 10m 
  Work Description: dannycjones commented on code in PR #4205:
URL: https://github.com/apache/hadoop/pull/4205#discussion_r858597582


##
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md:
##
@@ -0,0 +1,192 @@
+
+
+# S3A Prefetching
+
+This document explains the `S3PrefetchingInputStream` and the various 
components it uses.
+
+This input stream implements prefetching and caching to improve read 
performance of the input
+stream.
+A high level overview of this feature was published in
+[Pinterest Engineering's blog post titled "Improving efficiency and reducing 
runtime using S3 read 
optimization"](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0).
+
+With prefetching, the input stream divides the remote file into blocks of a 
fixed size, associates
+buffers to these blocks and then reads data into these buffers asynchronously. 
+It also potentially caches these blocks.
+
+### Basic Concepts
+
+* **Remote File**: A binary blob of data stored on some storage device.
+* **Block File**: Local file containing a block of the remote file.
+* **Block**: A file is divided into a number of blocks. 
+The size of the first n-1 blocks is same, and the size of the last block may 
be same or smaller.
+* **Block based reading**: The granularity of read is one block. 
+That is, either an entire block is read and returned or none at all. 
+Multiple blocks may be read in parallel.
+
+### Configuring the stream
+
+|Property|Meaning|Default|
+|---   |---|---|
+|`fs.s3a.prefetch.enabled`|Enable the prefetch input stream|`true` |
+|`fs.s3a.prefetch.block.size`|Size of a block|`8M`|
+|`fs.s3a.prefetch.block.count`|Number of blocks to prefetch|`8`|
+
+### Key Components
+
+`S3PrefetchingInputStream` - When prefetching is enabled, S3AFileSystem will 
return an instance of
+this class as the input stream. 
+Depending on the remote file size, it will either use
+the `S3InMemoryInputStream` or the `S3CachingInputStream` as the underlying 
input stream.
+
+`S3InMemoryInputStream` - Underlying input stream used when the remote file 
size < configured block
+size. 
+Will read the entire remote file into memory.
+
+`S3CachingInputStream` - Underlying input stream used when remote file size > 
configured block size.
+Uses asynchronous prefetching of blocks and caching to improve performance.
+
+`BlockData` - Holds information about the blocks in a remote file, such as:
+
+* Number of blocks in the remote file
+* Block size
+* State of each block (initially all blocks have state *NOT_READY*). 
+Other states are: Queued, Ready, Cached.
+
+`BufferData` - Holds the buffer and additional information about it such as:
+
+* The block number this buffer is for
+* State of the buffer (Unknown, Blank, Prefetching, Caching, Ready, Done). 
+Initial state of a buffer is blank.
+
+`CachingBlockManager` - Implements reading data into the buffer, prefetching 
and caching.
+
+`BufferPool` - Manages a fixed sized pool of buffers. 
+It’s used by `CachingBlockManager` to acquire buffers.
+
+`S3File` - Implements operations to interact with S3 such as opening and 
closing the input stream to
+the remote file in S3.
+
+`S3Reader` - Implements reading from the stream opened by `S3File`. 
+Reads from this input stream in blocks of 64KB.
+
+`FilePosition` - Provides functionality related to tracking the position in 
the file. 
+Also gives access to the current buffer in use.
+
+`SingleFilePerBlockCache` - Responsible for caching blocks to the local file 
system. 
+Each cache block is stored on the local disk as a separate block file.
+
+### Operation
+
+ S3InMemoryInputStream
+
+For a remote file with size 5MB, and block size = 8MB, since file size is less 
than the block size,
+the `S3InMemoryInputStream` will be used.
+
+If the caller makes the following read calls:
+
+```
+in.read(buffer, 0, 3MB);
+in.read(buffer, 0, 2MB);
+```
+
+When the first read is issued, there is no buffer in use yet. 
+The `S3InMemoryInputStream` gets the data in this remote file by calling the 
`ensureCurrentBuffer()` 
+method, which ensures that a buffer with data is available to be read from.
+
+The `ensureCurrentBuffer()` then:
+
+* Reads data into a buffer by calling `S3Reader.read(ByteBuffer buffer, long 
offset, int size)`.
+* `S3Reader` uses `S3File` to open an input stream to the remote file in S3 by 
making
+  a `getObject()` request with range as `(0, filesize)`.
+* The `S3Reader` reads the entire re

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

2022-04-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18177?focusedWorklogId=762317&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-762317
 ]

ASF GitHub Bot logged work on HADOOP-18177:
---

Author: ASF GitHub Bot
Created on: 26/Apr/22 13:49
Start Date: 26/Apr/22 13:49
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on PR #4205:
URL: https://github.com/apache/hadoop/pull/4205#issuecomment-1109822081

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 50s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  markdownlint  |   0m  0s |  |  markdownlint was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
    _ feature-HADOOP-18028-s3a-prefetch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  42m  6s |  |  
feature-HADOOP-18028-s3a-prefetch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 55s |  |  
feature-HADOOP-18028-s3a-prefetch passed  |
   | +1 :green_heart: |  shadedclient  |  66m 12s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 38s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  mvnsite  |   0m 40s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  23m 34s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  asflicense  |   0m 43s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   |  93m 28s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/7/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4205 |
   | Optional Tests | dupname asflicense mvnsite codespell markdownlint |
   | uname | Linux 918724e3c41c 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 
17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | feature-HADOOP-18028-s3a-prefetch / 
658396acb384ba50103e240e890f31cf1355388a |
   | Max. process+thread count | 601 (vs. ulimit of 5500) |
   | modules | C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/7/console |
   | versions | git=2.25.1 maven=3.6.3 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




Issue Time Tracking
---

Worklog Id: (was: 762317)
Time Spent: 2h 20m  (was: 2h 10m)

> document use and architecture design of prefetching s3a input stream
> 
>
> Key: HADOOP-18177
> URL: https://issues.apache.org/jira/browse/HADOOP-18177
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: documentation, fs/s3
>Affects Versions: 3.4.0
>Reporter: Steve Loughran
>Assignee: Ahmar Suhail
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Document S3PrefetchingInputStream for users  (including any new failure modes 
> in troubleshooting) and the architecture for maintainers
> there's some markdown in 
> hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/README.md 
> already



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

2022-04-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18177?focusedWorklogId=762347&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-762347
 ]

ASF GitHub Bot logged work on HADOOP-18177:
---

Author: ASF GitHub Bot
Created on: 26/Apr/22 14:22
Start Date: 26/Apr/22 14:22
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on PR #4205:
URL: https://github.com/apache/hadoop/pull/4205#issuecomment-1109860466

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 54s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  markdownlint  |   0m  0s |  |  markdownlint was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
    _ feature-HADOOP-18028-s3a-prefetch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  41m 59s |  |  
feature-HADOOP-18028-s3a-prefetch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 56s |  |  
feature-HADOOP-18028-s3a-prefetch passed  |
   | +1 :green_heart: |  shadedclient  |  66m 18s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 38s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  mvnsite  |   0m 41s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  23m 32s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  asflicense  |   0m 43s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   |  93m 36s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/8/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4205 |
   | Optional Tests | dupname asflicense mvnsite codespell markdownlint |
   | uname | Linux 416a76a29726 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 
17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | feature-HADOOP-18028-s3a-prefetch / 
9558361263d879b0bef5f452528dca5f18abdc15 |
   | Max. process+thread count | 525 (vs. ulimit of 5500) |
   | modules | C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/8/console |
   | versions | git=2.25.1 maven=3.6.3 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




Issue Time Tracking
---

Worklog Id: (was: 762347)
Time Spent: 2.5h  (was: 2h 20m)

> document use and architecture design of prefetching s3a input stream
> 
>
> Key: HADOOP-18177
> URL: https://issues.apache.org/jira/browse/HADOOP-18177
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: documentation, fs/s3
>Affects Versions: 3.4.0
>Reporter: Steve Loughran
>Assignee: Ahmar Suhail
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Document S3PrefetchingInputStream for users  (including any new failure modes 
> in troubleshooting) and the architecture for maintainers
> there's some markdown in 
> hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/README.md 
> already



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

2022-04-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18177?focusedWorklogId=762448&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-762448
 ]

ASF GitHub Bot logged work on HADOOP-18177:
---

Author: ASF GitHub Bot
Created on: 26/Apr/22 17:36
Start Date: 26/Apr/22 17:36
Worklog Time Spent: 10m 
  Work Description: steveloughran merged PR #4205:
URL: https://github.com/apache/hadoop/pull/4205




Issue Time Tracking
---

Worklog Id: (was: 762448)
Time Spent: 2h 40m  (was: 2.5h)

> document use and architecture design of prefetching s3a input stream
> 
>
> Key: HADOOP-18177
> URL: https://issues.apache.org/jira/browse/HADOOP-18177
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: documentation, fs/s3
>Affects Versions: 3.4.0
>Reporter: Steve Loughran
>Assignee: Ahmar Suhail
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Document S3PrefetchingInputStream for users  (including any new failure modes 
> in troubleshooting) and the architecture for maintainers
> there's some markdown in 
> hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/README.md 
> already



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

16 matches

Site Navigation

Mail list logo

Footer information