[jira] [Commented] (HADOOP-19348) S3A: Add initial support for analytics-accelerator-s3

ASF GitHub Bot (Jira) Thu, 27 Feb 2025 06:48:11 -0800


    [ 
https://issues.apache.org/jira/browse/HADOOP-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931191#comment-17931191
 ]


ASF GitHub Bot commented on HADOOP-19348:
-----------------------------------------

steveloughran commented on code in PR #7433:
URL: https://github.com/apache/hadoop/pull/7433#discussion_r1973722346


##########
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java:
##########
@@ -636,7 +636,8 @@ public void initialize(URI name, Configuration originalConf)
       // If encryption method is set to CSE-KMS or CSE-CUSTOM then CSE is 
enabled.
       isCSEEnabled = 
CSEUtils.isCSEEnabled(getS3EncryptionAlgorithm().getMethod());
 
-      isAnalyticsAccelaratorEnabled = 
StreamIntegration.determineInputStreamType(conf).equals(InputStreamType.Analytics);
+      isAnalyticsAccelaratorEnabled = 
StreamIntegration.determineInputStreamType(conf)

Review Comment:
   nit: spelling



##########
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AAnalyticsAcceleratorStreamReading.java:
##########
@@ -141,37 +144,37 @@ public void testMalformedParquetFooter() throws 
IOException {
    * can contain multiple row groups, this allows for further parallelisation, 
as each row group
    * can be processed independently.
    */
- @Test
- public void testMultiRowGroupParquet() throws Throwable {
+  @Test
+  public void testMultiRowGroupParquet() throws Throwable {
     describe("A parquet file is read successfully");
 
     Path dest = path("multi_row_group.parquet");
 
-   File file = new File("src/test/resources/multi_row_group.parquet");
-   Path sourcePath = new Path(file.toURI().getPath());
-   getFileSystem().copyFromLocalFile(false, true, sourcePath, dest);
+    File file = new File("src/test/resources/multi_row_group.parquet");
+    Path sourcePath = new Path(file.toURI().getPath());
+    getFileSystem().copyFromLocalFile(false, true, sourcePath, dest);
 
-   FileStatus fileStatus = getFileSystem().getFileStatus(dest);
+    FileStatus fileStatus = getFileSystem().getFileStatus(dest);
 
-   byte[] buffer = new byte[3000];
-   IOStatistics ioStats;
+    byte[] buffer = new byte[3000];
+    IOStatistics ioStats;
 
-   try (FSDataInputStream inputStream = getFileSystem().open(dest)) {
-     ioStats = inputStream.getIOStatistics();
-     inputStream.readFully(buffer, 0, (int) fileStatus.getLen());
-   }
+    try (FSDataInputStream inputStream = getFileSystem().open(dest)) {
+      ioStats = inputStream.getIOStatistics();
+      inputStream.readFully(buffer, 0, (int) fileStatus.getLen());
+    }
 
-   verifyStatisticCounterValue(ioStats, STREAM_READ_ANALYTICS_OPENED, 1);
+    verifyStatisticCounterValue(ioStats, STREAM_READ_ANALYTICS_OPENED, 1);
 
-   try (FSDataInputStream inputStream = getFileSystem().openFile(dest)
-       
.must(FS_OPTION_OPENFILE_READ_POLICY,FS_OPTION_OPENFILE_READ_POLICY_PARQUET)
-       .build().get()) {
-     ioStats = inputStream.getIOStatistics();
-     inputStream.readFully(buffer, 0, (int) fileStatus.getLen());
-   }
+    try (FSDataInputStream inputStream = getFileSystem().openFile(dest)
+        .must(FS_OPTION_OPENFILE_READ_POLICY, 
FS_OPTION_OPENFILE_READ_POLICY_PARQUET)
+        .build().get()) {
+      ioStats = inputStream.getIOStatistics();
+      inputStream.readFully(buffer, 0, (int) fileStatus.getLen());
+    }
 
-   verifyStatisticCounterValue(ioStats, STREAM_READ_ANALYTICS_OPENED, 1);
- }
+    verifyStatisticCounterValue(ioStats, STREAM_READ_ANALYTICS_OPENED, 1);

Review Comment:
   add a check for the filesystem iostats too, to make sure it trickles up





> S3A: Add initial support for analytics-accelerator-s3
> -----------------------------------------------------
>
>                 Key: HADOOP-19348
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19348
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.4.2
>            Reporter: Ahmar Suhail
>            Assignee: Ahmar Suhail
>            Priority: Major
>              Labels: pull-request-available
>
> S3 recently released [Analytics Accelerator Library for Amazon 
> S3|https://github.com/awslabs/analytics-accelerator-s3] as an Alpha release, 
> which is an input stream, with an initial goal of improving performance for 
> Apache Spark workloads on Parquet datasets. 
> For example, it implements optimisations such as footer prefetching, and so 
> avoids the multiple GETS S3AInputStream currently makes for the footer bytes 
> and PageIndex structures.
> The library also tracks columns currently being read by a query using the 
> parquet metadata, and then prefetches these bytes when parquet files with the 
> same schema are opened. 
> This ticket tracks the work required for the basic initial integration. There 
> is still more work to be done, such as VectoredIO support etc, which we will 
> identify and follow up with. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-19348) S3A: Add initial support for analytics-accelerator-s3

Reply via email to