[jira] [Updated] (HADOOP-19559) S3A: Analytics accelerator for S3 to be enabled by default

Steve Loughran (Jira) Wed, 09 Jul 2025 04:39:19 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-19559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Loughran updated HADOOP-19559:
------------------------------------
    Description: 
Make "analytics" the default input stream in S3A. 

Goals
* Parquet performance through applications running queries over the data (spark 
etc)
* Performance for other formats good as/better than today. Examples: avro 
manifests in iceberg, ORC in hive/spark
* Performance for other uses as good as today (whole-file/sequential reads of 
parquet data in distcp etc)
* better resilience to bad uses (incomplete reads not retaining http streams, 
buffer allocations on long-retained data)
* efficient on applications like Impala, which caches parquet footers itself, 
and uses unbuffer() to discard all stream-side resources. Maybe just throw 
alway all state on unbuffer() and stop trying to be sophisticated, or support 
some new openFile flag which can be used to disable footer parsing


  was:
This tracks work required to make AAL default on in S3A. 

 

The initial focus will be to make it default on for Spark + Parquet workloads 
only. 


> S3A: Analytics accelerator for S3 to be enabled by default
> ----------------------------------------------------------
>
>                 Key: HADOOP-19559
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19559
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs/s3
>    Affects Versions: 3.5.0, 3.4.2
>            Reporter: Ahmar Suhail
>            Priority: Major
>              Labels: pull-request-available
>
> Make "analytics" the default input stream in S3A. 
> Goals
> * Parquet performance through applications running queries over the data 
> (spark etc)
> * Performance for other formats good as/better than today. Examples: avro 
> manifests in iceberg, ORC in hive/spark
> * Performance for other uses as good as today (whole-file/sequential reads of 
> parquet data in distcp etc)
> * better resilience to bad uses (incomplete reads not retaining http streams, 
> buffer allocations on long-retained data)
> * efficient on applications like Impala, which caches parquet footers itself, 
> and uses unbuffer() to discard all stream-side resources. Maybe just throw 
> alway all state on unbuffer() and stop trying to be sophisticated, or support 
> some new openFile flag which can be used to disable footer parsing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Updated] (HADOOP-19559) S3A: Analytics accelerator for S3 to be enabled by default

Reply via email to