[ https://issues.apache.org/jira/browse/HADOOP-19559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Loughran updated HADOOP-19559: ------------------------------------ Description: Make "analytics" the default input stream in S3A. Goals * Parquet performance through applications running queries over the data (spark etc) * Performance for other formats good as/better than today. Examples: avro manifests in iceberg, ORC in hive/spark * Performance for other uses as good as today (whole-file/sequential reads of parquet data in distcp etc) * better resilience to bad uses (incomplete reads not retaining http streams, buffer allocations on long-retained data) * efficient on applications like Impala, which caches parquet footers itself, and uses unbuffer() to discard all stream-side resources. Maybe just throw alway all state on unbuffer() and stop trying to be sophisticated, or support some new openFile flag which can be used to disable footer parsing was: This tracks work required to make AAL default on in S3A. The initial focus will be to make it default on for Spark + Parquet workloads only. > S3A: Analytics accelerator for S3 to be enabled by default > ---------------------------------------------------------- > > Key: HADOOP-19559 > URL: https://issues.apache.org/jira/browse/HADOOP-19559 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3 > Affects Versions: 3.5.0, 3.4.2 > Reporter: Ahmar Suhail > Priority: Major > Labels: pull-request-available > > Make "analytics" the default input stream in S3A. > Goals > * Parquet performance through applications running queries over the data > (spark etc) > * Performance for other formats good as/better than today. Examples: avro > manifests in iceberg, ORC in hive/spark > * Performance for other uses as good as today (whole-file/sequential reads of > parquet data in distcp etc) > * better resilience to bad uses (incomplete reads not retaining http streams, > buffer allocations on long-retained data) > * efficient on applications like Impala, which caches parquet footers itself, > and uses unbuffer() to discard all stream-side resources. Maybe just throw > alway all state on unbuffer() and stop trying to be sophisticated, or support > some new openFile flag which can be used to disable footer parsing -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org