[ https://issues.apache.org/jira/browse/DRILL-4976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Uwe L. Korn updated DRILL-4976: ------------------------------- Summary: Querying Parquet files on S3 pulls too much data (was: Querying Parquet files on S3 pulls ) > Querying Parquet files on S3 pulls too much data > ------------------------------------------------- > > Key: DRILL-4976 > URL: https://issues.apache.org/jira/browse/DRILL-4976 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet > Affects Versions: 1.8.0 > Reporter: Uwe L. Korn > > Currently (Drill 1.8, Hadoop 2.7.2) when queries are executed on files stored > in S3, the underlying implementation of s3a requests magnitudes too much > data. Given sufficient seek sizes, the following HTTP pattern is observed: > * GET bytes=8k-100M > * GET bytes=2M-100M > * GET bytes=4M-100M > Although the HTTP request were normally aborted before all the data was > send by the server, it was still about 10-15x the size of the input files > that went over the network, i.e. for a file of the size of 100M, sometimes 1G > of data is transferred over the network. > A fix for this is the newly introduced > {{fs.s3a.experimental.input.fadvise=random}} mode which will be introduced > with Hadoop 3. -- This message was sent by Atlassian JIRA (v6.3.4#6332)