Uwe L. Korn created DRILL-4976: ---------------------------------- Summary: Querying Parquet files on S3 pulls Key: DRILL-4976 URL: https://issues.apache.org/jira/browse/DRILL-4976 Project: Apache Drill Issue Type: Improvement Components: Storage - Parquet Affects Versions: 1.8.0 Reporter: Uwe L. Korn
Currently (Drill 1.8, Hadoop 2.7.2) when queries are executed on files stored in S3, the underlying implementation of s3a requests magnitudes too much data. Given sufficient seek sizes, the following HTTP pattern is observed: * GET bytes=8k-100M * GET bytes=2M-100M * GET bytes=4M-100M Although the HTTP request were normally aborted before all the data was send by the server, it was still about 10-15x the size of the input files that went over the network, i.e. for a file of the size of 100M, sometimes 1G of data is transferred over the network. A fix for this is the newly introduced {{fs.s3a.experimental.input.fadvise=random}} mode which will be introduced with Hadoop 3. -- This message was sent by Atlassian JIRA (v6.3.4#6332)