Uwe L. Korn created DRILL-4976:
----------------------------------

             Summary: Querying Parquet files on S3 pulls 
                 Key: DRILL-4976
                 URL: https://issues.apache.org/jira/browse/DRILL-4976
             Project: Apache Drill
          Issue Type: Improvement
          Components: Storage - Parquet
    Affects Versions: 1.8.0
            Reporter: Uwe L. Korn


Currently (Drill 1.8, Hadoop 2.7.2) when queries are executed on files stored 
in S3, the underlying implementation of s3a requests magnitudes too much data. 
Given sufficient seek sizes, the following HTTP pattern is observed:

* GET bytes=8k-100M
* GET bytes=2M-100M
* GET bytes=4M-100M

Although the HTTP request were normally aborted before all the data was
send by the server, it was still about 10-15x the size of the input files
that went over the network, i.e. for a file of the size of 100M, sometimes 1G 
of data is transferred over the network.

A fix for this is the newly introduced 
{{fs.s3a.experimental.input.fadvise=random}} mode which will be introduced with 
Hadoop 3.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to