[jira] [Updated] (DRILL-4976) Querying Parquet files on S3 pulls too much data

Uwe L. Korn (JIRA) Fri, 28 Oct 2016 01:00:07 -0700

     [ 
https://issues.apache.org/jira/browse/DRILL-4976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Uwe L. Korn updated DRILL-4976:
-------------------------------
    Summary: Querying Parquet files on S3 pulls too much data   (was: Querying 
Parquet files on S3 pulls )

> Querying Parquet files on S3 pulls too much data 
> -------------------------------------------------
>
>                 Key: DRILL-4976
>                 URL: https://issues.apache.org/jira/browse/DRILL-4976
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>    Affects Versions: 1.8.0
>            Reporter: Uwe L. Korn
>
> Currently (Drill 1.8, Hadoop 2.7.2) when queries are executed on files stored 
> in S3, the underlying implementation of s3a requests magnitudes too much 
> data. Given sufficient seek sizes, the following HTTP pattern is observed:
> * GET bytes=8k-100M
> * GET bytes=2M-100M
> * GET bytes=4M-100M
> Although the HTTP request were normally aborted before all the data was
> send by the server, it was still about 10-15x the size of the input files
> that went over the network, i.e. for a file of the size of 100M, sometimes 1G 
> of data is transferred over the network.
> A fix for this is the newly introduced 
> {{fs.s3a.experimental.input.fadvise=random}} mode which will be introduced 
> with Hadoop 3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (DRILL-4976) Querying Parquet files on S3 pulls too much data

Reply via email to