Sahil Takiar created IMPALA-8523:
------------------------------------

             Summary: Migrate hdfsOpen to builder-based openFile API
                 Key: IMPALA-8523
                 URL: https://issues.apache.org/jira/browse/IMPALA-8523
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend
            Reporter: Sahil Takiar
            Assignee: Sahil Takiar


When opening files via libhdfs we call {{hdfsOpen}} which ultimately calls 
{{FileSystem#open(Path f, int bufferSize)}}. As of HADOOP-15229, the 
HDFS-client now exposes a new API for opening files called {{openFile}}. The 
new API has a few advantages (1) it is capable of specifying file specific 
configuration values in a builder-based manner (see {{o.a.h.fs.FSBuilder}} for 
details), and (2) it can open files asynchronously (e.g. see 
{{o.a.h.fs.FutureDataInputStreamBuilder}} for details.

The async file opens are similar to IMPALA-7738 (Implement timeouts for HDFS 
open calls). To avoid overlap between IMPALA-7738 and the async file opens in 
{{openFile}}, HADOOP-15691 can be used to check which filesystems open files 
asynchronously and which ones don't (currently only S3A opens files 
asynchronously).

The main use case for the new {{openFile}} API is Impala-S3 performance. 
Performance benchmarks have shown that setting 
{{fs.s3a.experimental.input.fadvise}} to {{RANDOM}} for Parquet files can 
significantly improve performance, however, this setting also adversely affects 
scans of non-splittable file formats such as gzipped files (see HADOOP-13203). 
One solution to this issue is to just document that setting 
{{fs.s3a.experimental.input.fadvise}} to {{RANDOM}} for Parquet improves 
performance, however, a better solution would be to use the new {{openFile}} 
API to specify different values of fadvise depending on the file type.

This work is dependent on exposing the new {{openFile}} API via libhdfs 
(HDFS-14478).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to