[jira] [Created] (IMPALA-8523) Migrate hdfsOpen to builder-based openFile API

2019-05-08 Thread Sahil Takiar (JIRA)
Sahil Takiar created IMPALA-8523:


 Summary: Migrate hdfsOpen to builder-based openFile API
 Key: IMPALA-8523
 URL: https://issues.apache.org/jira/browse/IMPALA-8523
 Project: IMPALA
  Issue Type: Improvement
  Components: Backend
Reporter: Sahil Takiar
Assignee: Sahil Takiar


When opening files via libhdfs we call {{hdfsOpen}} which ultimately calls 
{{FileSystem#open(Path f, int bufferSize)}}. As of HADOOP-15229, the 
HDFS-client now exposes a new API for opening files called {{openFile}}. The 
new API has a few advantages (1) it is capable of specifying file specific 
configuration values in a builder-based manner (see {{o.a.h.fs.FSBuilder}} for 
details), and (2) it can open files asynchronously (e.g. see 
{{o.a.h.fs.FutureDataInputStreamBuilder}} for details.

The async file opens are similar to IMPALA-7738 (Implement timeouts for HDFS 
open calls). To avoid overlap between IMPALA-7738 and the async file opens in 
{{openFile}}, HADOOP-15691 can be used to check which filesystems open files 
asynchronously and which ones don't (currently only S3A opens files 
asynchronously).

The main use case for the new {{openFile}} API is Impala-S3 performance. 
Performance benchmarks have shown that setting 
{{fs.s3a.experimental.input.fadvise}} to {{RANDOM}} for Parquet files can 
significantly improve performance, however, this setting also adversely affects 
scans of non-splittable file formats such as gzipped files (see HADOOP-13203). 
One solution to this issue is to just document that setting 
{{fs.s3a.experimental.input.fadvise}} to {{RANDOM}} for Parquet improves 
performance, however, a better solution would be to use the new {{openFile}} 
API to specify different values of fadvise depending on the file type.

This work is dependent on exposing the new {{openFile}} API via libhdfs 
(HDFS-14478).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-8523) Migrate hdfsOpen to builder-based openFile API

2019-05-08 Thread Sahil Takiar (JIRA)
Sahil Takiar created IMPALA-8523:


 Summary: Migrate hdfsOpen to builder-based openFile API
 Key: IMPALA-8523
 URL: https://issues.apache.org/jira/browse/IMPALA-8523
 Project: IMPALA
  Issue Type: Improvement
  Components: Backend
Reporter: Sahil Takiar
Assignee: Sahil Takiar


When opening files via libhdfs we call {{hdfsOpen}} which ultimately calls 
{{FileSystem#open(Path f, int bufferSize)}}. As of HADOOP-15229, the 
HDFS-client now exposes a new API for opening files called {{openFile}}. The 
new API has a few advantages (1) it is capable of specifying file specific 
configuration values in a builder-based manner (see {{o.a.h.fs.FSBuilder}} for 
details), and (2) it can open files asynchronously (e.g. see 
{{o.a.h.fs.FutureDataInputStreamBuilder}} for details.

The async file opens are similar to IMPALA-7738 (Implement timeouts for HDFS 
open calls). To avoid overlap between IMPALA-7738 and the async file opens in 
{{openFile}}, HADOOP-15691 can be used to check which filesystems open files 
asynchronously and which ones don't (currently only S3A opens files 
asynchronously).

The main use case for the new {{openFile}} API is Impala-S3 performance. 
Performance benchmarks have shown that setting 
{{fs.s3a.experimental.input.fadvise}} to {{RANDOM}} for Parquet files can 
significantly improve performance, however, this setting also adversely affects 
scans of non-splittable file formats such as gzipped files (see HADOOP-13203). 
One solution to this issue is to just document that setting 
{{fs.s3a.experimental.input.fadvise}} to {{RANDOM}} for Parquet improves 
performance, however, a better solution would be to use the new {{openFile}} 
API to specify different values of fadvise depending on the file type.

This work is dependent on exposing the new {{openFile}} API via libhdfs 
(HDFS-14478).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)