[jira] [Created] (IMPALA-8523) Migrate hdfsOpen to builder-based openFile API
Sahil Takiar created IMPALA-8523: Summary: Migrate hdfsOpen to builder-based openFile API Key: IMPALA-8523 URL: https://issues.apache.org/jira/browse/IMPALA-8523 Project: IMPALA Issue Type: Improvement Components: Backend Reporter: Sahil Takiar Assignee: Sahil Takiar When opening files via libhdfs we call {{hdfsOpen}} which ultimately calls {{FileSystem#open(Path f, int bufferSize)}}. As of HADOOP-15229, the HDFS-client now exposes a new API for opening files called {{openFile}}. The new API has a few advantages (1) it is capable of specifying file specific configuration values in a builder-based manner (see {{o.a.h.fs.FSBuilder}} for details), and (2) it can open files asynchronously (e.g. see {{o.a.h.fs.FutureDataInputStreamBuilder}} for details. The async file opens are similar to IMPALA-7738 (Implement timeouts for HDFS open calls). To avoid overlap between IMPALA-7738 and the async file opens in {{openFile}}, HADOOP-15691 can be used to check which filesystems open files asynchronously and which ones don't (currently only S3A opens files asynchronously). The main use case for the new {{openFile}} API is Impala-S3 performance. Performance benchmarks have shown that setting {{fs.s3a.experimental.input.fadvise}} to {{RANDOM}} for Parquet files can significantly improve performance, however, this setting also adversely affects scans of non-splittable file formats such as gzipped files (see HADOOP-13203). One solution to this issue is to just document that setting {{fs.s3a.experimental.input.fadvise}} to {{RANDOM}} for Parquet improves performance, however, a better solution would be to use the new {{openFile}} API to specify different values of fadvise depending on the file type. This work is dependent on exposing the new {{openFile}} API via libhdfs (HDFS-14478). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-8523) Migrate hdfsOpen to builder-based openFile API
Sahil Takiar created IMPALA-8523: Summary: Migrate hdfsOpen to builder-based openFile API Key: IMPALA-8523 URL: https://issues.apache.org/jira/browse/IMPALA-8523 Project: IMPALA Issue Type: Improvement Components: Backend Reporter: Sahil Takiar Assignee: Sahil Takiar When opening files via libhdfs we call {{hdfsOpen}} which ultimately calls {{FileSystem#open(Path f, int bufferSize)}}. As of HADOOP-15229, the HDFS-client now exposes a new API for opening files called {{openFile}}. The new API has a few advantages (1) it is capable of specifying file specific configuration values in a builder-based manner (see {{o.a.h.fs.FSBuilder}} for details), and (2) it can open files asynchronously (e.g. see {{o.a.h.fs.FutureDataInputStreamBuilder}} for details. The async file opens are similar to IMPALA-7738 (Implement timeouts for HDFS open calls). To avoid overlap between IMPALA-7738 and the async file opens in {{openFile}}, HADOOP-15691 can be used to check which filesystems open files asynchronously and which ones don't (currently only S3A opens files asynchronously). The main use case for the new {{openFile}} API is Impala-S3 performance. Performance benchmarks have shown that setting {{fs.s3a.experimental.input.fadvise}} to {{RANDOM}} for Parquet files can significantly improve performance, however, this setting also adversely affects scans of non-splittable file formats such as gzipped files (see HADOOP-13203). One solution to this issue is to just document that setting {{fs.s3a.experimental.input.fadvise}} to {{RANDOM}} for Parquet improves performance, however, a better solution would be to use the new {{openFile}} API to specify different values of fadvise depending on the file type. This work is dependent on exposing the new {{openFile}} API via libhdfs (HDFS-14478). -- This message was sent by Atlassian JIRA (v7.6.3#76005)