[
https://issues.apache.org/jira/browse/PARQUET-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058231#comment-14058231
]
Cheng Lian commented on PARQUET-16:
-----------------------------------
Hi [~dvryaboy], thanks for pointing out that PR, the background information is
very helpful. Actually I can make my change compatible with that PR.
Essentially the problem we are facing to is much easier.
Currently, calling {{ParquetInputFormat.getSplits(JobContext jobContext)}}
result a call chain like this:
{code}
List<InputSplit> getSplits(JobContext jobContext)
List<Footer> getFooters(JobContext jobContext)
List<FileStatus> listStatus(JobContext jobContext)
<-- (1)
List<ParquetInputSplit> getSplits(Configuration configuration, List<Footer>
footers)
...
<-- (2)
{code}
Basically all the {{FileStatus}} objects are already fetched at (1), but
abandoned immediately, and then fetched again at (2). The bad thing here is
that (2) fetches all those objects by calling {{getFileStatus()}} on all
part-files sequentially. Thus we only need to pass those fetched objects from
(1) to (2), and caching is not required to solve this performance issue.
> Unnecessary getFileStatus() calls on all part-files in
> ParquetInputFormat.getSplits
> -----------------------------------------------------------------------------------
>
> Key: PARQUET-16
> URL: https://issues.apache.org/jira/browse/PARQUET-16
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Reporter: Cheng Lian
>
> When testing Spark SQL Parquet support, we found that accessing large Parquet
> files located in S3 can be very slow. To be more specific, we have a S3
> Parquet file with over 3,000 part-files, calling
> {{ParquetInputFormat.getSplits}} on it takes several minutes. (We were
> accessing this file from our office network rather than AWS.)
> After some investigation, we found that {{ParquetInputFormat.getSplits}} is
> trying to call {{getFileStatus()}} on all part-files one by one sequentially
> ([here|https://github.com/apache/incubator-parquet-mr/blob/parquet-1.5.0/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java#L370]).
> And in the case of S3, each {{getFileStatus()}} call issues an HTTP request
> and wait for the reply in a blocking manner, which is considerably expensive.
> Actually all these {{FileStatus}} objects have already been fetched when
> footers are retrieved
> ([here|https://github.com/apache/incubator-parquet-mr/blob/parquet-1.5.0/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java#L443]).
> Caching these {{FileStatus}} objects can greatly improve our S3 case
> (reduced from over 5 minutes to about 1.4 minutes).
> Will submit a PR for this issue soon.
--
This message was sent by Atlassian JIRA
(v6.2#6252)