Grzegorz Liter created FLINK-35704:
--------------------------------------
Summary: ForkJoinPool introduction to
NonSplittingRecursiveEnumerator to vastly improve enumeration performance
Key: FLINK-35704
URL: https://issues.apache.org/jira/browse/FLINK-35704
Project: Flink
Issue Type: Improvement
Components: Connectors / FileSystem
Reporter: Grzegorz Liter
Attachments: ParallelNonSplittingRecursiveEnumerator.java
In current implementation of NonSplittingRecursiveEnumerator the files and
directories are enumerated in sequence. In case of accessing a remote storage
like S3 the vast amount of time is wasted waiting for a response.
What is worse the enumeration is done by JM it self during which it is
unresponsive for RPC calls. When accessing multiple (thousands+) files the wait
time can quickly add up and can cause a pekko timeout.
The performance can be improved by enumerating files in parallel with e.g.
ForkJoinPool and parallel streams. I am attaching example implementation that I
am happy to contribute to Flink repository.
In my tests it cuts the time at least 10x
--
This message was sent by Atlassian Jira
(v8.20.10#820010)