Re: Beam on Flink not processing input splits in parallel

Jan Lukavský Wed, 09 Mar 2022 08:15:23 -0800

Hi Janek,

I think you hit a deficiency in the FlinkRunner's SDF implementation.AFAIK the runner is unable to do dynamic splitting, which is what youare probably looking for. What you describe essentially works in themodel, but FlinkRunner does not implement the complete contract to makeuse of ability to split a large file to multiple parts and process themin parallel. I'm afraid there is no simple solution currently, otherthan what you described. The dynamic splitting might be introduced insome future release so that it starts working as you expect out of thebox. This should affect mostly users with few large files, if you canparallelize on the files itself, then it should work fine (which is whatyou observe).


 Jan

On 3/9/22 16:44, Janek Bevendorff wrote:

I went through all Flink and Beam documentation I could find to see ifI overlooked something, but I could not get the text input source tounfuse the file input splits. This creates a huge input bottleneck,because one worker is busy reading records from a huge input filewhile 99 others wait for input and I can only shuffle the generatedrecords, not the actual file input splits.
To fix it, I wrote a custom PTransform that globs files, optionallyshuffles the file names, generates fixed-size split markers, shufflesthe markers, and then reads the lines from these splits. This workswell, but it feels like a hack, because I think Beam should be doingthat out of the box. My implementation of the file reader is also muchsimpler and relies on IOBase.readline(), which keeps the code shortand concise, but also loses a lot of flexibility compared to the Beamfile reader implementation (such as support for custom line delimiters).
Any other ideas how I can solve this without writing custom PTransforms?

Janek


On 08/03/2022 14:11, Janek Bevendorff wrote:
Hey there,
According to the docs, when using a FileBasedSource or a splittableDoFn, the runner is free to initiate splits that can be run inparallel. As far as I can tell, the former is actually happening onmy Apache Flink cluster, but the latter is not. This causes a singleTaskmanager to process all splits of an input text file. Is thisknown behaviour and how can I fix this?
I have a pipeline that looks like this (Python SDK):

(pipeline
| 'Read Input File' >> textio.ReadFromText(input_glob,min_bundle_size=1)
| 'Reshuffle Lines' >> beam.Reshuffle()
| 'Map Records' >> beam.ParDo(map_func))
The input file is a large, uncompressed plaintext file from a shareddrive containing millions of newline-separated data records. I amrunning this job with a parallelism of 100, but it is bottlenecked bya single worker running ReadFromText(). The reshuffling in betweenwas added to force Beam/Flink to parallelize the processing, but thishas no effect on the preceding stage. Only the following mapoperation is being run in parallel. The stage itself is marked ashaving a parallelism of 100, but 99 workers finish immediately.
I had the same issue earlier with another input source, in which Imatch a bunch of WARC file globs and then iterate over them in asplittable DoFn. I solved the missing parallelism by adding anexplicit reshuffle in between matching input globs and actuallyreading the individual files:
class WarcInput(beam.PTransform):
    def expand(self, pcoll):
return pcoll | MatchFiles(self._file_pattern) |beam.Reshuffle() | beam.ParDo(WarcReader())
This way I can at least achieve parallelism on file level. Thisdoesn't work with a single splittable input file, of course, forwhich I would have to reshuffle somewhere inside of ReadFromText().Do I really have to write a custom PTransform that generates initialsplits, shuffles them, and then reads from those splits? I considerthis somewhat essential functionality.
Any hints appreciated.

Janek

Re: Beam on Flink not processing input splits in parallel

Reply via email to