Re: Beam on Flink not processing input splits in parallel

Janek Bevendorff Wed, 09 Mar 2022 08:27:07 -0800

Thanks for the response! That's what I feared was going on.

I consider this a huge shortcoming, particularly because it does notonly affect users with large files like you said. The same happens withmany small files, because file globs are also fused to one worker. Theonly way to process files in parallel is to write a PTransform that doesMatchFiles(file_glob) | Reshuffle() | ReadAllFromText(). A simpleReadFromText(file_glob) would not be run in parallel.

In fact, if you feed multiple textfile sources to your job, not onlywill each of these inputs process its files on one worker, but even theinputs are fused together. So even if you resolved the glob locally andthen added one input for each individual file, all of that would stillrun sequentially.



On 09/03/2022 17:14, Jan Lukavský wrote:

Hi Janek,
I think you hit a deficiency in the FlinkRunner's SDF implementation.AFAIK the runner is unable to do dynamic splitting, which is what youare probably looking for. What you describe essentially works in themodel, but FlinkRunner does not implement the complete contract tomake use of ability to split a large file to multiple parts andprocess them in parallel. I'm afraid there is no simple solutioncurrently, other than what you described. The dynamic splitting mightbe introduced in some future release so that it starts working as youexpect out of the box. This should affect mostly users with few largefiles, if you can parallelize on the files itself, then it should workfine (which is what you observe).
 Jan

On 3/9/22 16:44, Janek Bevendorff wrote:
I went through all Flink and Beam documentation I could find to seeif I overlooked something, but I could not get the text input sourceto unfuse the file input splits. This creates a huge inputbottleneck, because one worker is busy reading records from a hugeinput file while 99 others wait for input and I can only shuffle thegenerated records, not the actual file input splits.
To fix it, I wrote a custom PTransform that globs files, optionallyshuffles the file names, generates fixed-size split markers, shufflesthe markers, and then reads the lines from these splits. This workswell, but it feels like a hack, because I think Beam should be doingthat out of the box. My implementation of the file reader is alsomuch simpler and relies on IOBase.readline(), which keeps the codeshort and concise, but also loses a lot of flexibility compared tothe Beam file reader implementation (such as support for custom linedelimiters).
Any other ideas how I can solve this without writing custom PTransforms?

Janek


On 08/03/2022 14:11, Janek Bevendorff wrote:
Hey there,
According to the docs, when using a FileBasedSource or a splittableDoFn, the runner is free to initiate splits that can be run inparallel. As far as I can tell, the former is actually happening onmy Apache Flink cluster, but the latter is not. This causes a singleTaskmanager to process all splits of an input text file. Is thisknown behaviour and how can I fix this?
I have a pipeline that looks like this (Python SDK):

(pipeline
| 'Read Input File' >> textio.ReadFromText(input_glob,min_bundle_size=1)
| 'Reshuffle Lines' >> beam.Reshuffle()
| 'Map Records' >> beam.ParDo(map_func))
The input file is a large, uncompressed plaintext file from a shareddrive containing millions of newline-separated data records. I amrunning this job with a parallelism of 100, but it is bottleneckedby a single worker running ReadFromText(). The reshuffling inbetween was added to force Beam/Flink to parallelize the processing,but this has no effect on the preceding stage. Only the followingmap operation is being run in parallel. The stage itself is markedas having a parallelism of 100, but 99 workers finish immediately.
I had the same issue earlier with another input source, in which Imatch a bunch of WARC file globs and then iterate over them in asplittable DoFn. I solved the missing parallelism by adding anexplicit reshuffle in between matching input globs and actuallyreading the individual files:
class WarcInput(beam.PTransform):
    def expand(self, pcoll):
return pcoll | MatchFiles(self._file_pattern) |beam.Reshuffle() | beam.ParDo(WarcReader())
This way I can at least achieve parallelism on file level. Thisdoesn't work with a single splittable input file, of course, forwhich I would have to reshuffle somewhere inside of ReadFromText().Do I really have to write a custom PTransform that generates initialsplits, shuffles them, and then reads from those splits? I considerthis somewhat essential functionality.
Any hints appreciated.

Janek

--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de

Re: Beam on Flink not processing input splits in parallel

Reply via email to