Re: Beam on Flink not processing input splits in parallel

Jan Lukavský Wed, 09 Mar 2022 08:40:47 -0800

There are two "kinds" of splits in SDF - one splits the restriction*before* being processed and the other *during* processing. The firstone is supported (it is needed for correctness) and the other is inbounded case only an optimization (which is not currently supported). Itseems to me, that is should be possible to pre-split on filenames, whichthen it should be processed in parallel. Unfortunately I'm not thatfamiliar with python SDK's fileio, so I'd rather leave the more detailedanswer for someone else. But otherwise what you say makes sense to me.

Jan


On 3/9/22 17:26, Janek Bevendorff wrote:

Thanks for the response! That's what I feared was going on.
I consider this a huge shortcoming, particularly because it does notonly affect users with large files like you said. The same happenswith many small files, because file globs are also fused to oneworker. The only way to process files in parallel is to write aPTransform that does MatchFiles(file_glob) | Reshuffle() |ReadAllFromText(). A simple ReadFromText(file_glob) would not be runin parallel.
In fact, if you feed multiple textfile sources to your job, not onlywill each of these inputs process its files on one worker, but eventhe inputs are fused together. So even if you resolved the globlocally and then added one input for each individual file, all of thatwould still run sequentially.
On 09/03/2022 17:14, Jan Lukavský wrote:
Hi Janek,
I think you hit a deficiency in the FlinkRunner's SDF implementation.AFAIK the runner is unable to do dynamic splitting, which is what youare probably looking for. What you describe essentially works in themodel, but FlinkRunner does not implement the complete contract tomake use of ability to split a large file to multiple parts andprocess them in parallel. I'm afraid there is no simple solutioncurrently, other than what you described. The dynamic splitting mightbe introduced in some future release so that it starts working as youexpect out of the box. This should affect mostly users with few largefiles, if you can parallelize on the files itself, then it shouldwork fine (which is what you observe).
 Jan

On 3/9/22 16:44, Janek Bevendorff wrote:
I went through all Flink and Beam documentation I could find to seeif I overlooked something, but I could not get the text input sourceto unfuse the file input splits. This creates a huge inputbottleneck, because one worker is busy reading records from a hugeinput file while 99 others wait for input and I can only shuffle thegenerated records, not the actual file input splits.
To fix it, I wrote a custom PTransform that globs files, optionallyshuffles the file names, generates fixed-size split markers,shuffles the markers, and then reads the lines from these splits.This works well, but it feels like a hack, because I think Beamshould be doing that out of the box. My implementation of the filereader is also much simpler and relies on IOBase.readline(), whichkeeps the code short and concise, but also loses a lot offlexibility compared to the Beam file reader implementation (such assupport for custom line delimiters).
Any other ideas how I can solve this without writing customPTransforms?
Janek


On 08/03/2022 14:11, Janek Bevendorff wrote:
Hey there,
According to the docs, when using a FileBasedSource or a splittableDoFn, the runner is free to initiate splits that can be run inparallel. As far as I can tell, the former is actually happening onmy Apache Flink cluster, but the latter is not. This causes asingle Taskmanager to process all splits of an input text file. Isthis known behaviour and how can I fix this?
I have a pipeline that looks like this (Python SDK):

(pipeline
| 'Read Input File' >> textio.ReadFromText(input_glob,min_bundle_size=1)
| 'Reshuffle Lines' >> beam.Reshuffle()
| 'Map Records' >> beam.ParDo(map_func))
The input file is a large, uncompressed plaintext file from ashared drive containing millions of newline-separated data records.I am running this job with a parallelism of 100, but it isbottlenecked by a single worker running ReadFromText(). Thereshuffling in between was added to force Beam/Flink to parallelizethe processing, but this has no effect on the preceding stage. Onlythe following map operation is being run in parallel. The stageitself is marked as having a parallelism of 100, but 99 workersfinish immediately.
I had the same issue earlier with another input source, in which Imatch a bunch of WARC file globs and then iterate over them in asplittable DoFn. I solved the missing parallelism by adding anexplicit reshuffle in between matching input globs and actuallyreading the individual files:
class WarcInput(beam.PTransform):
    def expand(self, pcoll):
return pcoll | MatchFiles(self._file_pattern) |beam.Reshuffle() | beam.ParDo(WarcReader())
This way I can at least achieve parallelism on file level. Thisdoesn't work with a single splittable input file, of course, forwhich I would have to reshuffle somewhere inside of ReadFromText().Do I really have to write a custom PTransform that generatesinitial splits, shuffles them, and then reads from those splits? Iconsider this somewhat essential functionality.
Any hints appreciated.

Janek

Re: Beam on Flink not processing input splits in parallel

Reply via email to