Hi,

I'm writing Java Beam code to run with both Dataflow and Spark. The input
files are tfrecord format and are from multiple directories. Java
TFRecordIO doesn't have readAll from list of files so what I'm doing is:

for (String dir: listOfDirs) {
    p.apply(TFRecordIO.read().from(dir))
     .apply(ParDo.of(new BatchElements()))
     .apply(ParDo.of(new Process()))
     .apply(Combine.globally(new CombineResult()))
     .apply(TextIO.write().to(dir))
}

These directories are fairly independent and I only need result of each
directory. When running on Dataflow, processing of these directories happen
concurrently. But when running with Spark, I saw the spark jobs and stages
are sequential. It needs finish all steps in one directory before moving to
next one. What's the way to make multiple transforms run concurrently with
SparkRunner?

Reply via email to