Hello everyone, I am working on the Dataflow Template for (imports to) Neo4j and we are currently in the middle of revisiting the whole Beam pipeline logic.
The main concepts are: - data sources: typically data from a SQL query or a text file - import targets: a target is linked to a single source and defines mappings from the source data to Neo4j - actions (one-off / fire-and-forget actions): typically HTTP requests or Cypher (~= Neo4j's SQL) queries The first two easily map to Beam. Data sources are PTransform<PBegin, PCollection<Row>>. Import targets would ideally be PTransform<PCollection<Row>, PDone> since they're sinks, processing the data flowing from sources in batches. (They're not defined like that at the moment, but that's a topic for another question I am going to post soon.) Targets can also depend on other targets, and we rely on Wait.on for this. I am just not sure how to model actions. Contrary to targets, they must run only exactly once during the whole pipeline execution. Like targets, they can depend on a set of targets (actions can also depend on a set of sources). Does it make sense to define actions as a PTransform? If so, what would be the input and output types be? Is there a way to make sure the action runs at most globally once? Thanks a lot for your help and guidance, Florent BIVILLE