Hello everyone,

I am working on the Dataflow Template for (imports to) Neo4j and we are
currently in the middle of revisiting the whole Beam pipeline logic.

 The main concepts are:

  - data sources: typically data from a SQL query or a text file
  - import targets: a target is linked to a single source and defines
mappings from the source data to Neo4j
  - actions (one-off / fire-and-forget actions): typically HTTP requests or
Cypher (~= Neo4j's SQL) queries

The first two easily map to Beam.

Data sources are PTransform<PBegin, PCollection<Row>>.

Import targets would ideally be PTransform<PCollection<Row>, PDone> since
they're sinks, processing the data flowing from sources in batches.
(They're not defined like that at the moment, but that's a topic for
another question I am going to post soon.)
Targets can also depend on other targets, and we rely on Wait.on for this.

I am just not sure how to model actions.
Contrary to targets, they must run only exactly once during the whole
pipeline execution.
Like targets, they can depend on a set of targets (actions can also depend
on a set of sources).

Does it make sense to define actions as a PTransform? If so, what would be
the input and output types be? Is there a way to make sure the action runs
at most globally once?

Thanks a lot for your help and guidance,
Florent BIVILLE

Reply via email to