Having the sink output something I think is a good option.

My use case looks something like this:
1) Read a bunch of avro files (all with the same schema, but the schema is
not known before hand)
2) Use the avro schema + data to generate bigtable mutations
3) write the mutations
4) *once all files are processed and written *-> update a "checkpoint"
marker in another bigtable table, which also depends on the schema from (1).

I've hacked this up by making the flow in step 4 rely on an output from
step 2 that goes through a GroupBy to ensure that all records are at least
processed by step 2 before step 4 runs, but there's still a race condition
between the last record being emitted by step 2 and the write in step 3
completing.

If as you said, the sink emitted a record when it completed, that'd solve
the race condition.

In summary: right now the flow looks like this (terrible ASCII attempt):

Read Avro Files (extract schema + data) (1)
 |
V
Generate mutations (2)
 |----------------------------> [GroupBy -> Take first -> Generate mutation
-> Bigtable write] (4)
V
Write mutations (3)


On Fri, Aug 25, 2017 at 11:53 AM, Eugene Kirpichov <
[email protected]> wrote:

> I'd like to know more about your both use cases, can you clarify? I think
> making sinks output something that can be waited on by another pipeline
> step is a reasonable request, but more details would help refine this
> suggestion.
>
> On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <[email protected]>
> wrote:
>
> > Can you do this from the program that runs the Beam job, after job is
> > complete (you might have to use a blocking runner or poll for the status
> of
> > the job) ?
> >
> > - Cham
> >
> > On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <[email protected]>
> wrote:
> >
> > > I also have a similar use case (but with BigTable) that I feel like I
> had
> > > to hack up to make work.  It'd be great to hear if there is a way to do
> > > something like this already, or if there are plans in the future.
> > >
> > > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <[email protected]>
> wrote:
> > >
> > > > Hi,
> > > >   I have a few piplines that are an ETL from different systems to
> > > bigquery.
> > > > I would like to write the status of the ETL after all records have
> > > > been updated to the bigquery.
> > > > The problem is that writing to bigquery is a sink and you cannot have
> > > > any other steps after the sink.
> > > > I tried a sideoutput, but this is called in no correlation to the
> > > > writing to bigquery, so i don't know if it succeeded or failed.
> > > >
> > > >
> > > > any ideas?
> > > > chaim
> > > >
> > >
> >
>

Reply via email to