Thanks. A couple of questions:
- For step 1 (read Avro files with unknown schema), I presume you're using
AvroIO.parseGenericRecords()?
- For step 4: why do you need the checkpoint: is it because you actually
want to continuously ingest new Avro files as they keep appearing? In that
case you might want to take advantage of
https://github.com/apache/beam/pull/3725 when it's in.

On Fri, Aug 25, 2017 at 10:39 AM Steve Niemitz <[email protected]> wrote:

> Having the sink output something I think is a good option.
>
> My use case looks something like this:
> 1) Read a bunch of avro files (all with the same schema, but the schema is
> not known before hand)
> 2) Use the avro schema + data to generate bigtable mutations
> 3) write the mutations
> 4) *once all files are processed and written *-> update a "checkpoint"
> marker in another bigtable table, which also depends on the schema from
> (1).
>
> I've hacked this up by making the flow in step 4 rely on an output from
> step 2 that goes through a GroupBy to ensure that all records are at least
> processed by step 2 before step 4 runs, but there's still a race condition
> between the last record being emitted by step 2 and the write in step 3
> completing.
>
> If as you said, the sink emitted a record when it completed, that'd solve
> the race condition.
>
> In summary: right now the flow looks like this (terrible ASCII attempt):
>
> Read Avro Files (extract schema + data) (1)
>  |
> V
> Generate mutations (2)
>  |----------------------------> [GroupBy -> Take first -> Generate mutation
> -> Bigtable write] (4)
> V
> Write mutations (3)
>
>
> On Fri, Aug 25, 2017 at 11:53 AM, Eugene Kirpichov <
> [email protected]> wrote:
>
> > I'd like to know more about your both use cases, can you clarify? I think
> > making sinks output something that can be waited on by another pipeline
> > step is a reasonable request, but more details would help refine this
> > suggestion.
> >
> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <[email protected]>
> > wrote:
> >
> > > Can you do this from the program that runs the Beam job, after job is
> > > complete (you might have to use a blocking runner or poll for the
> status
> > of
> > > the job) ?
> > >
> > > - Cham
> > >
> > > On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <[email protected]>
> > wrote:
> > >
> > > > I also have a similar use case (but with BigTable) that I feel like I
> > had
> > > > to hack up to make work.  It'd be great to hear if there is a way to
> do
> > > > something like this already, or if there are plans in the future.
> > > >
> > > > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <[email protected]>
> > wrote:
> > > >
> > > > > Hi,
> > > > >   I have a few piplines that are an ETL from different systems to
> > > > bigquery.
> > > > > I would like to write the status of the ETL after all records have
> > > > > been updated to the bigquery.
> > > > > The problem is that writing to bigquery is a sink and you cannot
> have
> > > > > any other steps after the sink.
> > > > > I tried a sideoutput, but this is called in no correlation to the
> > > > > writing to bigquery, so i don't know if it succeeded or failed.
> > > > >
> > > > >
> > > > > any ideas?
> > > > > chaim
> > > > >
> > > >
> > >
> >
>

Reply via email to