Re: writing status

Steve Niemitz Sat, 09 Sep 2017 11:24:43 -0700

I like the idea of having sinks return PCollection<Void>, which is very
similar to how I changed the BigtableIO.Write interface.  I can submit a
pull request for BigtableIO which does that once I test it out a little
more.


On Sat, Sep 9, 2017 at 1:46 PM, Eugene Kirpichov <
[email protected]> wrote:

> Hi Steve,
> Unfortunately for BigQuery it's more complicated than that. Rows aren't
> written to BigQuery one by one (unless you're using streaming inserts,
> which are way more expensive and are usually used only in streaming
> pipelines) - they are written to files, and then a BigQuery import job, or
> several import jobs if there are too many files, picks them up. We can
> declare writing complete when all of the BigQuery import jobs have
> successfully completed.
> However, the method of writing is an implementation detail of BigQuery, so
> we need to create an API that works regardless of the method (import jobs
> vs. streaming inserts).
> Another complication is triggering - windows can fire multiple times. This
> rules out any approaches that sequence using side inputs, because side
> inputs don't have triggering.
>
> I think a common approach could be to return a PCollection<Void>,
> containing a Void in every window and pane that has been successfully
> written. This could be implemented in both modes and could be a general
> design patterns for this sort of thing. It just isn't easy to implement, so
> I didn't have time to take it on. It also could turn out to have other
> complications we haven't thought of yet.
>
> That said, if somebody tried to implement this for some connectors (not
> necessarily BigQuery) and pioneered the approach, it would be a great
> contribution.
>
> On Sat, Sep 9, 2017 at 9:41 AM Steve Niemitz <[email protected]> wrote:
>
> > I wonder if it makes sense to start simple and go from there.  For
> example,
> > I enhanced BigtableIO.Write to output the number of rows written
> > in finishBundle(), simply into the global window with the current
> > timestamp.  This was more than enough to unblock me, but doesn't support
> > more complicated scenarios with windowing.
> >
> > However, as I said it was more than enough to solve the general batch use
> > case, and I imagine could be enhanced to support windowing by keeping
> track
> > of which windows were written per bundle. (can there even ever be more
> than
> > one window per bundle?)
> >
> > On Fri, Sep 8, 2017 at 2:32 PM, Eugene Kirpichov <
> > [email protected]> wrote:
> >
> > > Hi,
> > > I was going to implement this, but discussed it with +Reuven Lax
> > > <[email protected]> and it appears to be quite difficult to do
> properly,
> > or
> > > even to define what it means at all, especially if you're using the
> > > streaming inserts write method. So for now there is no workaround
> except
> > > programmatically waiting for your whole pipeline to finish
> > > (pipeline.run().waitUntilFinish()).
> > >
> > > On Fri, Sep 8, 2017 at 2:19 AM Chaim Turkel <[email protected]> wrote:
> > >
> > > > is there a way around this for now?
> > > > how can i get a snapshot version?
> > > >
> > > > chaim
> > > >
> > > > On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
> > > > <[email protected]> wrote:
> > > > > Oh I see! Okay, this should be easy to fix. I'll take a look.
> > > > >
> > > > > On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <[email protected]>
> > wrote:
> > > > >
> > > > >> WriteResult does not support apply -> that is the problem
> > > > >>
> > > > >> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
> > > > >> <[email protected]> wrote:
> > > > >> > Hi,
> > > > >> >
> > > > >> > Sorry for the delay. So sounds like you want to do something
> after
> > > > >> writing
> > > > >> > a window of data to BigQuery is complete.
> > > > >> > I think this should be possible: expansion of BigQueryIO.write()
> > > > returns
> > > > >> a
> > > > >> > WriteResult and you can apply other transforms to it. Have you
> > tried
> > > > >> that?
> > > > >> >
> > > > >> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <[email protected]>
> > > > wrote:
> > > > >> >
> > > > >> >> I have documents from a mongo db that i need to migrate to
> > > bigquery.
> > > > >> >> Since it is mongodb i do not know they schema ahead of time,
> so i
> > > > have
> > > > >> >> two pipelines, one to run over the documents and update the
> > > bigquery
> > > > >> >> schema, then wait a few minutes (i can take for bigquery to be
> > able
> > > > to
> > > > >> >> use the new schema) then with the other pipline copy all the
> > > > >> >> documents.
> > > > >> >> To know as to where i got with the different piplines i have a
> > > status
> > > > >> >> table so that at the start i know from where to continue.
> > > > >> >> So i need the option to update the status table with the
> success
> > of
> > > > >> >> the copy and some time value of the last copied document
> > > > >> >>
> > > > >> >>
> > > > >> >> chaim
> > > > >> >>
> > > > >> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
> > > > >> >> <[email protected]> wrote:
> > > > >> >> > I'd like to know more about your both use cases, can you
> > > clarify? I
> > > > >> think
> > > > >> >> > making sinks output something that can be waited on by
> another
> > > > >> pipeline
> > > > >> >> > step is a reasonable request, but more details would help
> > refine
> > > > this
> > > > >> >> > suggestion.
> > > > >> >> >
> > > > >> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
> > > > >> [email protected]>
> > > > >> >> > wrote:
> > > > >> >> >
> > > > >> >> >> Can you do this from the program that runs the Beam job,
> after
> > > > job is
> > > > >> >> >> complete (you might have to use a blocking runner or poll
> for
> > > the
> > > > >> >> status of
> > > > >> >> >> the job) ?
> > > > >> >> >>
> > > > >> >> >> - Cham
> > > > >> >> >>
> > > > >> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <
> > > > [email protected]>
> > > > >> >> wrote:
> > > > >> >> >>
> > > > >> >> >> > I also have a similar use case (but with BigTable) that I
> > feel
> > > > >> like I
> > > > >> >> had
> > > > >> >> >> > to hack up to make work.  It'd be great to hear if there
> is
> > a
> > > > way
> > > > >> to
> > > > >> >> do
> > > > >> >> >> > something like this already, or if there are plans in the
> > > > future.
> > > > >> >> >> >
> > > > >> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <
> > > [email protected]
> > > > >
> > > > >> >> wrote:
> > > > >> >> >> >
> > > > >> >> >> > > Hi,
> > > > >> >> >> > >   I have a few piplines that are an ETL from different
> > > > systems to
> > > > >> >> >> > bigquery.
> > > > >> >> >> > > I would like to write the status of the ETL after all
> > > records
> > > > >> have
> > > > >> >> >> > > been updated to the bigquery.
> > > > >> >> >> > > The problem is that writing to bigquery is a sink and
> you
> > > > cannot
> > > > >> >> have
> > > > >> >> >> > > any other steps after the sink.
> > > > >> >> >> > > I tried a sideoutput, but this is called in no
> correlation
> > > to
> > > > the
> > > > >> >> >> > > writing to bigquery, so i don't know if it succeeded or
> > > > failed.
> > > > >> >> >> > >
> > > > >> >> >> > >
> > > > >> >> >> > > any ideas?
> > > > >> >> >> > > chaim
> > > > >> >> >> > >
> > > > >> >> >> >
> > > > >> >> >>
> > > > >> >>
> > > > >>
> > > >
> > >
> >
>

Re: writing status

Reply via email to