Re: writing status

Eugene Kirpichov Sun, 10 Sep 2017 09:18:11 -0700

Would like to elaborate why getFailedInserts is easier to implement than
returning a sequenceable indicator of success.


getFailedInserts returns individual rows. In a streaming pipeline, it will
return them one by one, as each individual row fails [in a batch pipeline -
or in a streaming pipeline but with BigQueryIO.write() configured to use
the BATCH_LOADS write method - as Reuven said, it will be empty].

This is reasonable because we assume that the number of failed rows will be
small, and because the exact contents of the failed rows is valuable: we
want to preserve them for further investigation to avoid data loss.

Returning individual "succeeded" rows would be inefficient and rather odd,
because it would in most cases simply duplicate the input PCollection
(which can also be very large, so returning it would make the pipeline much
slower), and in all cases the caller would ignore the contents of the rows.

I still believe the right approach is PCollection<Void> with windows and
panes as suggested above. It's not easy to implement, especially with
streaming inserts, but I think it gives the right
batch/streaming-independent semantics and should be doable. Will keep
thinking.

On Sun, Sep 10, 2017 at 8:59 AM Reuven Lax <[email protected]> wrote:

> No, windowing is supported on BigQuery.
>
> The BigQueryIO transform is divided into two parts. The first part takes
> your input elements and decides which tables they should be written to.
> This can be based on the windows that the elements are in (as well as the
> data in the elements themselves if you want).
>
> The second part of the transform actually writes the data to BigQuery.
> Since we've already used the windowing information to decide which tables
> to write the elements to, this second part does not need windowing
> information any more. It is written in terms of Beam primitives (e.g.
> GroupByKey) that we don't want affected by windowing; at this point we want
> to write the data to the target BigQuery tables as fast as possible. So
> this second half of the transform rewindows all the data in GlobalWindows
> before writing to BigQuery.
>
> As for getFailedInserts, it only is supported for the streaming path. The
> reason why is that it does not make sense for the batch path. The batch
> path works by writing all elements to files in GCS, and then telling
> BigQuery to import all these files. If there's a failure, we don't get any
> information about which row failed, all we know is that the entire import
> job failed. In streaming on the other hand we insert the rows one by one,
> and when there's a failure we get detailed information on which row failed
> and can return it in getFailedInserts.
>
> Reuven
>
> On Sat, Sep 9, 2017 at 10:25 PM, Chaim Turkel <[email protected]> wrote:
>
> > so what you are saying is that windowing is not supported on the
> > bigquery? this does not make sense since i am using it for the table
> > partition, and that works fine?
> >
> > On Sat, Sep 9, 2017 at 7:40 PM, Steve Niemitz <[email protected]>
> wrote:
> > > I wonder if it makes sense to start simple and go from there.  For
> > example,
> > > I enhanced BigtableIO.Write to output the number of rows written
> > > in finishBundle(), simply into the global window with the current
> > > timestamp.  This was more than enough to unblock me, but doesn't
> support
> > > more complicated scenarios with windowing.
> > >
> > > However, as I said it was more than enough to solve the general batch
> use
> > > case, and I imagine could be enhanced to support windowing by keeping
> > track
> > > of which windows were written per bundle. (can there even ever be more
> > than
> > > one window per bundle?)
> > >
> > > On Fri, Sep 8, 2017 at 2:32 PM, Eugene Kirpichov <
> > > [email protected]> wrote:
> > >
> > >> Hi,
> > >> I was going to implement this, but discussed it with +Reuven Lax
> > >> <[email protected]> and it appears to be quite difficult to do
> > properly, or
> > >> even to define what it means at all, especially if you're using the
> > >> streaming inserts write method. So for now there is no workaround
> except
> > >> programmatically waiting for your whole pipeline to finish
> > >> (pipeline.run().waitUntilFinish()).
> > >>
> > >> On Fri, Sep 8, 2017 at 2:19 AM Chaim Turkel <[email protected]> wrote:
> > >>
> > >> > is there a way around this for now?
> > >> > how can i get a snapshot version?
> > >> >
> > >> > chaim
> > >> >
> > >> > On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
> > >> > <[email protected]> wrote:
> > >> > > Oh I see! Okay, this should be easy to fix. I'll take a look.
> > >> > >
> > >> > > On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <[email protected]>
> > wrote:
> > >> > >
> > >> > >> WriteResult does not support apply -> that is the problem
> > >> > >>
> > >> > >> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
> > >> > >> <[email protected]> wrote:
> > >> > >> > Hi,
> > >> > >> >
> > >> > >> > Sorry for the delay. So sounds like you want to do something
> > after
> > >> > >> writing
> > >> > >> > a window of data to BigQuery is complete.
> > >> > >> > I think this should be possible: expansion of
> BigQueryIO.write()
> > >> > returns
> > >> > >> a
> > >> > >> > WriteResult and you can apply other transforms to it. Have you
> > tried
> > >> > >> that?
> > >> > >> >
> > >> > >> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <[email protected]
> >
> > >> > wrote:
> > >> > >> >
> > >> > >> >> I have documents from a mongo db that i need to migrate to
> > >> bigquery.
> > >> > >> >> Since it is mongodb i do not know they schema ahead of time,
> so
> > i
> > >> > have
> > >> > >> >> two pipelines, one to run over the documents and update the
> > >> bigquery
> > >> > >> >> schema, then wait a few minutes (i can take for bigquery to be
> > able
> > >> > to
> > >> > >> >> use the new schema) then with the other pipline copy all the
> > >> > >> >> documents.
> > >> > >> >> To know as to where i got with the different piplines i have a
> > >> status
> > >> > >> >> table so that at the start i know from where to continue.
> > >> > >> >> So i need the option to update the status table with the
> > success of
> > >> > >> >> the copy and some time value of the last copied document
> > >> > >> >>
> > >> > >> >>
> > >> > >> >> chaim
> > >> > >> >>
> > >> > >> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
> > >> > >> >> <[email protected]> wrote:
> > >> > >> >> > I'd like to know more about your both use cases, can you
> > >> clarify? I
> > >> > >> think
> > >> > >> >> > making sinks output something that can be waited on by
> another
> > >> > >> pipeline
> > >> > >> >> > step is a reasonable request, but more details would help
> > refine
> > >> > this
> > >> > >> >> > suggestion.
> > >> > >> >> >
> > >> > >> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
> > >> > >> [email protected]>
> > >> > >> >> > wrote:
> > >> > >> >> >
> > >> > >> >> >> Can you do this from the program that runs the Beam job,
> > after
> > >> > job is
> > >> > >> >> >> complete (you might have to use a blocking runner or poll
> for
> > >> the
> > >> > >> >> status of
> > >> > >> >> >> the job) ?
> > >> > >> >> >>
> > >> > >> >> >> - Cham
> > >> > >> >> >>
> > >> > >> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <
> > >> > [email protected]>
> > >> > >> >> wrote:
> > >> > >> >> >>
> > >> > >> >> >> > I also have a similar use case (but with BigTable) that I
> > feel
> > >> > >> like I
> > >> > >> >> had
> > >> > >> >> >> > to hack up to make work.  It'd be great to hear if there
> > is a
> > >> > way
> > >> > >> to
> > >> > >> >> do
> > >> > >> >> >> > something like this already, or if there are plans in the
> > >> > future.
> > >> > >> >> >> >
> > >> > >> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <
> > >> [email protected]
> > >> > >
> > >> > >> >> wrote:
> > >> > >> >> >> >
> > >> > >> >> >> > > Hi,
> > >> > >> >> >> > >   I have a few piplines that are an ETL from different
> > >> > systems to
> > >> > >> >> >> > bigquery.
> > >> > >> >> >> > > I would like to write the status of the ETL after all
> > >> records
> > >> > >> have
> > >> > >> >> >> > > been updated to the bigquery.
> > >> > >> >> >> > > The problem is that writing to bigquery is a sink and
> you
> > >> > cannot
> > >> > >> >> have
> > >> > >> >> >> > > any other steps after the sink.
> > >> > >> >> >> > > I tried a sideoutput, but this is called in no
> > correlation
> > >> to
> > >> > the
> > >> > >> >> >> > > writing to bigquery, so i don't know if it succeeded or
> > >> > failed.
> > >> > >> >> >> > >
> > >> > >> >> >> > >
> > >> > >> >> >> > > any ideas?
> > >> > >> >> >> > > chaim
> > >> > >> >> >> > >
> > >> > >> >> >> >
> > >> > >> >> >>
> > >> > >> >>
> > >> > >>
> > >> >
> > >>
> >
>

Re: writing status

Reply via email to