I like the idea of having sinks return PCollection<Void>, which is very similar to how I changed the BigtableIO.Write interface. I can submit a pull request for BigtableIO which does that once I test it out a little more.
On Sat, Sep 9, 2017 at 1:46 PM, Eugene Kirpichov < [email protected]> wrote: > Hi Steve, > Unfortunately for BigQuery it's more complicated than that. Rows aren't > written to BigQuery one by one (unless you're using streaming inserts, > which are way more expensive and are usually used only in streaming > pipelines) - they are written to files, and then a BigQuery import job, or > several import jobs if there are too many files, picks them up. We can > declare writing complete when all of the BigQuery import jobs have > successfully completed. > However, the method of writing is an implementation detail of BigQuery, so > we need to create an API that works regardless of the method (import jobs > vs. streaming inserts). > Another complication is triggering - windows can fire multiple times. This > rules out any approaches that sequence using side inputs, because side > inputs don't have triggering. > > I think a common approach could be to return a PCollection<Void>, > containing a Void in every window and pane that has been successfully > written. This could be implemented in both modes and could be a general > design patterns for this sort of thing. It just isn't easy to implement, so > I didn't have time to take it on. It also could turn out to have other > complications we haven't thought of yet. > > That said, if somebody tried to implement this for some connectors (not > necessarily BigQuery) and pioneered the approach, it would be a great > contribution. > > On Sat, Sep 9, 2017 at 9:41 AM Steve Niemitz <[email protected]> wrote: > > > I wonder if it makes sense to start simple and go from there. For > example, > > I enhanced BigtableIO.Write to output the number of rows written > > in finishBundle(), simply into the global window with the current > > timestamp. This was more than enough to unblock me, but doesn't support > > more complicated scenarios with windowing. > > > > However, as I said it was more than enough to solve the general batch use > > case, and I imagine could be enhanced to support windowing by keeping > track > > of which windows were written per bundle. (can there even ever be more > than > > one window per bundle?) > > > > On Fri, Sep 8, 2017 at 2:32 PM, Eugene Kirpichov < > > [email protected]> wrote: > > > > > Hi, > > > I was going to implement this, but discussed it with +Reuven Lax > > > <[email protected]> and it appears to be quite difficult to do > properly, > > or > > > even to define what it means at all, especially if you're using the > > > streaming inserts write method. So for now there is no workaround > except > > > programmatically waiting for your whole pipeline to finish > > > (pipeline.run().waitUntilFinish()). > > > > > > On Fri, Sep 8, 2017 at 2:19 AM Chaim Turkel <[email protected]> wrote: > > > > > > > is there a way around this for now? > > > > how can i get a snapshot version? > > > > > > > > chaim > > > > > > > > On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov > > > > <[email protected]> wrote: > > > > > Oh I see! Okay, this should be easy to fix. I'll take a look. > > > > > > > > > > On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <[email protected]> > > wrote: > > > > > > > > > >> WriteResult does not support apply -> that is the problem > > > > >> > > > > >> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov > > > > >> <[email protected]> wrote: > > > > >> > Hi, > > > > >> > > > > > >> > Sorry for the delay. So sounds like you want to do something > after > > > > >> writing > > > > >> > a window of data to BigQuery is complete. > > > > >> > I think this should be possible: expansion of BigQueryIO.write() > > > > returns > > > > >> a > > > > >> > WriteResult and you can apply other transforms to it. Have you > > tried > > > > >> that? > > > > >> > > > > > >> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <[email protected]> > > > > wrote: > > > > >> > > > > > >> >> I have documents from a mongo db that i need to migrate to > > > bigquery. > > > > >> >> Since it is mongodb i do not know they schema ahead of time, > so i > > > > have > > > > >> >> two pipelines, one to run over the documents and update the > > > bigquery > > > > >> >> schema, then wait a few minutes (i can take for bigquery to be > > able > > > > to > > > > >> >> use the new schema) then with the other pipline copy all the > > > > >> >> documents. > > > > >> >> To know as to where i got with the different piplines i have a > > > status > > > > >> >> table so that at the start i know from where to continue. > > > > >> >> So i need the option to update the status table with the > success > > of > > > > >> >> the copy and some time value of the last copied document > > > > >> >> > > > > >> >> > > > > >> >> chaim > > > > >> >> > > > > >> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov > > > > >> >> <[email protected]> wrote: > > > > >> >> > I'd like to know more about your both use cases, can you > > > clarify? I > > > > >> think > > > > >> >> > making sinks output something that can be waited on by > another > > > > >> pipeline > > > > >> >> > step is a reasonable request, but more details would help > > refine > > > > this > > > > >> >> > suggestion. > > > > >> >> > > > > > >> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath < > > > > >> [email protected]> > > > > >> >> > wrote: > > > > >> >> > > > > > >> >> >> Can you do this from the program that runs the Beam job, > after > > > > job is > > > > >> >> >> complete (you might have to use a blocking runner or poll > for > > > the > > > > >> >> status of > > > > >> >> >> the job) ? > > > > >> >> >> > > > > >> >> >> - Cham > > > > >> >> >> > > > > >> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz < > > > > [email protected]> > > > > >> >> wrote: > > > > >> >> >> > > > > >> >> >> > I also have a similar use case (but with BigTable) that I > > feel > > > > >> like I > > > > >> >> had > > > > >> >> >> > to hack up to make work. It'd be great to hear if there > is > > a > > > > way > > > > >> to > > > > >> >> do > > > > >> >> >> > something like this already, or if there are plans in the > > > > future. > > > > >> >> >> > > > > > >> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel < > > > [email protected] > > > > > > > > > >> >> wrote: > > > > >> >> >> > > > > > >> >> >> > > Hi, > > > > >> >> >> > > I have a few piplines that are an ETL from different > > > > systems to > > > > >> >> >> > bigquery. > > > > >> >> >> > > I would like to write the status of the ETL after all > > > records > > > > >> have > > > > >> >> >> > > been updated to the bigquery. > > > > >> >> >> > > The problem is that writing to bigquery is a sink and > you > > > > cannot > > > > >> >> have > > > > >> >> >> > > any other steps after the sink. > > > > >> >> >> > > I tried a sideoutput, but this is called in no > correlation > > > to > > > > the > > > > >> >> >> > > writing to bigquery, so i don't know if it succeeded or > > > > failed. > > > > >> >> >> > > > > > > >> >> >> > > > > > > >> >> >> > > any ideas? > > > > >> >> >> > > chaim > > > > >> >> >> > > > > > > >> >> >> > > > > > >> >> >> > > > > >> >> > > > > >> > > > > > > > > > >
