What is your concern about this job? It seems not that slow, and it's not really bottlenecked by writing to BigQuery (<50% of the wall-clock time is in the step that writes to bigquery).
On Thu, Sep 28, 2017 at 12:38 AM Chaim Turkel <[email protected]> wrote: > you can see my job at: > > https://console.cloud.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2017-09-26_03_17_44-4821512213867199289?project=ordinal-ember-163410 > > > On Wed, Sep 27, 2017 at 10:47 PM, Reuven Lax <[email protected]> > wrote: > > There are a couple of options, and if you provide a job id (since you are > > using the Dataflow runner) we can better advise. > > > > If you are writing to more than 2000 partitions, this won't work - > BigQuery > > has a hard quota of 1000 partition updates per table per day. > > > > If you have fewer than 1000 jobs, there are a few possibilities. It's > > possible that BigQuery is taking a while to schedule some of those jobs; > > they'll end up sitting in a queue waiting to be scheduled. We can look at > > one of the jobs in detail to see if that's happening. Eugene's suggestion > > of using your pipeline to load into a single table might be the best one. > > You can write the date into a separate column, and then write a shell > > script to copy each date to it's own partition (see > > > https://cloud.google.com/bigquery/docs/creating-partitioned-tables#update-with-query > > for some examples). > > > > On Wed, Sep 27, 2017 at 11:39 AM, Eugene Kirpichov < > > [email protected]> wrote: > > > >> I see. Then Reuven's answer above applies. > >> Maybe you could write to a non-partitioned table, and then split it into > >> smaller partitioned tables. See https://stackoverflow.com/a/ > >> 39001706/278042 > >> <https://stackoverflow.com/a/39001706/278042ащк> for a discussion of > the > >> current options - granted, it seems like there currently don't exist > very > >> good options for creating a very large number of table partitions from > >> existing data. > >> > >> On Wed, Sep 27, 2017 at 4:01 AM Chaim Turkel <[email protected]> wrote: > >> > >> > thank you for your detailed response. > >> > Currently i am a bit stuck. > >> > I need to migrate data from mongo to bigquery, we have about 1 terra > >> > of data. It is history data, so i want to use bigquery partitions. > >> > It seems that the io connector creates a job per partition so it takes > >> > a very long time, and i hit the quota in bigquery of the amount of > >> > jobs per day. > >> > I would like to use streaming but you cannot stream old data more > than 30 > >> > day > >> > > >> > So I thought of partitions to see if i can do more parraleism > >> > > >> > chaim > >> > > >> > > >> > On Wed, Sep 27, 2017 at 9:49 AM, Eugene Kirpichov > >> > <[email protected]> wrote: > >> > > Okay, I see - there's about 3 different meanings of the word > >> "partition" > >> > > that could have been involved here (BigQuery partitions, > >> runner-specific > >> > > bundles, and the Partition transform), hence my request for > >> > clarification. > >> > > > >> > > If you mean the Partition transform - then I'm confused what do you > >> mean > >> > by > >> > > BigQueryIO "supporting" it? The Partition transform takes a > PCollection > >> > and > >> > > produces a bunch of PCollections; these are ordinary PCollection's > and > >> > you > >> > > can apply any Beam transforms to them, and BigQueryIO.write() is no > >> > > exception to this - you can apply it too. > >> > > > >> > > To answer whether using Partition would improve your performance, > I'd > >> > need > >> > > to understand exactly what you're comparing against what. I suppose > >> > you're > >> > > comparing the following: > >> > > 1) Applying BigQueryIO.write() to a PCollection, writing to a single > >> > table > >> > > 2) Splitting a PCollection into several smaller PCollection's using > >> > > Partition, and applying BigQueryIO.write() to each of them, writing > to > >> > > different tables I suppose? (or do you want to write to different > >> > BigQuery > >> > > partitions of the same table using a table partition decorator?) > >> > > I would expect #2 to perform strictly worse than #1, because it > writes > >> > the > >> > > same amount of data but increases the number of BigQuery load jobs > >> > involved > >> > > (thus increases per-job overhead and consumes BigQuery quota). > >> > > > >> > > On Tue, Sep 26, 2017 at 11:35 PM Chaim Turkel <[email protected]> > >> wrote: > >> > > > >> > >> https://beam.apache.org/documentation/programming-guide/#partition > >> > >> > >> > >> On Tue, Sep 26, 2017 at 6:42 PM, Eugene Kirpichov > >> > >> <[email protected]> wrote: > >> > >> > What do you mean by Beam partitions? > >> > >> > > >> > >> > On Tue, Sep 26, 2017, 6:57 AM Chaim Turkel <[email protected]> > >> wrote: > >> > >> > > >> > >> >> by the way currently the performance on bigquery partitions is > very > >> > bad. > >> > >> >> Is there a repository where i can test with 2.2.0? > >> > >> >> > >> > >> >> chaim > >> > >> >> > >> > >> >> On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax > >> <[email protected] > >> > > > >> > >> >> wrote: > >> > >> >> > Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug > if > >> > the > >> > >> >> table > >> > >> >> > containing the partitions is not pre created (fixed in 2.2.0). > >> > >> >> > > >> > >> >> > On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel < > [email protected]> > >> > >> wrote: > >> > >> >> > > >> > >> >> >> Hi, > >> > >> >> >> > >> > >> >> >> Does BigQueryIO support Partitions when writing? will it > >> > improve > >> > >> my > >> > >> >> >> performance? > >> > >> >> >> > >> > >> >> >> > >> > >> >> >> chaim > >> > >> >> >> > >> > >> >> > >> > >> > >> > > >> >
