you can see my job at: https://console.cloud.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2017-09-26_03_17_44-4821512213867199289?project=ordinal-ember-163410
On Wed, Sep 27, 2017 at 10:47 PM, Reuven Lax <[email protected]> wrote: > There are a couple of options, and if you provide a job id (since you are > using the Dataflow runner) we can better advise. > > If you are writing to more than 2000 partitions, this won't work - BigQuery > has a hard quota of 1000 partition updates per table per day. > > If you have fewer than 1000 jobs, there are a few possibilities. It's > possible that BigQuery is taking a while to schedule some of those jobs; > they'll end up sitting in a queue waiting to be scheduled. We can look at > one of the jobs in detail to see if that's happening. Eugene's suggestion > of using your pipeline to load into a single table might be the best one. > You can write the date into a separate column, and then write a shell > script to copy each date to it's own partition (see > https://cloud.google.com/bigquery/docs/creating-partitioned-tables#update-with-query > for some examples). > > On Wed, Sep 27, 2017 at 11:39 AM, Eugene Kirpichov < > [email protected]> wrote: > >> I see. Then Reuven's answer above applies. >> Maybe you could write to a non-partitioned table, and then split it into >> smaller partitioned tables. See https://stackoverflow.com/a/ >> 39001706/278042 >> <https://stackoverflow.com/a/39001706/278042ащк> for a discussion of the >> current options - granted, it seems like there currently don't exist very >> good options for creating a very large number of table partitions from >> existing data. >> >> On Wed, Sep 27, 2017 at 4:01 AM Chaim Turkel <[email protected]> wrote: >> >> > thank you for your detailed response. >> > Currently i am a bit stuck. >> > I need to migrate data from mongo to bigquery, we have about 1 terra >> > of data. It is history data, so i want to use bigquery partitions. >> > It seems that the io connector creates a job per partition so it takes >> > a very long time, and i hit the quota in bigquery of the amount of >> > jobs per day. >> > I would like to use streaming but you cannot stream old data more than 30 >> > day >> > >> > So I thought of partitions to see if i can do more parraleism >> > >> > chaim >> > >> > >> > On Wed, Sep 27, 2017 at 9:49 AM, Eugene Kirpichov >> > <[email protected]> wrote: >> > > Okay, I see - there's about 3 different meanings of the word >> "partition" >> > > that could have been involved here (BigQuery partitions, >> runner-specific >> > > bundles, and the Partition transform), hence my request for >> > clarification. >> > > >> > > If you mean the Partition transform - then I'm confused what do you >> mean >> > by >> > > BigQueryIO "supporting" it? The Partition transform takes a PCollection >> > and >> > > produces a bunch of PCollections; these are ordinary PCollection's and >> > you >> > > can apply any Beam transforms to them, and BigQueryIO.write() is no >> > > exception to this - you can apply it too. >> > > >> > > To answer whether using Partition would improve your performance, I'd >> > need >> > > to understand exactly what you're comparing against what. I suppose >> > you're >> > > comparing the following: >> > > 1) Applying BigQueryIO.write() to a PCollection, writing to a single >> > table >> > > 2) Splitting a PCollection into several smaller PCollection's using >> > > Partition, and applying BigQueryIO.write() to each of them, writing to >> > > different tables I suppose? (or do you want to write to different >> > BigQuery >> > > partitions of the same table using a table partition decorator?) >> > > I would expect #2 to perform strictly worse than #1, because it writes >> > the >> > > same amount of data but increases the number of BigQuery load jobs >> > involved >> > > (thus increases per-job overhead and consumes BigQuery quota). >> > > >> > > On Tue, Sep 26, 2017 at 11:35 PM Chaim Turkel <[email protected]> >> wrote: >> > > >> > >> https://beam.apache.org/documentation/programming-guide/#partition >> > >> >> > >> On Tue, Sep 26, 2017 at 6:42 PM, Eugene Kirpichov >> > >> <[email protected]> wrote: >> > >> > What do you mean by Beam partitions? >> > >> > >> > >> > On Tue, Sep 26, 2017, 6:57 AM Chaim Turkel <[email protected]> >> wrote: >> > >> > >> > >> >> by the way currently the performance on bigquery partitions is very >> > bad. >> > >> >> Is there a repository where i can test with 2.2.0? >> > >> >> >> > >> >> chaim >> > >> >> >> > >> >> On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax >> <[email protected] >> > > >> > >> >> wrote: >> > >> >> > Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug if >> > the >> > >> >> table >> > >> >> > containing the partitions is not pre created (fixed in 2.2.0). >> > >> >> > >> > >> >> > On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <[email protected]> >> > >> wrote: >> > >> >> > >> > >> >> >> Hi, >> > >> >> >> >> > >> >> >> Does BigQueryIO support Partitions when writing? will it >> > improve >> > >> my >> > >> >> >> performance? >> > >> >> >> >> > >> >> >> >> > >> >> >> chaim >> > >> >> >> >> > >> >> >> > >> >> > >>
