I see. Then Reuven's answer above applies. Maybe you could write to a non-partitioned table, and then split it into smaller partitioned tables. See https://stackoverflow.com/a/39001706/278042 <https://stackoverflow.com/a/39001706/278042ащк> for a discussion of the current options - granted, it seems like there currently don't exist very good options for creating a very large number of table partitions from existing data.
On Wed, Sep 27, 2017 at 4:01 AM Chaim Turkel <[email protected]> wrote: > thank you for your detailed response. > Currently i am a bit stuck. > I need to migrate data from mongo to bigquery, we have about 1 terra > of data. It is history data, so i want to use bigquery partitions. > It seems that the io connector creates a job per partition so it takes > a very long time, and i hit the quota in bigquery of the amount of > jobs per day. > I would like to use streaming but you cannot stream old data more than 30 > day > > So I thought of partitions to see if i can do more parraleism > > chaim > > > On Wed, Sep 27, 2017 at 9:49 AM, Eugene Kirpichov > <[email protected]> wrote: > > Okay, I see - there's about 3 different meanings of the word "partition" > > that could have been involved here (BigQuery partitions, runner-specific > > bundles, and the Partition transform), hence my request for > clarification. > > > > If you mean the Partition transform - then I'm confused what do you mean > by > > BigQueryIO "supporting" it? The Partition transform takes a PCollection > and > > produces a bunch of PCollections; these are ordinary PCollection's and > you > > can apply any Beam transforms to them, and BigQueryIO.write() is no > > exception to this - you can apply it too. > > > > To answer whether using Partition would improve your performance, I'd > need > > to understand exactly what you're comparing against what. I suppose > you're > > comparing the following: > > 1) Applying BigQueryIO.write() to a PCollection, writing to a single > table > > 2) Splitting a PCollection into several smaller PCollection's using > > Partition, and applying BigQueryIO.write() to each of them, writing to > > different tables I suppose? (or do you want to write to different > BigQuery > > partitions of the same table using a table partition decorator?) > > I would expect #2 to perform strictly worse than #1, because it writes > the > > same amount of data but increases the number of BigQuery load jobs > involved > > (thus increases per-job overhead and consumes BigQuery quota). > > > > On Tue, Sep 26, 2017 at 11:35 PM Chaim Turkel <[email protected]> wrote: > > > >> https://beam.apache.org/documentation/programming-guide/#partition > >> > >> On Tue, Sep 26, 2017 at 6:42 PM, Eugene Kirpichov > >> <[email protected]> wrote: > >> > What do you mean by Beam partitions? > >> > > >> > On Tue, Sep 26, 2017, 6:57 AM Chaim Turkel <[email protected]> wrote: > >> > > >> >> by the way currently the performance on bigquery partitions is very > bad. > >> >> Is there a repository where i can test with 2.2.0? > >> >> > >> >> chaim > >> >> > >> >> On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax <[email protected] > > > >> >> wrote: > >> >> > Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug if > the > >> >> table > >> >> > containing the partitions is not pre created (fixed in 2.2.0). > >> >> > > >> >> > On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <[email protected]> > >> wrote: > >> >> > > >> >> >> Hi, > >> >> >> > >> >> >> Does BigQueryIO support Partitions when writing? will it > improve > >> my > >> >> >> performance? > >> >> >> > >> >> >> > >> >> >> chaim > >> >> >> > >> >> > >> >
