Re: BigQueryIO Partitions

Eugene Kirpichov Thu, 28 Sep 2017 15:36:02 -0700

What is your concern about this job? It seems not that slow, and it's not
really bottlenecked by writing to BigQuery (<50% of the wall-clock time is
in the step that writes to bigquery).


On Thu, Sep 28, 2017 at 12:38 AM Chaim Turkel <[email protected]> wrote:

> you can see my job at:
>
> https://console.cloud.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2017-09-26_03_17_44-4821512213867199289?project=ordinal-ember-163410
>
>
> On Wed, Sep 27, 2017 at 10:47 PM, Reuven Lax <[email protected]>
> wrote:
> > There are a couple of options, and if you provide a job id (since you are
> > using the Dataflow runner) we can better advise.
> >
> > If you are writing to more than 2000 partitions, this won't work -
> BigQuery
> > has a hard quota of 1000 partition updates per table per day.
> >
> > If you have fewer than 1000 jobs, there are a few possibilities. It's
> > possible that BigQuery is taking a while to schedule some of those jobs;
> > they'll end up sitting in a queue waiting to be scheduled. We can look at
> > one of the jobs in detail to see if that's happening. Eugene's suggestion
> > of using your pipeline to load into a single table might be the best one.
> > You can write the date into a separate column, and then write a shell
> > script to copy each date to it's own partition (see
> >
> https://cloud.google.com/bigquery/docs/creating-partitioned-tables#update-with-query
> > for some examples).
> >
> > On Wed, Sep 27, 2017 at 11:39 AM, Eugene Kirpichov <
> > [email protected]> wrote:
> >
> >> I see. Then Reuven's answer above applies.
> >> Maybe you could write to a non-partitioned table, and then split it into
> >> smaller partitioned tables. See https://stackoverflow.com/a/
> >> 39001706/278042
> >> <https://stackoverflow.com/a/39001706/278042ащк> for a discussion of
> the
> >> current options - granted, it seems like there currently don't exist
> very
> >> good options for creating a very large number of table partitions from
> >> existing data.
> >>
> >> On Wed, Sep 27, 2017 at 4:01 AM Chaim Turkel <[email protected]> wrote:
> >>
> >> > thank you for your detailed response.
> >> > Currently i am a bit stuck.
> >> > I need to migrate data from mongo to bigquery, we have about 1 terra
> >> > of data. It is history data, so i want to use bigquery partitions.
> >> > It seems that the io connector creates a job per partition so it takes
> >> > a very long time, and i hit the quota in bigquery of the amount of
> >> > jobs per day.
> >> > I would like to use streaming but you cannot stream old data more
> than 30
> >> > day
> >> >
> >> > So I thought of partitions to see if i can do more parraleism
> >> >
> >> > chaim
> >> >
> >> >
> >> > On Wed, Sep 27, 2017 at 9:49 AM, Eugene Kirpichov
> >> > <[email protected]> wrote:
> >> > > Okay, I see - there's about 3 different meanings of the word
> >> "partition"
> >> > > that could have been involved here (BigQuery partitions,
> >> runner-specific
> >> > > bundles, and the Partition transform), hence my request for
> >> > clarification.
> >> > >
> >> > > If you mean the Partition transform - then I'm confused what do you
> >> mean
> >> > by
> >> > > BigQueryIO "supporting" it? The Partition transform takes a
> PCollection
> >> > and
> >> > > produces a bunch of PCollections; these are ordinary PCollection's
> and
> >> > you
> >> > > can apply any Beam transforms to them, and BigQueryIO.write() is no
> >> > > exception to this - you can apply it too.
> >> > >
> >> > > To answer whether using Partition would improve your performance,
> I'd
> >> > need
> >> > > to understand exactly what you're comparing against what. I suppose
> >> > you're
> >> > > comparing the following:
> >> > > 1) Applying BigQueryIO.write() to a PCollection, writing to a single
> >> > table
> >> > > 2) Splitting a PCollection into several smaller PCollection's using
> >> > > Partition, and applying BigQueryIO.write() to each of them, writing
> to
> >> > > different tables I suppose? (or do you want to write to different
> >> > BigQuery
> >> > > partitions of the same table using a table partition decorator?)
> >> > > I would expect #2 to perform strictly worse than #1, because it
> writes
> >> > the
> >> > > same amount of data but increases the number of BigQuery load jobs
> >> > involved
> >> > > (thus increases per-job overhead and consumes BigQuery quota).
> >> > >
> >> > > On Tue, Sep 26, 2017 at 11:35 PM Chaim Turkel <[email protected]>
> >> wrote:
> >> > >
> >> > >> https://beam.apache.org/documentation/programming-guide/#partition
> >> > >>
> >> > >> On Tue, Sep 26, 2017 at 6:42 PM, Eugene Kirpichov
> >> > >> <[email protected]> wrote:
> >> > >> > What do you mean by Beam partitions?
> >> > >> >
> >> > >> > On Tue, Sep 26, 2017, 6:57 AM Chaim Turkel <[email protected]>
> >> wrote:
> >> > >> >
> >> > >> >> by the way currently the performance on bigquery partitions is
> very
> >> > bad.
> >> > >> >> Is there a repository where i can test with 2.2.0?
> >> > >> >>
> >> > >> >> chaim
> >> > >> >>
> >> > >> >> On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax
> >> <[email protected]
> >> > >
> >> > >> >> wrote:
> >> > >> >> > Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug
> if
> >> > the
> >> > >> >> table
> >> > >> >> > containing the partitions is not pre created (fixed in 2.2.0).
> >> > >> >> >
> >> > >> >> > On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <
> [email protected]>
> >> > >> wrote:
> >> > >> >> >
> >> > >> >> >> Hi,
> >> > >> >> >>
> >> > >> >> >>    Does BigQueryIO support Partitions when writing? will it
> >> > improve
> >> > >> my
> >> > >> >> >> performance?
> >> > >> >> >>
> >> > >> >> >>
> >> > >> >> >> chaim
> >> > >> >> >>
> >> > >> >>
> >> > >>
> >> >
> >>
>

Re: BigQueryIO Partitions

Reply via email to