Re: BigQueryIO Partitions

Eugene Kirpichov Wed, 27 Sep 2017 11:40:24 -0700

I see. Then Reuven's answer above applies.
Maybe you could write to a non-partitioned table, and then split it into
smaller partitioned tables. See https://stackoverflow.com/a/39001706/278042
<https://stackoverflow.com/a/39001706/278042ащк> for a discussion of the
current options - granted, it seems like there currently don't exist very
good options for creating a very large number of table partitions from
existing data.


On Wed, Sep 27, 2017 at 4:01 AM Chaim Turkel <[email protected]> wrote:

> thank you for your detailed response.
> Currently i am a bit stuck.
> I need to migrate data from mongo to bigquery, we have about 1 terra
> of data. It is history data, so i want to use bigquery partitions.
> It seems that the io connector creates a job per partition so it takes
> a very long time, and i hit the quota in bigquery of the amount of
> jobs per day.
> I would like to use streaming but you cannot stream old data more than 30
> day
>
> So I thought of partitions to see if i can do more parraleism
>
> chaim
>
>
> On Wed, Sep 27, 2017 at 9:49 AM, Eugene Kirpichov
> <[email protected]> wrote:
> > Okay, I see - there's about 3 different meanings of the word "partition"
> > that could have been involved here (BigQuery partitions, runner-specific
> > bundles, and the Partition transform), hence my request for
> clarification.
> >
> > If you mean the Partition transform - then I'm confused what do you mean
> by
> > BigQueryIO "supporting" it? The Partition transform takes a PCollection
> and
> > produces a bunch of PCollections; these are ordinary PCollection's and
> you
> > can apply any Beam transforms to them, and BigQueryIO.write() is no
> > exception to this - you can apply it too.
> >
> > To answer whether using Partition would improve your performance, I'd
> need
> > to understand exactly what you're comparing against what. I suppose
> you're
> > comparing the following:
> > 1) Applying BigQueryIO.write() to a PCollection, writing to a single
> table
> > 2) Splitting a PCollection into several smaller PCollection's using
> > Partition, and applying BigQueryIO.write() to each of them, writing to
> > different tables I suppose? (or do you want to write to different
> BigQuery
> > partitions of the same table using a table partition decorator?)
> > I would expect #2 to perform strictly worse than #1, because it writes
> the
> > same amount of data but increases the number of BigQuery load jobs
> involved
> > (thus increases per-job overhead and consumes BigQuery quota).
> >
> > On Tue, Sep 26, 2017 at 11:35 PM Chaim Turkel <[email protected]> wrote:
> >
> >> https://beam.apache.org/documentation/programming-guide/#partition
> >>
> >> On Tue, Sep 26, 2017 at 6:42 PM, Eugene Kirpichov
> >> <[email protected]> wrote:
> >> > What do you mean by Beam partitions?
> >> >
> >> > On Tue, Sep 26, 2017, 6:57 AM Chaim Turkel <[email protected]> wrote:
> >> >
> >> >> by the way currently the performance on bigquery partitions is very
> bad.
> >> >> Is there a repository where i can test with 2.2.0?
> >> >>
> >> >> chaim
> >> >>
> >> >> On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax <[email protected]
> >
> >> >> wrote:
> >> >> > Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug if
> the
> >> >> table
> >> >> > containing the partitions is not pre created (fixed in 2.2.0).
> >> >> >
> >> >> > On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <[email protected]>
> >> wrote:
> >> >> >
> >> >> >> Hi,
> >> >> >>
> >> >> >>    Does BigQueryIO support Partitions when writing? will it
> improve
> >> my
> >> >> >> performance?
> >> >> >>
> >> >> >>
> >> >> >> chaim
> >> >> >>
> >> >>
> >>
>

Re: BigQueryIO Partitions

Reply via email to