Re: BigQueryIO Partitions

Chaim Turkel Thu, 28 Sep 2017 00:38:49 -0700

you can see my job at:
https://console.cloud.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2017-09-26_03_17_44-4821512213867199289?project=ordinal-ember-163410



On Wed, Sep 27, 2017 at 10:47 PM, Reuven Lax <[email protected]> wrote:
> There are a couple of options, and if you provide a job id (since you are
> using the Dataflow runner) we can better advise.
>
> If you are writing to more than 2000 partitions, this won't work - BigQuery
> has a hard quota of 1000 partition updates per table per day.
>
> If you have fewer than 1000 jobs, there are a few possibilities. It's
> possible that BigQuery is taking a while to schedule some of those jobs;
> they'll end up sitting in a queue waiting to be scheduled. We can look at
> one of the jobs in detail to see if that's happening. Eugene's suggestion
> of using your pipeline to load into a single table might be the best one.
> You can write the date into a separate column, and then write a shell
> script to copy each date to it's own partition (see
> https://cloud.google.com/bigquery/docs/creating-partitioned-tables#update-with-query
> for some examples).
>
> On Wed, Sep 27, 2017 at 11:39 AM, Eugene Kirpichov <
> [email protected]> wrote:
>
>> I see. Then Reuven's answer above applies.
>> Maybe you could write to a non-partitioned table, and then split it into
>> smaller partitioned tables. See https://stackoverflow.com/a/
>> 39001706/278042
>> <https://stackoverflow.com/a/39001706/278042ащк> for a discussion of the
>> current options - granted, it seems like there currently don't exist very
>> good options for creating a very large number of table partitions from
>> existing data.
>>
>> On Wed, Sep 27, 2017 at 4:01 AM Chaim Turkel <[email protected]> wrote:
>>
>> > thank you for your detailed response.
>> > Currently i am a bit stuck.
>> > I need to migrate data from mongo to bigquery, we have about 1 terra
>> > of data. It is history data, so i want to use bigquery partitions.
>> > It seems that the io connector creates a job per partition so it takes
>> > a very long time, and i hit the quota in bigquery of the amount of
>> > jobs per day.
>> > I would like to use streaming but you cannot stream old data more than 30
>> > day
>> >
>> > So I thought of partitions to see if i can do more parraleism
>> >
>> > chaim
>> >
>> >
>> > On Wed, Sep 27, 2017 at 9:49 AM, Eugene Kirpichov
>> > <[email protected]> wrote:
>> > > Okay, I see - there's about 3 different meanings of the word
>> "partition"
>> > > that could have been involved here (BigQuery partitions,
>> runner-specific
>> > > bundles, and the Partition transform), hence my request for
>> > clarification.
>> > >
>> > > If you mean the Partition transform - then I'm confused what do you
>> mean
>> > by
>> > > BigQueryIO "supporting" it? The Partition transform takes a PCollection
>> > and
>> > > produces a bunch of PCollections; these are ordinary PCollection's and
>> > you
>> > > can apply any Beam transforms to them, and BigQueryIO.write() is no
>> > > exception to this - you can apply it too.
>> > >
>> > > To answer whether using Partition would improve your performance, I'd
>> > need
>> > > to understand exactly what you're comparing against what. I suppose
>> > you're
>> > > comparing the following:
>> > > 1) Applying BigQueryIO.write() to a PCollection, writing to a single
>> > table
>> > > 2) Splitting a PCollection into several smaller PCollection's using
>> > > Partition, and applying BigQueryIO.write() to each of them, writing to
>> > > different tables I suppose? (or do you want to write to different
>> > BigQuery
>> > > partitions of the same table using a table partition decorator?)
>> > > I would expect #2 to perform strictly worse than #1, because it writes
>> > the
>> > > same amount of data but increases the number of BigQuery load jobs
>> > involved
>> > > (thus increases per-job overhead and consumes BigQuery quota).
>> > >
>> > > On Tue, Sep 26, 2017 at 11:35 PM Chaim Turkel <[email protected]>
>> wrote:
>> > >
>> > >> https://beam.apache.org/documentation/programming-guide/#partition
>> > >>
>> > >> On Tue, Sep 26, 2017 at 6:42 PM, Eugene Kirpichov
>> > >> <[email protected]> wrote:
>> > >> > What do you mean by Beam partitions?
>> > >> >
>> > >> > On Tue, Sep 26, 2017, 6:57 AM Chaim Turkel <[email protected]>
>> wrote:
>> > >> >
>> > >> >> by the way currently the performance on bigquery partitions is very
>> > bad.
>> > >> >> Is there a repository where i can test with 2.2.0?
>> > >> >>
>> > >> >> chaim
>> > >> >>
>> > >> >> On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax
>> <[email protected]
>> > >
>> > >> >> wrote:
>> > >> >> > Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug if
>> > the
>> > >> >> table
>> > >> >> > containing the partitions is not pre created (fixed in 2.2.0).
>> > >> >> >
>> > >> >> > On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <[email protected]>
>> > >> wrote:
>> > >> >> >
>> > >> >> >> Hi,
>> > >> >> >>
>> > >> >> >>    Does BigQueryIO support Partitions when writing? will it
>> > improve
>> > >> my
>> > >> >> >> performance?
>> > >> >> >>
>> > >> >> >>
>> > >> >> >> chaim
>> > >> >> >>
>> > >> >>
>> > >>
>> >
>>

Re: BigQueryIO Partitions

Reply via email to