thank you for your detailed response.
Currently i am a bit stuck.
I need to migrate data from mongo to bigquery, we have about 1 terra
of data. It is history data, so i want to use bigquery partitions.
It seems that the io connector creates a job per partition so it takes
a very long time, and i hit the quota in bigquery of the amount of
jobs per day.
I would like to use streaming but you cannot stream old data more than 30 day

So I thought of partitions to see if i can do more parraleism

chaim


On Wed, Sep 27, 2017 at 9:49 AM, Eugene Kirpichov
<[email protected]> wrote:
> Okay, I see - there's about 3 different meanings of the word "partition"
> that could have been involved here (BigQuery partitions, runner-specific
> bundles, and the Partition transform), hence my request for clarification.
>
> If you mean the Partition transform - then I'm confused what do you mean by
> BigQueryIO "supporting" it? The Partition transform takes a PCollection and
> produces a bunch of PCollections; these are ordinary PCollection's and you
> can apply any Beam transforms to them, and BigQueryIO.write() is no
> exception to this - you can apply it too.
>
> To answer whether using Partition would improve your performance, I'd need
> to understand exactly what you're comparing against what. I suppose you're
> comparing the following:
> 1) Applying BigQueryIO.write() to a PCollection, writing to a single table
> 2) Splitting a PCollection into several smaller PCollection's using
> Partition, and applying BigQueryIO.write() to each of them, writing to
> different tables I suppose? (or do you want to write to different BigQuery
> partitions of the same table using a table partition decorator?)
> I would expect #2 to perform strictly worse than #1, because it writes the
> same amount of data but increases the number of BigQuery load jobs involved
> (thus increases per-job overhead and consumes BigQuery quota).
>
> On Tue, Sep 26, 2017 at 11:35 PM Chaim Turkel <[email protected]> wrote:
>
>> https://beam.apache.org/documentation/programming-guide/#partition
>>
>> On Tue, Sep 26, 2017 at 6:42 PM, Eugene Kirpichov
>> <[email protected]> wrote:
>> > What do you mean by Beam partitions?
>> >
>> > On Tue, Sep 26, 2017, 6:57 AM Chaim Turkel <[email protected]> wrote:
>> >
>> >> by the way currently the performance on bigquery partitions is very bad.
>> >> Is there a repository where i can test with 2.2.0?
>> >>
>> >> chaim
>> >>
>> >> On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax <[email protected]>
>> >> wrote:
>> >> > Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug if the
>> >> table
>> >> > containing the partitions is not pre created (fixed in 2.2.0).
>> >> >
>> >> > On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <[email protected]>
>> wrote:
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >>    Does BigQueryIO support Partitions when writing? will it improve
>> my
>> >> >> performance?
>> >> >>
>> >> >>
>> >> >> chaim
>> >> >>
>> >>
>>

Reply via email to