Re: SplitEnumerator for Bigquery Source.

Lavkesh Lahngir Tue, 18 Oct 2022 00:42:42 -0700

Hi Martin,
Tables are partitioned on timestamp, just like Hive. It can be range
partitioned too. It doesn't matter. The option number two in the first
email talks about one split of each partition. Are you suggesting something
different?


Thanks

þri., 18. okt. 2022 kl. 15:28 skrifaði Martijn Visser <
[email protected]>:

> Hi Lavkesh,
>
> I'm not familiar with Big Query but when looking through the BQ API, I
> noticed that the `Table` resource provides both a timePartioning and a
> rangePartioning. [1] Couldn't you use that?
>
> Best regards,
>
> Martijn
>
> https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#Table
>
> On Tue, Oct 18, 2022 at 3:44 AM yuxia <[email protected]> wrote:
>
> > I'm familiar with Hive source but have no much knowledge about Bigquery.
> > But from my side, the apprach number three sounds more reasonable.
> >
> > option1 sounds a llitte of complex and may time-counsuming during
> > generateing splits .
> > option2 seems isnot flexible and is too coarse-grained.
> > option4 need extrac efforts to wrting the data again.
> >
> > Best regards,
> > Yuxia
> >
> > ----- 原始邮件 -----
> > 发件人: "Lavkesh Lahngir" <[email protected]>
> > 收件人: "dev" <[email protected]>
> > 发送时间: 星期一, 2022年 10 月 17日 下午 10:42:29
> > 主题: SplitEnumerator for Bigquery Source.
> >
> > Hii Everybody,
> > we are trying to implement a google bigquery source on flink. We were
> > thinking of taking time partition and column information as config. I was
> > thinking of how to parallelize the source and how to generate splits. I
> > read the code of Hive source, where we could generate hadoop file splits
> > based on partitions. There is no way to access file level information on
> > BQ.
> > What would be a solution to generate splits for BQ source?
> >
> > Currently, most of our tables are partitioned daily. Assuming the columns
> > and time range are taken as config.
> > Some ideas from me to generate splits:
> > 1. Calculate approximate number of rows and size and divide them equally.
> > This will require some way to add a marker for division.
> > 2. For each daily partition create one split.
> > 3. We can take the time partition granularity of minute/hour/day as
> config
> > and make buckets. For example: Hour granularity and 7 days of data, it
> will
> > make 7*24 splits. In the CustomSplit class we can save the start and end
> of
> > timestamps for the reader to execute.
> > 4. Scan all the data into a distributed file system like hadoop or gcs.
> > Then just use file splitter.
> >
> > I am thinking of going with approach number three. Because calculation of
> > splits is just config based, it doesn't require reading any data to
> > calculate, for example option four.
> >
> > Any suggestions are welcome.
> >
> > Thank you!
> > ~lav
> >
>

Re: SplitEnumerator for Bigquery Source.

Reply via email to