SplitEnumerator for Bigquery Source.

Lavkesh Lahngir Mon, 17 Oct 2022 07:42:51 -0700

Hii Everybody,
we are trying to implement a google bigquery source on flink. We were
thinking of taking time partition and column information as config. I was
thinking of how to parallelize the source and how to generate splits. I
read the code of Hive source, where we could generate hadoop file splits
based on partitions. There is no way to access file level information on BQ.
What would be a solution to generate splits for BQ source?


Currently, most of our tables are partitioned daily. Assuming the columns
and time range are taken as config.
Some ideas from me to generate splits:
1. Calculate approximate number of rows and size and divide them equally.
This will require some way to add a marker for division.
2. For each daily partition create one split.
3. We can take the time partition granularity of minute/hour/day as config
and make buckets. For example: Hour granularity and 7 days of data, it will
make 7*24 splits. In the CustomSplit class we can save the start and end of
timestamps for the reader to execute.
4. Scan all the data into a distributed file system like hadoop or gcs.
Then just use file splitter.

I am thinking of going with approach number three. Because calculation of
splits is just config based, it doesn't require reading any data to
calculate, for example option four.

Any suggestions are welcome.

Thank you!
~lav

SplitEnumerator for Bigquery Source.

Reply via email to