Hii Everybody, we are trying to implement a google bigquery source on flink. We were thinking of taking time partition and column information as config. I was thinking of how to parallelize the source and how to generate splits. I read the code of Hive source, where we could generate hadoop file splits based on partitions. There is no way to access file level information on BQ. What would be a solution to generate splits for BQ source?
Currently, most of our tables are partitioned daily. Assuming the columns and time range are taken as config. Some ideas from me to generate splits: 1. Calculate approximate number of rows and size and divide them equally. This will require some way to add a marker for division. 2. For each daily partition create one split. 3. We can take the time partition granularity of minute/hour/day as config and make buckets. For example: Hour granularity and 7 days of data, it will make 7*24 splits. In the CustomSplit class we can save the start and end of timestamps for the reader to execute. 4. Scan all the data into a distributed file system like hadoop or gcs. Then just use file splitter. I am thinking of going with approach number three. Because calculation of splits is just config based, it doesn't require reading any data to calculate, for example option four. Any suggestions are welcome. Thank you! ~lav