Hi Beam, I've happily been developing my Python Beam batch pipeline that writes to Bigtable for a while, and the time finally came to catch up on all our existing data. It's readings from many smart meters, every 15 minutes, organized in files coming in continuously over time - but for now, I want to run over these files in batch mode. I currently have ~6T entries, each with 5-10 values (floats) that I want to save.
Our Bigtable is using time buckets where all data for a day is bucketed together, with 5-10 columns filled out per row, a (packed) float in each. I've been trying to set up my Beam job so that the writes to Bigtable can be efficient. The naive way is to write one cell at a time, which is as expected quite inefficient and the writes can be all over the place in the table. First off, I tried to optimize by grouping my entries and write an entire row (96 timestamps in a day * ~5-10 columns) at a time - but that fails as soon as I get to the WriteToBigtable stage, with messages like the following on every worker, whether I have a big job or run on a smaller data set: "/usr/local/lib/python3.9/site-packages/apache_beam/io/gcp/bigtableio.py", line 184, in process self.batcher.mutate(row) File "/usr/local/lib/python3.9/site-packages/google/cloud/bigtable/batcher.py", line 98, in mutate self.flush() File "/usr/local/lib/python3.9/site-packages/apache_beam/io/gcp/bigtableio.py", line 81, in flush raise Exception( *Exception: Failed to write a batch of 285 records* [while running 'Write to BigTable/ParDo(_BigTableWriteFn)-ptransform-27'] I understand that I'm encountering this bug <https://github.com/googleapis/python-bigtable/issues/485> as there is no error message from Bigtable. *Question 1*: Is there a workaround for this or at least a guess on what might be going wrong? The error is consistent. I estimate each row I try to write to be on the order of ~10-15 kB. I tried some smaller batches, but eventually just reverted to letting Beam write one cell at a time, and it seems like I can still make that a bit better with some careful grouping of entries before the BT write. That works OK, but I end up having a Beam job that has a whole bunch of relatively idle CPUs while the BT nodes (currently, 3) are overloaded. I can, and perhaps should, allow BT to scale to more nodes, but isn't there a better way? *Question 2:* Is there a way to tell Beam to adapt its write rate to Bigtable as well as lower its numbers of workers? *Question 3*/perhaps an alternative: I don't actually have to run on all data at once. It's useful, for performance reasons, to group entries for the same meter across time to some extent, but there are no dependencies across time stamps. Would it perhaps make sense to let my job run in sequenced batches instead, and is there an easy way of doing so? I was thinking of using fixed windows, but that doesn't affect the initial read, and it's also unclear if it does much more for me than GroupByKey with carefully chosen keys in batch mode. That's a lot of questions, I know, but it seems like there could be many options on how to solve my performance problem! Thanks, -Lina