ahmedabu98 commented on PR #38149: URL: https://github.com/apache/beam/pull/38149#issuecomment-4253764009
I understand wanting to prevent a situation where closed thread-level Catalogs lead to dead Tables in a VM cache. I'm a little worried about moving to a per-thread Table cache though. There will be a lot of "get table" calls in the beginning of a job's lifetime which can max out quota pretty quickly (we’ve seen that with users already). And it gets exacerbated for some streaming runners like Dataflow, where (# threads) >> (# vCPUs). What if we move in the opposite direction and have a static cache of Catalogs? So all threads in a VM share one Catalog instance and perform all Table operations through it? Realistically, only one thread will need to create the Catalog. That same thread will likely create/load the Table and also store it in a static cache. Thoughts on this approach? It would also improve our current Catalog management (which I’m noticing, is to eagerly create a Catalog per thread whether or not it actually gets used) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
