Re: Threading question .. am I doing this right?
Chris Angelico wrote: > I'm still curious as to the workload (requests per second), as it might still > be worth going for the feeder model. But if your current system works, then > it may be simplest to debug that rather than change. It is by all accounts a low-traffic situation, maybe one request/second. But the view in question opens four plots on one page, generating four separate requests. So with only two clients and a blocking DB connection, the whole application with eight uwsgi worker threads comes down. Now with the "extra load thread" modification, the app worked fine for several days with only two threads. Out of curiosity I tried the "feeder thread" approach with a dummy thread that just sleeps and logs something every few seconds, ten times total. For some reason it sometimes hangs after eight or nine loops, and then uwsgi cannot restart gracefully probably because it is still waiting for that thread to finish. Also my web app is built around setting up the DB connections in the request context, so using an extra thread outside that context would require doubling some DB infrastructure. Probably not worth it at this point. -- https://mail.python.org/mailman/listinfo/python-list
Re: Threading question .. am I doing this right?
On Sat, 26 Feb 2022 at 05:16, Robert Latest via Python-list wrote: > > Chris Angelico wrote: > > Depending on your database, this might be counter-productive. A > > PostgreSQL database running on localhost, for instance, has its own > > caching, and data transfers between two apps running on the same > > computer can be pretty fast. The complexity you add in order to do > > your own caching might be giving you negligible benefit, or even a > > penalty. I would strongly recommend benchmarking the naive "keep going > > back to the database" approach first, as a baseline, and only testing > > these alternatives when you've confirmed that the database really is a > > bottleneck. > > "Depending on your database" is the key phrase. This is not "my" database that > is running on localhost. It is an external MSSQL server that I have no control > over and whose requests frequently time out. > Okay, cool. That's crucial to know. I'm still curious as to the workload (requests per second), as it might still be worth going for the feeder model. But if your current system works, then it may be simplest to debug that rather than change. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Threading question .. am I doing this right?
Greg Ewing wrote: > * If more than one thread calls get_data() during the initial > cache filling, it looks like only one of them will wait for > the thread -- the others will skip waiting altogether and > immediately return None. Right. But that needs to be dealt with somehow. No data is no data. > * Also if the first call to get_data() times out it will > return None (although maybe that's acceptable if the caller > is expecting it). Right. Needs to be dealt with. > * The caller of get_data() is getting an object that could > be changed under it by a future update. I don't think that's a problem. If it turns out to be one I'll create a copy of the data while I hold the lock and pass that back. > -- https://mail.python.org/mailman/listinfo/python-list
Re: Threading question .. am I doing this right?
Chris Angelico wrote: > Depending on your database, this might be counter-productive. A > PostgreSQL database running on localhost, for instance, has its own > caching, and data transfers between two apps running on the same > computer can be pretty fast. The complexity you add in order to do > your own caching might be giving you negligible benefit, or even a > penalty. I would strongly recommend benchmarking the naive "keep going > back to the database" approach first, as a baseline, and only testing > these alternatives when you've confirmed that the database really is a > bottleneck. "Depending on your database" is the key phrase. This is not "my" database that is running on localhost. It is an external MSSQL server that I have no control over and whose requests frequently time out. > Hmm, it's complicated. There is another approach, and that's to > completely invert your thinking: instead of "request wants data, so > let's get data", have a thread that periodically updates your cache > from the database, and then all requests return from the cache, > without pinging the requester. Downside: It'll be requesting fairly > frequently. Upside: Very simple, very easy, no difficulties debugging. I'm using a similar approach in other places, but there I actually have a separate process that feeds my local, fast DB with unwieldy data. But that is not merely replicating, it actually preprocesses and "adds value" to the data, and the data is worth retaining on my server. I didn't want to take that approach in this instance because it is a bit too much overhead for essentially "throwaway" stuff. I like the idea of starting a separated "timed" thread in the same application. Need to think about that. Background: The clients are SBCs that display data on screens distributed throughout a manufacturing facility. They periodically refresh every few minutes. Occasionally the requests would pile up waiting for the databsase, so that some screens displayed error messages for a minute or two. Nobody cares but my pride was piqued and the error logs filled up. I've had my proposed solution running for a few days now without errors. For me that's enough but I wanted to ask you guys if I made some logical mistakes. -- https://mail.python.org/mailman/listinfo/python-list
Re: Threading question .. am I doing this right?
On 25/02/22 1:08 am, Robert Latest wrote: My question is: Is this a solid approach? Am I forgetting something? I can see a few of problems: * If more than one thread calls get_data() during the initial cache filling, it looks like only one of them will wait for the thread -- the others will skip waiting altogether and immediately return None. * Also if the first call to get_data() times out it will return None (although maybe that's acceptable if the caller is expecting it). * The caller of get_data() is getting an object that could be changed under it by a future update. -- Greg -- https://mail.python.org/mailman/listinfo/python-list
Re: Threading question .. am I doing this right?
On Fri, 25 Feb 2022 at 06:54, Robert Latest via Python-list wrote: > > I have a multi-threaded application (a web service) where several threads need > data from an external database. That data is quite a lot, but it is almost > always the same. Between incoming requests, timestamped records get added to > the DB. > > So I decided to keep an in-memory cache of the DB records that gets only > "topped up" with the most recent records on each request: Depending on your database, this might be counter-productive. A PostgreSQL database running on localhost, for instance, has its own caching, and data transfers between two apps running on the same computer can be pretty fast. The complexity you add in order to do your own caching might be giving you negligible benefit, or even a penalty. I would strongly recommend benchmarking the naive "keep going back to the database" approach first, as a baseline, and only testing these alternatives when you've confirmed that the database really is a bottleneck. > Since it is better to quickly serve the client with slightly outdated data > than > not at all, I came up with the "impatient" solution below. The idea is that an > incoming request triggers an update query in another thread, waits for a short > timeout for that thread to finish and then returns either updated or old data. > > class MyCache(): > def __init__(self): > self.cache = None > self.thread_lock = Lock() > self.update_thread = None > > def _update(self): > new_records = query_external_database() > if self.cache is None: > self.cache = new_records > else: > self.cache.extend(new_records) > > def get_data(self): > if self.cache is None: > timeout = 10 # allow more time to get initial batch of data > else: > timeout = 0.5 > with self.thread_lock: > if self.update_thread is None or not > self.update_thread.is_alive(): > self.update_thread = Thread(target=self._update) > self.update_thread.start() > self.update_thread.join(timeout) > > return self.cache > > my_cache = MyCache() > > My question is: Is this a solid approach? Am I forgetting something? For > instance, I believe that I don't need another lock to guard > self.cache.append() > because _update() can ever only run in one thread at a time. But maybe I'm > overlooking something. Hmm, it's complicated. There is another approach, and that's to completely invert your thinking: instead of "request wants data, so let's get data", have a thread that periodically updates your cache from the database, and then all requests return from the cache, without pinging the requester. Downside: It'll be requesting fairly frequently. Upside: Very simple, very easy, no difficulties debugging. How many requests per second does your service process? (By "requests", I mean things that require this particular database lookup.) What's average throughput, what's peak throughput? And importantly, what sorts of idle times do you have? For instance, if you might have to handle 100 requests/second, but there could be hours-long periods with no requests at all (eg if your clients are all in the same timezone and don't operate at night), that's a very different workload from 10 r/s constantly throughout the day. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Threading question .. am I doing this right?
I have a multi-threaded application (a web service) where several threads need data from an external database. That data is quite a lot, but it is almost always the same. Between incoming requests, timestamped records get added to the DB. So I decided to keep an in-memory cache of the DB records that gets only "topped up" with the most recent records on each request: from threading import Lock, Thread class MyCache(): def __init__(self): self.cache = None self.cache_lock = Lock() def _update(self): new_records = query_external_database() if self.cache is None: self.cache = new_records else: self.cache.extend(new_records) def get_data(self): with self.cache_lock: self._update() return self.cache my_cache = MyCache() # module level This works, but even those "small" queries can sometimes hang for a long time, causing incoming requests to pile up at the "with self.cache_lock" block. Since it is better to quickly serve the client with slightly outdated data than not at all, I came up with the "impatient" solution below. The idea is that an incoming request triggers an update query in another thread, waits for a short timeout for that thread to finish and then returns either updated or old data. class MyCache(): def __init__(self): self.cache = None self.thread_lock = Lock() self.update_thread = None def _update(self): new_records = query_external_database() if self.cache is None: self.cache = new_records else: self.cache.extend(new_records) def get_data(self): if self.cache is None: timeout = 10 # allow more time to get initial batch of data else: timeout = 0.5 with self.thread_lock: if self.update_thread is None or not self.update_thread.is_alive(): self.update_thread = Thread(target=self._update) self.update_thread.start() self.update_thread.join(timeout) return self.cache my_cache = MyCache() My question is: Is this a solid approach? Am I forgetting something? For instance, I believe that I don't need another lock to guard self.cache.append() because _update() can ever only run in one thread at a time. But maybe I'm overlooking something. -- https://mail.python.org/mailman/listinfo/python-list