Re: Threading question .. am I doing this right?

2022-02-28 Thread Robert Latest via Python-list
Chris Angelico wrote:
> I'm still curious as to the workload (requests per second), as it might still
> be worth going for the feeder model. But if your current system works, then
> it may be simplest to debug that rather than change.

It is by all accounts a low-traffic situation, maybe one request/second. But
the view in question opens four plots on one page, generating four separate
requests. So with only two clients and a blocking DB connection, the whole
application with eight uwsgi worker threads comes down. Now with the "extra
load thread" modification, the app worked fine for several days with only two
threads.

Out of curiosity I tried the "feeder thread" approach with a dummy thread that
just sleeps and logs something every few seconds, ten times total. For some
reason it sometimes hangs after eight or nine loops, and then uwsgi cannot
restart gracefully probably because it is still waiting for that thread to
finish. Also my web app is built around setting up the DB connections in the
request context, so using an extra thread outside that context would require
doubling some DB infrastructure. Probably not worth it at this point.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Threading question .. am I doing this right?

2022-02-25 Thread Chris Angelico
On Sat, 26 Feb 2022 at 05:16, Robert Latest via Python-list
 wrote:
>
> Chris Angelico wrote:
> > Depending on your database, this might be counter-productive. A
> > PostgreSQL database running on localhost, for instance, has its own
> > caching, and data transfers between two apps running on the same
> > computer can be pretty fast. The complexity you add in order to do
> > your own caching might be giving you negligible benefit, or even a
> > penalty. I would strongly recommend benchmarking the naive "keep going
> > back to the database" approach first, as a baseline, and only testing
> > these alternatives when you've confirmed that the database really is a
> > bottleneck.
>
> "Depending on your database" is the key phrase. This is not "my" database that
> is running on localhost. It is an external MSSQL server that I have no control
> over and whose requests frequently time out.
>

Okay, cool. That's crucial to know.

I'm still curious as to the workload (requests per second), as it
might still be worth going for the feeder model. But if your current
system works, then it may be simplest to debug that rather than
change.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Threading question .. am I doing this right?

2022-02-25 Thread Robert Latest via Python-list
Greg Ewing wrote:
> * If more than one thread calls get_data() during the initial
> cache filling, it looks like only one of them will wait for
> the thread -- the others will skip waiting altogether and
> immediately return None.

Right. But that needs to be dealt with somehow. No data is no data.

> * Also if the first call to get_data() times out it will
> return None (although maybe that's acceptable if the caller
> is expecting it).

Right. Needs to be dealt with.

> * The caller of get_data() is getting an object that could
> be changed under it by a future update.

I don't think that's a problem. If it turns out to be one I'll create a copy of
the data while I hold the lock and pass that back.
>
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Threading question .. am I doing this right?

2022-02-25 Thread Robert Latest via Python-list
Chris Angelico wrote:
> Depending on your database, this might be counter-productive. A
> PostgreSQL database running on localhost, for instance, has its own
> caching, and data transfers between two apps running on the same
> computer can be pretty fast. The complexity you add in order to do
> your own caching might be giving you negligible benefit, or even a
> penalty. I would strongly recommend benchmarking the naive "keep going
> back to the database" approach first, as a baseline, and only testing
> these alternatives when you've confirmed that the database really is a
> bottleneck.

"Depending on your database" is the key phrase. This is not "my" database that
is running on localhost. It is an external MSSQL server that I have no control
over and whose requests frequently time out.

> Hmm, it's complicated. There is another approach, and that's to
> completely invert your thinking: instead of "request wants data, so
> let's get data", have a thread that periodically updates your cache
> from the database, and then all requests return from the cache,
> without pinging the requester. Downside: It'll be requesting fairly
> frequently. Upside: Very simple, very easy, no difficulties debugging.

I'm using a similar approach in other places, but there I actually have a
separate process that feeds my local, fast DB with unwieldy data. But that is
not merely replicating, it actually preprocesses and "adds value" to the data,
and the data is worth retaining on my server. I didn't want to take that
approach in this instance because it is a bit too much overhead for essentially
"throwaway" stuff. I like the idea of starting a separated "timed" thread in
the same application. Need to think about that.

Background: The clients are SBCs that display data on screens distributed
throughout a manufacturing facility. They periodically refresh every few
minutes. Occasionally the requests would pile up waiting for the databsase, so
that some screens displayed error messages for a minute or two. Nobody cares
but my pride was piqued and the error logs filled up.

I've had my proposed solution running for a few days now without errors. For me
that's enough but I wanted to ask you guys if I made some logical mistakes.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Threading question .. am I doing this right?

2022-02-25 Thread Greg Ewing

On 25/02/22 1:08 am, Robert Latest wrote:

My question is: Is this a solid approach? Am I forgetting something?


I can see a few of problems:

* If more than one thread calls get_data() during the initial
cache filling, it looks like only one of them will wait for
the thread -- the others will skip waiting altogether and
immediately return None.

* Also if the first call to get_data() times out it will
return None (although maybe that's acceptable if the caller
is expecting it).

* The caller of get_data() is getting an object that could
be changed under it by a future update.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list


Re: Threading question .. am I doing this right?

2022-02-24 Thread Chris Angelico
On Fri, 25 Feb 2022 at 06:54, Robert Latest via Python-list
 wrote:
>
> I have a multi-threaded application (a web service) where several threads need
> data from an external database. That data is quite a lot, but it is almost
> always the same. Between incoming requests, timestamped records get added to
> the DB.
>
> So I decided to keep an in-memory cache of the DB records that gets only
> "topped up" with the most recent records on each request:

Depending on your database, this might be counter-productive. A
PostgreSQL database running on localhost, for instance, has its own
caching, and data transfers between two apps running on the same
computer can be pretty fast. The complexity you add in order to do
your own caching might be giving you negligible benefit, or even a
penalty. I would strongly recommend benchmarking the naive "keep going
back to the database" approach first, as a baseline, and only testing
these alternatives when you've confirmed that the database really is a
bottleneck.

> Since it is better to quickly serve the client with slightly outdated data 
> than
> not at all, I came up with the "impatient" solution below. The idea is that an
> incoming request triggers an update query in another thread, waits for a short
> timeout for that thread to finish and then returns either updated or old data.
>
> class MyCache():
> def __init__(self):
> self.cache = None
> self.thread_lock = Lock()
> self.update_thread = None
>
> def _update(self):
> new_records = query_external_database()
> if self.cache is None:
> self.cache = new_records
> else:
> self.cache.extend(new_records)
>
> def get_data(self):
> if self.cache is None:
> timeout = 10 # allow more time to get initial batch of data
> else:
> timeout = 0.5
> with self.thread_lock:
> if self.update_thread is None or not 
> self.update_thread.is_alive():
> self.update_thread = Thread(target=self._update)
> self.update_thread.start()
> self.update_thread.join(timeout)
>
> return self.cache
>
> my_cache = MyCache()
>
> My question is: Is this a solid approach? Am I forgetting something? For
> instance, I believe that I don't need another lock to guard 
> self.cache.append()
> because _update() can ever only run in one thread at a time. But maybe I'm
> overlooking something.

Hmm, it's complicated. There is another approach, and that's to
completely invert your thinking: instead of "request wants data, so
let's get data", have a thread that periodically updates your cache
from the database, and then all requests return from the cache,
without pinging the requester. Downside: It'll be requesting fairly
frequently. Upside: Very simple, very easy, no difficulties debugging.

How many requests per second does your service process? (By
"requests", I mean things that require this particular database
lookup.) What's average throughput, what's peak throughput? And
importantly, what sorts of idle times do you have? For instance, if
you might have to handle 100 requests/second, but there could be
hours-long periods with no requests at all (eg if your clients are all
in the same timezone and don't operate at night), that's a very
different workload from 10 r/s constantly throughout the day.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Threading question .. am I doing this right?

2022-02-24 Thread Robert Latest via Python-list
I have a multi-threaded application (a web service) where several threads need
data from an external database. That data is quite a lot, but it is almost
always the same. Between incoming requests, timestamped records get added to
the DB.

So I decided to keep an in-memory cache of the DB records that gets only
"topped up" with the most recent records on each request:


from threading import Lock, Thread


class MyCache():
def __init__(self):
self.cache = None
self.cache_lock = Lock()

def _update(self):
new_records = query_external_database()
if self.cache is None:
self.cache = new_records
else:
self.cache.extend(new_records)

def get_data(self):
with self.cache_lock:
self._update()

return self.cache

my_cache = MyCache() # module level


This works, but even those "small" queries can sometimes hang for a long time,
causing incoming requests to pile up at the "with self.cache_lock" block.

Since it is better to quickly serve the client with slightly outdated data than
not at all, I came up with the "impatient" solution below. The idea is that an
incoming request triggers an update query in another thread, waits for a short
timeout for that thread to finish and then returns either updated or old data.

class MyCache():
def __init__(self):
self.cache = None
self.thread_lock = Lock()
self.update_thread = None

def _update(self):
new_records = query_external_database()
if self.cache is None:
self.cache = new_records
else:
self.cache.extend(new_records)

def get_data(self):
if self.cache is None:
timeout = 10 # allow more time to get initial batch of data
else:
timeout = 0.5
with self.thread_lock:
if self.update_thread is None or not self.update_thread.is_alive():
self.update_thread = Thread(target=self._update)
self.update_thread.start()
self.update_thread.join(timeout)

return self.cache

my_cache = MyCache()

My question is: Is this a solid approach? Am I forgetting something? For
instance, I believe that I don't need another lock to guard self.cache.append()
because _update() can ever only run in one thread at a time. But maybe I'm
overlooking something.

-- 
https://mail.python.org/mailman/listinfo/python-list