Re: Recommendations in terms of threading, multi-threading and/or asynchronous processes/programming? - Sent Mail - Mozilla Thunderbird

2023-01-08 Thread Peter J. Holzer
On 2023-01-08 13:49:38 +0200, jacob kruger wrote:
> Ok, the specific usage case right now is that I need to set up a process
> pulling contents of e-mail messages from an IMAP protocol mail server, which
> I then populate into a postgresql database, and, since this is the inbox of
> a relatively large-scale CRM/support system, there are currently over 2.5
> million e-mails in the inbox, but, it can grow by over 5 per day.

This is probably I/O-bound. You will likely spend much more time waiting
for the IMAP server or the database than parsing the messages. So you
probably don't need multi-processing just to utilize all your cores.
On the other hand you have some nicely separated task which can be
parallelized, so multi-threading should help (async probably would work
just as well or as badly as multi-threading but I find that harder to
understand so I would discard it at this point).

I might be mistaken, though: Depending on how much processing you need
to do on these messages it might be worth it split the work across
multiple processes. Check the CPU-usage of your process: If it's close
to 100% you will probably gain significantly from multi-processing.


> I already have the basic process operating, using imap_tools, but, wanted to
> enable you to query the process during run-time, without needing to either
> check logs, or query the database itself while it is on-the-go
[...]
> Also wanted to offer the ability to either pause, or terminate processes
> while it's busy batch processing large chunks of e-mail messages

So that would be an http (or other socket-based) interface? Should also
be possible to add as an additional thread (or process).


> So, I think that for now, threading is probably the simplest to look into.

I agree with that assessment.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Recommendations in terms of threading, multi-threading and/or asynchronous processes/programming? - Sent Mail - Mozilla Thunderbird

2023-01-08 Thread jacob kruger
Ok, the specific usage case right now is that I need to set up a process 
pulling contents of e-mail messages from an IMAP protocol mail server, 
which I then populate into a postgresql database, and, since this is the 
inbox of a relatively large-scale CRM/support system, there are 
currently over 2.5 million e-mails in the inbox, but, it can grow by 
over 5 per day.



I already have the basic process operating, using imap_tools, but, 
wanted to enable you to query the process during run-time, without 
needing to either check logs, or query the database itself while it is 
on-the-go - even if this is just for initial population time-period, 
since later on I will just set up code to run under a form of cron job, 
or handling time-based repeats itself on a separate machine.



Also wanted to offer the ability to either pause, or terminate processes 
while it's busy batch processing large chunks of e-mail messages - 
either send a message to the thread, or set a global variable to tell it 
to end the run after the current process item has finished off, just in 
case.



So, I think that for now, threading is probably the simplest to look into.


Later on, was also considering forms of low-level monitoring for UI 
elements, but, this is not really related to initial task, but, could 
almost relate to forms of non-visual gaming interfaces, for blind/VI 
individuals - I am myself 100% blind, but, that's not really relevant in 
this context.



Stay well


Jacob Kruger
+2782 413 4791
"Resistance is futile...but, acceptance is versatile..."


On 2023/01/06 21:19, Chris Angelico wrote:

On Sat, 7 Jan 2023 at 04:54, jacob kruger  wrote:

I am just trying to make up my mind with regards to what I should look
into working with/making use of in terms of what have put in subject line?


As in, if want to be able to trigger multiple/various threads/processes
to run in the background, possibly monitoring their states, either via
interface, or via global variables, but, possibly while processing other
forms of user interaction via the normal/main process, what would be
recommended?


Any. All. Whatever suits your purpose.

They all have different goals, different tradeoffs. Threads are great
for I/O bound operations; they're easy to work with (especially in
Python), behave pretty much like just having multiple things running
concurrently, and generally are the easiest to use. But you'll run
into limits as your thread count climbs (with a simple test, I started
seeing delays at about 10,000 threads, with more serious problems at
100,000), so it's not well-suited for huge scaling. Also, only one
thread at a time can run Python code, which limits them to I/O-bound
tasks like networking.

Multiple processes take a lot more management. You have to carefully
define your communication channels (for instance, a
multiprocessing.Queue() to collect results), but they can do CPU-bound
tasks in parallel. So multiprocessing is a good way to saturate all of
your CPU cores. Big downsides include it being much harder to share
information between the processes, and much MUCH higher resource usage
than threads (with the same test as the above, I ran into limitations
at just over 500 processes - way fewer than the 10,000 threads!).

Asynchronous I/O runs a single thread in a single process. So like
multithreading, it's only good for I/O bound tasks like networking.
It's harder to work with, though, since you have to be very careful to
include proper await points, and you can stall out the entire event
loop with one mistake (common culprits being synchronous disk I/O, and
gethostbyname). But the upside is that you get near-infinite tasks,
basically just limited by available memory (or other resources).

Use whichever one is right for your needs.

ChrisA

--
https://mail.python.org/mailman/listinfo/python-list


Re: Recommendations in terms of threading, multi-threading and/or asynchronous processes/programming? - Sent Mail - Mozilla Thunderbird

2023-01-06 Thread Chris Angelico
On Sat, 7 Jan 2023 at 04:54, jacob kruger  wrote:
>
> I am just trying to make up my mind with regards to what I should look
> into working with/making use of in terms of what have put in subject line?
>
>
> As in, if want to be able to trigger multiple/various threads/processes
> to run in the background, possibly monitoring their states, either via
> interface, or via global variables, but, possibly while processing other
> forms of user interaction via the normal/main process, what would be
> recommended?
>

Any. All. Whatever suits your purpose.

They all have different goals, different tradeoffs. Threads are great
for I/O bound operations; they're easy to work with (especially in
Python), behave pretty much like just having multiple things running
concurrently, and generally are the easiest to use. But you'll run
into limits as your thread count climbs (with a simple test, I started
seeing delays at about 10,000 threads, with more serious problems at
100,000), so it's not well-suited for huge scaling. Also, only one
thread at a time can run Python code, which limits them to I/O-bound
tasks like networking.

Multiple processes take a lot more management. You have to carefully
define your communication channels (for instance, a
multiprocessing.Queue() to collect results), but they can do CPU-bound
tasks in parallel. So multiprocessing is a good way to saturate all of
your CPU cores. Big downsides include it being much harder to share
information between the processes, and much MUCH higher resource usage
than threads (with the same test as the above, I ran into limitations
at just over 500 processes - way fewer than the 10,000 threads!).

Asynchronous I/O runs a single thread in a single process. So like
multithreading, it's only good for I/O bound tasks like networking.
It's harder to work with, though, since you have to be very careful to
include proper await points, and you can stall out the entire event
loop with one mistake (common culprits being synchronous disk I/O, and
gethostbyname). But the upside is that you get near-infinite tasks,
basically just limited by available memory (or other resources).

Use whichever one is right for your needs.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Recommendations in terms of threading, multi-threading and/or asynchronous processes/programming? - Sent Mail - Mozilla Thunderbird

2023-01-06 Thread Peter J. Holzer
On 2023-01-06 10:18:24 +0200, jacob kruger wrote:
> I am just trying to make up my mind with regards to what I should look into
> working with/making use of in terms of what have put in subject line?
> 
> 
> As in, if want to be able to trigger multiple/various threads/processes to
> run in the background, possibly monitoring their states, either via
> interface, or via global variables, but, possibly while processing other
> forms of user interaction via the normal/main process, what would be
> recommended?

This depends very much on what you want to do and what the constraints
and requirements are and is completely impossible to answer in the
abstract.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Recommendations in terms of threading, multi-threading and/or asynchronous processes/programming? - Sent Mail - Mozilla Thunderbird

2023-01-06 Thread jacob kruger
I am just trying to make up my mind with regards to what I should look 
into working with/making use of in terms of what have put in subject line?



As in, if want to be able to trigger multiple/various threads/processes 
to run in the background, possibly monitoring their states, either via 
interface, or via global variables, but, possibly while processing other 
forms of user interaction via the normal/main process, what would be 
recommended?


As in, for example, the following page mentions some possibilities, like 
threading, asyncio, etc., but, without going into too much detail:

https://itnext.io/practical-guide-to-async-threading-multiprocessing-958e57d7bbb8

And, have played around with threading in the past, and, was looking 
into asyncio now, but, thought would rather first ask for 
recommendations/suggestions here?



For reference, am currently working with python 3.11, or might roll back 
to 3.10 if relevant, but, main thing is just want to get an idea of 
what's simplest to make use of in this context?


Thanks in advance

--

Jacob Kruger
+2782 413 4791
"Resistance is futile...but, acceptance is versatile..."


--
https://mail.python.org/mailman/listinfo/python-list