On 2/25/2023 4:41 PM, Skip Montanaro wrote:
Thanks for the responses.

Peter wrote:

Which OS is this?

MacOS Ventura 13.1, M1 MacBook Pro (eight cores).

Thomas wrote:

 > I'm no expert on locks, but you don't usually want to keep a lock while
 > some long-running computation goes on.  You want the computation to be
 > done by a separate thread, put its results somewhere, and then notify
 > the choreographing thread that the result is ready.

In this case I'm extracting the noun phrases from the body of an email message(returned as a list). I have a collection of email messages organized by month(typically 1000 to 3000 messages per month). I'm using concurrent.futures.ThreadPoolExecutor() with the default number of workers (os.cpu_count() * 1.5, or 12 threads on my system)to process each month, so 12 active threads at a time. Given that the process is pretty much CPU bound, maybe reducing the number of workers to the CPU count would make sense. Processing of each email message enters that with block once.That's about as minimal as I can make it. I thought for a bit about pushing the textblob stuff into a separate worker thread, but it wasn't obvious how to set up queues to handle the communication between the threads created by ThreadPoolExecutor()and the worker thread. Maybe I'll think about it harder. (I have a related problem with SQLite, since an open database can't be manipulated from multiple threads. That makes much of the program's end-of-run processing single-threaded.)

If the noun extractor is single-threaded (which I think you mentioned), no amount of parallel access is going to help. The best you can do is to queue up requests so that as soon as the noun extractor returns from one call, it gets handed another blob. The CPU will be busy all the time running the noun-extraction code.

If that's the case, you might just as well eliminate all the threads and just do it sequentially in the most obvious and simple manner.

It would possibly be worth while to try this approach out and see what happens to the CPU usage and overall computation time.

 > This link may be helpful -
 >
> https://anandology.com/blog/using-iterators-and-generators/ <https://anandology.com/blog/using-iterators-and-generators/>

I don't think that's where my problem is. The lock protects the generation of the noun phrases. My loop which does the yielding operates outside of that lock's control. The version of the code is my latest, in which I tossed out a bunch of phrase-processing code (effectively dead end ideas for processing the phrases). Replacing the for loop with a simple return seems not to have any effect. In any case, the caller which uses the phrases does a fair amount of extra work with the phrases, populating a SQLite database, so I don't think the amount of time it takes to process a single email message is dominated by the phrase generation.

Here's timeitoutput for the noun_phrases code:

% python -m timeit -s 'text = """`python -m timeit --help`""" ; from textblob import TextBlob ; from textblob.np_extractors import ConllExtractor ; ext = ConllExtractor() ; phrases = TextBlob(text, np_extractor=ext).noun_phrases' 'phrases = TextBlob(text, np_extractor=ext).noun_phrases'
5000 loops, best of 5: 98.7 usec per loop

I process the output of timeit's help message which looks to be about the same length as a typical email message, certainly the same order of magnitude. Also, note that I call it once in the setup to eliminate the initial training of the ConllExtractor instance. I don't know if ~100us qualifies as long running or not.

I'll keep messing with it.

Skip

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to