Hello everyone,

I'm having a hard time getting my head around threads so I was hoping
someone who has better understanding of their underlying functionality
could lend me a helping hand, in particular how threads work with each
other when using thread.join() and Semaphore set with maximum value. I'll
try to keep it as clear and concise as possible, but please don't hesitate
to ask if anything about my approach is unclear or, frankly, awful.

I'm writing a script that performs a couple of I/O operations and CLI
commands for each element in a list of IDs. The whole process takes a while
and may vary based on the ID, hence the threading approach sounded like the
best fit since next ID can start once space has freed up. I'm parsing an
extract of my code below and will explain what I can't properly understand
underneath.

Note: Please ignore any syntax typos, I'm rewriting the code to make it
suitable for here.


file1.py
---------
ids = [<IDs listed here>]
threadsPool = []
for id in ids:
  thread = threading.Thread(target=file2.runStuff, name=str(id), args=(id,
))
  threadsPool.append(thread)
for thread in threadsPool:
  thread.start()
for thread in threadsPool:
  print(thread.enumerate())
  print("Queuing thread" + str(thread))
  thread.join()

file2.py
----------
queue = threading.Semaphore(2)
def runStuff(id):
  queue.acquire()
  print("Lock acquired for " + str(id))
  file3.doMoreStuff()
  file4.evenMoreStuff()
  queue.release()


Onto my confusion - as long as I don't try to print information about the
thread that's being queued or the total amount of threads using
.enumerate(), the script is working absolutely flawlessly, each thread that
doesn't have a lock is waiting until it acquires it and then moves on. I
decided it'd be nice to be able to provide more information about which
thread starts next and how many threads are active right now (each can take
a different amount of time), however, when I tried to do that, my log was
showing me some pretty funky output which at first made me believe I've
messed up all my threads, example:


<<  2018-11-19 15:01:38,094 file2 [ID09] INFO - Lock acquired for
ID09                 <---- this is from file2.py
------ some time later and other logs in here ---------
[<_MainThread(MainThread, started 140431033562880)>, <Thread(ID09, started
140430614177536)>] <---- output from thread.enumerate(), file1.py
<<  2018-11-19 15:01:38,103 file1 [MainThread] DEBUG - Queuing thread -
<Thread(ID09, started 140430614177536)> <---- output from print() right
after thread.enumerate()


After some head scratching, I believe I've finally tracked down the reason
for my confusion:

The .start() loop starts the threads and the first 2 acquire a lock
immediately and start running, later on the .join() queue puts the rest in
waiting for lock, that's fine, what I didn't realize, of course, is that
the .join() loop goes through threads that have already been instantly
kicked off by the .start() loop (the first 2 threads since Semaphore allows
2 locks) and then my print in that loop is telling me that those threads
are being queued, except they aren't since they are already running, it's
just my text is telling me that, since I wasn't smart enough to realize
what's about to happen, as seen below:

<<  2018-11-19 15:01:33,094 file1.py [MainThread] DEBUG - Queuing thread -
<Thread(ID02, stopped 140430666626816)> <--- makes it clear the thread has
already even finished

Which finally gets me to my cry for help - I know I can't modify the
threadsPool list to remove the threads already created on the fly, so I can
have only the ones pending to be queued in the 2nd loop, but for the life
of me I can't think of a proper way to try and extract some information
about what threads are still going (or rather, have finished since
thread.enumerate() shows both running and queued).

I have the feeling I'm using a very wrong approach in trying to extract
that information in the .join() loop, since it only goes back to it once a
thread has finished, but at the same time it feels like the perfect timing.
I feel like (and I might be very wrong) if I could only have the threads
that are actually being queued in there (getting rid of the ones started
initially), my print(thread) will be the absolute sufficient amount of
information I want to display.

And just in case you are wondering why I have my threads starting in
file1.py and my Semaphore queue in file2.py, it's because I wanted to split
the runStuff(id) function in a separate module due to its length. I don't
know if it's a good way to do it, but thankfully the Python interpreter is
smart enough to see through my ignorance.

I'm also really sorry for the wall of text, I just hope the information
provided is enough to clear up the situation I'm in and what I'm struggling
with.

Thank you in advance and with kindest regards,
Dimitar
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Reply via email to