Hi folks,

I'd like to get some feedback on a multi-threading interface I've been
thinking about and using for the past year or so. I won't bury the lede, see
my approach here
<https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-my_example-py>
.

*Background / problem:*

A couple of years ago, I inherited my company's codebase to get data into
our data warehouse using an ELT approach (extract-and-loads done in python,
transforms done in dbt/SQL). The codebase has dozens of python scripts to
integrate first-party and third-party data from databases, FTPs, and APIs,
which are run on a scheduler (typically daily or hourly). The scripts I
inherited were single-threaded procedural scripts, looking like glue code,
and spending most of their time in network I/O. (See example.
<https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-unthreaded_example-py>)
This got my company pretty far!

As my team and I added more and more integrations with more and more data,
we wanted to have faster and faster scripts to reduce our dev cycles and
reduce our multi-hour nightly jobs to minutes. Because our scripts were
network-bound, multi-threading was a good way to accomplish this, and so I
looked into concurrent.futures (example
<https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-concurrent_futures_example-py>)
and asyncio (example
<https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-asyncio_example-py>),
but I decided against these options because:

1. It wasn't immediately apparently how to adapt my codebase to use these
libraries without either some fundamental changes to our execution platform
and/or reworking of our scripts from the ground up and/or adding
significant lines of multi-threading code to each script.

2. I couldn't wrap my head around the async/await and future constructs
particularly quickly, and I was concerned that my team would also struggle
with this change.

3. I believe the procedural style glue code we have is quite easy to
comprehend, which I think has a positive impact on scale.

*Solution:*

And so, as mentioned at the top, I designed a different interface to
concurrent.futures.ThreadPoolExecutor that we are successfully using for
our extract-and-load pattern, see a basic example here
<https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-my_example-py>.
The design considerations of this interface include:

- The usage is minimally-invasive to the original unthreaded approach of
the codebase. (And so, teaching the library to team members has been fairly
straightforward despite the multi-threaded paradigm shift.)

- The @parallel.task decorator should be used to encapsulate a homogeneous
method accepting different parameters. The contents of the method should be
primarily I/O to achieve the concurrency gains of python multi-threading.

- If no parallel.threads context manager has been entered, the
@parallel.task decorator acts as a no-op (and the code runs serially).

- If an environment variable is set to disable the context manager, the
@parallel.task decorator acts as a no-op (and the code runs serially).

- There is also an environment variable to change the number of workers
provided by parallel.threads (if not hard-coded).

While it's possible to return a value from a @parallel.task method, I
encourage my team to use the decorator to start-and-complete work; think of
writing "embarrassingly parallel" methods that can be "mapped".

A couple of other things we've implemented include a "thread barrier" in
the case where we want a set tasks to complete before a set of other tasks,
and a decorator for factory methods to produce cached thread-local objects
(helpful for ensuring thread-safe access to network clients that are not
thread-safe).

*Your feedback:*

- I'd love to hear your thoughts on my problem and solution.

- I've done a bit of research of existing libraries in PyPI and PEPs but I
don't see any similar libraries; are you aware of anything?

- What do you suggest I do next? I'm considering publishing it, but could
use some tips on what to here!

Thanks!

Sean McIntyre
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/KGSMCQT4JIVFEPXULKIYMQOIZLQZUWW5/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to