[ https://issues.apache.org/jira/browse/BEAM-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977923#comment-16977923 ]
Valentyn Tymofieiev edited comment on BEAM-8651 at 11/19/19 11:19 PM: ---------------------------------------------------------------------- While investigating the example from [https://github.com/tensorflow/tfx/issues/928] I observe the following failure mode: SDK worker starts multiple threads. Each thread happens to unpickle some payload, which calls {code:java} dill.loads [1] {code} Dill calls {code:java} pickle.Unpickler.find_class [2] {code} which calls {code:java} __import__ [3] {code} Concurrent import calls cause a deadlock on Python 3 (checked on Python 3.7.5rc1), but not on Python 2.7. Following snippet, with calls extracted from the Chicago taxi pipeline {code:python} import threading def t1(): return __import__("tensorflow_transform.beam.analyzer_impls") def t2(): return __import__("tensorflow_transform.tf_metadata.metadata_io") def t3(): return __import__("tensorflow.core.example.example_pb2") threads = [] threads.append(threading.Thread(target=t1)) threads.append(threading.Thread(target=t2)) threads.append(threading.Thread(target=t3)) for thread in threads: thread.start() {code} fails with {noformat} Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/usr/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "dealock_repro.py", line 4, in t1 return __import__("tensorflow_transform.beam.analyzer_impls", level=0) File "/home/valentyn/tmp/tfx_master_py37/lib/python3.7/site-packages/tensorflow_transform/__init__.py", line 23, in <module> from tensorflow_transform.output_wrapper import TFTransformOutput File "/home/valentyn/tmp/tfx_master_py37/lib/python3.7/site-packages/tensorflow_transform/output_wrapper.py", line 29, in <module> from tensorflow_transform.tf_metadata import metadata_io File "<frozen importlib._bootstrap>", line 980, in _find_and_load File "<frozen importlib._bootstrap>", line 149, in __enter__ File "<frozen importlib._bootstrap>", line 94, in acquire _frozen_importlib._DeadlockError: deadlock detected by _ModuleLock('tensorflow_transform.tf_metadata.metadata_io') at 140070103765456 {noformat} [1] [https://github.com/apache/beam/blob/cba445c8da93d9bdd01b30b2f54e9c3b52a98b7d/sdks/python/apache_beam/internal/pickler.py#L261] [2] [https://github.com/uqfoundation/dill/blob/76e8472502a656f3ab6973cd8375cf7847f33842/dill/_dill.py#L462] [3] [https://github.com/python/cpython/blob/4ffc569b47bef9f95e443f3c56f7e7e32cb440c0/Lib/pickle.py#L1426] was (Author: tvalentyn): While investigating the example from [https://github.com/tensorflow/tfx/issues/928] I observe the following failure mode: SDK worker starts multiple threads. Each thread happens to unpickle some payload, which calls {code:java} dill.loads [1] {code} Dill calls {code:java} pickle.Unpickler.find_class [1] {code} which calls {code:java} __import__ [3] {code} Concurrent import calls cause a deadlock on Python 3 (checked on Python 3.7.5rc1), but not on Python 2.7. Following snippet, with calls extracted from the Chicago taxi pipeline {code:python} import threading def t1(): return __import__("tensorflow_transform.beam.analyzer_impls") def t2(): return __import__("tensorflow_transform.tf_metadata.metadata_io") def t3(): return __import__("tensorflow.core.example.example_pb2") threads = [] threads.append(threading.Thread(target=t1)) threads.append(threading.Thread(target=t2)) threads.append(threading.Thread(target=t3)) for thread in threads: thread.start() {code} fails with {noformat} Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/usr/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "dealock_repro.py", line 4, in t1 return __import__("tensorflow_transform.beam.analyzer_impls", level=0) File "/home/valentyn/tmp/tfx_master_py37/lib/python3.7/site-packages/tensorflow_transform/__init__.py", line 23, in <module> from tensorflow_transform.output_wrapper import TFTransformOutput File "/home/valentyn/tmp/tfx_master_py37/lib/python3.7/site-packages/tensorflow_transform/output_wrapper.py", line 29, in <module> from tensorflow_transform.tf_metadata import metadata_io File "<frozen importlib._bootstrap>", line 980, in _find_and_load File "<frozen importlib._bootstrap>", line 149, in __enter__ File "<frozen importlib._bootstrap>", line 94, in acquire _frozen_importlib._DeadlockError: deadlock detected by _ModuleLock('tensorflow_transform.tf_metadata.metadata_io') at 140070103765456 {noformat} [1] [https://github.com/apache/beam/blob/cba445c8da93d9bdd01b30b2f54e9c3b52a98b7d/sdks/python/apache_beam/internal/pickler.py#L261] [2] [https://github.com/uqfoundation/dill/blob/76e8472502a656f3ab6973cd8375cf7847f33842/dill/_dill.py#L462] [3] [https://github.com/python/cpython/blob/4ffc569b47bef9f95e443f3c56f7e7e32cb440c0/Lib/pickle.py#L1426] > Python 3 portable pipelines sometimes fail with errors in > StockUnpickler.find_class() > ------------------------------------------------------------------------------------- > > Key: BEAM-8651 > URL: https://issues.apache.org/jira/browse/BEAM-8651 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core > Reporter: Valentyn Tymofieiev > Assignee: Valentyn Tymofieiev > Priority: Blocker > Attachments: beam8651.py > > > Several Beam users [1,2] reported an error which happens on Python 3 in > StockUnpickler.find_class. > So far I've seen reports of the error on Python 3.5, 3.6, and 3.7.1, on Flink > and Dataflow runners. On Dataflow runner so far I have seen this in streaming > pipelines only, which use portable SDK worker. > Typical stack trace: > {noformat} > File > "python3.5/site-packages/apache_beam/runners/worker/bundle_processor.py", > line 1148, in _create_pardo_operation > dofn_data = pickler.loads(serialized_fn) > > File "python3.5/site-packages/apache_beam/internal/pickler.py", line 265, > in loads > return dill.loads(s) > > File "python3.5/site-packages/dill/_dill.py", line 317, in loads > > return load(file, ignore) > > File "python3.5/site-packages/dill/_dill.py", line 305, in load > > obj = pik.load() > > File "python3.5/site-packages/dill/_dill.py", line 474, in find_class > > return StockUnpickler.find_class(self, module, name) > > AttributeError: Can't get attribute 'ClassName' on <module 'ModuleName' from > 'python3.5/site-packages/filename.py'> > {noformat} > According to Guenther from [1]: > {quote} > This looks exactly like a race condition that we've encountered on Python > 3.7.1: There's a bug in some older 3.7.x releases that breaks the > thread-safety of the unpickler, as concurrent unpickle threads can access a > module before it has been fully imported. See > https://bugs.python.org/issue34572 for more information. > The traceback shows a Python 3.6 venv so this could be a different issue > (the unpickle bug was introduced in version 3.7). If it's the same bug then > upgrading to Python 3.7.3 or higher should fix that issue. One potential > workaround is to ensure that all of the modules get imported during the > initialization of the sdk_worker, as this bug only affects imports done by > the unpickler. > {quote} > Opening this for visibility. Current open questions are: > 1. Find a minimal example to reproduce this issue. > 2. Figure out whether users are still affected by this issue on Python 3.7.3. > 3. Communicate a workarounds for 3.5, 3.6 users affected by this. > [1] > https://lists.apache.org/thread.html/5581ddfcf6d2ae10d25b834b8a61ebee265ffbcf650c6ec8d1e69408@%3Cdev.beam.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)