[ 
https://issues.apache.org/jira/browse/BEAM-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977923#comment-16977923
 ] 

Valentyn Tymofieiev edited comment on BEAM-8651 at 11/19/19 11:19 PM:
----------------------------------------------------------------------

While investigating the example from 
[https://github.com/tensorflow/tfx/issues/928] I observe the following failure 
mode:

SDK worker starts multiple threads. Each thread happens to unpickle some 
payload, which calls
{code:java}
 dill.loads [1] {code}
Dill calls
{code:java}
 pickle.Unpickler.find_class [2] {code}
which calls
{code:java}
 __import__ [3] {code}
Concurrent import calls cause a deadlock on Python 3 (checked on Python 
3.7.5rc1), but not on Python 2.7. Following snippet, with calls extracted from 
the Chicago taxi pipeline
{code:python}
import threading

def t1():
    return __import__("tensorflow_transform.beam.analyzer_impls")

def t2():
    return __import__("tensorflow_transform.tf_metadata.metadata_io")

def t3():
    return __import__("tensorflow.core.example.example_pb2")

threads = []
threads.append(threading.Thread(target=t1))
threads.append(threading.Thread(target=t2))
threads.append(threading.Thread(target=t3))

for thread in threads:
    thread.start()
{code}
fails with
{noformat}
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "dealock_repro.py", line 4, in t1
    return __import__("tensorflow_transform.beam.analyzer_impls", level=0)
  File 
"/home/valentyn/tmp/tfx_master_py37/lib/python3.7/site-packages/tensorflow_transform/__init__.py",
 line 23, in <module>
    from tensorflow_transform.output_wrapper import TFTransformOutput
  File 
"/home/valentyn/tmp/tfx_master_py37/lib/python3.7/site-packages/tensorflow_transform/output_wrapper.py",
 line 29, in <module>
    from tensorflow_transform.tf_metadata import metadata_io
  File "<frozen importlib._bootstrap>", line 980, in _find_and_load
  File "<frozen importlib._bootstrap>", line 149, in __enter__
  File "<frozen importlib._bootstrap>", line 94, in acquire
_frozen_importlib._DeadlockError: deadlock detected by 
_ModuleLock('tensorflow_transform.tf_metadata.metadata_io') at 140070103765456
{noformat}
[1] 
[https://github.com/apache/beam/blob/cba445c8da93d9bdd01b30b2f54e9c3b52a98b7d/sdks/python/apache_beam/internal/pickler.py#L261]
 [2] 
[https://github.com/uqfoundation/dill/blob/76e8472502a656f3ab6973cd8375cf7847f33842/dill/_dill.py#L462]
 [3] 
[https://github.com/python/cpython/blob/4ffc569b47bef9f95e443f3c56f7e7e32cb440c0/Lib/pickle.py#L1426]


was (Author: tvalentyn):
While investigating the example from 
[https://github.com/tensorflow/tfx/issues/928] I observe the following failure 
mode:

SDK worker starts multiple threads. Each thread happens to unpickle some 
payload, which calls
{code:java}
 dill.loads [1] {code}
Dill calls
{code:java}
 pickle.Unpickler.find_class [1] {code}
which calls
{code:java}
 __import__ [3] {code}
Concurrent import calls cause a deadlock on Python 3 (checked on Python 
3.7.5rc1), but not on Python 2.7. Following snippet, with calls extracted from 
the Chicago taxi pipeline
{code:python}
import threading

def t1():
    return __import__("tensorflow_transform.beam.analyzer_impls")

def t2():
    return __import__("tensorflow_transform.tf_metadata.metadata_io")

def t3():
    return __import__("tensorflow.core.example.example_pb2")

threads = []
threads.append(threading.Thread(target=t1))
threads.append(threading.Thread(target=t2))
threads.append(threading.Thread(target=t3))

for thread in threads:
    thread.start()
{code}

fails with 

{noformat}
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "dealock_repro.py", line 4, in t1
    return __import__("tensorflow_transform.beam.analyzer_impls", level=0)
  File 
"/home/valentyn/tmp/tfx_master_py37/lib/python3.7/site-packages/tensorflow_transform/__init__.py",
 line 23, in <module>
    from tensorflow_transform.output_wrapper import TFTransformOutput
  File 
"/home/valentyn/tmp/tfx_master_py37/lib/python3.7/site-packages/tensorflow_transform/output_wrapper.py",
 line 29, in <module>
    from tensorflow_transform.tf_metadata import metadata_io
  File "<frozen importlib._bootstrap>", line 980, in _find_and_load
  File "<frozen importlib._bootstrap>", line 149, in __enter__
  File "<frozen importlib._bootstrap>", line 94, in acquire
_frozen_importlib._DeadlockError: deadlock detected by 
_ModuleLock('tensorflow_transform.tf_metadata.metadata_io') at 140070103765456
{noformat}


[1] 
[https://github.com/apache/beam/blob/cba445c8da93d9bdd01b30b2f54e9c3b52a98b7d/sdks/python/apache_beam/internal/pickler.py#L261]
[2] 
[https://github.com/uqfoundation/dill/blob/76e8472502a656f3ab6973cd8375cf7847f33842/dill/_dill.py#L462]
[3] 
[https://github.com/python/cpython/blob/4ffc569b47bef9f95e443f3c56f7e7e32cb440c0/Lib/pickle.py#L1426]

> Python 3 portable pipelines sometimes fail with errors in 
> StockUnpickler.find_class()
> -------------------------------------------------------------------------------------
>
>                 Key: BEAM-8651
>                 URL: https://issues.apache.org/jira/browse/BEAM-8651
>             Project: Beam
>          Issue Type: Sub-task
>          Components: sdk-py-core
>            Reporter: Valentyn Tymofieiev
>            Assignee: Valentyn Tymofieiev
>            Priority: Blocker
>         Attachments: beam8651.py
>
>
> Several Beam users [1,2] reported an error which happens on Python 3 in 
> StockUnpickler.find_class.
> So far I've seen reports of the error on Python 3.5, 3.6, and 3.7.1, on Flink 
> and Dataflow runners. On Dataflow runner so far I have seen this in streaming 
> pipelines only, which use portable SDK worker.    
> Typical stack trace:                                                    
> {noformat}
> File 
> "python3.5/site-packages/apache_beam/runners/worker/bundle_processor.py", 
> line 1148, in _create_pardo_operation
>     dofn_data = pickler.loads(serialized_fn)                                  
>      
>   File "python3.5/site-packages/apache_beam/internal/pickler.py", line 265, 
> in loads
>     return dill.loads(s)                                                      
>      
>   File "python3.5/site-packages/dill/_dill.py", line 317, in loads            
>      
>     return load(file, ignore)                                                 
>      
>   File "python3.5/site-packages/dill/_dill.py", line 305, in load             
>      
>     obj = pik.load()                                                          
>      
>   File "python3.5/site-packages/dill/_dill.py", line 474, in find_class       
>      
>     return StockUnpickler.find_class(self, module, name)                      
>      
> AttributeError: Can't get attribute 'ClassName' on <module 'ModuleName' from 
> 'python3.5/site-packages/filename.py'>
> {noformat}
> According to Guenther from [1]:
> {quote}
> This looks exactly like a race condition that we've encountered on Python
> 3.7.1: There's a bug in some older 3.7.x releases that breaks the
> thread-safety of the unpickler, as concurrent unpickle threads can access a
> module before it has been fully imported. See
> https://bugs.python.org/issue34572 for more information.
> The traceback shows a Python 3.6 venv so this could be a different issue
> (the unpickle bug was introduced in version 3.7). If it's the same bug then
> upgrading to Python 3.7.3 or higher should fix that issue. One potential
> workaround is to ensure that all of the modules get imported during the
> initialization of the sdk_worker, as this bug only affects imports done by
> the unpickler.
> {quote}
> Opening this for visibility. Current open questions are:
> 1. Find a minimal example to reproduce this issue.
> 2. Figure out whether users are still affected by this issue on Python 3.7.3.
> 3. Communicate a workarounds for 3.5, 3.6 users affected by this.
> [1] 
> https://lists.apache.org/thread.html/5581ddfcf6d2ae10d25b834b8a61ebee265ffbcf650c6ec8d1e69408@%3Cdev.beam.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to