hm, interesting, are we using pybind anywhere? I didn't see any references
to it. I can give it a try on python 3.8 too though.
On Wed, Dec 8, 2021 at 9:19 AM Brian Hulette wrote:
> A google search for "__import__ not found" turned up an issue filed with
> pybind [1]. I can't deduce a root cause from the discussion there, but it
> looks like they didn't experience the issue in Python 3.8 - it could be
> interesting to see if your problem goes away there.
>
> It looks like +Charles Chen added the __import__('re')
> workaround in [2], maybe he remembers what was going on?
>
> [1] https://github.com/pybind/pybind11/issues/2557
> [2] https://github.com/apache/beam/pull/5071
>
> On Wed, Dec 8, 2021 at 5:30 AM Steve Niemitz wrote:
>
>> Yeah, I can't imagine this is a "normal" problem.
>>
>> I'm on linux w/ py 3.7. My script does have a __name__ == '__main__'
>> block.
>>
>> On Wed, Dec 8, 2021 at 12:38 AM Ning Kang wrote:
>>
>>> I tried a pipeline:
>>>
>>> p = beam.Pipeline(DataflowRunner(), options=options)
>>> text = p | beam.Create(['Hello World, Hello You'])
>>>
>>>
>>> def tokenize(x):
>>> import re
>>> return re.findall('Hello', x)
>>>
>>>
>>> flatten = text | 'Split' >>
>>> (beam.FlatMap(tokenize).with_output_types(str))
>>> pipeline_result = p.run()
>>>
>>>
>>> Didn't run into the issue.
>>>
>>> What OS and Python version are you using? Does your script come with a
>>> `if __name__ == '__main__': `?
>>>
>>> On Tue, Dec 7, 2021 at 6:58 PM Steve Niemitz
>>> wrote:
>>>
I have a fairly simple python word count job (although the packaging is
a little more complicated) that I'm trying to run. (note: I'm explicitly
NOT using save_main_session.)
In it is a method to tokenize the incoming text to words, and I used
something similar to how the wordcount example worked.
def tokenize(row):
import re
return re.findall(r'[A-Za-z\']+', row.text)
which is then used as the function for a FlatMap:
| 'Split' >> (
beam.FlatMap(tokenize).with_output_types(str))
However, if I run this job on dataflow (2.33), the python runner fails
with a bizarre error:
INFO:apache_beam.runners.dataflow.dataflow_runner:2021-12-07T20:59:59.704Z:
JOB_MESSAGE_ERROR: Traceback (most recent call last):
File "apache_beam/runners/common.py", line 1232, in
apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 572, in
apache_beam.runners.common.SimpleInvoker.invoke_process
File "/tmp/tmpq_8l154y/wordcount_test.py", line 75, in tokenize
ImportError: __import__ not found
I was able to find an example in the streaming wordcount snippet that
did something similar, but very strange [1]:
| 'ExtractWords' >>
beam.FlatMap(lambda x: __import__('re').findall(r'[A-Za-z\']+',
x))
For whatever reason this actually fixed the issue in my job as well. I
can't for the life of me understand why this works, or why the normal
import fails. Someone else must have run into this same issue though for
that streaming wordcount example to be like that. Any ideas what's going
on here?
[1]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py#L692
>>>