hm, interesting, are we using pybind anywhere?  I didn't see any references
to it.  I can give it a try on python 3.8 too though.

On Wed, Dec 8, 2021 at 9:19 AM Brian Hulette <bhule...@google.com> wrote:

> A google search for "__import__ not found" turned up an issue filed with
> pybind [1]. I can't deduce a root cause from the discussion there, but it
> looks like they didn't experience the issue in Python 3.8 - it could be
> interesting to see if your problem goes away there.
>
> It looks like +Charles Chen <c...@google.com> added the __import__('re')
> workaround in [2], maybe he remembers what was going on?
>
> [1] https://github.com/pybind/pybind11/issues/2557
> [2] https://github.com/apache/beam/pull/5071
>
> On Wed, Dec 8, 2021 at 5:30 AM Steve Niemitz <sniem...@apache.org> wrote:
>
>> Yeah, I can't imagine this is a "normal" problem.
>>
>> I'm on linux w/ py 3.7.  My script does have a __name__ == '__main__'
>> block.
>>
>> On Wed, Dec 8, 2021 at 12:38 AM Ning Kang <ni...@google.com> wrote:
>>
>>> I tried a pipeline:
>>>
>>> p = beam.Pipeline(DataflowRunner(), options=options)
>>> text = p | beam.Create(['Hello World, Hello You'])
>>>
>>>
>>> def tokenize(x):
>>>     import re
>>>     return re.findall('Hello', x)
>>>
>>>
>>> flatten = text | 'Split' >>
>>> (beam.FlatMap(tokenize).with_output_types(str))
>>> pipeline_result = p.run()
>>>
>>>
>>> Didn't run into the issue.
>>>
>>> What OS and Python version are you using? Does your script come with a
>>> `if __name__ == '__main__': `?
>>>
>>> On Tue, Dec 7, 2021 at 6:58 PM Steve Niemitz <sniem...@apache.org>
>>> wrote:
>>>
>>>> I have a fairly simple python word count job (although the packaging is
>>>> a little more complicated) that I'm trying to run.  (note: I'm explicitly
>>>> NOT using save_main_session.)
>>>>
>>>> In it is a method to tokenize the incoming text to words, and I used
>>>> something similar to how the wordcount example worked.
>>>>
>>>> def tokenize(row):
>>>>   import re
>>>>   return re.findall(r'[A-Za-z\']+', row.text)
>>>>
>>>> which is then used as the function for a FlatMap:
>>>> | 'Split' >> (
>>>>         beam.FlatMap(tokenize).with_output_types(str))
>>>>
>>>> However, if I run this job on dataflow (2.33), the python runner fails
>>>> with a bizarre error:
>>>> INFO:apache_beam.runners.dataflow.dataflow_runner:2021-12-07T20:59:59.704Z:
>>>> JOB_MESSAGE_ERROR: Traceback (most recent call last):
>>>>   File "apache_beam/runners/common.py", line 1232, in
>>>> apache_beam.runners.common.DoFnRunner.process
>>>>   File "apache_beam/runners/common.py", line 572, in
>>>> apache_beam.runners.common.SimpleInvoker.invoke_process
>>>>   File "/tmp/tmpq_8l154y/wordcount_test.py", line 75, in tokenize
>>>> ImportError: __import__ not found
>>>>
>>>> I was able to find an example in the streaming wordcount snippet that
>>>> did something similar, but very strange [1]:
>>>>         | 'ExtractWords' >>
>>>>         beam.FlatMap(lambda x: __import__('re').findall(r'[A-Za-z\']+',
>>>> x))
>>>>
>>>> For whatever reason this actually fixed the issue in my job as well.  I
>>>> can't for the life of me understand why this works, or why the normal
>>>> import fails.  Someone else must have run into this same issue though for
>>>> that streaming wordcount example to be like that.  Any ideas what's going
>>>> on here?
>>>>
>>>> [1]
>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py#L692
>>>>
>>>

Reply via email to