Yeah, I can't imagine this is a "normal" problem. I'm on linux w/ py 3.7. My script does have a __name__ == '__main__' block.
On Wed, Dec 8, 2021 at 12:38 AM Ning Kang <ni...@google.com> wrote: > I tried a pipeline: > > p = beam.Pipeline(DataflowRunner(), options=options) > text = p | beam.Create(['Hello World, Hello You']) > > > def tokenize(x): > import re > return re.findall('Hello', x) > > > flatten = text | 'Split' >> (beam.FlatMap(tokenize).with_output_types(str)) > pipeline_result = p.run() > > > Didn't run into the issue. > > What OS and Python version are you using? Does your script come with a `if > __name__ == '__main__': `? > > On Tue, Dec 7, 2021 at 6:58 PM Steve Niemitz <sniem...@apache.org> wrote: > >> I have a fairly simple python word count job (although the packaging is a >> little more complicated) that I'm trying to run. (note: I'm explicitly NOT >> using save_main_session.) >> >> In it is a method to tokenize the incoming text to words, and I used >> something similar to how the wordcount example worked. >> >> def tokenize(row): >> import re >> return re.findall(r'[A-Za-z\']+', row.text) >> >> which is then used as the function for a FlatMap: >> | 'Split' >> ( >> beam.FlatMap(tokenize).with_output_types(str)) >> >> However, if I run this job on dataflow (2.33), the python runner fails >> with a bizarre error: >> INFO:apache_beam.runners.dataflow.dataflow_runner:2021-12-07T20:59:59.704Z: >> JOB_MESSAGE_ERROR: Traceback (most recent call last): >> File "apache_beam/runners/common.py", line 1232, in >> apache_beam.runners.common.DoFnRunner.process >> File "apache_beam/runners/common.py", line 572, in >> apache_beam.runners.common.SimpleInvoker.invoke_process >> File "/tmp/tmpq_8l154y/wordcount_test.py", line 75, in tokenize >> ImportError: __import__ not found >> >> I was able to find an example in the streaming wordcount snippet that did >> something similar, but very strange [1]: >> | 'ExtractWords' >> >> beam.FlatMap(lambda x: __import__('re').findall(r'[A-Za-z\']+', >> x)) >> >> For whatever reason this actually fixed the issue in my job as well. I >> can't for the life of me understand why this works, or why the normal >> import fails. Someone else must have run into this same issue though for >> that streaming wordcount example to be like that. Any ideas what's going >> on here? >> >> [1] >> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py#L692 >> >