Hi Levan, Great to hear that your issue is resolved! For the follow-up question, I am not quite familiar with AWS EMR's configuration for flink but due to the error you attached, it looks like that pyflink may not ship some 'Google' dependencies in the Flink binary zip file and as a result, it will try to find it in your python environment. cc @hxbks...@gmail.com For now, to manage the complex python dependencies, the typical usage of pyflink in multiple node clusters for production is to create your venv and use it in your `flink run` command or in the python code. You can refer to this doc <https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/python/faq/#preparing-python-virtual-environment> for details.
Best, Biao Geng Levan Huyen <lvhu...@gmail.com> 于2022年10月20日周四 14:11写道: > Hi Biao, > > Thanks for your help. That solved my issue. It turned out that in setup1 > (in EMR), I got apache-flink installed, but the package (and its > dependencies) are not in the directory `/usr/lib/python3.7/site-packages` > (corresponding to the python binary in `/usr/bin/python3`). For some > reason, the packages are in the current user's location (`~/.local/...) > which Flink did not look at. > > BTW, is there any way to use the pyflink shipped with the Flink binary zip > file that I downloaded from Apache's site? On EMR, such package is > included, and I feel it's awkward to have to install another version using > `pip install`. It will also be confusing about where to add the > dependencies jars. > > Thanks and regards, > Levan Huyen > > > On Thu, 20 Oct 2022 at 02:25, Biao Geng <biaoge...@gmail.com> wrote: > >> Hi Levan, >> >> For your setup1 & 2, it looks like the python environment is not ready. >> Have you tried python -m pip install apache-flink for the first 2 setups? >> For your setup3, as you are trying to use `flink run ...` command, it >> will try to connect to a launched flink cluster but I guess you did not >> launch the flink cluster. You can do `start-cluster.sh` first to launch a >> standalone flink cluster and then try the `flink run ...` command. >> For your setup4, the reason why it works well is that it will use the >> default mini cluster to run the pyflink job. So even you haven't started a >> standalone cluster, it can work as well. >> >> Best, >> Biao Geng >> >> Levan Huyen <lvhu...@gmail.com> 于2022年10月19日周三 17:07写道: >> >>> Hi, >>> >>> I'm new to PyFlink, and I couldn't run a basic example that shipped with >>> Flink. >>> This is the command I tried: >>> >>> ./bin/flink run -py examples/python/datastream/word_count.py >>> >>> Here below are the results I got with different setups: >>> >>> 1. On AWS EMR 6.8.0 (Flink 1.15.1): >>> *Error: No module named 'google'*I tried with the Flink shipped with >>> EMR, or the binary v1.15.1/v1.15.2 downloaded from Flink site. I got that >>> same error message in all cases. >>> >>> Traceback (most recent call last): >>> >>> File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main >>> >>> "__main__", mod_spec) >>> >>> File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code >>> >>> exec(code, run_globals) >>> >>> File >>> "/tmp/pyflink/622f19cc-6616-45ba-b7cb-20a1462c4c3f/a29eb7a4-fff1-4d98-9b38-56b25f97b70a/word_count.py", >>> line 134, in <module> >>> >>> word_count(known_args.input, known_args.output) >>> >>> File >>> "/tmp/pyflink/622f19cc-6616-45ba-b7cb-20a1462c4c3f/a29eb7a4-fff1-4d98-9b38-56b25f97b70a/word_count.py", >>> line 89, in word_count >>> >>> ds = ds.flat_map(split) \ >>> >>> File >>> "/usr/lib/flink/opt/python/pyflink.zip/pyflink/datastream/data_stream.py", >>> line 333, in flat_map >>> >>> File >>> "/usr/lib/flink/opt/python/pyflink.zip/pyflink/datastream/data_stream.py", >>> line 557, in process >>> >>> File >>> "/usr/lib/flink/opt/python/pyflink.zip/pyflink/fn_execution/flink_fn_execution_pb2.py", >>> line 23, in <module> >>> >>> ModuleNotFoundError: No module named 'google' >>> >>> org.apache.flink.client.program.ProgramAbortException: >>> java.lang.RuntimeException: Python process exits with code: 1 >>> >>> 2. On my Mac, without a virtual environment (so `*-pyclientexec python3*` >>> is included in the run command): got the same error as with EMR, but >>> there's a stdout line from `*print()*` in the Python script >>> >>> File >>> "/Users/lvh/dev/tmp/flink-1.15.2/opt/python/pyflink.zip/pyflink/datastream/data_stream.py", >>> line 557, in process >>> >>> File "<frozen zipimport>", line 259, in load_module >>> >>> File >>> "/Users/lvh/dev/tmp/flink-1.15.2/opt/python/pyflink.zip/pyflink/fn_execution/flink_fn_execution_pb2.py", >>> line 23, in <module> >>> >>> ModuleNotFoundError: No module named 'google' >>> >>> Executing word_count example with default input data set. >>> >>> Use --input to specify file input. >>> >>> org.apache.flink.client.program.ProgramAbortException: >>> java.lang.RuntimeException: Python process exits with code: 1 >>> >>> 3. On my Mac, with a virtual environment and Python package ` >>> *apache-flink`* installed: Flink tried to connect to localhost:8081 (I >>> don't know why), and failed with 'connection refused'. >>> >>> Caused by: org.apache.flink.util.concurrent.FutureUtils$RetryException: >>> Could not complete the operation. Number of retries has been exhausted. >>> >>> at >>> org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:395) >>> >>> ... 21 more >>> >>> Caused by: java.util.concurrent.CompletionException: >>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: >>> Connection refused: localhost/127.0.0.1:8081 >>> >>> 4. If I run that same example job using Python: `*python word_count.py*` >>> then it runs well. >>> >>> I tried with both v1.15.2 and 1.15.1, Python 3.7 & 3.8, and got the same >>> result. >>> >>> Could someone please help? >>> >>> Thanks. >>> >>