Re: Issue with running Flink Python jobs on cluster

Chesnay Schepler Sun, 17 Jul 2016 08:59:01 -0700

well now i know what the problem could be.

You are trying to execute a job on a cluster (== not local), but haveset the local flag to true.

    env.execute(local=True)

Due to this flag the files are only copied into the tmp directory of thenode where you execute the plan, and are thus not accessible from otherworker nodes.

In order to use the Python API on a cluster you *must* have a filesystemthat is accessible by all workers (like HDFS) to which the files can becopied. From there they can be distributed to the nodes via the DC.**


On 17.07.2016 17:33, Geoffrey Mon wrote:

I haven't yet figured out how to write a Java job to testDistributedCache functionality between machines; I've only gottenworker nodes to create caches from local files (on the same workernodes), rather than on files from the master node. TheDistributedCache test I've been using (based on theDistributedCacheTest unit test) is here:
https://gist.github.com/GEOFBOT/041d76b47f08919305493f57ebdde0f7
I realized that this test only tested local files because I wasgetting an error that the file used for the cache was not found untilI created that file on the worker node in the location specified inthe plan.
I've been trying to run a simple Python example that does word counting:
https://gist.github.com/GEOFBOT/dbdc30120fb4d71383d9e3eff5f93c1f
I've tried three different setups so far: I've tried virtual machines,AWS virtual machine instances, and physical machines. With each setup,I get the same errors.
Although with all three of these setups, basic Java jobs can be run(like WordCount, PageRank), Python programs cannot be run because thefiles needed to run them are not properly distributed to the workernodes. I've found that although the master node reads the Pythonlibraries and plan files (presumably to send them to the worker), theworker node never writes any of those files to disk, despite the filesbeing added to the list of files in the distributed cache viaDistributedCache.writeFileInfotoConfig (which I found via remotedebugging).
When a Python program is run via pyflink, it executes but crashes assoon as there is any sort of operation requiring mapping. Thefollowing exception is thrown:
2016-07-17 09:39:50,857 INFOorg.apache.flink.runtime.executiongraph.ExecutionGraph -MapPartition (PythonFlatMap -> PythonMap) (1/1)(12fbc21f0424ab87f3ef579fbe73b0b3) switched from RUNNING to FAILED2016-07-17 09:39:50,863 INFOorg.apache.flink.runtime.jobmanager.JobManager - Status of job3685c180d7eb004154e8f9d94996e0cf (Flink Java Job at Sun Jul 1709:39:49 EDT 2016) changed to FAILING.java.lang.Exception: The user defined 'open()' method caused anexception: An error occurred while copying the file.
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:481)
<snip>
Caused by: java.lang.RuntimeException: An error occurred while copyingthe file.atorg.apache.flink.api.common.cache.DistributedCache.getFile(DistributedCache.java:78)atorg.apache.flink.python.api.streaming.data.PythonStreamer.startPython(PythonStreamer.java:102)
<snip>
Caused by: java.io.FileNotFoundException: File file:/tmp/flink doesnot exist or the user running Flink ('gmon') has insufficientpermissions to access it.atorg.apache.flink.core.fs.local.LocalFileSystem.getFileStatus(LocalFileSystem.java:109)
at org.apache.flink.runtime.filecache.FileCache.copy(FileCache.java:242)
atorg.apache.flink.runtime.filecache.FileCache$CopyProcess.call(FileCache.java:322)
<snip>
... 1 more
If the pyflink library is manually copied into place at /tmp/flink,that error will be replaced by the following:
2016-07-17 00:10:54,342 INFOorg.apache.flink.runtime.executiongraph.ExecutionGraph -MapPartition (PythonFlatMap -> PythonMap) (1/1)(23591303d5b571a6b3e9b68ef51c5a8e) switched from RUNNING to FAILED2016-07-17 00:10:54,348 INFOorg.apache.flink.runtime.jobmanager.JobManager - Status of jobe072403ffec32bd14b54416b53cb46ae (Flink Java Job at Sun Jul 1700:10:51 EDT 2016) changed to FAILING.java.lang.Exception: The user defined 'open()' method caused anexception: External process for task MapPartition (PythonFlatMap ->PythonMap) terminated prematurely.python3: can't open file'/tmp/flink-dist-cache-724a9274-f984-429c-a0bf-4f5f23c5cfbc/e072403ffec32bd14b54416b53cb46ae/flink/plan.py':[Errno 2] No such file or directory
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:481)
<snip>
Caused by: java.lang.RuntimeException: External process for taskMapPartition (PythonFlatMap -> PythonMap) terminated prematurely.python3: can't open file'/tmp/flink-dist-cache-724a9274-f984-429c-a0bf-4f5f23c5cfbc/e072403ffec32bd14b54416b53cb46ae/flink/plan.py':[Errno 2] No such file or directoryatorg.apache.flink.python.api.streaming.data.PythonStreamer.startPython(PythonStreamer.java:144)atorg.apache.flink.python.api.streaming.data.PythonStreamer.open(PythonStreamer.java:92)atorg.apache.flink.python.api.functions.PythonMapPartition.open(PythonMapPartition.java:48)
<snip>
... 3 more
Both of these exceptions point to Flink not properly copying therequested files. Has anyone else successfully run Python jobs on aFlink cluster, or is there a bug preventing successful operation?Unfortunately, I am relying on using a Flink cluster to run a Pythonjob for some scientific data that needs to be completed soon.
Thank for your assistance,
Geoffrey
On Sun, Jul 17, 2016 at 4:04 AM Chesnay Schepler <ches...@apache.org<mailto:ches...@apache.org>> wrote:
    Please also post the job you're trying to run.


    On 17.07.2016 08:43, Geoffrey Mon wrote:
    The Java program I used to test DistributedCache was faulty since
    it actually created the cache from files on the machine on which
    the program was running (i.e. the worker node).

    I tried implementing a cluster again, this time using two actual
    machines instead of virtual machines. I found the same error of
    the Python libraries and plan file not being found in the
    temporary directory. Has anyone else been able to successfully
    set up a Flink cluster to run Python jobs? I've been beginning to
    suspect that there may be some issues with running Python jobs on
    Flink clusters that are present in Flink.

    Cheers,
    Geoffrey

    On Fri, Jul 15, 2016 at 11:28 AM Geoffrey Mon <geof...@gmail.com
    <mailto:geof...@gmail.com>> wrote:

        I wrote a simple Java plan that reads a file in the
        distributed cache and uses the first line from that file in a
        map operation. Sure enough, it works locally, but fails when
        the job is sent to a taskmanager on a worker node. Since
        DistributedCache seems to work for everyone else, I'm
        thinking that maybe some sort of file permissions are not
        properly set such that Flink is not able to successfully
        write distributed cache files.

        I used inotify-tools to watch the temporary files directory
        on both the master node and worker node. When the plan is
        being prepared, the jobmanager node wrote the Python modules
        and plan file to the temporary files directory. However, on
        the worker node, the directory tree was created, but the job
        failed before any of the module or plan files were even
        attempted to be written. Interestingly enough, there were no
        error messages or warnings about the cache.

        Cheers,
        Geoffrey

        On Fri, Jul 15, 2016 at 4:15 AM Chesnay Schepler
        <ches...@apache.org <mailto:ches...@apache.org>> wrote:

            Could you write a java job that uses the Distributed
            cache to distribute files?

            If this fails then the DC is faulty, if it doesn't
            something in the Python API is wrong.


            On 15.07.2016 08:06, Geoffrey Mon wrote:
            I've come across similar issues when trying to set up
            Flink on Amazon EC2 instances. Presumably there is
            something wrong with my setup? Here is the
            flink-conf.yaml I am using:
            
https://gist.githubusercontent.com/GEOFBOT/3ffc9b21214174ae750cc3fdb2625b71/raw/flink-conf.yaml

            Thanks,
            Geoffrey

            On Wed, Jul 13, 2016 at 1:15 PM Geoffrey Mon
            <geof...@gmail.com <mailto:geof...@gmail.com>> wrote:

                Hello,

                Here is the TaskManager log on pastebin:
                http://pastebin.com/XAJ56gn4

                I will look into whether the files were created.

                By the way, the cluster is made with virtual
                machines running on BlueData EPIC. I don't know if
                that might be related to the problem.

                Thanks,
                Geoffrey


                On Wed, Jul 13, 2016 at 6:12 AM Chesnay Schepler
                <ches...@apache.org <mailto:ches...@apache.org>> wrote:

                    Hello Geoffrey,

                    How often does this occur?

                    Flink distributes the user-code and the python
                    library using the Distributed Cache.

                    Either the file is deleted right after being
                    created for some reason, or the DC returns a
                    file name before the file was created (which
                    shouldn't happen, it should block it is available).

                    If you are up to debugging this i would suggest
                    looking into FileCache class and verifying
                    whether the file in question is in fact created.

                    The logs of the TaskManager of which the
                    exception occurs could be of interest too; could
                    you send them to me?

                    Regards,
                    Chesnay


                    On 13.07.2016 04:11, Geoffrey Mon wrote:
                    Hello all,

                    I've set up Flink on a very small cluster of
                    one master node and five worker nodes,
                    following the instructions in the documentation
                    
(https://ci.apache.org/projects/flink/flink-docs-master/setup/cluster_setup.html).
                    I can run the included examples like WordCount
                    and PageRank across the entire cluster, but
                    when I try to run simple Python examples, I
                    sometimes get a strange error on the first
                    PythonMapPartition about the temporary folders
                    that contain the streams of data between Python
                    and Java.

                    If I run jobs on only the taskmanager on the
                    master node, Python examples run fine. However,
                    if the jobs use the worker nodes, then I get
                    the following error:

                    org.apache.flink.client.program.ProgramInvocationException:
                    The program execution failed: Job execution failed.
                    at
                    
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:378)
                    <snip>
                    Caused by:
                    org.apache.flink.runtime.client.JobExecutionException:
                    Job execution failed.
                    at
                    
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:806)
                    <snip>
                    Caused by: java.lang.Exception: The user
                    defined 'open()' method caused an exception:
                    External process for task MapPartition
                    (PythonMap) terminated prematurely.
                    python: can't open file
                    
'/home/bluedata/flink/tmp/flink-dist-cache-d1958549-3d58-4554-b286-cb4b1cdb9060/52feefa8892e61ba9a187f7684c84ada/flink/plan.py':
                    [Errno 2] No such file or directory
                    at
                    
org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:481)
                    <snip>
                    Caused by: java.lang.RuntimeException: External
                    process for task MapPartition (PythonMap)
                    terminated prematurely.
                    python: can't open file
                    
'/home/bluedata/flink/tmp/flink-dist-cache-d1958549-3d58-4554-b286-cb4b1cdb9060/52feefa8892e61ba9a187f7684c84ada/flink/plan.py':
                    [Errno 2] No such file or directory
                    at
                    
org.apache.flink.python.api.streaming.data.PythonStreamer.startPython(PythonStreamer.java:144)
                    at
                    
org.apache.flink.python.api.streaming.data.PythonStreamer.open(PythonStreamer.java:92)
                    at
                    
org.apache.flink.python.api.functions.PythonMapPartition.open(PythonMapPartition.java:48)
                    at
                    
org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:38)
                    at
                    
org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:477)
                    ... 5 more

                    I'm suspecting this issue has something to do
                    with the data sending between the master and
                    the workers, but I haven't been able to find
                    any solutions. Presumably the temporary files
                    weren't received properly and thus were not
                    created properly?

                    Thanks in advance.

                    Cheers,
                    Geoffrey

Re: Issue with running Flink Python jobs on cluster

Reply via email to