Hi all,

We're working on a project where we're limited to one big development
machine for now. We want to start developing data processing pipelines in
Python, which should eventually be ported to a currently unknown setup on a
separate cluster or cloud, so we went with Beam for its portability.

For the development setup, we wanted to have the least amount of overhead
possible, so we deployed a one node flink cluster with docker-compose. The
whole setup is defined by the following docker-compose.yml:

```
version: "2.1"
services:
  flink-jobmanager:
    image: flink:1.9
    network_mode: host
    command: jobmanager
    environment:
      - JOB_MANAGER_RPC_ADDRESS=localhost

  flink-taskmanager:
    image: flink:1.9
    network_mode: host
    depends_on:
      - flink-jobmanager
    command: taskmanager
    environment:
      - JOB_MANAGER_RPC_ADDRESS=localhost
    volumes:
      - staging-dir:/tmp/beam-artifact-staging
      - /usr/bin/docker:/usr/bin/docker
      - /var/run/docker.sock:/var/run/docker.sock
    user: flink:${DOCKER_GID}

  beam-jobserver:
    image: apache/beam_flink1.9_job_server:2.20.0
    network_mode: host
    command: --flink-master=localhost:8081
    volumes:
      - staging-dir:/tmp/beam-artifact-staging

volumes:
  staging-dir:
```

We can submit and run pipelines with the following options:
```
'runner': 'PortableRunner',
'job_endpoint': 'localhost:8099',
```
The environment type for the SDK Harness is configured to the default
'docker'.

However, we cannot write output files to the host system. To fix this,
I tried to mount a host directory to the Beam SDK Container (I had to
rebuild the Beam Job Server jar and image to do this). This seems to have
worked, as the output file is created on the host system. However the
pipeline silently fails, and the output file remains empty. Running the
pipeline with DirectRunner confirms that the pipeline is working.

Looking at the output logs, the following error is thrown in the Flink Task
Manager:
flink-taskmanager_1  | java.lang.NoClassDefFoundError:
org/apache/beam/vendor/grpc/v1p26p0/io/netty/buffer/PoolArena$1
I don't know if this is a result of me rebuilding the Job Server, or caused
by another issue.

We currently do not have a distributed file system available. Is there any
way to make writing to the host system possible?

Kind regards,
Robbe

 [image: https://ml6.eu] <https://ml6.eu>

Robbe Sneyders

ML6 Gent
<https://www.google.be/maps/place/ML6/@51.037408,3.7044893,17z/data=!3m1!4b1!4m5!3m4!1s0x47c37161feeca14b:0xb8f72585fdd21c90!8m2!3d51.037408!4d3.706678?hl=nl>

M: +32 474 71 31 08

Reply via email to