I am creating a setup to process packets from single kafta topic in parallel. For example, I have 3 containers (let's take 4 cores) on one vm, and from 1 kafka topic stream I create 10 jobs depending on packet source. These packets have small workload.
1. I can install dask in each container, and execute set of tasks in parallel. So, each container with 4 cores can execute 4 jobs. This will increase computation and scheduling overhead in execution time. 2. I can have a stand-alone spark app in each container, and execute set of tasks in parallel. Since packets are small, I won't be able to harness Spark's distributed computing power. 3. I can have a spark master node, and distribute set of tasks in worker container, and there execute 1 task per core. Is such scheduling in Spark possible? I have not yet done POC to compute resource cose, computation overhead etc. So, I can use your opinion. Which is the best solution for my use-case? I am interested in stand-alone spark app, but using it just for parallelism, is it an efficient solution? I explained all in details.