Best option to process single kafka stream in parallel: PySpark Vs Dask

lab22 Thu, 11 Jan 2024 04:54:17 -0800

I am creating a setup to process packets from single kafta topic in
parallel. For example, I have 3 containers (let's take 4 cores) on one vm,
and from 1 kafka topic stream I create 10 jobs depending on packet source.
These packets have small workload.


   1.

   I can install dask in each container, and execute set of tasks in
   parallel. So, each container with 4 cores can execute 4 jobs. This will
   increase computation and scheduling overhead in execution time.
   2.

   I can have a stand-alone spark app in each container, and execute set of
   tasks in parallel. Since packets are small, I won't be able to harness
   Spark's distributed computing power.
   3.

   I can have a spark master node, and distribute set of tasks in worker
   container, and there execute 1 task per core. Is such scheduling in Spark
   possible?

I have not yet done POC to compute resource cose, computation overhead etc.
So, I can use your opinion.

Which is the best solution for my use-case? I am interested in stand-alone
spark app, but using it just for parallelism, is it an efficient solution?

I explained all in details.

Best option to process single kafka stream in parallel: PySpark Vs Dask

Reply via email to