Hi All, I am running a Spark program where one of my parts is using Spark as a scheduler rather than a data management framework. That is, my job can be described as RDD[String] where the string describes an operation to perform which may be cheap or expensive (process an object which may have a small or large amount of records associated with it).
Leaving things to default, I have bad job balancing. I am wondering which approach I should take: 1. Write a partitioner which uses partitionBy to ahead of time balance partitions by number of records each string needs 2. repartition to have many small partitions (I have ~1700 strings acting as jobs to run, so maybe 1-5 per partition). My question here is, does Spark re-schedule/steal jobs if there are executors/worker processes that aren't doing any work? The second one would be easier and since I am not shuffling much data around would work just fine for me, but I can't seem to find out for sure if Spark does job re-scheduling/stealing. Thanks -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 Github: github.com/EntilZha | LinkedIn: https://www.linkedin.com/in/pedrorodriguezscience