Hello all, We are getting stream of input data from a Kafka queue using Spark Streaming API. For each data element we want to run parallel threads to process a set of feature lists (nearly 100 feature or more). Since feature lists creation is independent of each other we would like to execute these feature lists in parallel on the input data that we get from the Kafka queue.
Question is 1. Should we write thread pool and manage these features execution on different threads in parallel. Only concern is because of data locality we are confined to the node that is assigned to the input data from the Kafka stream we cannot leverage distributed nodes for processing of these features for a single input data. 2. Or since we are using JavaRDD as a feature list, these feature execution will be managed internally by Spark executors. Thanks, Rachana