Hello all,

We are getting stream of input data from a Kafka queue using Spark Streaming 
API.  For each data element we want to run parallel threads to process a set of 
feature lists (nearly 100 feature or more).    Since feature lists creation is 
independent of each other we would like to execute these feature lists in 
parallel on the input data that we get from the Kafka queue.

Question is

1. Should we write thread pool and manage these features execution on different 
threads in parallel.  Only concern is because of data locality we are confined 
to the node that is assigned to the input data from the Kafka stream we cannot 
leverage distributed nodes for processing of these features for a single input 
data.

2.  Or since we are using JavaRDD as a feature list, these feature execution 
will be managed internally by Spark executors.

Thanks,

Rachana

Reply via email to