Hi, I am curious how records are being put to task, since, as you may see on the photo below, there's 1 specific executor that contains more task than the other. The setup is this:
- Spark version 2.3.1 - Spark streaming job runs on Spark Standalone with following configuration: - spark.max.cores: 105 - executor-memory: 4G - driver-memory: 2G - memory.storageFraction: 0.1 - spark.streaming.kafka.maxRatePerPartition: 15000 - duration per second: 20 seconds - Spark streaming job per batch finishes at ~9 seconds, and consuming ~800k records - Spark standalone contains: - workers: 15 (8 cores, 30G memory per worker) - cores: 120 - memory: 455.6G - Consumes on kafka topic with 60 partitions The spark streaming job is consuming records on kafka using org.apache.spark.streaming.kafka010.KafkaUtils, record format is JSON, what it does is map and filter transformations (the data type being transformed is a class with 50 fields), no repartitioning, and in the end sink to another topic with 60 partitions, and transform map to pair (timestamp as key, and class as value) -> countByValue -> sortByValue and print the top 10 records. Would like to do tuning and enhancements and hope someone could explain and assist where I should look into. [image: screencapture-aratupstream201-prod-hnd2-bdd-local-4040-executors-2019-06-04-18_28_59.png] Thanks in advance! A