I have a single source of data. The processing of records have to be directed
to multiple destinations. i.e
1. read the source data
2. based on condition route to the following sources
1. Kafka for error records
2. store success records with certain condition in s3 bucket, bucket
name : "A
Thank you. This was helpful. I have follow up questions.
1. How does spark know the data size is 5 million?
2. Are there any books or documentation that takes one simple job and goes
deeper in terms of understanding what happens under the hood?
--
Sent from: http://apache-spark-user-list.10015
Hello,
What happens when a job is submitted to a cluster? I know the 10,000
foot overview of the spark architecture. But I need the minute details as to
how spark estimates the resources to ask yarn, what's the response of yarn
etc... I need the *step by step* understanding of the complete