I have a Spark job that read data from database. By increasing submit parameter '--driver-memory 25g' the job can works without a problem locally but not in prod env because prod master do not have enough capacity.
So I have a few questions: - What functions such as collecct() would cause the data to be sent back to the driver program? My job so far merely uses `as`, `filter`, `map`, and `filter`. - Is it possible to write data (in parquet format for instance) to hdfs directly from the executor? If so how can I do (any code snippet, doc for reference, or what keyword to search cause can't find by e.g. `spark direct executor hdfs write`)? Thanks