Hi all, I have written a small ETL spark application which takes data from GCS and transforms them and saves them again into some other GCS bucket. I am trying to run this application for different ids using a spark cluster in google's dataproc and just tweaking the default configuration to use a FAIR scheduler with FIFO queue by configuring these settings in /etc/hadoop/conf/yarn-site.xml yarn.resourcemanager.scheduler.class = yarn.resourcemanager.scheduler.class yarn.scheduler.fair.allocation.file = /etc/hadoop/conf/fair-scheduler.xml yarn.scheduler.fair.user-as-default-queue = false in /etc/hadoop/conf/fair-scheduler.xml, allocations as <queueMaxAppsDefault>1</queueMaxAppsDefault>
in a spark cluster for a 2 core, 4GB RAM master - 1 4 core, 16GB RAM workers - 2 I did testing for 5 spark submissions and everything is working as expected. All the applications are running one after the other without any exceptions. when I tried to run the same testing exercise for 100 submissions, some of the submissions failed with out of memory errors. When I re-ran the OOM submissions individually they completed without any error. the submission's log which has out of memory ''' 20/06/05 19:44:23 INFO org.spark_project.jetty.util.log: Logging initialized @5463ms 20/06/05 19:44:24 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown 20/06/05 19:44:24 INFO org.spark_project.jetty.server.Server: Started @5599ms 20/06/05 19:44:24 WARN org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 20/06/05 19:44:24 WARN org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042. 20/06/05 19:44:24 WARN org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043. 20/06/05 19:44:24 WARN org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044. 20/06/05 19:44:24 WARN org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045. 20/06/05 19:44:24 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@723f98fa{HTTP/1.1,[http/1.1]}{0.0.0.0:4045} 20/06/05 19:44:24 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration. 20/06/05 19:44:26 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at airf-m-2c-w-4c-4-faff-m/10.160.0.156:8032 20/06/05 19:44:27 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at airf-m-2c-w-4c-4-faff-m/10.160.0.156:10200 20/06/05 19:44:29 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1591383928453_0047 20/06/05 19:46:34 WARN org.apache.spark.sql.SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect. 20/06/05 19:46:41 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Repairing batch of 24 missing directories. 20/06/05 19:46:44 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Successfully repaired 24/24 implicit directories. OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000098200000, 46661632, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (mmap) failed to map 46661632 bytes for committing reserved memory. # An error report file with more information is saved as: # /tmp/9e22ca5b-5bf8-47b7-12ee-69cd9e37e7c8_spark_submit_20200605_82b0375c/hs_err_pid9917.log Job output is complete ''' ALso, when I was test running an application I never saw this log Service 'SparkUI' could not bind on port 4040. Attempting port 4041. I am very new to spark. I didnt know which configurations might help to debug this. This log also didn't help. I lost the hs_err file when the cluster was deleted. What can I do to debug this? Thanks for taking your time to read this.