[ https://issues.apache.org/jira/browse/SPARK-15937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408217#comment-15408217 ]
Marcelo Vanzin commented on SPARK-15937: ---------------------------------------- [~subrotosanyal] from the logs you posted it doesn't seem like a Spark context is even being instantiated by your application. Can you provide more details about what your application is doing (like full logs, and optimally some sample code)? > Spark declares a succeeding job to be failed in yarn-cluster mode if the job > takes very small time (~ < 10 seconds) to finish > ----------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-15937 > URL: https://issues.apache.org/jira/browse/SPARK-15937 > Project: Spark > Issue Type: Bug > Affects Versions: 1.6.1 > Reporter: Subroto Sanyal > Priority: Critical > > h5. Problem: > Spark Job fails in yarn-cluster mode if the job takes less time than 10 > seconds. The job execution here is successful but, spark framework declares > it failed. > {noformat} > 16/06/13 10:50:29 INFO yarn.ApplicationMaster: Registered signal handlers for > [TERM, HUP, INT] > 16/06/13 10:50:30 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 16/06/13 10:50:31 INFO yarn.ApplicationMaster: ApplicationAttemptId: > appattempt_1465791692084_0078_000001 > 16/06/13 10:50:32 INFO spark.SecurityManager: Changing view acls to: subroto > 16/06/13 10:50:32 INFO spark.SecurityManager: Changing modify acls to: subroto > 16/06/13 10:50:32 INFO spark.SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(subroto); users > with modify permissions: Set(subroto) > 16/06/13 10:50:32 INFO yarn.ApplicationMaster: Starting the user application > in a separate Thread > 16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context > initialization > 16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context > initialization ... > 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Initializing plugin > registry on cluster... > 16/06/13 10:50:33 INFO util.DefaultTimeZone: Loading default time zone of > US/Eastern > 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property > das.big-decimal.precision=32 > 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property > das.default-timezone=US/Eastern > 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property > das.security.conductor.properties.keysLocation=etc/securePropertiesKeys > 16/06/13 10:50:33 INFO util.DefaultTimeZone: Changing default time zone of > from US/Eastern to US/Eastern > 16/06/13 10:50:34 INFO job.PluginRegistryImpl: --- JVM Information --- > 16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM: Java HotSpot(TM) 64-Bit > Server VM, 1.7 (Oracle Corporation) > 16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM arguments: -Xmx1024m > -Djava.io.tmpdir=/mnt/hadoop/yarn/usercache/subroto/appcache/application_1465791692084_0078/container_1465791692084_0078_01_000001/tmp > > -Dspark.yarn.app.container.log.dir=/var/log/hadoop/yarn/application_1465791692084_0078/container_1465791692084_0078_01_000001 > -XX:MaxPermSize=256m > 16/06/13 10:50:34 INFO job.PluginRegistryImpl: Log4j: > 'file:/mnt/hadoop/yarn/usercache/subroto/filecache/103/__spark_conf__6826322497897602970.zip/log4j.properties' > (default classpath) > 16/06/13 10:50:34 INFO job.PluginRegistryImpl: Max memory : 910.5 MB > 16/06/13 10:50:34 INFO job.PluginRegistryImpl: Free memory: 831.8 MB, before > Plugin Registry start-up: 847.5 MB > 16/06/13 10:50:34 INFO job.PluginRegistryImpl: --------------------- > 16/06/13 10:50:34 INFO graphv2.ClusterTaskRuntime: Initializing cluster task > configuration... > 16/06/13 10:50:34 INFO util.LoggingUtil: Setting root logger level for hadoop > task to DEBUG: > 16/06/13 10:50:35 INFO cluster.JobProcessor: Processing > JobInput{_jobName=Import job (76): > BookorderHS2ImportJob_SparkCluster#import(Identity)} > 16/06/13 10:50:35 DEBUG security.UserGroupInformation: hadoop login > 16/06/13 10:50:35 INFO cluster.JobProcessor: Writing job output to > hdfs://ip-10-195-43-46.eu-west-1.compute.internal:8020/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827. > 16/06/13 10:50:35 DEBUG hdfs.DFSClient: > /user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827: > masked=rw-r--r-- > 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to > ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto > sending #2 > 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to > ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got > value #2 > 16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: create took 2ms > 16/06/13 10:50:35 DEBUG hdfs.DFSClient: computePacketChunkSize: > src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827, > chunkSize=516, chunksPerPacket=127, packetSize=65532 > 16/06/13 10:50:35 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for > [DFSClient_NONMAPREDUCE_1004172348_1] with renew id 1 started > 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient writeChunk allocating new > packet seqno=0, > src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827, > packetSize=65532, chunksPerPacket=127, bytesCurBlock=0 > 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 0 > 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 1 > 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Allocating new block > 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Waiting for ack for: 1 > 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to > ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto > sending #3 > 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to > ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got > value #3 > 16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: addBlock took 1ms > 16/06/13 10:50:35 DEBUG hdfs.DFSClient: pipeline = 10.126.43.144:1004 > 16/06/13 10:50:35 DEBUG hdfs.DFSClient: pipeline = 10.195.43.46:1004 > 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Connecting to datanode > 10.126.43.144:1004 > 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Send buf size 131071 > 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DataStreamer block > BP-88063502-10.195.43.46-1465791630789:blk_1073743637_2813 sending packet > packet seqno:0 offsetInBlock:0 lastPacketInBlock:false lastByteOffsetInBlock: > 330 > 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient seqno: 0 status: SUCCESS > status: SUCCESS downstreamAckTimeNanos: 510884 > 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DataStreamer block > BP-88063502-10.195.43.46-1465791630789:blk_1073743637_2813 sending packet > packet seqno:1 offsetInBlock:330 lastPacketInBlock:true > lastByteOffsetInBlock: 330 > 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient seqno: 1 status: SUCCESS > status: SUCCESS downstreamAckTimeNanos: 1449877 > 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to > ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto > sending #4 > 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to > ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got > value #4 > 16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: complete took 1ms > 16/06/13 10:50:35 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, > exitCode: 0 > 16/06/13 10:50:35 DEBUG yarn.ApplicationMaster: Done running users class > 16/06/13 10:50:42 ERROR yarn.ApplicationMaster: SparkContext did not > initialize after waiting for 500000 ms. Please check earlier log output for > errors. Failing the application. > 16/06/13 10:50:42 INFO yarn.ApplicationMaster: Unregistering > ApplicationMaster with SUCCEEDED > 16/06/13 10:50:42 INFO yarn.ApplicationMaster: Deleting staging directory > .sparkStaging/application_1465791692084_0078 > 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to > ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto > sending #5 > 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to > ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got > value #5 > 16/06/13 10:50:42 DEBUG ipc.ProtobufRpcEngine: Call: delete took 2ms > 16/06/13 10:50:42 INFO util.ShutdownHookManager: Shutdown hook called > 16/06/13 10:50:42 DEBUG ipc.Client: Stopping client > 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to > ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto: > closed > 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to > ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto: > stopped, remaining connections 0 > {noformat} > h5. Root Cause > Race condition in the logic of method > _org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkContextInitialized()_ > to figure out if the SparkContext is available or not is not sufficient and > it doesn't respect the fact job can be finished within 10 seconds. > From the above log snippet (log of App Master) we can see that user class is > done executing and marks as finished with App status as _SUCCEEDED_ and > exitCode as _0_ but, immediately after that there is an error mentioning that > SparkContext was't initialised after 500 seconds(the job here actually takes > little less than 5 seconds to complete) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org