[jira] [Commented] (SPARK-15937) Spark declares a succeeding job to be failed in yarn-cluster mode if the job takes very small time (~ < 10 seconds) to finish

Marcelo Vanzin (JIRA) Thu, 04 Aug 2016 10:58:42 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408217#comment-15408217
 ]


Marcelo Vanzin commented on SPARK-15937:
----------------------------------------

[~subrotosanyal] from the logs you posted it doesn't seem like a Spark context 
is even being instantiated by your application. Can you provide more details 
about what your application is doing (like full logs, and optimally some sample 
code)?

> Spark declares a succeeding job to be failed in yarn-cluster mode if the job 
> takes very small time (~ < 10 seconds) to finish
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-15937
>                 URL: https://issues.apache.org/jira/browse/SPARK-15937
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.1
>            Reporter: Subroto Sanyal
>            Priority: Critical
>
> h5. Problem:
> Spark Job fails in yarn-cluster mode if the job takes less time than 10 
> seconds. The job execution here is successful but, spark framework declares 
> it failed.
> {noformat}
> 16/06/13 10:50:29 INFO yarn.ApplicationMaster: Registered signal handlers for 
> [TERM, HUP, INT]
> 16/06/13 10:50:30 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 16/06/13 10:50:31 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
> appattempt_1465791692084_0078_000001
> 16/06/13 10:50:32 INFO spark.SecurityManager: Changing view acls to: subroto
> 16/06/13 10:50:32 INFO spark.SecurityManager: Changing modify acls to: subroto
> 16/06/13 10:50:32 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(subroto); users 
> with modify permissions: Set(subroto)
> 16/06/13 10:50:32 INFO yarn.ApplicationMaster: Starting the user application 
> in a separate Thread
> 16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context 
> initialization
> 16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context 
> initialization ... 
> 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Initializing plugin 
> registry on cluster...
> 16/06/13 10:50:33 INFO util.DefaultTimeZone: Loading default time zone of 
> US/Eastern
> 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property 
> das.big-decimal.precision=32
> 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property 
> das.default-timezone=US/Eastern
> 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property 
> das.security.conductor.properties.keysLocation=etc/securePropertiesKeys
> 16/06/13 10:50:33 INFO util.DefaultTimeZone: Changing default time zone of 
> from US/Eastern to US/Eastern
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: --- JVM Information ---
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM: Java HotSpot(TM) 64-Bit 
> Server VM, 1.7 (Oracle Corporation)
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM arguments: -Xmx1024m 
> -Djava.io.tmpdir=/mnt/hadoop/yarn/usercache/subroto/appcache/application_1465791692084_0078/container_1465791692084_0078_01_000001/tmp
>  
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop/yarn/application_1465791692084_0078/container_1465791692084_0078_01_000001
>  -XX:MaxPermSize=256m 
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: Log4j:  
> 'file:/mnt/hadoop/yarn/usercache/subroto/filecache/103/__spark_conf__6826322497897602970.zip/log4j.properties'
>  (default classpath)
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: Max memory : 910.5 MB
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: Free memory: 831.8 MB, before 
> Plugin Registry start-up: 847.5 MB
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: ---------------------
> 16/06/13 10:50:34 INFO graphv2.ClusterTaskRuntime: Initializing cluster task 
> configuration...
> 16/06/13 10:50:34 INFO util.LoggingUtil: Setting root logger level for hadoop 
> task to DEBUG: 
> 16/06/13 10:50:35 INFO cluster.JobProcessor: Processing 
> JobInput{_jobName=Import job (76): 
> BookorderHS2ImportJob_SparkCluster#import(Identity)}
> 16/06/13 10:50:35 DEBUG security.UserGroupInformation: hadoop login
> 16/06/13 10:50:35 INFO cluster.JobProcessor: Writing job output to 
> hdfs://ip-10-195-43-46.eu-west-1.compute.internal:8020/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827.
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: 
> /user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827: 
> masked=rw-r--r--
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto 
> sending #2
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got 
> value #2
> 16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: create took 2ms
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: computePacketChunkSize: 
> src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827, 
> chunkSize=516, chunksPerPacket=127, packetSize=65532
> 16/06/13 10:50:35 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for 
> [DFSClient_NONMAPREDUCE_1004172348_1] with renew id 1 started
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient writeChunk allocating new 
> packet seqno=0, 
> src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827, 
> packetSize=65532, chunksPerPacket=127, bytesCurBlock=0
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 0
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 1
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Allocating new block
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Waiting for ack for: 1
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto 
> sending #3
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got 
> value #3
> 16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: addBlock took 1ms
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: pipeline = 10.126.43.144:1004
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: pipeline = 10.195.43.46:1004
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Connecting to datanode 
> 10.126.43.144:1004
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Send buf size 131071
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DataStreamer block 
> BP-88063502-10.195.43.46-1465791630789:blk_1073743637_2813 sending packet 
> packet seqno:0 offsetInBlock:0 lastPacketInBlock:false lastByteOffsetInBlock: 
> 330
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient seqno: 0 status: SUCCESS 
> status: SUCCESS downstreamAckTimeNanos: 510884
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DataStreamer block 
> BP-88063502-10.195.43.46-1465791630789:blk_1073743637_2813 sending packet 
> packet seqno:1 offsetInBlock:330 lastPacketInBlock:true 
> lastByteOffsetInBlock: 330
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient seqno: 1 status: SUCCESS 
> status: SUCCESS downstreamAckTimeNanos: 1449877
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto 
> sending #4
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got 
> value #4
> 16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: complete took 1ms
> 16/06/13 10:50:35 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 16/06/13 10:50:35 DEBUG yarn.ApplicationMaster: Done running users class
> 16/06/13 10:50:42 ERROR yarn.ApplicationMaster: SparkContext did not 
> initialize after waiting for 500000 ms. Please check earlier log output for 
> errors. Failing the application.
> 16/06/13 10:50:42 INFO yarn.ApplicationMaster: Unregistering 
> ApplicationMaster with SUCCEEDED
> 16/06/13 10:50:42 INFO yarn.ApplicationMaster: Deleting staging directory 
> .sparkStaging/application_1465791692084_0078
> 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to 
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto 
> sending #5
> 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to 
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got 
> value #5
> 16/06/13 10:50:42 DEBUG ipc.ProtobufRpcEngine: Call: delete took 2ms
> 16/06/13 10:50:42 INFO util.ShutdownHookManager: Shutdown hook called
> 16/06/13 10:50:42 DEBUG ipc.Client: Stopping client
> 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to 
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto: 
> closed
> 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to 
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto: 
> stopped, remaining connections 0
> {noformat}
> h5. Root Cause
> Race condition in the logic of method  
> _org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkContextInitialized()_
>  to figure out if the SparkContext is available or not is not sufficient and 
> it doesn't respect the fact job can be finished within 10 seconds.
> From the above log snippet (log of App Master) we can see that user class is 
> done executing and marks as finished with App status as _SUCCEEDED_ and 
> exitCode as _0_ but, immediately after that there is an error mentioning that 
> SparkContext was't initialised after 500 seconds(the job here actually takes 
> little less than 5 seconds to complete)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15937) Spark declares a succeeding job to be failed in yarn-cluster mode if the job takes very small time (~ < 10 seconds) to finish

Reply via email to