Hello,

(2014/06/19 23:43), Xiangrui Meng wrote:
The execution was slow for more large KDD cup 2012, Track 2 dataset (235M+ 
records of 16.7M+ (2^24) sparse features in about 33.6GB) due to the sequential 
aggregation of dense vectors on a single driver node.

It took about 7.6m for aggregation for an iteration.

When running the above test, I got another error at the beginning of the 2nd iteration when enabling iterations.

It works fine for the first iteration but the 2nd iteration always fails.

It seems that akka connections are suddenly disassociated when GC happens on the driver node. Two possible causes can be considered: 1) The driver is under a heavy load because of GC; so executors cannot connect to the driver. Changing akka timeout setting did not resolve the issue.
2) akka oddly released valid connections on GC.

I'm using spark 1.0.1 and timeout setting of akka as follows did not resolve the problem.

[spark-defaults.conf]
spark.akka.frameSize     50
spark.akka.timeout       120
spark.akka.askTimeout    120
spark.akka.lookupTimeout 120
spark.akka.heartbeat.pauses     600

It seems this issue is related to one previously discussed in
http://markmail.org/message/p2i34frtf4iusdfn

Are there any preferred configurations or workaround for this issue?

Thanks,
Makoto

--------------------------------------------
[The error log of the driver]

14/07/14 18:11:32 INFO scheduler.TaskSetManager: Serialized task 4.0:117 as 25300254 bytes in 35 ms 666.108: [GC [PSYoungGen: 6540914K->975362K(7046784K)] 12419091K->7792529K(23824000K), 5.2157830 secs] [Times: user=0.00 sys=68.43, real=5.22 secs] 14/07/14 18:11:38 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(dc09.mydomain.org,34565) 14/07/14 18:11:38 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(dc09.mydomain.org,34565) 14/07/14 18:11:38 INFO client.AppClient$ClientActor: Executor updated: app-20140714180032-0010/8 is now EXITED (Command exited with code 1) 14/07/14 18:11:38 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found 14/07/14 18:11:38 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140714180032-0010/8 removed: Command exited with code 1 14/07/14 18:11:38 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(dc30.mydomain.org,59016) 14/07/14 18:11:38 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(dc30.mydomain.org,59016) 14/07/14 18:11:38 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found 672.596: [GC [PSYoungGen: 6642785K->359202K(6059072K)] 13459952K->8065935K(22836288K), 2.8260220 secs] [Times: user=2.83 sys=33.72, real=2.83 secs] 14/07/14 18:11:41 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(dc03.mydomain.org,43278) 14/07/14 18:11:41 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(dc03.mydomain.org,43278) 14/07/14 18:11:41 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(dc02.mydomain.org,54538) 14/07/14 18:11:41 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(dc18.mydomain.org,58100) 14/07/14 18:11:41 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(dc18.mydomain.org,58100) 14/07/14 18:11:41 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(dc18.mydomain.org,58100)

The full log is uploaded on
https://dl.dropboxusercontent.com/u/13123103/driver.log

--------------------------------------------
[The error log of a worker]
14/07/14 18:11:38 INFO worker.Worker: Executor app-20140714180032-0010/8 finished with state EXITED message Command exited with code 1 exitStatus 1 14/07/14 18:11:38 INFO actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.0.1.9%3A60601-39#1322474303] was not delivered. [13] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/07/14 18:11:38 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkwor...@dc09.mydomain.org:39578] -> [akka.tcp://sparkexecu...@dc09.mydomain.org:33886]: Error [Association failed with [akka.tcp://sparkexecu...@dc09.mydomain.org:33886]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkexecu...@dc09.mydomain.org:33886] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: dc09.mydomain.org/10.0.1.9:33886] 14/07/14 18:11:38 INFO worker.Worker: Asked to launch executor app-20140714180032-0010/32 for Spark shell 14/07/14 18:11:38 WARN worker.CommandUtils: SPARK_JAVA_OPTS was set on the worker. It is deprecated in Spark 1.0. 14/07/14 18:11:38 WARN worker.CommandUtils: Set SPARK_LOCAL_DIRS for node-specific storage locations. 14/07/14 18:11:38 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkwor...@dc09.mydomain.org:39578] -> [akka.tcp://sparkexecu...@dc09.mydomain.org:33886]: Error [Association failed with [akka.tcp://sparkexecu...@dc09.mydomain.org:33886]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkexecu...@dc09.mydomain.org:33886] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: dc09.mydomain.org/10.0.1.9:33886]

Reply via email to