Hi Makoto,

I don't remember I wrote that but thanks for bringing this issue up!
There are two important settings to check: 1) driver memory (you can
see it from the executor tab), 2) number of partitions (try to use
small number of partitions). I put two PRs to fix the problem:

1) use broadcast in task closure: https://github.com/apache/spark/pull/1427
2) use treeAggregate to get the result:
https://github.com/apache/spark/pull/1110

They are still under review. Once merged, the problem should be fixed.
I will test the KDDB dataset and report back. Thanks!

Best,
Xiangrui

On Tue, Jul 15, 2014 at 10:48 PM, Makoto Yui <yuin...@gmail.com> wrote:
> Hello,
>
> (2014/06/19 23:43), Xiangrui Meng wrote:
>>>
>>> The execution was slow for more large KDD cup 2012, Track 2 dataset
>>> (235M+ records of 16.7M+ (2^24) sparse features in about 33.6GB) due to the
>>> sequential aggregation of dense vectors on a single driver node.
>>>
>>> It took about 7.6m for aggregation for an iteration.
>
>
> When running the above test, I got another error at the beginning of the 2nd
> iteration when enabling iterations.
>
> It works fine for the first iteration but the 2nd iteration always fails.
>
> It seems that akka connections are suddenly disassociated when GC happens on
> the driver node. Two possible causes can be considered:
> 1) The driver is under a heavy load because of GC; so executors cannot
> connect to the driver. Changing akka timeout setting did not resolve the
> issue.
> 2) akka oddly released valid connections on GC.
>
> I'm using spark 1.0.1 and timeout setting of akka as follows did not resolve
> the problem.
>
> [spark-defaults.conf]
> spark.akka.frameSize     50
> spark.akka.timeout       120
> spark.akka.askTimeout    120
> spark.akka.lookupTimeout 120
> spark.akka.heartbeat.pauses     600
>
> It seems this issue is related to one previously discussed in
> http://markmail.org/message/p2i34frtf4iusdfn
>
> Are there any preferred configurations or workaround for this issue?
>
> Thanks,
> Makoto
>
> --------------------------------------------
> [The error log of the driver]
>
> 14/07/14 18:11:32 INFO scheduler.TaskSetManager: Serialized task 4.0:117 as
> 25300254 bytes in 35 ms
> 666.108: [GC [PSYoungGen: 6540914K->975362K(7046784K)]
> 12419091K->7792529K(23824000K), 5.2157830 secs] [Times: user=0.00 sys=68.43,
> real=5.22 secs]
> 14/07/14 18:11:38 INFO network.ConnectionManager: Removing SendingConnection
> to ConnectionManagerId(dc09.mydomain.org,34565)
> 14/07/14 18:11:38 INFO network.ConnectionManager: Removing
> ReceivingConnection to ConnectionManagerId(dc09.mydomain.org,34565)
> 14/07/14 18:11:38 INFO client.AppClient$ClientActor: Executor updated:
> app-20140714180032-0010/8 is now EXITED (Command exited with code 1)
> 14/07/14 18:11:38 ERROR network.ConnectionManager: Corresponding
> SendingConnectionManagerId not found
> 14/07/14 18:11:38 INFO cluster.SparkDeploySchedulerBackend: Executor
> app-20140714180032-0010/8 removed: Command exited with code 1
> 14/07/14 18:11:38 INFO network.ConnectionManager: Removing SendingConnection
> to ConnectionManagerId(dc30.mydomain.org,59016)
> 14/07/14 18:11:38 INFO network.ConnectionManager: Removing
> ReceivingConnection to ConnectionManagerId(dc30.mydomain.org,59016)
> 14/07/14 18:11:38 ERROR network.ConnectionManager: Corresponding
> SendingConnectionManagerId not found
> 672.596: [GC [PSYoungGen: 6642785K->359202K(6059072K)]
> 13459952K->8065935K(22836288K), 2.8260220 secs] [Times: user=2.83 sys=33.72,
> real=2.83 secs]
> 14/07/14 18:11:41 INFO network.ConnectionManager: Removing
> ReceivingConnection to ConnectionManagerId(dc03.mydomain.org,43278)
> 14/07/14 18:11:41 INFO network.ConnectionManager: Removing SendingConnection
> to ConnectionManagerId(dc03.mydomain.org,43278)
> 14/07/14 18:11:41 INFO network.ConnectionManager: Removing SendingConnection
> to ConnectionManagerId(dc02.mydomain.org,54538)
> 14/07/14 18:11:41 INFO network.ConnectionManager: Removing
> ReceivingConnection to ConnectionManagerId(dc18.mydomain.org,58100)
> 14/07/14 18:11:41 INFO network.ConnectionManager: Removing SendingConnection
> to ConnectionManagerId(dc18.mydomain.org,58100)
> 14/07/14 18:11:41 INFO network.ConnectionManager: Removing SendingConnection
> to ConnectionManagerId(dc18.mydomain.org,58100)
>
> The full log is uploaded on
> https://dl.dropboxusercontent.com/u/13123103/driver.log
>
> --------------------------------------------
> [The error log of a worker]
> 14/07/14 18:11:38 INFO worker.Worker: Executor app-20140714180032-0010/8
> finished with state EXITED message Command exited with code 1 exitStatus 1
> 14/07/14 18:11:38 INFO actor.LocalActorRef: Message
> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
> Actor[akka://sparkWorker/deadLetters] to
> Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.0.1.9%3A60601-39#1322474303]
> was not delivered. [13] dead letters encountered. This logging can be turned
> off or adjusted with configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.
> 14/07/14 18:11:38 ERROR remote.EndpointWriter: AssociationError
> [akka.tcp://sparkwor...@dc09.mydomain.org:39578] ->
> [akka.tcp://sparkexecu...@dc09.mydomain.org:33886]: Error [Association
> failed with [akka.tcp://sparkexecu...@dc09.mydomain.org:33886]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkexecu...@dc09.mydomain.org:33886]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: dc09.mydomain.org/10.0.1.9:33886]
> 14/07/14 18:11:38 INFO worker.Worker: Asked to launch executor
> app-20140714180032-0010/32 for Spark shell
> 14/07/14 18:11:38 WARN worker.CommandUtils: SPARK_JAVA_OPTS was set on the
> worker. It is deprecated in Spark 1.0.
> 14/07/14 18:11:38 WARN worker.CommandUtils: Set SPARK_LOCAL_DIRS for
> node-specific storage locations.
> 14/07/14 18:11:38 ERROR remote.EndpointWriter: AssociationError
> [akka.tcp://sparkwor...@dc09.mydomain.org:39578] ->
> [akka.tcp://sparkexecu...@dc09.mydomain.org:33886]: Error [Association
> failed with [akka.tcp://sparkexecu...@dc09.mydomain.org:33886]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkexecu...@dc09.mydomain.org:33886]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: dc09.mydomain.org/10.0.1.9:33886]

Reply via email to