I have 3 datasets in all the datasets the average file size is 10-12Kb. 
I am able to run my code on the dataset with 70K files, but I am not able to 
run it on datasets with 1.1M and 3.8M files. 
______________________________________________________________


On Sun, Nov 2, 2014 at 1:35 AM,  <jan.zi...@centrum.cz> wrote:
Hi,

I am using Spark on Yarn, particularly Spark in Python. I am trying to run:

myrdd = sc.textFile("s3n://mybucket/files/*/*/*.json")

How many files do you have? and the average size of each file?

myrdd.getNumPartitions()

Unfortunately it seems that Spark tries to load everything to RAM, or at
least after while of running this everything slows down and then I am
getting errors with log below. Everything works fine for datasets smaller
than RAM, but I would expect Spark doing this without storing everything to
RAM. So I would like to ask if I'm not missing some settings in Spark on
Yarn?


Thank you in advance for any help.


14/11/01 22:06:57 ERROR actor.ActorSystemImpl: Uncaught fatal error from
thread [sparkDriver-akka.actor.default-dispatcher-375] shutting down
ActorSystem [sparkDriver]

java.lang.OutOfMemoryError: GC overhead limit exceeded

14/11/01 22:06:57 ERROR actor.ActorSystemImpl: Uncaught fatal error from
thread [sparkDriver-akka.actor.default-dispatcher-381] shutting down
ActorSystem [sparkDriver]

java.lang.OutOfMemoryError: GC overhead limit exceeded

11744,575: [Full GC 1194515K->1192839K(1365504K), 2,2367150 secs]

11746,814: [Full GC 1194507K->1193186K(1365504K), 2,1788150 secs]

11748,995: [Full GC 1194507K->1193278K(1365504K), 1,3511480 secs]

11750,347: [Full GC 1194507K->1193263K(1365504K), 2,2735350 secs]

11752,622: [Full GC 1194506K->1193192K(1365504K), 1,2700110 secs]

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

  File "/home/hadoop/spark/python/pyspark/rdd.py", line 391, in
getNumPartitions

    return self._jrdd.partitions().size()

  File
"/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__

  File
"/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
300, in get_return_value

py4j.protocol.Py4JJavaError14/11/01 22:07:07 INFO scheduler.DAGScheduler:
Failed to run saveAsTextFile at NativeMethodAccessorImpl.java:-2

: An error occurred while calling o112.partitions.

: java.lang.OutOfMemoryError: GC overhead limit exceeded



11753,896: [Full GC 1194506K->947839K(1365504K), 2,1483780 secs]

14/11/01 22:07:09 INFO remote.RemoteActorRefProvider$RemotingTerminator:
Shutting down remote daemon.

14/11/01 22:07:09 ERROR actor.ActorSystemImpl: Uncaught fatal error from
thread [sparkDriver-akka.actor.default-dispatcher-381] shutting down
ActorSystem [sparkDriver]

java.lang.OutOfMemoryError: GC overhead limit exceeded

14/11/01 22:07:09 ERROR actor.ActorSystemImpl: Uncaught fatal error from
thread [sparkDriver-akka.actor.default-dispatcher-309] shutting down
ActorSystem [sparkDriver]

java.lang.OutOfMemoryError: GC overhead limit exceeded

14/11/01 22:07:09 INFO remote.RemoteActorRefProvider$RemotingTerminator:
Remote daemon shut down; proceeding with flushing remote transports.

14/11/01 22:07:09 INFO Remoting: Remoting shut down

14/11/01 22:07:09 INFO remote.RemoteActorRefProvider$RemotingTerminator:
Remoting shut down.

14/11/01 22:07:09 INFO network.ConnectionManager: Removing
ReceivingConnection to
ConnectionManagerId(ip-172-31-18-35.us-west-2.compute.internal,55871)

14/11/01 22:07:09 INFO network.ConnectionManager: Removing SendingConnection
to ConnectionManagerId(ip-172-31-18-35.us-west-2.compute.internal,55871)

14/11/01 22:07:09 INFO network.ConnectionManager: Removing SendingConnection
to ConnectionManagerId(ip-172-31-18-35.us-west-2.compute.internal,55871)

14/11/01 22:07:09 INFO network.ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@5ca1c790

14/11/01 22:07:09 INFO network.ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@5ca1c790

java.nio.channels.CancelledKeyException

at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)

at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)

14/11/01 22:07:09 INFO network.ConnectionManager: Removing SendingConnection
to ConnectionManagerId(ip-172-31-18-35.us-west-2.compute.internal,52768)

14/11/01 22:07:09 INFO network.ConnectionManager: Removing
ReceivingConnection to
ConnectionManagerId(ip-172-31-18-35.us-west-2.compute.internal,52768)

14/11/01 22:07:09 ERROR network.ConnectionManager: Corresponding
SendingConnection to
ConnectionManagerId(ip-172-31-18-35.us-west-2.compute.internal,52768) not
found

14/11/01 22:07:10 ERROR cluster.YarnClientSchedulerBackend: Yarn application
already ended: FINISHED

14/11/01 22:07:10 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/metrics/json,null}

14/11/01 22:07:10 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/stages/stage/kill,null}

14/11/01 22:07:10 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/,null}

14/11/01 22:07:10 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/static,null}

14/11/01 22:07:10 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/executors/json,null}

14/11/01 22:07:10 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/executors,null}

14/11/01 22:07:10 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/environment/json,null}

14/11/01 22:07:10 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/environment,null}

14/11/01 22:07:10 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/storage/rdd/json,null}

14/11/01 22:07:10 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/storage/rdd,null}

14/11/01 22:07:10 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/storage/json,null}

14/11/01 22:07:10 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/storage,null}

14/11/01 22:07:10 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/stages/pool/json,null}

14/11/01 22:07:10 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/stages/pool,null}

14/11/01 22:07:10 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/stages/stage/json,null}

14/11/01 22:07:10 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/stages/stage,null}

14/11/01 22:07:10 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/stages/json,null}

14/11/01 22:07:10 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/stages,null}

14/11/01 22:07:10 INFO ui.SparkUI: Stopped Spark web UI at
http://ip-172-31-20-69.us-west-2.compute.internal:4040 
<http://ip-172-31-20-69.us-west-2.compute.internal:4040>

14/11/01 22:07:10 INFO scheduler.DAGScheduler: Stopping DAGScheduler

14/11/01 22:07:10 INFO cluster.YarnClientSchedulerBackend: Shutting down all
executors

14/11/01 22:07:10 INFO cluster.YarnClientSchedulerBackend: Stopped

14/11/01 22:07:11 ERROR spark.MapOutputTrackerMaster: Error communicating
with MapOutputTracker

akka.pattern.AskTimeoutException:
Recipient[Actor[akka://sparkDriver/user/MapOutputTracker#132167931]] had
already been terminated.

at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134)

at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:106)

at org.apache.spark.MapOutputTracker.sendTracker(MapOutputTracker.scala:117)

at org.apache.spark.MapOutputTrackerMaster.stop(MapOutputTracker.scala:324)

at org.apache.spark.SparkEnv.stop(SparkEnv.scala:80)

at org.apache.spark.SparkContext.stop(SparkContext.scala:1024)

at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:131)

Exception in thread "Yarn Application State Checker"
org.apache.spark.SparkException: Error communicating with MapOutputTracker

at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:111)

at org.apache.spark.MapOutputTracker.sendTracker(MapOutputTracker.scala:117)

at org.apache.spark.MapOutputTrackerMaster.stop(MapOutputTracker.scala:324)

at org.apache.spark.SparkEnv.stop(SparkEnv.scala:80)

at org.apache.spark.SparkContext.stop(SparkContext.scala:1024)

at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:131)

Caused by: akka.pattern.AskTimeoutException:
Recipient[Actor[akka://sparkDriver/user/MapOutputTracker#132167931]] had
already been terminated.

at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134)

at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:106)

... 5 more





---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to