Hi folks,

Help me! I met a very weird problem. I really need some help!! Here is my
situation:
--------------------------------------------------------------------
Case: Assign keys to two datasets (one is 96GB with 2.7 billion records and
one 1.5GB with 30k records) via MapPartitions first, and join them together
with their keys.
--------------------------------------------------------------------
Environment:

Standalone Spark on Amazon EC2
Master*1 13GB 8 cores
Worker*16  each one 13GB 8 cores


(After met this problem, I switched to
Worker*16  each one 59GB 8 cores)


Read and write on HDFS (same cluster)
----------------------------------------------------------------------
Problem:

At the beginning:-----------------------

The MapPartitions looks no problem. But when Spark does the Join for two
datasets, the console says

*"ERROR TaskSchedulerImpl: Lost executor 4 on
ip-172-31-27-174.us-west-2.compute.internal: remote Akka client
disassociated"*

Then I go back to this worker and check its log

There is something like "Master said remote Akka client disassociated and
asked to kill executor *** and then the worker killed this executor"

(Sorry I deleted that log and just remember the content.)

There is no other errors before the Akka client disassociated (for both of
master and worker).

Then -------------------------------------------

I tried one 62GB dataset with the 1.5 GB dataset. My job worked
smoothly. *HOWEVER,
I found one thing: If I set the spark.shuffle.memoryFraction to Zero, same
error will happen on this 62GB dataset.*

Then ---------------------------------------------------

I switched my workers to Worker*16  each one 59GB 8 cores. Error even
happened when Spark does the MapPartitions!!!!

Some metrics I
found----------------------------------------------------------------------------

*When I do the MapPartitions or Join with 96GB data, its shuffle write is
around 100GB. And I cached 96GB data and its size is around 530GB.*

*Garbage collection time for 96GB dataset when Spark does the Map or Join
is around 12 second.*

My analysis----------------------------------------------

This problem might be caused by large shuffle write data. The large shuffle
write caused high I/O on disk. If the shuffle write cannot be done by some
timeout period, then the master will think this executor is disassociated.

But I don't know how to solve this problem.

-------------------------------------------------------------------


Any help will be appreciated!!!!!!!!!!!

Thanks,
Jia

Reply via email to