Hi, I am using sc.textFile("shared_dir/*") to load all the files in a directory on a shared partition. The total size of the files in this directory is 1.2 TB. We have a 16 node cluster with 3 TB memory (1 node is driver, 15 nodes are workers). But the loading fails after around 1 TB of data is read (in the mapPartitions stage). Basically, there is no progress in mapPartitions after 1 TB of input. It seems that the cluster has sufficient memory but not sure why the program get stuck.
1.2 TB of data divided across 15 worker nodes would require each node to have about 80 GB of memory. Every node in the cluster is allocated around 170GB of memory. According to the spark documentation, the default storage fraction for RDDs is 60% of the allocated memory. I have increased that to 0.8 (by setting --conf spark.storage.memorFraction=0.8) , so each node should have around 136 GB of memory for storing RDDs. So I am not sure why the program is failing in the mapPartitions stage where it seems to be loading the data. I dont have a good idea about the Spark internals and would appreciate any help in fixing this issue. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-failing-when-loading-large-amount-of-data-tp19441.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org