Hello, I have a large data calculation in Spark, distributed across serveral nodes. In the end, I want to write to a single output file.
For this I do: output.coalesce(1, false).saveAsTextFile(filename). What happens is all the data from the workers flows to a single worker, and that one writes the data. If the data is small enough, it all goes well. However, for a RDD from a certain size, I get a lot of the following messages (see below). >From what I understand, ExternalAppendOnlyMap spills the data to disk when it can't hold it in memory. Is there a way to tell it to stream the data right to disk, instead of spilling each block slowly? 14/11/24 12:54:59 INFO MapOutputTrackerWorker: Got the output locations 14/11/24 12:54:59 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/11/24 12:54:59 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 69 non-empty blocks out of 90 blocks 14/11/24 12:54:59 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 3 remote fetches in 22 ms 14/11/24 12:55:11 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/11/24 12:55:11 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 70 non-empty blocks out of 90 blocks 14/11/24 12:55:11 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 3 remote fetches in 4 ms 14/11/24 12:55:11 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory map of 13 MB to disk (1 time so far) 14/11/24 12:55:11 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory map of 12 MB to disk (2 times so far) 14/11/24 12:55:11 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory map of 12 MB to disk (3 times so far) [...trimmed...] 14/11/24 13:13:28 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/11/24 13:13:28 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 69 non-empty blocks out of 90 blocks 14/11/24 13:13:28 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 3 remote fetches in 2 ms 14/11/24 13:13:28 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory map of 15 MB to disk (1 time so far) 14/11/24 13:13:28 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory map of 16 MB to disk (2 times so far) 14/11/24 13:13:28 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory map of 14 MB to disk (3 times so far) [...trimmed...] 14/11/24 13:13:32 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory map of 13 MB to disk (33 times so far) 14/11/24 13:13:32 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory map of 13 MB to disk (34 times so far) 14/11/24 13:13:32 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory map of 13 MB to disk (35 times so far) 14/11/24 13:13:40 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/11/24 13:13:40 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 69 non-empty blocks out of 90 blocks 14/11/24 13:13:40 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 3 remote fetches in 4 ms 14/11/24 13:13:40 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory map of 10 MB to disk (1 time so far) 14/11/24 13:13:41 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory map of 10 MB to disk (2 times so far) 14/11/24 13:13:41 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory map of 9 MB to disk (3 times so far) [...trimmed...] 14/11/24 13:13:45 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory map of 12 MB to disk (36 times so far) 14/11/24 13:13:45 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory map of 11 MB to disk (37 times so far) 14/11/24 13:13:56 INFO FileOutputCommitter: Saved output of task 'attempt_201411241250_0000_m_000000_90' to s3n://mybucket/mydir/output *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com