Our job is creating what appears to be an inordinate number of very small tasks, which blow out our os inode and file limits. Rather than continually upping those limits, we are seeking to understand whether our real problem is that too many tasks are running, perhaps because we are mis-configured or we are coding incorrectly.
Rather than posting our actual code I have re-created the essence of the matter in the shell with a directory of files simulating the data we deal with. We have three servers, each with 8G RAM. Given 1,000 files, each containing a string of 100 characters, in the myfiles directory: val data = sc.textFile("/user/foo/myfiles/*") val c = data.count The count operation produces 1,000 tasks. Is this normal? val cart = data.cartesian(data) cart.count The cartesian operation produces 1M tasks. I understand that the cartesian product of 1,000 items against itself is 1M, however, it seems the overhead of all this task creation and file I/O of all these tiny files outweighs the gains of distributed computing. What am I missing here? Below is the truncated output of the count operation, if this helps indicate a configuration problem. Thank you. scala> data.count 14/12/16 07:40:46 INFO FileInputFormat: Total input paths to process : 1000 14/12/16 07:40:47 INFO SparkContext: Starting job: count at <console>:15 14/12/16 07:40:47 INFO DAGScheduler: Got job 0 (count at <console>:15) with 1000 output partitions (allowLocal=false) 14/12/16 07:40:47 INFO DAGScheduler: Final stage: Stage 0(count at <console>:15) 14/12/16 07:40:47 INFO DAGScheduler: Parents of final stage: List() 14/12/16 07:40:47 INFO DAGScheduler: Missing parents: List() 14/12/16 07:40:47 INFO DAGScheduler: Submitting Stage 0 (/user/ds/randomfiles/* MappedRDD[3] at textFile at <console>:12), which has no missing parents 14/12/16 07:40:47 INFO MemoryStore: ensureFreeSpace(2400) called with curMem=507154, maxMem=278019440 14/12/16 07:40:47 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.3 KB, free 264.7 MB) 14/12/16 07:40:47 INFO MemoryStore: ensureFreeSpace(1813) called with curMem=509554, maxMem=278019440 14/12/16 07:40:47 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1813.0 B, free 264.7 MB) 14/12/16 07:40:47 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on dev-myserver-1.abc.cloud:44041 (size: 1813.0 B, free: 265.1 MB) 14/12/16 07:40:47 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0 14/12/16 07:40:47 INFO DAGScheduler: Submitting 1000 missing tasks from Stage 0 (/user/ds/randomfiles/* MappedRDD[3] at textFile at <console>:12) 14/12/16 07:40:47 INFO TaskSchedulerImpl: Adding task set 0.0 with 1000 tasks 14/12/16 07:40:47 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 0, dev-myserver-2.abc.cloud, NODE_LOCAL, 1202 bytes) 14/12/16 07:40:47 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 1, dev-myserver-3.abc.cloud, NODE_LOCAL, 1201 bytes) 14/12/16 07:40:47 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, dev-myserver-1.abc.cloud, NODE_LOCAL, 1203 bytes) 14/12/16 07:40:47 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 3, dev-myserver-2.abc.cloud, NODE_LOCAL, 1203 bytes) 14/12/16 07:40:47 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 4, dev-myserver-3.abc.cloud, NODE_LOCAL, 1204 bytes) 14/12/16 07:40:47 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, dev-myserver-1.abc.cloud, NODE_LOCAL, 1203 bytes) 14/12/16 07:40:47 INFO ConnectionManager: Accepted connection from [dev-myserver-3.abc.cloud/10.40.13.192:36133] 14/12/16 07:40:47 INFO ConnectionManager: Accepted connection from [dev-myserver-2.abc.cloud/10.40.13.195:35716] 14/12/16 07:40:47 INFO ConnectionManager: Accepted connection from [dev-myserver-1.abc.cloud/10.40.13.194:33728] 14/12/16 07:40:47 INFO SendingConnection: Initiating connection to [dev-myserver-1.abc.cloud/10.40.13.194:49458] 14/12/16 07:40:47 INFO SendingConnection: Initiating connection to [dev-myserver-3.abc.cloud/10.40.13.192:58579] 14/12/16 07:40:47 INFO SendingConnection: Initiating connection to [dev-myserver-2.abc.cloud/10.40.13.195:52502] 14/12/16 07:40:47 INFO SendingConnection: Connected to [dev-myserver-3.abc.cloud/10.40.13.192:58579], 1 messages pending 14/12/16 07:40:47 INFO SendingConnection: Connected to [dev-myserver-1.abc.cloud/10.40.13.194:49458], 1 messages pending 14/12/16 07:40:47 INFO SendingConnection: Connected to [dev-myserver-2.abc.cloud/10.40.13.195:52502], 1 messages pending 14/12/16 07:40:47 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on dev-myserver-1.abc.cloud:49458 (size: 1813.0 B, free: 1060.0 MB) 14/12/16 07:40:47 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on dev-myserver-3.abc.cloud:58579 (size: 1813.0 B, free: 1060.0 MB) 14/12/16 07:40:47 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on dev-myserver-2.abc.cloud:52502 (size: 1813.0 B, free: 1060.0 MB) 14/12/16 07:40:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on dev-myserver-3.abc.cloud:58579 (size: 32.5 KB, free: 1060.0 MB) 14/12/16 07:40:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on dev-myserver-2.abc.cloud:52502 (size: 32.5 KB, free: 1060.0 MB) 14/12/16 07:40:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on dev-myserver-1.abc.cloud:49458 (size: 32.5 KB, free: 1060.0 MB) 14/12/16 07:40:49 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, dev-myserver-1.abc.cloud, NODE_LOCAL, 1203 bytes) 14/12/16 07:40:49 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, dev-myserver-1.abc.cloud, NODE_LOCAL, 1203 bytes) 14/12/16 07:40:49 INFO TaskSetManager: Starting task 19.0 in stage 0.0 (TID 8, dev-myserver-2.abc.cloud, NODE_LOCAL, 1203 bytes) 14/12/16 07:40:49 INFO TaskSetManager: Starting task 23.0 in stage 0.0 (TID 9, dev-myserver-2.abc.cloud, NODE_LOCAL, 1203 bytes) 14/12/16 07:40:49 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID 10, dev-myserver-3.abc.cloud, NODE_LOCAL, 1203 bytes) 14/12/16 07:40:49 INFO TaskSetManager: Starting task 9.0 in stage 0.0 (TID 11, dev-myserver-3.abc.cloud, NODE_LOCAL, 1203 bytes) 14/12/16 07:40:49 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 1964 ms on dev-myserver-1.abc.cloud (1/1000) 14/12/16 07:40:49 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 0) in 2000 ms on dev-myserver-2.abc.cloud (2/1000) 14/12/16 07:40:49 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 1975 ms on dev-myserver-1.abc.cloud (3/1000) .... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-so-many-tasks-tp20712.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org