Hi, I am new to spark. I am running a hdfs file system on a remote cluster whereas my spark workers are on another cluster. When my textFile RDD gets executed, does spark worker read from the file according to hdfs partitions task by task, or do they read it once when the blockmanager sets after the start of first task and distributes it among the memory of spark cluster?
I have this question because I have a situation where, when I have only one worker executing a job it shows less run time per task (shown in history server) then when I have two workers executing the same job in parallel. Even though the total duration is almost the same. I am running a simple grep application and no shuffles within the cluster. Text file is on a remote hdfs cluster and is of 813MB distributed into 7 chunks of 128MB, last chunk is left over size. Thanks -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org