Thanks Mayur. So without Hadoop and any other distributed file systems, by running: val doc = sc.textFile("/home/scalatest.txt",5) doc.count we can only get parallelization within the computer where the file is loaded, but not the parallelization within the computers in the cluster (Spark can not automatically duplicate the file to the other computers in the cluster), is this understanding correct? Thank you.
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4734.html Sent from the Apache Spark User List mailing list archive at Nabble.com.