Hello, I need to process a significant amount of data every day, about 4TB. This will be processed in batches of about 140GB. The cluster this will be running on doesn't have enough memory to hold the dataset at once, so I am trying to understand how this works internally.
When using textFile to read an HDFS folder (containing multiple files), I understand that the number of partitions created are equal to the number of HDFS blocks, correct? Are those created in a lazy way? I mean, if the number of blocks/partitions is larger than the number of cores/threads the Spark driver was launched with (N), are N partitions created initially and then the rest when required? Or are all those partitions created up front? I want to avoid reading the whole data into memory just to spill it out to disk if there is no enough memory. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/laziness-in-textFile-reading-from-HDFS-tp24837.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org