Hello,

I need to process a significant amount of data every day, about 4TB. This
will be processed in batches of about 140GB. The cluster this will be
running on doesn't have enough memory to hold the dataset at once, so I am
trying to understand how this works internally.

When using textFile to read an HDFS folder (containing multiple files), I
understand that the number of partitions created are equal to the number of
HDFS blocks, correct? Are those created in a lazy way? I mean, if the number
of blocks/partitions is larger than the number of cores/threads the Spark
driver was launched with (N), are N partitions created initially and then
the rest when required? Or are all those partitions created up front?

I want to avoid reading the whole data into memory just to spill it out to
disk if there is no enough memory.

Thanks! 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/laziness-in-textFile-reading-from-HDFS-tp24837.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to