We are tasked with loading a big file (possibly 2TB) into a data warehouse. In order to do this efficiently, we need to split the file into smaller files.
I don't believe there is a way to do this with Spark, because in order for Spark to distribute the file to the worker nodes, it first has to be split up, right? We ended up using a single machine with a single thread to do the splitting. I just want to make sure I am not missing something obvious. Thanks! -- Paul Henry Tremblay Attunix